# Predicting High Risk Breast Cancer 2022 (Phase 1)

Predicting High Risk Breast Cancer: a Nightingale OS & AHLI data challenge

## Overview

Every year, 40 million women get a mammogram; some go on to have an invasive biopsy to better examine a concerning area. Underneath these routine tests lies a deep—and disturbing—mystery. Since the 1990s, we have found far more ‘cancers’, which has in turn prompted vastly more surgical procedures and chemotherapy. But death rates from metastatic breast cancer have hardly changed.

When a pathologist looks at a biopsy slide, she is looking for known signs of cancer: tubules, cells with atypical looking nuclei, evidence of rapid cell division. These features, first identified in 1928, still underlie critical decisions today: which women must receive urgent treatment with surgery and chemotherapy? And which can be prescribed “watchful waiting”, sparing them invasive procedures for cancers that would not harm them?

There is already evidence that algorithms can predict which cancers will metastasize and harm patients on the basis of the biopsy image. Fascinatingly, these algorithms also hone in on features that humans neglect, for example, the nature of the non-cancerous tissue surrounding the tumor. But to date, the datasets linking biopsy images to patient outcomes—metastasis, death—have been far smaller than what is needed to apply modern approaches.

This dataset contains images and outcomes for 72,400 biopsy slides that correspond to 4,200 cases ranging from 2014 to 2020. Please refer to the full version of the dataset documentation as you get started to learn more about the cohort and key variables for this challenge including mortality and cancer stage.

Providence St. Joseph, Nightingale OS, and The Association for Health Learning and Inference (AHLI) developed this challenge in order to catalyze the development of algorithms that find new signal in digital pathology images, ultimately providing new insights into which patients may be at risk and need preventive treatment.

The goal of this challenge is to predict the stage of a patient’s cancer, using only the slide images generated by a breast biopsy.

Cancer staging is a complex, multidisciplinary task: while it does take into account some features of the biopsy, it also integrates a wide variety of external information: the size of the lesion biopsied, its appearance and location on imaging, and a variety of other tests (imaging and more) to determine whether the cancer has spread to other locations in the body. This important contextual information, most of which is not present in the whole slide image, serves as our ground-truth label for the challenge. By linking features of the whole slide image to this label, algorithmic approaches have the potential to find new sources of signal—beyond the tubules, atypical nuclei, and cell division markers pathologists consider today—that can identify patients with benign or deadly cancers.

Building on successful work in this challenge, a particularly interesting next step is to identify predictable “outliers”: patients whose cancer is far more—or less—benign that it appears to the pathologists. Researchers at Providence, who have access to rich and granular data on pathologists’ judgments, are eager to collaborate on this exciting follow-on work.

## Dataset

This dataset contains whole slide images from 4,335 breast biopsies, in 3,425 patients, over the years 2014 to 2020. For our purposes, an observation in this dataset corresponds to a biopsy (i.e., performance in the hold-out set will be evaluated at the biopsy level).

Images: Each biopsy generates between one to one hundred physical slides (processed with hematoxylin and eosin stain). The slides have been digitized at 40x magnification with a Hamamatsu slide scanner, yielding a whole slide image. These images have a resolution around 100,000 x 150,000 pixels, and are stored as a single NDPI file (average size ~2GB). A NDPI file is a TIFF-like file, and libraries like openslide can be used to interact with them. The 4,335 biopsies of this dataset generate 69,606 whole slide images, with a median of 13 WSI per biopsy.

Labels: The primary label is the cancer stage associated with a biopsy. Table 1 shows that not all cancers are staged: only those with an initial diagnosis and first round of treatment at Providence will have a stage assigned (by convention, cases are staged and reported at the time of diagnosis, by the institution at which the diagnosis was made; 94% of staging judgments in this dataset are made within one month of biopsy).

Dataset splits: The dataset has been split randomly at the patient level with 75% of the data made available. The 25% holdout will be used for validation purposes. Refer to Table 1 for what is expected to be made available.

#### Table 1

Train Holdout
N biopsies 3258 1077
N patients 2569 856
N images 52707 16899
N biopsies with stage 2722 886
0 15% redacted
I 52%
II 24%
III 6.4%
IV 2.2%
N unstaged biopsies 536 191

Model performance measurement: The primary metric we will evaluate model performance is prediction of cancer stage, among staged cases in the holdout set. More detailed information on the exact scoring methodology is below. Similar to the training dataset, the holdout set contains staged and unstaged biopsies (see Table 1), but only staged biopsies will be used to calculate the primary Challenge metric. We will award additional prizes for other aspects of performance, reflecting both clinical utility of models and Nightingale Open Science’s commitment to equity.

You are free to use images from unstaged cases in any way you’d like in the training process. You are also free to use any other available information in the training process: detailed demographic information including age, sex, and self-reported race; and other information about the cancer and its progression beyond stage, including mortality and ICD codes for metastatic disease (though keep in mind the many caveats of this information, as noted for example here). However, note that none of this contextual information will be provided in the holdout set—only the slide images. Please refer to the full version of the dataset documentation as you get started to learn more about the cohort and key variables.

## Important Challenge Dates

The contest will have two iterations. The dataset is the same for each phase, but each phase is completely independent, with its own teams and prizes.

#### Phase 1

• Wed June 1, 2022 00:00 UTC - start date
• Fri August 12, 2022 00:00 UTC - submission deadline
• Fri-Sat August 26-27 - Portland conference TBA

#### Phase 2

• Sat August 27, 2022 00:00 UTC (during Portland conference) - start date
• Mon November 14, 2022 00:00 UTC - submission deadline
• Mon 11/28 - Machine Learning for Health (ML4H) and NeurIPS 2022 TBA

## Rules

1. Registration required. Only registered Nightingale OS users may participate in the contest.
1. Registration is open to anyone worldwide who has an active affiliation with an accredited academic institution.
2. Your registration must be approved by Nightingale before you can access contest data or any other Nightingale OS resources.
5. You don’t have to participate in Phase 1 to be eligible for Phase 2.
2. Teams and collaboration
1. Size limit. Teams can be any size. As a general rule, contests often limit team sizes to about 5. Because this is a multi-disciplinary research area, Nightingale recognizes that teams may be larger from time to time.
2. No merges. You can only be a member of one team during a given phase.
1. If you entered the competition as an individual, you may join someone else’s project only if you have not submitted any entries for scoring during the contest’s current phase.
2. You may join a different team in the second phase.
3. No sharing. Sharing code between teams is prohibited during a given phase unless the information shared is free and publicly available.
4. Public discussion. You may not publicly describe the methodology you are using for the competition until after the submission deadline for a given phase. You may describe your methodology in the context of other datasets as long as you don’t also indicate this is the methodology you used in competition.
5. Publication. Please use the recommended citation found on the dataset documentation page. Although the dataset is available to all Nightingale users, please do not submit contest-related work to other conferences or journals until the current contest phase has ended. Our goal is to provide all competitors with the opportunity to discuss methods and findings in a single forum at the Phase 1 conference in Portland or ML4H in the fall.
3. Scoring
1. Predictions CSV. Teams submit entries for scoring in the form of a predictions file. Nightingale will score the entry according to the methodology described in the Scoring methodology section.
2. Entry limit. Each team can submit 1 entry per day.
3. Public leaderboard. During the competition period, each team’s current ranking will be visible on the competition public leaderboard. After the phase end date, the leaderboard will reveal each team’s score and model description.
4. High score. The best score from all your team’s submissions will be used in the leaderboard, along with your description of that entry, if any. See Scoring for details.
5. Tiebreaker. In the event of a tie, the first submission will outrank subsequent submissions.
6. Submission requirements. Nightingale may make efforts to help you validate your predictions file and fix any issues before scoring. However, invalid submissions may result in no score and may count against your team’s entry limit.
4. Resources
1. Use of non-public data or software. Use of free and publicly available external data is allowed, including pre-trained models. If you use any proprietary software or data that is not free and publicly available, such as a pre-trained model using private data, then you should declare this by including “[NON-PUBLIC]” in your Nightingale project description field. Rankings and scores for teams using non-public resources will be published on the leaderboard and encouraged to contribute to the conferences but will not be eligible for any prizes.
2. Billing. Nightingale OS provides limited free computing. Teams are responsible for all costs incurred when they use non-free resources.

## Scoring

#### Scoring methodology

Cancer stage takes on discrete values (0, I, II, III, IV), and clinically, some errors are more costly than others: in broad strokes, metastatic disease (stage IV) is managed very differently from locoregional disease, and carries a much higher mortality rate. That said, we felt that mean squared error was a good approximation of the clinical loss function.

$${MSE =\frac{1}{n}\displaystyle\sum_{i=1}^{n}(predicted\ stage_i – actual\ stage_i)^2}$$

We will accept continuous predictions from 0 to 4: we want to reward getting as close to the recorded stage as possible, rather than asking participants to convert predictions to whole numbers.

Nightingale randomly samples a set of patients to create a hold-out set, on which entries will be scored. You can see the images in the dataset directory in the holdout subdirectory, but of course, stage and mortality fields are excluded. We have included both the staged and unstaged biopsies for the holdout. A table is included in the holdout that indicates which biopsies have been staged. When submitting the prediction file, submit predictions for all staged cases and any unstaged cases.

~/datasets/brca-psj-path/holdout

To submit an entry, predict the stage for each biopsy and write results to a CSV file.

#### Predictions CSV file format

• The file needs to be located in your project directory. Specifically, a predictions file should be in ~/project or any subdirectory.

• The predictions file can have any name.

• For subsequent submissions, it doesn’t matter whether you use the same file or create a new one.

The CSV file should have no header row.

Each line should have the following schema:

• Biopsy ID (string)
• Cancer stage (float)

Example CSV

47ba1eb2-0d3b-4752-80d3-6d318001751e,0.1234
e4235769-c290-4bce-bf3a-9b98c7ef80b5,1.2345
d9bd5e69-98ce-4736-a108-fd64234ffb05,3.7890
...

#### Submission description (recommended)

• You may optionally add a description for each submission. You don’t get to change the description later. We strongly recommend you use this field to note the version of the model that generated it, so that scores can be traced back to a particular model.
• You will see descriptions in your Submissions list inside your team’s Nightingale project.
• The public will see the description for your best submission after the contest end date.

#### How to submit an entry

>>> import ngsci
>>> ngsci.submit_contest_entry("path/to/your.csv", description="our model")
(<Result.SUCCESS: 1>, 'success')

After you submit a file, Nightingale will attempt to validate the schema of your CSV. If it fails validation, you will see one or more errors in the Submissions tab in your Nightingale OS project (not in your JupyterLab editor). Any validation errors detected automatically will not count against your submission limit.

After a predictions file passes validation, it will be submitted for scoring. The result will be posted in your project Submissions view, and if the result is a new high score for your team, the public leaderboard will also be updated.

## How to form a team

After you have registered for Nightingale and a Nightingale admin has admitted you, then you can enter the currently active contest phase. For example: Predicting High Risk Breast Cancer 2022 (Phase 1).

Collaboration in Nightingale OS happens inside projects. In the case of a contest, your project is your team. After you create your team, you can add teammates by adding them as members of your project.

• Only registered users are eligible to be added to your project/team.
• You can only be a member of one team per contest.

## Host organizations

This is a challenge jointly hosted by Nightingale Open Science; The Association for Health Learning and Inference; and Providence St. Joseph Health.

Nightingale Open Science is a platform connecting researchers with deidentified, cutting edge medical datasets. The Nightingale OS team works closely with health systems around the world to create and curate datasets of medical images linked to ground-truth labels, and make them freely available to academic researchers. Nightingale OS launched at the 2021 NeurIPS conference with five anchor datasets spanning different disease areas.

The Association for Health Learning and Inference (AHLI) is a not-for-profit organization dedicated to building a transdisciplinary machine learning and health community. AHLI works with its partners to advance health data quality and access, knowledge discovery, and meaningful use of complex health data. AHLI was founded in September 2021 with generous support from Schmidt Futures.

Providence St. Joseph Health is a not-for-profit health care system operating in seven states and serves as the parent organization for 100,000 caregivers. The combined system includes 51 hospitals, 829 clinics, and other health, education and social services across Washington, Oregon, California, Alaska, Montana, New Mexico, and Texas.

Together, our three teams are thrilled to collaborate on this contest to spur collaboration and competition in the field of computational medicine.

## Acknowledgements

We thank our generous funders: Schmidt Futures, The Gordon and Betty Moore Foundation, and Ken Griffin. We would also like to acknowledge the team that created and conceived of this dataset, and worked with us to make this challenge possible: Carlo Bifulco, MD, Director of Molecular Pathology and Pathology Informatics; Brian Piening, PhD, Technical Director of Clinical Genomics; Tucker Bower, Bioinformatics Scientist. Many thanks also to the leadership of Ari Robicsek, Chief Medical Analytics Officer at Providence, Bill Wright, VP of Health Innovation at Providence, and Raina Tamakawna, Enterprise and GME Research Program Manager at Providence.

We express our thanks to Hamamatsu as well – developers of the NanoZoomer 360 platform, Hamamatsu supported this work with a grant from their Product Marketing Division and partnered with Providence to ensure a seamless start to the dataset creation process.

## Prizes

We’ll have a variety of prizes – ranging from cash to compute credit awards – for the different phases and categories of the challenge. You must be a registered user on the Nightingale platform to be eligible for prizes.

Final winners will be announced after Phase 2 concludes, at the Machine Learning for Health (ML4H) Symposium co-located with NeurIPS in November 2022.