torch_em.data.datasets.histopathology.orion_crc

ORION-CRC contains paired H&E and multiplex immunofluorescence images of colorectal cancer tissue.

This loader supports the processed ORION-CRC tile dataset released with MIPHEI-ViT: https://zenodo.org/records/15340874. The source ORION-CRC dataset is available from https://zenodo.org/records/7637988 and is described in https://doi.org/10.1038/s43018-023-00576-1.

The processed release provides H&E tiles, mIF tiles, Cellpose-generated nucleus instance masks, and per-cell CSV metadata with nucleus positions and cell types. Tiles are grouped per slide and stored in per-slide HDF5 files with:

  • raw/he: H&E tiles stacked as (3, N, H, W).
  • raw/mif: mIF tiles stacked as (C, N, H, W).
  • labels/nucleus/instances: Cellpose nucleus instance masks as (N, H, W).
  • labels/nucleus/semantic: derived cell-type semantic masks as (N, H, W), if per-cell CSV data is available.

The nucleus instance labels are generated by Cellpose on the DAPI channel in the processed ORION release, so they are algorithm-generated instance segmentations, not manual annotations. The semantic labels are not native pixel annotations either: they are derived by assigning the CSV cell type of each nucleus coordinate to the corresponding nucleus instance region. Background is 0 and class IDs are stored in semantic_label_mapping.csv in the preprocessed output directory.

Please cite the ORION-CRC and MIPHEI-ViT publications if you use this dataset for your research.

  1"""ORION-CRC contains paired H&E and multiplex immunofluorescence images of colorectal cancer tissue.
  2
  3This loader supports the processed ORION-CRC tile dataset released with MIPHEI-ViT:
  4https://zenodo.org/records/15340874. The source ORION-CRC dataset is available from
  5https://zenodo.org/records/7637988 and is described in https://doi.org/10.1038/s43018-023-00576-1.
  6
  7The processed release provides H&E tiles, mIF tiles, Cellpose-generated nucleus instance masks, and per-cell CSV
  8metadata with nucleus positions and cell types. Tiles are grouped per slide and stored in per-slide HDF5 files with:
  9
 10- `raw/he`: H&E tiles stacked as `(3, N, H, W)`.
 11- `raw/mif`: mIF tiles stacked as `(C, N, H, W)`.
 12- `labels/nucleus/instances`: Cellpose nucleus instance masks as `(N, H, W)`.
 13- `labels/nucleus/semantic`: derived cell-type semantic masks as `(N, H, W)`, if per-cell CSV data is available.
 14
 15The nucleus instance labels are generated by Cellpose on the DAPI channel in the processed ORION release, so they
 16are algorithm-generated instance segmentations, not manual annotations. The semantic labels are not native pixel
 17annotations either: they are derived by assigning the CSV cell type of each nucleus coordinate to the corresponding
 18nucleus instance region. Background is 0 and class IDs are stored in `semantic_label_mapping.csv` in the
 19preprocessed output directory.
 20
 21Please cite the ORION-CRC and MIPHEI-ViT publications if you use this dataset for your research.
 22"""
 23
 24import os
 25import re
 26from concurrent.futures import ThreadPoolExecutor
 27from glob import glob
 28from multiprocessing import Pool
 29from typing import List, Literal, Optional, Tuple, Union
 30
 31import imageio.v3 as imageio
 32import numpy as np
 33import pandas as pd
 34
 35from torch.utils.data import Dataset, DataLoader
 36
 37import torch_em
 38
 39from .. import util
 40
 41
 42URL = "https://zenodo.org/api/records/15340874/files/ORIONCRC_dataset_tile_20x.zip/content"
 43ZIP_NAME = "ORIONCRC_dataset_tile_20x.zip"
 44SPLITS = ("train", "val", "test")
 45
 46CELL_TYPE_COLUMNS = ("cell_type", "celltype", "cell_type_pred", "predicted_cell_type", "phenotype", "class", "label")
 47X_COLUMNS = ("x", "X", "centroid_x", "nucleus_x", "nuclei_x", "center_x")
 48Y_COLUMNS = ("y", "Y", "centroid_y", "nucleus_y", "nuclei_y", "center_y")
 49TILE_X_COLUMNS = ("tile_x", "x_start", "xmin", "min_x", "left")
 50TILE_Y_COLUMNS = ("tile_y", "y_start", "ymin", "min_y", "top")
 51
 52
 53def _find_file(path, name):
 54    matches = glob(os.path.join(path, "**", name), recursive=True)
 55    if len(matches) == 1:
 56        return matches[0]
 57    if len(matches) > 1:
 58        return sorted(matches)[0]
 59    return None
 60
 61
 62def _resolve_path(root, metadata_path, value):
 63    value = str(value)
 64    candidates = [
 65        os.path.join(os.path.dirname(metadata_path), value),
 66        os.path.join(root, value),
 67        value,
 68    ]
 69    for candidate in candidates:
 70        if os.path.exists(candidate):
 71            return candidate
 72    return candidates[0]
 73
 74
 75def _find_column(columns, candidates):
 76    lower_to_column = {column.lower(): column for column in columns}
 77    for candidate in candidates:
 78        if candidate.lower() in lower_to_column:
 79            return lower_to_column[candidate.lower()]
 80    return None
 81
 82
 83def _get_metadata(root, split):
 84    metadata_path = _find_file(root, f"{split}_dataframe.csv")
 85    if metadata_path is None:
 86        raise RuntimeError(f"Could not find {split}_dataframe.csv in {root}.")
 87    metadata = pd.read_csv(metadata_path)
 88    return metadata_path, metadata
 89
 90
 91def _get_slide_csv_paths(root):
 92    slide_dataframe_path = _find_file(root, "slide_dataframe.csv")
 93    if slide_dataframe_path is None:
 94        return {}
 95    slide_dataframe = pd.read_csv(slide_dataframe_path)
 96    slide_name_col = _find_column(slide_dataframe.columns, ["slide_name", "in_slide_name"])
 97    if slide_name_col is None or "nuclei_csv_path" not in slide_dataframe.columns:
 98        return {}
 99    return {
100        row[slide_name_col]: _resolve_path(root, slide_dataframe_path, row["nuclei_csv_path"])
101        for _, row in slide_dataframe.iterrows()
102    }
103
104
105def _get_slide_id_map(root):
106    slide_df_path = _find_file(root, "slide_dataframe.csv")
107    if slide_df_path is None:
108        return {}
109    slide_df = pd.read_csv(slide_df_path)
110    slide_name_col = _find_column(slide_df.columns, ["slide_name", "in_slide_name"])
111    if slide_name_col is None or "orion_slide_id" not in slide_df.columns:
112        return {}
113    return dict(zip(slide_df[slide_name_col], slide_df["orion_slide_id"]))
114
115
116def _parse_tile_origin(path):
117    stem = os.path.splitext(os.path.basename(path))[0]
118    numbers = [int(n) for n in re.findall(r"\d+", stem)]
119    # Tile filenames follow the pattern *_x_y_z_width_height.*, so origin is at [-5], [-4].
120    if len(numbers) >= 5:
121        return numbers[-5], numbers[-4]
122    return None
123
124
125def _get_tile_origin(row, image_path):
126    x_column = _find_column(row.index, TILE_X_COLUMNS)
127    y_column = _find_column(row.index, TILE_Y_COLUMNS)
128    if x_column is not None and y_column is not None:
129        return int(row[x_column]), int(row[y_column])
130    return _parse_tile_origin(image_path)
131
132
133def _get_cell_type_mapping(csv_tables, cell_type_column):
134    cell_types = set()
135    for table in csv_tables.values():
136        cell_types.update(str(value) for value in table[cell_type_column].dropna().unique())
137    return {cell_type: label_id for label_id, cell_type in enumerate(sorted(cell_types), start=1)}
138
139
140def _read_image(path):
141    image = imageio.imread(path)
142    if image.ndim == 3:
143        image = image.transpose(2, 0, 1)
144    return image
145
146
147def _read_label(path):
148    label = imageio.imread(path)
149    if label.ndim == 3:
150        label = label[..., 0]
151    return label
152
153
154def _collect_cell_tables(root):
155    tables = {}
156    for slide_name, csv_path in _get_slide_csv_paths(root).items():
157        if os.path.exists(csv_path):
158            tables[slide_name] = pd.read_csv(csv_path)
159    return tables
160
161
162def _infer_cell_columns(cell_tables):
163    if not cell_tables:
164        return None
165    first_table = next(iter(cell_tables.values()))
166    cell_type_column = _find_column(first_table.columns, CELL_TYPE_COLUMNS)
167    x_column = _find_column(first_table.columns, X_COLUMNS)
168    y_column = _find_column(first_table.columns, Y_COLUMNS)
169    if cell_type_column is None or x_column is None or y_column is None:
170        return None
171    return cell_type_column, x_column, y_column
172
173
174def _write_cell_type_mapping(output_root, mapping):
175    mapping_path = os.path.join(output_root, "semantic_label_mapping.csv")
176    if os.path.exists(mapping_path):
177        return
178    os.makedirs(output_root, exist_ok=True)
179    pd.DataFrame(
180        [{"label_id": label_id, "cell_type": cell_type} for cell_type, label_id in mapping.items()]
181    ).to_csv(mapping_path, index=False)
182
183
184def _make_semantic_label_from_instances(row, image_path, nuclei, cell_table, cell_type_mapping, cell_columns):
185    cell_type_column, x_column, y_column = cell_columns
186    origin = _get_tile_origin(row, image_path)
187    tile_h, tile_w = nuclei.shape
188
189    valid_mask = cell_table[cell_type_column].notna()
190    if not valid_mask.any():
191        return np.zeros(nuclei.shape, dtype="uint16")
192
193    cells = cell_table[valid_mask]
194    xs = cells[x_column].to_numpy(dtype=float)
195    ys = cells[y_column].to_numpy(dtype=float)
196    class_ids = np.array([cell_type_mapping[str(v)] for v in cells[cell_type_column]], dtype="uint16")
197
198    if origin is not None:
199        lx = np.round(xs - origin[0]).astype(int)
200        ly = np.round(ys - origin[1]).astype(int)
201        in_bounds = (lx >= 0) & (lx < tile_w) & (ly >= 0) & (ly < tile_h)
202        inst_ids = np.zeros(len(xs), dtype=nuclei.dtype)
203        inst_ids[in_bounds] = nuclei[ly[in_bounds], lx[in_bounds]]
204
205        needs_fallback = ~in_bounds | (inst_ids == 0)
206        if needs_fallback.any():
207            lx_raw = np.round(xs).astype(int)
208            ly_raw = np.round(ys).astype(int)
209            fb = needs_fallback & (lx_raw >= 0) & (lx_raw < tile_w) & (ly_raw >= 0) & (ly_raw < tile_h)
210            inst_ids[fb] = nuclei[ly_raw[fb], lx_raw[fb]]
211    else:
212        lx = np.round(xs).astype(int)
213        ly = np.round(ys).astype(int)
214        in_bounds = (lx >= 0) & (lx < tile_w) & (ly >= 0) & (ly < tile_h)
215        inst_ids = np.zeros(len(xs), dtype=nuclei.dtype)
216        inst_ids[in_bounds] = nuclei[ly[in_bounds], lx[in_bounds]]
217
218    hit = inst_ids > 0
219    if not hit.any():
220        return np.zeros(nuclei.shape, dtype="uint16")
221
222    inst_to_class = np.zeros(int(nuclei.max()) + 1, dtype="uint16")
223    inst_to_class[inst_ids[hit]] = class_ids[hit]
224    return inst_to_class[nuclei]
225
226
227def _preprocess_slide(
228    root, metadata_path, slide_name, group, output_path, cell_tables, cell_columns, cell_type_mapping
229):
230    import h5py
231
232    if os.path.exists(output_path):
233        return
234
235    has_cell_table = cell_columns is not None and slide_name in cell_tables
236    tmp_path = output_path + ".tmp"
237    n = 0
238    N = len(group)
239    tile_h = tile_w = None
240    he_ds = mif_ds = inst_ds = sem_ds = None
241
242    with h5py.File(tmp_path, "w") as f:
243        f.attrs["slide_name"] = slide_name
244
245        for _, row in group.iterrows():
246            he_path = _resolve_path(root, metadata_path, row["image_path"])
247            mif_path = _resolve_path(root, metadata_path, row["target_path"])
248            nucleus_path = _resolve_path(root, metadata_path, row["nuclei_path"])
249            if not (os.path.exists(he_path) and os.path.exists(mif_path) and os.path.exists(nucleus_path)):
250                continue
251
252            with ThreadPoolExecutor(max_workers=3) as ex:
253                he_f = ex.submit(_read_image, he_path)
254                mif_f = ex.submit(_read_image, mif_path)
255                nuc_f = ex.submit(_read_label, nucleus_path)
256                he = he_f.result()
257                mif = mif_f.result()
258                nuclei = nuc_f.result()
259
260            if he.ndim == 2:
261                he = he[None]
262
263            if tile_h is None:
264                tile_h, tile_w = he.shape[-2:]
265            elif he.shape[-2:] != (tile_h, tile_w):
266                continue
267
268            if mif.ndim == 2:
269                mif = mif[None]
270
271            if he_ds is None:
272                C_he, C_mif = he.shape[0], mif.shape[0]
273                he_ds = f.create_dataset(
274                    "raw/he", shape=(C_he, N, tile_h, tile_w),
275                    maxshape=(C_he, None, tile_h, tile_w),
276                    compression="lzf", chunks=(C_he, 1, tile_h, tile_w), dtype=he.dtype
277                )
278                mif_ds = f.create_dataset(
279                    "raw/mif", shape=(C_mif, N, tile_h, tile_w),
280                    maxshape=(C_mif, None, tile_h, tile_w),
281                    compression="lzf", chunks=(C_mif, 1, tile_h, tile_w), dtype=mif.dtype
282                )
283                inst_ds = f.create_dataset(
284                    "labels/nucleus/instances", shape=(N, tile_h, tile_w),
285                    maxshape=(None, tile_h, tile_w),
286                    compression="lzf", chunks=(1, tile_h, tile_w), dtype=nuclei.dtype
287                )
288                if has_cell_table:
289                    sem_ds = f.create_dataset(
290                        "labels/nucleus/semantic", shape=(N, tile_h, tile_w),
291                        maxshape=(None, tile_h, tile_w),
292                        compression="lzf", chunks=(1, tile_h, tile_w), dtype="uint16"
293                    )
294
295            he_ds[:, n] = he
296            mif_ds[:, n] = mif
297            inst_ds[n] = nuclei
298
299            if has_cell_table and sem_ds is not None:
300                sem_ds.resize(n + 1, axis=0)
301                sem_ds[n] = _make_semantic_label_from_instances(
302                    row, he_path, nuclei, cell_tables[slide_name], cell_type_mapping, cell_columns
303                )
304
305            n += 1
306
307        if he_ds is not None and n < N:
308            he_ds.resize(n, axis=1)
309            mif_ds.resize(n, axis=1)
310            inst_ds.resize(n, axis=0)
311            if sem_ds is not None:
312                sem_ds.resize(n, axis=0)
313
314    if n == 0:
315        os.remove(tmp_path)
316        return
317
318    os.rename(tmp_path, output_path)
319
320
321def _preprocess_split(root, split, preprocessing_workers=8):
322    metadata_path, metadata = _get_metadata(root, split)
323    expected_columns = {"image_path", "target_path", "nuclei_path"}
324    missing_columns = expected_columns - set(metadata.columns)
325    if missing_columns:
326        raise RuntimeError(f"Missing columns in {metadata_path}: {sorted(missing_columns)}.")
327
328    output_root = os.path.join(root, "preprocessed", "orion_crc")
329    split_root = os.path.join(output_root, split)
330    os.makedirs(split_root, exist_ok=True)
331
332    slide_id_map = _get_slide_id_map(root)
333    cell_tables = _collect_cell_tables(root)
334    cell_columns = _infer_cell_columns(cell_tables)
335    cell_type_mapping = None
336    if cell_columns is not None:
337        cell_type_mapping = _get_cell_type_mapping(cell_tables, cell_columns[0])
338        _write_cell_type_mapping(output_root, cell_type_mapping)
339
340    slide_name_col = _find_column(metadata.columns, ["slide_name", "in_slide_name"])
341    if slide_name_col is None:
342        raise RuntimeError(f"Could not find slide name column in {metadata_path}.")
343
344    tasks = []
345    for slide_name, group in metadata.groupby(slide_name_col):
346        slide_id = slide_id_map.get(slide_name, slide_name.split(".")[0])
347        output_path = os.path.join(split_root, f"{slide_id}.h5")
348        tasks.append(
349            (root, metadata_path, slide_name, group, output_path, cell_tables, cell_columns, cell_type_mapping)
350        )
351
352    n_workers = min(preprocessing_workers, len(tasks))
353    if n_workers > 1:
354        with Pool(n_workers) as pool:
355            pool.starmap(_preprocess_slide, tasks)
356    else:
357        for args in tasks:
358            _preprocess_slide(*args)
359
360    return output_root
361
362
363def get_orion_crc_data(
364    path: Union[os.PathLike, str],
365    split: Optional[Literal["train", "val", "test"]] = None,
366    download: bool = False,
367    preprocessing_workers: int = 8,
368) -> str:
369    """Download and preprocess the processed ORION-CRC tile dataset.
370
371    The archive is large, about 127 GB, so only use `download=True` when you really want to fetch it.
372    Alternatively, download and extract `ORIONCRC_dataset_tile_20x.zip` manually into `path`.
373
374    Args:
375        path: Filepath to a folder where the downloaded data will be saved.
376        split: The split to preprocess. By default all available splits are preprocessed.
377        download: Whether to download the data if it is not present.
378        preprocessing_workers: Number of parallel workers for preprocessing slides.
379
380    Returns:
381        Filepath where preprocessed HDF5 files are stored.
382    """
383    os.makedirs(path, exist_ok=True)
384    if _find_file(path, "train_dataframe.csv") is None:
385        zip_path = os.path.join(path, ZIP_NAME)
386        if os.path.exists(zip_path):
387            util.unzip(zip_path, path, remove=False)
388        elif download:
389            util.download_source(zip_path, URL, download=download, checksum=None)
390            util.unzip(zip_path, path, remove=False)
391        else:
392            raise RuntimeError(
393                f"Could not find the processed ORION-CRC data in {path}. "
394                f"Please download {ZIP_NAME} from https://zenodo.org/records/15340874 and extract it there, "
395                "or pass `download=True` to download the 127 GB archive."
396            )
397
398    splits = SPLITS if split is None else (split,)
399    for this_split in splits:
400        output_root = _preprocess_split(path, this_split, preprocessing_workers=preprocessing_workers)
401    return output_root
402
403
404def get_orion_crc_paths(
405    path: Union[os.PathLike, str],
406    split: Literal["train", "val", "test"],
407    download: bool = False,
408    preprocessing_workers: int = 8,
409) -> List[str]:
410    """Get paths to preprocessed per-slide ORION-CRC HDF5 files.
411
412    Args:
413        path: Filepath to a folder where the downloaded data will be saved.
414        split: The split to use. Either "train", "val" or "test".
415        download: Whether to download the data if it is not present.
416        preprocessing_workers: Number of parallel workers for preprocessing slides.
417
418    Returns:
419        List of preprocessed per-slide HDF5 filepaths.
420    """
421    if split not in SPLITS:
422        raise ValueError(f"'{split}' is not a valid split choice. Choose from {SPLITS}.")
423    output_root = get_orion_crc_data(path, split=split, download=download, preprocessing_workers=preprocessing_workers)
424    paths = sorted(glob(os.path.join(output_root, split, "*.h5")))
425    if not paths:
426        raise RuntimeError("Could not find any preprocessed ORION-CRC slides for the requested settings.")
427    return paths
428
429
430def get_orion_crc_dataset(
431    path: Union[os.PathLike, str],
432    patch_shape: Tuple[int, int],
433    split: Literal["train", "val", "test"],
434    modality: Literal["he", "mif"] = "he",
435    label_type: Literal["instances", "semantic"] = "instances",
436    download: bool = False,
437    resize_inputs: bool = False,
438    preprocessing_workers: int = 8,
439    **kwargs
440) -> Dataset:
441    """Get the processed ORION-CRC dataset for nucleus instance or semantic segmentation.
442
443    Args:
444        path: Filepath to a folder where the downloaded data will be saved.
445        patch_shape: The patch shape to use for training.
446        split: The split to use. Either "train", "val" or "test".
447        modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence.
448        label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels.
449        download: Whether to download the data if it is not present.
450        resize_inputs: Whether to resize the input images.
451        preprocessing_workers: Number of parallel workers for preprocessing slides.
452        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
453
454    Returns:
455        The segmentation dataset.
456    """
457    if modality not in ("he", "mif"):
458        raise ValueError(f"'{modality}' is not a valid modality. Choose from 'he' or 'mif'.")
459    if label_type not in ("instances", "semantic"):
460        raise ValueError(f"'{label_type}' is not a valid label type. Choose from 'instances' or 'semantic'.")
461
462    paths = get_orion_crc_paths(path, split, download, preprocessing_workers=preprocessing_workers)
463
464    if label_type == "semantic":
465        output_root = os.path.dirname(os.path.dirname(paths[0]))
466        if not os.path.exists(os.path.join(output_root, "semantic_label_mapping.csv")):
467            raise RuntimeError(
468                "Semantic labels are not available for this ORION-CRC data. "
469                "They require per-cell CSV metadata with cell types and nucleus coordinates."
470            )
471
472    if resize_inputs:
473        resize_kwargs = {"patch_shape": patch_shape, "is_rgb": modality == "he"}
474        kwargs, patch_shape = util.update_kwargs_for_resize_trafo(
475            kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs
476        )
477
478    # Raw shape is (C, N, H, W), label shape is (N, H, W).
479    # Prepend 1 to patch_shape to extract one full tile at a time from the N dimension.
480    return torch_em.default_segmentation_dataset(
481        raw_paths=paths,
482        raw_key=f"raw/{modality}",
483        label_paths=paths,
484        label_key=f"labels/nucleus/{label_type}",
485        is_seg_dataset=True,
486        patch_shape=(1,) + tuple(patch_shape),
487        with_channels=True,
488        **kwargs
489    )
490
491
492def get_orion_crc_loader(
493    path: Union[os.PathLike, str],
494    batch_size: int,
495    patch_shape: Tuple[int, int],
496    split: Literal["train", "val", "test"],
497    modality: Literal["he", "mif"] = "he",
498    label_type: Literal["instances", "semantic"] = "instances",
499    download: bool = False,
500    resize_inputs: bool = False,
501    preprocessing_workers: int = 8,
502    **kwargs
503) -> DataLoader:
504    """Get the processed ORION-CRC dataloader for nucleus instance or semantic segmentation.
505
506    Args:
507        path: Filepath to a folder where the downloaded data will be saved.
508        batch_size: The batch size for training.
509        patch_shape: The patch shape to use for training.
510        split: The split to use. Either "train", "val" or "test".
511        modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence.
512        label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels.
513        download: Whether to download the data if it is not present.
514        resize_inputs: Whether to resize the input images.
515        preprocessing_workers: Number of parallel workers for preprocessing slides.
516        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
517
518    Returns:
519        The DataLoader.
520    """
521    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
522    dataset = get_orion_crc_dataset(
523        path, patch_shape, split, modality=modality, label_type=label_type, download=download,
524        resize_inputs=resize_inputs, preprocessing_workers=preprocessing_workers, **ds_kwargs
525    )
526    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
URL = 'https://zenodo.org/api/records/15340874/files/ORIONCRC_dataset_tile_20x.zip/content'
ZIP_NAME = 'ORIONCRC_dataset_tile_20x.zip'
SPLITS = ('train', 'val', 'test')
CELL_TYPE_COLUMNS = ('cell_type', 'celltype', 'cell_type_pred', 'predicted_cell_type', 'phenotype', 'class', 'label')
X_COLUMNS = ('x', 'X', 'centroid_x', 'nucleus_x', 'nuclei_x', 'center_x')
Y_COLUMNS = ('y', 'Y', 'centroid_y', 'nucleus_y', 'nuclei_y', 'center_y')
TILE_X_COLUMNS = ('tile_x', 'x_start', 'xmin', 'min_x', 'left')
TILE_Y_COLUMNS = ('tile_y', 'y_start', 'ymin', 'min_y', 'top')
def get_orion_crc_data( path: Union[os.PathLike, str], split: Optional[Literal['train', 'val', 'test']] = None, download: bool = False, preprocessing_workers: int = 8) -> str:
364def get_orion_crc_data(
365    path: Union[os.PathLike, str],
366    split: Optional[Literal["train", "val", "test"]] = None,
367    download: bool = False,
368    preprocessing_workers: int = 8,
369) -> str:
370    """Download and preprocess the processed ORION-CRC tile dataset.
371
372    The archive is large, about 127 GB, so only use `download=True` when you really want to fetch it.
373    Alternatively, download and extract `ORIONCRC_dataset_tile_20x.zip` manually into `path`.
374
375    Args:
376        path: Filepath to a folder where the downloaded data will be saved.
377        split: The split to preprocess. By default all available splits are preprocessed.
378        download: Whether to download the data if it is not present.
379        preprocessing_workers: Number of parallel workers for preprocessing slides.
380
381    Returns:
382        Filepath where preprocessed HDF5 files are stored.
383    """
384    os.makedirs(path, exist_ok=True)
385    if _find_file(path, "train_dataframe.csv") is None:
386        zip_path = os.path.join(path, ZIP_NAME)
387        if os.path.exists(zip_path):
388            util.unzip(zip_path, path, remove=False)
389        elif download:
390            util.download_source(zip_path, URL, download=download, checksum=None)
391            util.unzip(zip_path, path, remove=False)
392        else:
393            raise RuntimeError(
394                f"Could not find the processed ORION-CRC data in {path}. "
395                f"Please download {ZIP_NAME} from https://zenodo.org/records/15340874 and extract it there, "
396                "or pass `download=True` to download the 127 GB archive."
397            )
398
399    splits = SPLITS if split is None else (split,)
400    for this_split in splits:
401        output_root = _preprocess_split(path, this_split, preprocessing_workers=preprocessing_workers)
402    return output_root

Download and preprocess the processed ORION-CRC tile dataset.

The archive is large, about 127 GB, so only use download=True when you really want to fetch it. Alternatively, download and extract ORIONCRC_dataset_tile_20x.zip manually into path.

Arguments:
  • path: Filepath to a folder where the downloaded data will be saved.
  • split: The split to preprocess. By default all available splits are preprocessed.
  • download: Whether to download the data if it is not present.
  • preprocessing_workers: Number of parallel workers for preprocessing slides.
Returns:

Filepath where preprocessed HDF5 files are stored.

def get_orion_crc_paths( path: Union[os.PathLike, str], split: Literal['train', 'val', 'test'], download: bool = False, preprocessing_workers: int = 8) -> List[str]:
405def get_orion_crc_paths(
406    path: Union[os.PathLike, str],
407    split: Literal["train", "val", "test"],
408    download: bool = False,
409    preprocessing_workers: int = 8,
410) -> List[str]:
411    """Get paths to preprocessed per-slide ORION-CRC HDF5 files.
412
413    Args:
414        path: Filepath to a folder where the downloaded data will be saved.
415        split: The split to use. Either "train", "val" or "test".
416        download: Whether to download the data if it is not present.
417        preprocessing_workers: Number of parallel workers for preprocessing slides.
418
419    Returns:
420        List of preprocessed per-slide HDF5 filepaths.
421    """
422    if split not in SPLITS:
423        raise ValueError(f"'{split}' is not a valid split choice. Choose from {SPLITS}.")
424    output_root = get_orion_crc_data(path, split=split, download=download, preprocessing_workers=preprocessing_workers)
425    paths = sorted(glob(os.path.join(output_root, split, "*.h5")))
426    if not paths:
427        raise RuntimeError("Could not find any preprocessed ORION-CRC slides for the requested settings.")
428    return paths

Get paths to preprocessed per-slide ORION-CRC HDF5 files.

Arguments:
  • path: Filepath to a folder where the downloaded data will be saved.
  • split: The split to use. Either "train", "val" or "test".
  • download: Whether to download the data if it is not present.
  • preprocessing_workers: Number of parallel workers for preprocessing slides.
Returns:

List of preprocessed per-slide HDF5 filepaths.

def get_orion_crc_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int], split: Literal['train', 'val', 'test'], modality: Literal['he', 'mif'] = 'he', label_type: Literal['instances', 'semantic'] = 'instances', download: bool = False, resize_inputs: bool = False, preprocessing_workers: int = 8, **kwargs) -> torch.utils.data.dataset.Dataset:
431def get_orion_crc_dataset(
432    path: Union[os.PathLike, str],
433    patch_shape: Tuple[int, int],
434    split: Literal["train", "val", "test"],
435    modality: Literal["he", "mif"] = "he",
436    label_type: Literal["instances", "semantic"] = "instances",
437    download: bool = False,
438    resize_inputs: bool = False,
439    preprocessing_workers: int = 8,
440    **kwargs
441) -> Dataset:
442    """Get the processed ORION-CRC dataset for nucleus instance or semantic segmentation.
443
444    Args:
445        path: Filepath to a folder where the downloaded data will be saved.
446        patch_shape: The patch shape to use for training.
447        split: The split to use. Either "train", "val" or "test".
448        modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence.
449        label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels.
450        download: Whether to download the data if it is not present.
451        resize_inputs: Whether to resize the input images.
452        preprocessing_workers: Number of parallel workers for preprocessing slides.
453        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
454
455    Returns:
456        The segmentation dataset.
457    """
458    if modality not in ("he", "mif"):
459        raise ValueError(f"'{modality}' is not a valid modality. Choose from 'he' or 'mif'.")
460    if label_type not in ("instances", "semantic"):
461        raise ValueError(f"'{label_type}' is not a valid label type. Choose from 'instances' or 'semantic'.")
462
463    paths = get_orion_crc_paths(path, split, download, preprocessing_workers=preprocessing_workers)
464
465    if label_type == "semantic":
466        output_root = os.path.dirname(os.path.dirname(paths[0]))
467        if not os.path.exists(os.path.join(output_root, "semantic_label_mapping.csv")):
468            raise RuntimeError(
469                "Semantic labels are not available for this ORION-CRC data. "
470                "They require per-cell CSV metadata with cell types and nucleus coordinates."
471            )
472
473    if resize_inputs:
474        resize_kwargs = {"patch_shape": patch_shape, "is_rgb": modality == "he"}
475        kwargs, patch_shape = util.update_kwargs_for_resize_trafo(
476            kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs
477        )
478
479    # Raw shape is (C, N, H, W), label shape is (N, H, W).
480    # Prepend 1 to patch_shape to extract one full tile at a time from the N dimension.
481    return torch_em.default_segmentation_dataset(
482        raw_paths=paths,
483        raw_key=f"raw/{modality}",
484        label_paths=paths,
485        label_key=f"labels/nucleus/{label_type}",
486        is_seg_dataset=True,
487        patch_shape=(1,) + tuple(patch_shape),
488        with_channels=True,
489        **kwargs
490    )

Get the processed ORION-CRC dataset for nucleus instance or semantic segmentation.

Arguments:
  • path: Filepath to a folder where the downloaded data will be saved.
  • patch_shape: The patch shape to use for training.
  • split: The split to use. Either "train", "val" or "test".
  • modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence.
  • label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels.
  • download: Whether to download the data if it is not present.
  • resize_inputs: Whether to resize the input images.
  • preprocessing_workers: Number of parallel workers for preprocessing slides.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.
Returns:

The segmentation dataset.

def get_orion_crc_loader( path: Union[os.PathLike, str], batch_size: int, patch_shape: Tuple[int, int], split: Literal['train', 'val', 'test'], modality: Literal['he', 'mif'] = 'he', label_type: Literal['instances', 'semantic'] = 'instances', download: bool = False, resize_inputs: bool = False, preprocessing_workers: int = 8, **kwargs) -> torch.utils.data.dataloader.DataLoader:
493def get_orion_crc_loader(
494    path: Union[os.PathLike, str],
495    batch_size: int,
496    patch_shape: Tuple[int, int],
497    split: Literal["train", "val", "test"],
498    modality: Literal["he", "mif"] = "he",
499    label_type: Literal["instances", "semantic"] = "instances",
500    download: bool = False,
501    resize_inputs: bool = False,
502    preprocessing_workers: int = 8,
503    **kwargs
504) -> DataLoader:
505    """Get the processed ORION-CRC dataloader for nucleus instance or semantic segmentation.
506
507    Args:
508        path: Filepath to a folder where the downloaded data will be saved.
509        batch_size: The batch size for training.
510        patch_shape: The patch shape to use for training.
511        split: The split to use. Either "train", "val" or "test".
512        modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence.
513        label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels.
514        download: Whether to download the data if it is not present.
515        resize_inputs: Whether to resize the input images.
516        preprocessing_workers: Number of parallel workers for preprocessing slides.
517        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
518
519    Returns:
520        The DataLoader.
521    """
522    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
523    dataset = get_orion_crc_dataset(
524        path, patch_shape, split, modality=modality, label_type=label_type, download=download,
525        resize_inputs=resize_inputs, preprocessing_workers=preprocessing_workers, **ds_kwargs
526    )
527    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the processed ORION-CRC dataloader for nucleus instance or semantic segmentation.

Arguments:
  • path: Filepath to a folder where the downloaded data will be saved.
  • batch_size: The batch size for training.
  • patch_shape: The patch shape to use for training.
  • split: The split to use. Either "train", "val" or "test".
  • modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence.
  • label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels.
  • download: Whether to download the data if it is not present.
  • resize_inputs: Whether to resize the input images.
  • preprocessing_workers: Number of parallel workers for preprocessing slides.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.
Returns:

The DataLoader.