torch_em.data.datasets.histopathology.orion_crc
ORION-CRC contains paired H&E and multiplex immunofluorescence images of colorectal cancer tissue.
This loader supports the processed ORION-CRC tile dataset released with MIPHEI-ViT: https://zenodo.org/records/15340874. The source ORION-CRC dataset is available from https://zenodo.org/records/7637988 and is described in https://doi.org/10.1038/s43018-023-00576-1.
The processed release provides H&E tiles, mIF tiles, Cellpose-generated nucleus instance masks, and per-cell CSV metadata with nucleus positions and cell types. Tiles are grouped per slide and stored in per-slide HDF5 files with:
raw/he: H&E tiles stacked as(3, N, H, W).raw/mif: mIF tiles stacked as(C, N, H, W).labels/nucleus/instances: Cellpose nucleus instance masks as(N, H, W).labels/nucleus/semantic: derived cell-type semantic masks as(N, H, W), if per-cell CSV data is available.
The nucleus instance labels are generated by Cellpose on the DAPI channel in the processed ORION release, so they
are algorithm-generated instance segmentations, not manual annotations. The semantic labels are not native pixel
annotations either: they are derived by assigning the CSV cell type of each nucleus coordinate to the corresponding
nucleus instance region. Background is 0 and class IDs are stored in semantic_label_mapping.csv in the
preprocessed output directory.
Please cite the ORION-CRC and MIPHEI-ViT publications if you use this dataset for your research.
1"""ORION-CRC contains paired H&E and multiplex immunofluorescence images of colorectal cancer tissue. 2 3This loader supports the processed ORION-CRC tile dataset released with MIPHEI-ViT: 4https://zenodo.org/records/15340874. The source ORION-CRC dataset is available from 5https://zenodo.org/records/7637988 and is described in https://doi.org/10.1038/s43018-023-00576-1. 6 7The processed release provides H&E tiles, mIF tiles, Cellpose-generated nucleus instance masks, and per-cell CSV 8metadata with nucleus positions and cell types. Tiles are grouped per slide and stored in per-slide HDF5 files with: 9 10- `raw/he`: H&E tiles stacked as `(3, N, H, W)`. 11- `raw/mif`: mIF tiles stacked as `(C, N, H, W)`. 12- `labels/nucleus/instances`: Cellpose nucleus instance masks as `(N, H, W)`. 13- `labels/nucleus/semantic`: derived cell-type semantic masks as `(N, H, W)`, if per-cell CSV data is available. 14 15The nucleus instance labels are generated by Cellpose on the DAPI channel in the processed ORION release, so they 16are algorithm-generated instance segmentations, not manual annotations. The semantic labels are not native pixel 17annotations either: they are derived by assigning the CSV cell type of each nucleus coordinate to the corresponding 18nucleus instance region. Background is 0 and class IDs are stored in `semantic_label_mapping.csv` in the 19preprocessed output directory. 20 21Please cite the ORION-CRC and MIPHEI-ViT publications if you use this dataset for your research. 22""" 23 24import os 25import re 26from concurrent.futures import ThreadPoolExecutor 27from glob import glob 28from multiprocessing import Pool 29from typing import List, Literal, Optional, Tuple, Union 30 31import imageio.v3 as imageio 32import numpy as np 33import pandas as pd 34 35from torch.utils.data import Dataset, DataLoader 36 37import torch_em 38 39from .. import util 40 41 42URL = "https://zenodo.org/api/records/15340874/files/ORIONCRC_dataset_tile_20x.zip/content" 43ZIP_NAME = "ORIONCRC_dataset_tile_20x.zip" 44SPLITS = ("train", "val", "test") 45 46CELL_TYPE_COLUMNS = ("cell_type", "celltype", "cell_type_pred", "predicted_cell_type", "phenotype", "class", "label") 47X_COLUMNS = ("x", "X", "centroid_x", "nucleus_x", "nuclei_x", "center_x") 48Y_COLUMNS = ("y", "Y", "centroid_y", "nucleus_y", "nuclei_y", "center_y") 49TILE_X_COLUMNS = ("tile_x", "x_start", "xmin", "min_x", "left") 50TILE_Y_COLUMNS = ("tile_y", "y_start", "ymin", "min_y", "top") 51 52 53def _find_file(path, name): 54 matches = glob(os.path.join(path, "**", name), recursive=True) 55 if len(matches) == 1: 56 return matches[0] 57 if len(matches) > 1: 58 return sorted(matches)[0] 59 return None 60 61 62def _resolve_path(root, metadata_path, value): 63 value = str(value) 64 candidates = [ 65 os.path.join(os.path.dirname(metadata_path), value), 66 os.path.join(root, value), 67 value, 68 ] 69 for candidate in candidates: 70 if os.path.exists(candidate): 71 return candidate 72 return candidates[0] 73 74 75def _find_column(columns, candidates): 76 lower_to_column = {column.lower(): column for column in columns} 77 for candidate in candidates: 78 if candidate.lower() in lower_to_column: 79 return lower_to_column[candidate.lower()] 80 return None 81 82 83def _get_metadata(root, split): 84 metadata_path = _find_file(root, f"{split}_dataframe.csv") 85 if metadata_path is None: 86 raise RuntimeError(f"Could not find {split}_dataframe.csv in {root}.") 87 metadata = pd.read_csv(metadata_path) 88 return metadata_path, metadata 89 90 91def _get_slide_csv_paths(root): 92 slide_dataframe_path = _find_file(root, "slide_dataframe.csv") 93 if slide_dataframe_path is None: 94 return {} 95 slide_dataframe = pd.read_csv(slide_dataframe_path) 96 slide_name_col = _find_column(slide_dataframe.columns, ["slide_name", "in_slide_name"]) 97 if slide_name_col is None or "nuclei_csv_path" not in slide_dataframe.columns: 98 return {} 99 return { 100 row[slide_name_col]: _resolve_path(root, slide_dataframe_path, row["nuclei_csv_path"]) 101 for _, row in slide_dataframe.iterrows() 102 } 103 104 105def _get_slide_id_map(root): 106 slide_df_path = _find_file(root, "slide_dataframe.csv") 107 if slide_df_path is None: 108 return {} 109 slide_df = pd.read_csv(slide_df_path) 110 slide_name_col = _find_column(slide_df.columns, ["slide_name", "in_slide_name"]) 111 if slide_name_col is None or "orion_slide_id" not in slide_df.columns: 112 return {} 113 return dict(zip(slide_df[slide_name_col], slide_df["orion_slide_id"])) 114 115 116def _parse_tile_origin(path): 117 stem = os.path.splitext(os.path.basename(path))[0] 118 numbers = [int(n) for n in re.findall(r"\d+", stem)] 119 # Tile filenames follow the pattern *_x_y_z_width_height.*, so origin is at [-5], [-4]. 120 if len(numbers) >= 5: 121 return numbers[-5], numbers[-4] 122 return None 123 124 125def _get_tile_origin(row, image_path): 126 x_column = _find_column(row.index, TILE_X_COLUMNS) 127 y_column = _find_column(row.index, TILE_Y_COLUMNS) 128 if x_column is not None and y_column is not None: 129 return int(row[x_column]), int(row[y_column]) 130 return _parse_tile_origin(image_path) 131 132 133def _get_cell_type_mapping(csv_tables, cell_type_column): 134 cell_types = set() 135 for table in csv_tables.values(): 136 cell_types.update(str(value) for value in table[cell_type_column].dropna().unique()) 137 return {cell_type: label_id for label_id, cell_type in enumerate(sorted(cell_types), start=1)} 138 139 140def _read_image(path): 141 image = imageio.imread(path) 142 if image.ndim == 3: 143 image = image.transpose(2, 0, 1) 144 return image 145 146 147def _read_label(path): 148 label = imageio.imread(path) 149 if label.ndim == 3: 150 label = label[..., 0] 151 return label 152 153 154def _collect_cell_tables(root): 155 tables = {} 156 for slide_name, csv_path in _get_slide_csv_paths(root).items(): 157 if os.path.exists(csv_path): 158 tables[slide_name] = pd.read_csv(csv_path) 159 return tables 160 161 162def _infer_cell_columns(cell_tables): 163 if not cell_tables: 164 return None 165 first_table = next(iter(cell_tables.values())) 166 cell_type_column = _find_column(first_table.columns, CELL_TYPE_COLUMNS) 167 x_column = _find_column(first_table.columns, X_COLUMNS) 168 y_column = _find_column(first_table.columns, Y_COLUMNS) 169 if cell_type_column is None or x_column is None or y_column is None: 170 return None 171 return cell_type_column, x_column, y_column 172 173 174def _write_cell_type_mapping(output_root, mapping): 175 mapping_path = os.path.join(output_root, "semantic_label_mapping.csv") 176 if os.path.exists(mapping_path): 177 return 178 os.makedirs(output_root, exist_ok=True) 179 pd.DataFrame( 180 [{"label_id": label_id, "cell_type": cell_type} for cell_type, label_id in mapping.items()] 181 ).to_csv(mapping_path, index=False) 182 183 184def _make_semantic_label_from_instances(row, image_path, nuclei, cell_table, cell_type_mapping, cell_columns): 185 cell_type_column, x_column, y_column = cell_columns 186 origin = _get_tile_origin(row, image_path) 187 tile_h, tile_w = nuclei.shape 188 189 valid_mask = cell_table[cell_type_column].notna() 190 if not valid_mask.any(): 191 return np.zeros(nuclei.shape, dtype="uint16") 192 193 cells = cell_table[valid_mask] 194 xs = cells[x_column].to_numpy(dtype=float) 195 ys = cells[y_column].to_numpy(dtype=float) 196 class_ids = np.array([cell_type_mapping[str(v)] for v in cells[cell_type_column]], dtype="uint16") 197 198 if origin is not None: 199 lx = np.round(xs - origin[0]).astype(int) 200 ly = np.round(ys - origin[1]).astype(int) 201 in_bounds = (lx >= 0) & (lx < tile_w) & (ly >= 0) & (ly < tile_h) 202 inst_ids = np.zeros(len(xs), dtype=nuclei.dtype) 203 inst_ids[in_bounds] = nuclei[ly[in_bounds], lx[in_bounds]] 204 205 needs_fallback = ~in_bounds | (inst_ids == 0) 206 if needs_fallback.any(): 207 lx_raw = np.round(xs).astype(int) 208 ly_raw = np.round(ys).astype(int) 209 fb = needs_fallback & (lx_raw >= 0) & (lx_raw < tile_w) & (ly_raw >= 0) & (ly_raw < tile_h) 210 inst_ids[fb] = nuclei[ly_raw[fb], lx_raw[fb]] 211 else: 212 lx = np.round(xs).astype(int) 213 ly = np.round(ys).astype(int) 214 in_bounds = (lx >= 0) & (lx < tile_w) & (ly >= 0) & (ly < tile_h) 215 inst_ids = np.zeros(len(xs), dtype=nuclei.dtype) 216 inst_ids[in_bounds] = nuclei[ly[in_bounds], lx[in_bounds]] 217 218 hit = inst_ids > 0 219 if not hit.any(): 220 return np.zeros(nuclei.shape, dtype="uint16") 221 222 inst_to_class = np.zeros(int(nuclei.max()) + 1, dtype="uint16") 223 inst_to_class[inst_ids[hit]] = class_ids[hit] 224 return inst_to_class[nuclei] 225 226 227def _preprocess_slide( 228 root, metadata_path, slide_name, group, output_path, cell_tables, cell_columns, cell_type_mapping 229): 230 import h5py 231 232 if os.path.exists(output_path): 233 return 234 235 has_cell_table = cell_columns is not None and slide_name in cell_tables 236 tmp_path = output_path + ".tmp" 237 n = 0 238 N = len(group) 239 tile_h = tile_w = None 240 he_ds = mif_ds = inst_ds = sem_ds = None 241 242 with h5py.File(tmp_path, "w") as f: 243 f.attrs["slide_name"] = slide_name 244 245 for _, row in group.iterrows(): 246 he_path = _resolve_path(root, metadata_path, row["image_path"]) 247 mif_path = _resolve_path(root, metadata_path, row["target_path"]) 248 nucleus_path = _resolve_path(root, metadata_path, row["nuclei_path"]) 249 if not (os.path.exists(he_path) and os.path.exists(mif_path) and os.path.exists(nucleus_path)): 250 continue 251 252 with ThreadPoolExecutor(max_workers=3) as ex: 253 he_f = ex.submit(_read_image, he_path) 254 mif_f = ex.submit(_read_image, mif_path) 255 nuc_f = ex.submit(_read_label, nucleus_path) 256 he = he_f.result() 257 mif = mif_f.result() 258 nuclei = nuc_f.result() 259 260 if he.ndim == 2: 261 he = he[None] 262 263 if tile_h is None: 264 tile_h, tile_w = he.shape[-2:] 265 elif he.shape[-2:] != (tile_h, tile_w): 266 continue 267 268 if mif.ndim == 2: 269 mif = mif[None] 270 271 if he_ds is None: 272 C_he, C_mif = he.shape[0], mif.shape[0] 273 he_ds = f.create_dataset( 274 "raw/he", shape=(C_he, N, tile_h, tile_w), 275 maxshape=(C_he, None, tile_h, tile_w), 276 compression="lzf", chunks=(C_he, 1, tile_h, tile_w), dtype=he.dtype 277 ) 278 mif_ds = f.create_dataset( 279 "raw/mif", shape=(C_mif, N, tile_h, tile_w), 280 maxshape=(C_mif, None, tile_h, tile_w), 281 compression="lzf", chunks=(C_mif, 1, tile_h, tile_w), dtype=mif.dtype 282 ) 283 inst_ds = f.create_dataset( 284 "labels/nucleus/instances", shape=(N, tile_h, tile_w), 285 maxshape=(None, tile_h, tile_w), 286 compression="lzf", chunks=(1, tile_h, tile_w), dtype=nuclei.dtype 287 ) 288 if has_cell_table: 289 sem_ds = f.create_dataset( 290 "labels/nucleus/semantic", shape=(N, tile_h, tile_w), 291 maxshape=(None, tile_h, tile_w), 292 compression="lzf", chunks=(1, tile_h, tile_w), dtype="uint16" 293 ) 294 295 he_ds[:, n] = he 296 mif_ds[:, n] = mif 297 inst_ds[n] = nuclei 298 299 if has_cell_table and sem_ds is not None: 300 sem_ds.resize(n + 1, axis=0) 301 sem_ds[n] = _make_semantic_label_from_instances( 302 row, he_path, nuclei, cell_tables[slide_name], cell_type_mapping, cell_columns 303 ) 304 305 n += 1 306 307 if he_ds is not None and n < N: 308 he_ds.resize(n, axis=1) 309 mif_ds.resize(n, axis=1) 310 inst_ds.resize(n, axis=0) 311 if sem_ds is not None: 312 sem_ds.resize(n, axis=0) 313 314 if n == 0: 315 os.remove(tmp_path) 316 return 317 318 os.rename(tmp_path, output_path) 319 320 321def _preprocess_split(root, split, preprocessing_workers=8): 322 metadata_path, metadata = _get_metadata(root, split) 323 expected_columns = {"image_path", "target_path", "nuclei_path"} 324 missing_columns = expected_columns - set(metadata.columns) 325 if missing_columns: 326 raise RuntimeError(f"Missing columns in {metadata_path}: {sorted(missing_columns)}.") 327 328 output_root = os.path.join(root, "preprocessed", "orion_crc") 329 split_root = os.path.join(output_root, split) 330 os.makedirs(split_root, exist_ok=True) 331 332 slide_id_map = _get_slide_id_map(root) 333 cell_tables = _collect_cell_tables(root) 334 cell_columns = _infer_cell_columns(cell_tables) 335 cell_type_mapping = None 336 if cell_columns is not None: 337 cell_type_mapping = _get_cell_type_mapping(cell_tables, cell_columns[0]) 338 _write_cell_type_mapping(output_root, cell_type_mapping) 339 340 slide_name_col = _find_column(metadata.columns, ["slide_name", "in_slide_name"]) 341 if slide_name_col is None: 342 raise RuntimeError(f"Could not find slide name column in {metadata_path}.") 343 344 tasks = [] 345 for slide_name, group in metadata.groupby(slide_name_col): 346 slide_id = slide_id_map.get(slide_name, slide_name.split(".")[0]) 347 output_path = os.path.join(split_root, f"{slide_id}.h5") 348 tasks.append( 349 (root, metadata_path, slide_name, group, output_path, cell_tables, cell_columns, cell_type_mapping) 350 ) 351 352 n_workers = min(preprocessing_workers, len(tasks)) 353 if n_workers > 1: 354 with Pool(n_workers) as pool: 355 pool.starmap(_preprocess_slide, tasks) 356 else: 357 for args in tasks: 358 _preprocess_slide(*args) 359 360 return output_root 361 362 363def get_orion_crc_data( 364 path: Union[os.PathLike, str], 365 split: Optional[Literal["train", "val", "test"]] = None, 366 download: bool = False, 367 preprocessing_workers: int = 8, 368) -> str: 369 """Download and preprocess the processed ORION-CRC tile dataset. 370 371 The archive is large, about 127 GB, so only use `download=True` when you really want to fetch it. 372 Alternatively, download and extract `ORIONCRC_dataset_tile_20x.zip` manually into `path`. 373 374 Args: 375 path: Filepath to a folder where the downloaded data will be saved. 376 split: The split to preprocess. By default all available splits are preprocessed. 377 download: Whether to download the data if it is not present. 378 preprocessing_workers: Number of parallel workers for preprocessing slides. 379 380 Returns: 381 Filepath where preprocessed HDF5 files are stored. 382 """ 383 os.makedirs(path, exist_ok=True) 384 if _find_file(path, "train_dataframe.csv") is None: 385 zip_path = os.path.join(path, ZIP_NAME) 386 if os.path.exists(zip_path): 387 util.unzip(zip_path, path, remove=False) 388 elif download: 389 util.download_source(zip_path, URL, download=download, checksum=None) 390 util.unzip(zip_path, path, remove=False) 391 else: 392 raise RuntimeError( 393 f"Could not find the processed ORION-CRC data in {path}. " 394 f"Please download {ZIP_NAME} from https://zenodo.org/records/15340874 and extract it there, " 395 "or pass `download=True` to download the 127 GB archive." 396 ) 397 398 splits = SPLITS if split is None else (split,) 399 for this_split in splits: 400 output_root = _preprocess_split(path, this_split, preprocessing_workers=preprocessing_workers) 401 return output_root 402 403 404def get_orion_crc_paths( 405 path: Union[os.PathLike, str], 406 split: Literal["train", "val", "test"], 407 download: bool = False, 408 preprocessing_workers: int = 8, 409) -> List[str]: 410 """Get paths to preprocessed per-slide ORION-CRC HDF5 files. 411 412 Args: 413 path: Filepath to a folder where the downloaded data will be saved. 414 split: The split to use. Either "train", "val" or "test". 415 download: Whether to download the data if it is not present. 416 preprocessing_workers: Number of parallel workers for preprocessing slides. 417 418 Returns: 419 List of preprocessed per-slide HDF5 filepaths. 420 """ 421 if split not in SPLITS: 422 raise ValueError(f"'{split}' is not a valid split choice. Choose from {SPLITS}.") 423 output_root = get_orion_crc_data(path, split=split, download=download, preprocessing_workers=preprocessing_workers) 424 paths = sorted(glob(os.path.join(output_root, split, "*.h5"))) 425 if not paths: 426 raise RuntimeError("Could not find any preprocessed ORION-CRC slides for the requested settings.") 427 return paths 428 429 430def get_orion_crc_dataset( 431 path: Union[os.PathLike, str], 432 patch_shape: Tuple[int, int], 433 split: Literal["train", "val", "test"], 434 modality: Literal["he", "mif"] = "he", 435 label_type: Literal["instances", "semantic"] = "instances", 436 download: bool = False, 437 resize_inputs: bool = False, 438 preprocessing_workers: int = 8, 439 **kwargs 440) -> Dataset: 441 """Get the processed ORION-CRC dataset for nucleus instance or semantic segmentation. 442 443 Args: 444 path: Filepath to a folder where the downloaded data will be saved. 445 patch_shape: The patch shape to use for training. 446 split: The split to use. Either "train", "val" or "test". 447 modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence. 448 label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels. 449 download: Whether to download the data if it is not present. 450 resize_inputs: Whether to resize the input images. 451 preprocessing_workers: Number of parallel workers for preprocessing slides. 452 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 453 454 Returns: 455 The segmentation dataset. 456 """ 457 if modality not in ("he", "mif"): 458 raise ValueError(f"'{modality}' is not a valid modality. Choose from 'he' or 'mif'.") 459 if label_type not in ("instances", "semantic"): 460 raise ValueError(f"'{label_type}' is not a valid label type. Choose from 'instances' or 'semantic'.") 461 462 paths = get_orion_crc_paths(path, split, download, preprocessing_workers=preprocessing_workers) 463 464 if label_type == "semantic": 465 output_root = os.path.dirname(os.path.dirname(paths[0])) 466 if not os.path.exists(os.path.join(output_root, "semantic_label_mapping.csv")): 467 raise RuntimeError( 468 "Semantic labels are not available for this ORION-CRC data. " 469 "They require per-cell CSV metadata with cell types and nucleus coordinates." 470 ) 471 472 if resize_inputs: 473 resize_kwargs = {"patch_shape": patch_shape, "is_rgb": modality == "he"} 474 kwargs, patch_shape = util.update_kwargs_for_resize_trafo( 475 kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs 476 ) 477 478 # Raw shape is (C, N, H, W), label shape is (N, H, W). 479 # Prepend 1 to patch_shape to extract one full tile at a time from the N dimension. 480 return torch_em.default_segmentation_dataset( 481 raw_paths=paths, 482 raw_key=f"raw/{modality}", 483 label_paths=paths, 484 label_key=f"labels/nucleus/{label_type}", 485 is_seg_dataset=True, 486 patch_shape=(1,) + tuple(patch_shape), 487 with_channels=True, 488 **kwargs 489 ) 490 491 492def get_orion_crc_loader( 493 path: Union[os.PathLike, str], 494 batch_size: int, 495 patch_shape: Tuple[int, int], 496 split: Literal["train", "val", "test"], 497 modality: Literal["he", "mif"] = "he", 498 label_type: Literal["instances", "semantic"] = "instances", 499 download: bool = False, 500 resize_inputs: bool = False, 501 preprocessing_workers: int = 8, 502 **kwargs 503) -> DataLoader: 504 """Get the processed ORION-CRC dataloader for nucleus instance or semantic segmentation. 505 506 Args: 507 path: Filepath to a folder where the downloaded data will be saved. 508 batch_size: The batch size for training. 509 patch_shape: The patch shape to use for training. 510 split: The split to use. Either "train", "val" or "test". 511 modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence. 512 label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels. 513 download: Whether to download the data if it is not present. 514 resize_inputs: Whether to resize the input images. 515 preprocessing_workers: Number of parallel workers for preprocessing slides. 516 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 517 518 Returns: 519 The DataLoader. 520 """ 521 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 522 dataset = get_orion_crc_dataset( 523 path, patch_shape, split, modality=modality, label_type=label_type, download=download, 524 resize_inputs=resize_inputs, preprocessing_workers=preprocessing_workers, **ds_kwargs 525 ) 526 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
364def get_orion_crc_data( 365 path: Union[os.PathLike, str], 366 split: Optional[Literal["train", "val", "test"]] = None, 367 download: bool = False, 368 preprocessing_workers: int = 8, 369) -> str: 370 """Download and preprocess the processed ORION-CRC tile dataset. 371 372 The archive is large, about 127 GB, so only use `download=True` when you really want to fetch it. 373 Alternatively, download and extract `ORIONCRC_dataset_tile_20x.zip` manually into `path`. 374 375 Args: 376 path: Filepath to a folder where the downloaded data will be saved. 377 split: The split to preprocess. By default all available splits are preprocessed. 378 download: Whether to download the data if it is not present. 379 preprocessing_workers: Number of parallel workers for preprocessing slides. 380 381 Returns: 382 Filepath where preprocessed HDF5 files are stored. 383 """ 384 os.makedirs(path, exist_ok=True) 385 if _find_file(path, "train_dataframe.csv") is None: 386 zip_path = os.path.join(path, ZIP_NAME) 387 if os.path.exists(zip_path): 388 util.unzip(zip_path, path, remove=False) 389 elif download: 390 util.download_source(zip_path, URL, download=download, checksum=None) 391 util.unzip(zip_path, path, remove=False) 392 else: 393 raise RuntimeError( 394 f"Could not find the processed ORION-CRC data in {path}. " 395 f"Please download {ZIP_NAME} from https://zenodo.org/records/15340874 and extract it there, " 396 "or pass `download=True` to download the 127 GB archive." 397 ) 398 399 splits = SPLITS if split is None else (split,) 400 for this_split in splits: 401 output_root = _preprocess_split(path, this_split, preprocessing_workers=preprocessing_workers) 402 return output_root
Download and preprocess the processed ORION-CRC tile dataset.
The archive is large, about 127 GB, so only use download=True when you really want to fetch it.
Alternatively, download and extract ORIONCRC_dataset_tile_20x.zip manually into path.
Arguments:
- path: Filepath to a folder where the downloaded data will be saved.
- split: The split to preprocess. By default all available splits are preprocessed.
- download: Whether to download the data if it is not present.
- preprocessing_workers: Number of parallel workers for preprocessing slides.
Returns:
Filepath where preprocessed HDF5 files are stored.
405def get_orion_crc_paths( 406 path: Union[os.PathLike, str], 407 split: Literal["train", "val", "test"], 408 download: bool = False, 409 preprocessing_workers: int = 8, 410) -> List[str]: 411 """Get paths to preprocessed per-slide ORION-CRC HDF5 files. 412 413 Args: 414 path: Filepath to a folder where the downloaded data will be saved. 415 split: The split to use. Either "train", "val" or "test". 416 download: Whether to download the data if it is not present. 417 preprocessing_workers: Number of parallel workers for preprocessing slides. 418 419 Returns: 420 List of preprocessed per-slide HDF5 filepaths. 421 """ 422 if split not in SPLITS: 423 raise ValueError(f"'{split}' is not a valid split choice. Choose from {SPLITS}.") 424 output_root = get_orion_crc_data(path, split=split, download=download, preprocessing_workers=preprocessing_workers) 425 paths = sorted(glob(os.path.join(output_root, split, "*.h5"))) 426 if not paths: 427 raise RuntimeError("Could not find any preprocessed ORION-CRC slides for the requested settings.") 428 return paths
Get paths to preprocessed per-slide ORION-CRC HDF5 files.
Arguments:
- path: Filepath to a folder where the downloaded data will be saved.
- split: The split to use. Either "train", "val" or "test".
- download: Whether to download the data if it is not present.
- preprocessing_workers: Number of parallel workers for preprocessing slides.
Returns:
List of preprocessed per-slide HDF5 filepaths.
431def get_orion_crc_dataset( 432 path: Union[os.PathLike, str], 433 patch_shape: Tuple[int, int], 434 split: Literal["train", "val", "test"], 435 modality: Literal["he", "mif"] = "he", 436 label_type: Literal["instances", "semantic"] = "instances", 437 download: bool = False, 438 resize_inputs: bool = False, 439 preprocessing_workers: int = 8, 440 **kwargs 441) -> Dataset: 442 """Get the processed ORION-CRC dataset for nucleus instance or semantic segmentation. 443 444 Args: 445 path: Filepath to a folder where the downloaded data will be saved. 446 patch_shape: The patch shape to use for training. 447 split: The split to use. Either "train", "val" or "test". 448 modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence. 449 label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels. 450 download: Whether to download the data if it is not present. 451 resize_inputs: Whether to resize the input images. 452 preprocessing_workers: Number of parallel workers for preprocessing slides. 453 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 454 455 Returns: 456 The segmentation dataset. 457 """ 458 if modality not in ("he", "mif"): 459 raise ValueError(f"'{modality}' is not a valid modality. Choose from 'he' or 'mif'.") 460 if label_type not in ("instances", "semantic"): 461 raise ValueError(f"'{label_type}' is not a valid label type. Choose from 'instances' or 'semantic'.") 462 463 paths = get_orion_crc_paths(path, split, download, preprocessing_workers=preprocessing_workers) 464 465 if label_type == "semantic": 466 output_root = os.path.dirname(os.path.dirname(paths[0])) 467 if not os.path.exists(os.path.join(output_root, "semantic_label_mapping.csv")): 468 raise RuntimeError( 469 "Semantic labels are not available for this ORION-CRC data. " 470 "They require per-cell CSV metadata with cell types and nucleus coordinates." 471 ) 472 473 if resize_inputs: 474 resize_kwargs = {"patch_shape": patch_shape, "is_rgb": modality == "he"} 475 kwargs, patch_shape = util.update_kwargs_for_resize_trafo( 476 kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs 477 ) 478 479 # Raw shape is (C, N, H, W), label shape is (N, H, W). 480 # Prepend 1 to patch_shape to extract one full tile at a time from the N dimension. 481 return torch_em.default_segmentation_dataset( 482 raw_paths=paths, 483 raw_key=f"raw/{modality}", 484 label_paths=paths, 485 label_key=f"labels/nucleus/{label_type}", 486 is_seg_dataset=True, 487 patch_shape=(1,) + tuple(patch_shape), 488 with_channels=True, 489 **kwargs 490 )
Get the processed ORION-CRC dataset for nucleus instance or semantic segmentation.
Arguments:
- path: Filepath to a folder where the downloaded data will be saved.
- patch_shape: The patch shape to use for training.
- split: The split to use. Either "train", "val" or "test".
- modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence.
- label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels.
- download: Whether to download the data if it is not present.
- resize_inputs: Whether to resize the input images.
- preprocessing_workers: Number of parallel workers for preprocessing slides.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
493def get_orion_crc_loader( 494 path: Union[os.PathLike, str], 495 batch_size: int, 496 patch_shape: Tuple[int, int], 497 split: Literal["train", "val", "test"], 498 modality: Literal["he", "mif"] = "he", 499 label_type: Literal["instances", "semantic"] = "instances", 500 download: bool = False, 501 resize_inputs: bool = False, 502 preprocessing_workers: int = 8, 503 **kwargs 504) -> DataLoader: 505 """Get the processed ORION-CRC dataloader for nucleus instance or semantic segmentation. 506 507 Args: 508 path: Filepath to a folder where the downloaded data will be saved. 509 batch_size: The batch size for training. 510 patch_shape: The patch shape to use for training. 511 split: The split to use. Either "train", "val" or "test". 512 modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence. 513 label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels. 514 download: Whether to download the data if it is not present. 515 resize_inputs: Whether to resize the input images. 516 preprocessing_workers: Number of parallel workers for preprocessing slides. 517 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 518 519 Returns: 520 The DataLoader. 521 """ 522 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 523 dataset = get_orion_crc_dataset( 524 path, patch_shape, split, modality=modality, label_type=label_type, download=download, 525 resize_inputs=resize_inputs, preprocessing_workers=preprocessing_workers, **ds_kwargs 526 ) 527 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the processed ORION-CRC dataloader for nucleus instance or semantic segmentation.
Arguments:
- path: Filepath to a folder where the downloaded data will be saved.
- batch_size: The batch size for training.
- patch_shape: The patch shape to use for training.
- split: The split to use. Either "train", "val" or "test".
- modality: The raw modality to load. Either "he" for H&E or "mif" for multiplex immunofluorescence.
- label_type: The nucleus label target. Either "instances" or "semantic" for derived cell-type labels.
- download: Whether to download the data if it is not present.
- resize_inputs: Whether to resize the input images.
- preprocessing_workers: Number of parallel workers for preprocessing slides.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.