torch_em.data.datasets.histopathology.catch
The CATCH dataset contains annotations for tissue segmentation in H&E stained histopathology images of seven canine cutaneous tumor types.
The dataset consists of 350 whole-slide images (50 per tumor type) with 12,424 polygon annotations across 13 tissue classes. The original Aperio SVS images are distributed via IBM Aspera (often firewalled), so this loader instead obtains the images from the Imaging Data Commons (IDC) over HTTPS as DICOM whole-slide images.
This dataset is from the publication https://doi.org/10.1038/s41597-022-01692-w. Please cite it if you use this dataset in your research. It is hosted on TCIA at https://doi.org/10.7937/TCIA.2M93-FX66 (CC BY 4.0) and mirrored on IDC.
NOTE: Downloading requires 'idc-index'. Reading the DICOM images requires 'wsidicom'
and rasterizing the polygons requires 'scikit-image'. The data is large (each slide
is around 0.2-2 GB as DICOM), so the slides are downloaded and converted one tumor
type / slide at a time, and the DICOM source is removed after conversion. By default
the full-resolution (base) level is used; this level can be several gigapixels per
slide, so it is read and written to the HDF5 file in tiles. Pass a higher level to
use a downsampled level instead.
The annotations are coarse region-level polygons (around 35 per slide), not cell or nucleus annotations. They are sparse: regions outside any polygon are left as 0, so 'labels/semantic' is a sparse region-level map and class 0 should typically be treated as background / ignored during training. Each whole-slide image shows a single tumor type, so within one slide you see that tumor class plus the surrounding normal tissue classes.
The 13 classes ('labels/semantic') are grouped into a 'Tissue' supercategory (1-6) and a 'Tumor' supercategory (7-13): 0: background (unannotated) 1: Bone 2: Cartilage 3: Dermis 4: Epidermis 5: Subcutis 6: Inflamm/Necrosis 7: Melanoma 8: Plasmacytoma 9: Mast Cell Tumor 10: PNST 11: SCC 12: Trichoblastoma 13: Histiocytoma
1"""The CATCH dataset contains annotations for tissue segmentation in 2H&E stained histopathology images of seven canine cutaneous tumor types. 3 4The dataset consists of 350 whole-slide images (50 per tumor type) with 12,424 5polygon annotations across 13 tissue classes. The original Aperio SVS images are 6distributed via IBM Aspera (often firewalled), so this loader instead obtains the 7images from the Imaging Data Commons (IDC) over HTTPS as DICOM whole-slide images. 8 9This dataset is from the publication https://doi.org/10.1038/s41597-022-01692-w. 10Please cite it if you use this dataset in your research. It is hosted on TCIA at 11https://doi.org/10.7937/TCIA.2M93-FX66 (CC BY 4.0) and mirrored on IDC. 12 13NOTE: Downloading requires 'idc-index'. Reading the DICOM images requires 'wsidicom' 14and rasterizing the polygons requires 'scikit-image'. The data is large (each slide 15is around 0.2-2 GB as DICOM), so the slides are downloaded and converted one tumor 16type / slide at a time, and the DICOM source is removed after conversion. By default 17the full-resolution (base) level is used; this level can be several gigapixels per 18slide, so it is read and written to the HDF5 file in tiles. Pass a higher `level` to 19use a downsampled level instead. 20 21The annotations are coarse region-level polygons (around 35 per slide), not cell or 22nucleus annotations. They are sparse: regions outside any polygon are left as 0, so 23'labels/semantic' is a sparse region-level map and class 0 should typically be treated 24as background / ignored during training. Each whole-slide image shows a single tumor 25type, so within one slide you see that tumor class plus the surrounding normal tissue 26classes. 27 28The 13 classes ('labels/semantic') are grouped into a 'Tissue' supercategory (1-6) and 29a 'Tumor' supercategory (7-13): 30 0: background (unannotated) 31 1: Bone 32 2: Cartilage 33 3: Dermis 34 4: Epidermis 35 5: Subcutis 36 6: Inflamm/Necrosis 37 7: Melanoma 38 8: Plasmacytoma 39 9: Mast Cell Tumor 40 10: PNST 41 11: SCC 42 12: Trichoblastoma 43 13: Histiocytoma 44""" 45 46import os 47import shutil 48import zipfile 49from glob import glob 50from typing import List, Optional, Tuple, Union 51 52import numpy as np 53from tqdm import tqdm 54 55import torch 56 57from torch.utils.data import Dataset, DataLoader 58 59import torch_em 60 61from .. import util 62 63 64COCO_URL = "https://www.cancerimagingarchive.net/wp-content/uploads/CATCH-json.zip" 65IDC_COLLECTION = "catch" 66TUMOR_TYPES = ["Histiocytoma", "MCT", "Melanoma", "PNST", "Plasmacytoma", "SCC", "Trichoblastoma"] 67 68 69def _load_coco(path): 70 import json 71 72 coco_path = os.path.join(path, "CATCH.json") 73 coco = json.load(open(coco_path)) 74 annotations = {} 75 for ann in coco["annotations"]: 76 annotations.setdefault(ann["image_id"], []).append(ann) 77 images = {im["file_name"]: (im["id"], im["width"], im["height"]) for im in coco["images"]} 78 return images, annotations 79 80 81def _rasterize_into(label_dataset, annotations, downsample): 82 from skimage.draw import polygon as draw_polygon 83 84 height, width = label_dataset.shape 85 # Larger regions are drawn first so that smaller annotations stay on top. Each polygon is 86 # rasterized within its own bounding box to avoid allocating a full-resolution label in memory. 87 for ann in sorted(annotations, key=lambda a: a.get("area", 0), reverse=True): 88 segments = ann["segmentation"] 89 if segments and isinstance(segments[0], (int, float)): 90 segments = [segments] 91 for segment in segments: 92 xs = np.asarray(segment[0::2], dtype="float64") / downsample 93 ys = np.asarray(segment[1::2], dtype="float64") / downsample 94 x0, x1 = max(int(np.floor(xs.min())), 0), min(int(np.ceil(xs.max())) + 1, width) 95 y0, y1 = max(int(np.floor(ys.min())), 0), min(int(np.ceil(ys.max())) + 1, height) 96 if x1 <= x0 or y1 <= y0: 97 continue 98 rr, cc = draw_polygon(ys - y0, xs - x0, shape=(y1 - y0, x1 - x0)) 99 block = label_dataset[y0:y1, x0:x1] 100 block[rr, cc] = ann["category_id"] 101 label_dataset[y0:y1, x0:x1] = block 102 103 104def _convert_slide(series_uid, file_name, images, annotations, level, output_path, tmp_dir, tile=4096): 105 import h5py 106 from idc_index import IDCClient 107 from wsidicom import WsiDicom 108 109 # Download into a per-series folder, since several slides of the same patient share a PatientID. 110 slide_dir = os.path.join(tmp_dir, series_uid) 111 if not os.path.exists(slide_dir): 112 IDCClient().download_dicom_series( 113 seriesInstanceUID=series_uid, downloadDir=tmp_dir, dirTemplate="%SeriesInstanceUID" 114 ) 115 116 slide = WsiDicom.open(slide_dir) 117 try: 118 base_width = slide.size.width 119 # By default the highest resolution (base) level is used. 120 wsi_level = max(slide.levels, key=lambda lv: lv.size.width) if level is None \ 121 else next(lv for lv in slide.levels if lv.level == level) 122 width, height = wsi_level.size.width, wsi_level.size.height 123 downsample = base_width / width 124 125 image_id = images[file_name][0] 126 tmp_path = output_path + ".tmp" 127 with h5py.File(tmp_path, "w") as f: 128 raw = f.create_dataset( 129 "raw", shape=(3, height, width), dtype="uint8", compression="gzip", chunks=(1, 512, 512) 130 ) 131 label = f.create_dataset( 132 "labels/semantic", shape=(height, width), dtype="uint8", compression="gzip", chunks=(512, 512) 133 ) 134 # The base level can be several gigapixels, so the image is read and written in tiles. 135 for y in range(0, height, tile): 136 for x in range(0, width, tile): 137 th, tw = min(tile, height - y), min(tile, width - x) 138 region = np.array(slide.read_region((x, y), wsi_level.level, (tw, th)))[..., :3] 139 raw[:, y:y + th, x:x + tw] = region.transpose(2, 0, 1) 140 _rasterize_into(label, annotations.get(image_id, []), downsample) 141 finally: 142 slide.close() 143 144 os.replace(tmp_path, output_path) 145 shutil.rmtree(slide_dir, ignore_errors=True) 146 147 148def get_catch_data( 149 path: Union[os.PathLike, str], 150 tumor_types: Optional[Union[str, List[str]]] = None, 151 level: Optional[int] = None, 152 download: bool = False, 153) -> str: 154 """Download and preprocess the CATCH data. 155 156 Args: 157 path: Filepath to a folder where the data will be saved. 158 tumor_types: The tumor types to use. By default all seven tumor types are used. 159 level: The DICOM pyramid level to read. By default the highest resolution (base) level is used. 160 download: Whether to download the data if it is not present. 161 162 Returns: 163 Filepath to the folder where the preprocessed data is stored. 164 """ 165 if tumor_types is None: 166 tumor_types = TUMOR_TYPES 167 if isinstance(tumor_types, str): 168 tumor_types = [tumor_types] 169 for tumor_type in tumor_types: 170 if tumor_type not in TUMOR_TYPES: 171 raise ValueError(f"'{tumor_type}' is not a valid tumor type. Choose from {TUMOR_TYPES}.") 172 173 preprocessed_dir = os.path.join(path, "preprocessed") 174 tmp_dir = os.path.join(path, "dicom") 175 os.makedirs(preprocessed_dir, exist_ok=True) 176 177 coco_path = os.path.join(path, "CATCH.json") 178 if not os.path.exists(coco_path): 179 zip_path = os.path.join(path, "CATCH-json.zip") 180 util.download_source(path=zip_path, url=COCO_URL, download=download, checksum=None) 181 with zipfile.ZipFile(zip_path, "r") as f: 182 f.extractall(path) 183 184 images, annotations = _load_coco(path) 185 186 try: 187 from idc_index import IDCClient 188 except ImportError: 189 raise ImportError("'idc-index' is required to download CATCH. Install it via conda/pip.") 190 191 # The slide microscopy index provides the 'ContainerIdentifier', which matches the COCO file name 192 # ('<ContainerIdentifier>.svs') and is unique per slide (unlike PatientID, where one patient may have 193 # several slides). 194 client = IDCClient() 195 client.fetch_index("sm_index") 196 catch = client.index[client.index["collection_id"] == IDC_COLLECTION] 197 catch = catch.merge(client.sm_index[["SeriesInstanceUID", "ContainerIdentifier"]], on="SeriesInstanceUID") 198 199 to_convert = catch[catch["ContainerIdentifier"].str.startswith(tuple(tumor_types))] 200 for _, row in tqdm(list(to_convert.iterrows()), desc="Converting CATCH slides"): 201 container_id = row["ContainerIdentifier"] 202 output_path = os.path.join(preprocessed_dir, f"{container_id}.h5") 203 if os.path.exists(output_path): 204 continue 205 if not download: 206 raise RuntimeError(f"Cannot find the data at {path}, but download was set to False.") 207 _convert_slide( 208 row["SeriesInstanceUID"], f"{container_id}.svs", images, annotations, level, output_path, tmp_dir 209 ) 210 211 return preprocessed_dir 212 213 214def get_catch_paths( 215 path: Union[os.PathLike, str], 216 tumor_types: Optional[Union[str, List[str]]] = None, 217 level: Optional[int] = None, 218 download: bool = False, 219) -> List[str]: 220 """Get paths to the CATCH data. 221 222 Args: 223 path: Filepath to a folder where the data will be saved. 224 tumor_types: The tumor types to use. By default all seven tumor types are used. 225 level: The DICOM pyramid level to read. By default the highest resolution (base) level is used. 226 download: Whether to download the data if it is not present. 227 228 Returns: 229 List of filepaths to the preprocessed HDF5 files. 230 """ 231 preprocessed_dir = get_catch_data(path, tumor_types, level, download) 232 volume_paths = sorted(glob(os.path.join(preprocessed_dir, "*.h5"))) 233 if not volume_paths: 234 raise RuntimeError("Could not find any preprocessed CATCH slides for the requested settings.") 235 236 return volume_paths 237 238 239def get_catch_dataset( 240 path: Union[os.PathLike, str], 241 patch_shape: Tuple[int, int], 242 tumor_types: Optional[Union[str, List[str]]] = None, 243 level: Optional[int] = None, 244 download: bool = False, 245 label_dtype: torch.dtype = torch.int64, 246 resize_inputs: bool = False, 247 **kwargs 248) -> Dataset: 249 """Get the CATCH dataset for tissue segmentation in canine cutaneous tumor histopathology images. 250 251 Args: 252 path: Filepath to a folder where the data will be saved. 253 patch_shape: The patch shape to use for training. 254 tumor_types: The tumor types to use. By default all seven tumor types are used. 255 level: The DICOM pyramid level to read. By default the highest resolution (base) level is used. 256 download: Whether to download the data if it is not present. 257 label_dtype: The datatype of the labels. 258 resize_inputs: Whether to resize the input images. 259 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 260 261 Returns: 262 The segmentation dataset. 263 """ 264 volume_paths = get_catch_paths(path, tumor_types, level, download) 265 266 if resize_inputs: 267 resize_kwargs = {"patch_shape": patch_shape, "is_rgb": True} 268 kwargs, patch_shape = util.update_kwargs_for_resize_trafo( 269 kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs 270 ) 271 272 return torch_em.default_segmentation_dataset( 273 raw_paths=volume_paths, 274 raw_key="raw", 275 label_paths=volume_paths, 276 label_key="labels/semantic", 277 patch_shape=patch_shape, 278 label_dtype=label_dtype, 279 is_seg_dataset=True, 280 with_channels=True, 281 ndim=2, 282 **kwargs 283 ) 284 285 286def get_catch_loader( 287 path: Union[os.PathLike, str], 288 patch_shape: Tuple[int, int], 289 batch_size: int, 290 tumor_types: Optional[Union[str, List[str]]] = None, 291 level: Optional[int] = None, 292 download: bool = False, 293 label_dtype: torch.dtype = torch.int64, 294 resize_inputs: bool = False, 295 **kwargs 296) -> DataLoader: 297 """Get the CATCH dataloader for tissue segmentation in canine cutaneous tumor histopathology images. 298 299 Args: 300 path: Filepath to a folder where the data will be saved. 301 patch_shape: The patch shape to use for training. 302 batch_size: The batch size for training. 303 tumor_types: The tumor types to use. By default all seven tumor types are used. 304 level: The DICOM pyramid level to read. By default the highest resolution (base) level is used. 305 download: Whether to download the data if it is not present. 306 label_dtype: The datatype of the labels. 307 resize_inputs: Whether to resize the input images. 308 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 309 310 Returns: 311 The DataLoader. 312 """ 313 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 314 dataset = get_catch_dataset( 315 path=path, patch_shape=patch_shape, tumor_types=tumor_types, level=level, download=download, 316 label_dtype=label_dtype, resize_inputs=resize_inputs, **ds_kwargs 317 ) 318 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
149def get_catch_data( 150 path: Union[os.PathLike, str], 151 tumor_types: Optional[Union[str, List[str]]] = None, 152 level: Optional[int] = None, 153 download: bool = False, 154) -> str: 155 """Download and preprocess the CATCH data. 156 157 Args: 158 path: Filepath to a folder where the data will be saved. 159 tumor_types: The tumor types to use. By default all seven tumor types are used. 160 level: The DICOM pyramid level to read. By default the highest resolution (base) level is used. 161 download: Whether to download the data if it is not present. 162 163 Returns: 164 Filepath to the folder where the preprocessed data is stored. 165 """ 166 if tumor_types is None: 167 tumor_types = TUMOR_TYPES 168 if isinstance(tumor_types, str): 169 tumor_types = [tumor_types] 170 for tumor_type in tumor_types: 171 if tumor_type not in TUMOR_TYPES: 172 raise ValueError(f"'{tumor_type}' is not a valid tumor type. Choose from {TUMOR_TYPES}.") 173 174 preprocessed_dir = os.path.join(path, "preprocessed") 175 tmp_dir = os.path.join(path, "dicom") 176 os.makedirs(preprocessed_dir, exist_ok=True) 177 178 coco_path = os.path.join(path, "CATCH.json") 179 if not os.path.exists(coco_path): 180 zip_path = os.path.join(path, "CATCH-json.zip") 181 util.download_source(path=zip_path, url=COCO_URL, download=download, checksum=None) 182 with zipfile.ZipFile(zip_path, "r") as f: 183 f.extractall(path) 184 185 images, annotations = _load_coco(path) 186 187 try: 188 from idc_index import IDCClient 189 except ImportError: 190 raise ImportError("'idc-index' is required to download CATCH. Install it via conda/pip.") 191 192 # The slide microscopy index provides the 'ContainerIdentifier', which matches the COCO file name 193 # ('<ContainerIdentifier>.svs') and is unique per slide (unlike PatientID, where one patient may have 194 # several slides). 195 client = IDCClient() 196 client.fetch_index("sm_index") 197 catch = client.index[client.index["collection_id"] == IDC_COLLECTION] 198 catch = catch.merge(client.sm_index[["SeriesInstanceUID", "ContainerIdentifier"]], on="SeriesInstanceUID") 199 200 to_convert = catch[catch["ContainerIdentifier"].str.startswith(tuple(tumor_types))] 201 for _, row in tqdm(list(to_convert.iterrows()), desc="Converting CATCH slides"): 202 container_id = row["ContainerIdentifier"] 203 output_path = os.path.join(preprocessed_dir, f"{container_id}.h5") 204 if os.path.exists(output_path): 205 continue 206 if not download: 207 raise RuntimeError(f"Cannot find the data at {path}, but download was set to False.") 208 _convert_slide( 209 row["SeriesInstanceUID"], f"{container_id}.svs", images, annotations, level, output_path, tmp_dir 210 ) 211 212 return preprocessed_dir
Download and preprocess the CATCH data.
Arguments:
- path: Filepath to a folder where the data will be saved.
- tumor_types: The tumor types to use. By default all seven tumor types are used.
- level: The DICOM pyramid level to read. By default the highest resolution (base) level is used.
- download: Whether to download the data if it is not present.
Returns:
Filepath to the folder where the preprocessed data is stored.
215def get_catch_paths( 216 path: Union[os.PathLike, str], 217 tumor_types: Optional[Union[str, List[str]]] = None, 218 level: Optional[int] = None, 219 download: bool = False, 220) -> List[str]: 221 """Get paths to the CATCH data. 222 223 Args: 224 path: Filepath to a folder where the data will be saved. 225 tumor_types: The tumor types to use. By default all seven tumor types are used. 226 level: The DICOM pyramid level to read. By default the highest resolution (base) level is used. 227 download: Whether to download the data if it is not present. 228 229 Returns: 230 List of filepaths to the preprocessed HDF5 files. 231 """ 232 preprocessed_dir = get_catch_data(path, tumor_types, level, download) 233 volume_paths = sorted(glob(os.path.join(preprocessed_dir, "*.h5"))) 234 if not volume_paths: 235 raise RuntimeError("Could not find any preprocessed CATCH slides for the requested settings.") 236 237 return volume_paths
Get paths to the CATCH data.
Arguments:
- path: Filepath to a folder where the data will be saved.
- tumor_types: The tumor types to use. By default all seven tumor types are used.
- level: The DICOM pyramid level to read. By default the highest resolution (base) level is used.
- download: Whether to download the data if it is not present.
Returns:
List of filepaths to the preprocessed HDF5 files.
240def get_catch_dataset( 241 path: Union[os.PathLike, str], 242 patch_shape: Tuple[int, int], 243 tumor_types: Optional[Union[str, List[str]]] = None, 244 level: Optional[int] = None, 245 download: bool = False, 246 label_dtype: torch.dtype = torch.int64, 247 resize_inputs: bool = False, 248 **kwargs 249) -> Dataset: 250 """Get the CATCH dataset for tissue segmentation in canine cutaneous tumor histopathology images. 251 252 Args: 253 path: Filepath to a folder where the data will be saved. 254 patch_shape: The patch shape to use for training. 255 tumor_types: The tumor types to use. By default all seven tumor types are used. 256 level: The DICOM pyramid level to read. By default the highest resolution (base) level is used. 257 download: Whether to download the data if it is not present. 258 label_dtype: The datatype of the labels. 259 resize_inputs: Whether to resize the input images. 260 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 261 262 Returns: 263 The segmentation dataset. 264 """ 265 volume_paths = get_catch_paths(path, tumor_types, level, download) 266 267 if resize_inputs: 268 resize_kwargs = {"patch_shape": patch_shape, "is_rgb": True} 269 kwargs, patch_shape = util.update_kwargs_for_resize_trafo( 270 kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs 271 ) 272 273 return torch_em.default_segmentation_dataset( 274 raw_paths=volume_paths, 275 raw_key="raw", 276 label_paths=volume_paths, 277 label_key="labels/semantic", 278 patch_shape=patch_shape, 279 label_dtype=label_dtype, 280 is_seg_dataset=True, 281 with_channels=True, 282 ndim=2, 283 **kwargs 284 )
Get the CATCH dataset for tissue segmentation in canine cutaneous tumor histopathology images.
Arguments:
- path: Filepath to a folder where the data will be saved.
- patch_shape: The patch shape to use for training.
- tumor_types: The tumor types to use. By default all seven tumor types are used.
- level: The DICOM pyramid level to read. By default the highest resolution (base) level is used.
- download: Whether to download the data if it is not present.
- label_dtype: The datatype of the labels.
- resize_inputs: Whether to resize the input images.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
287def get_catch_loader( 288 path: Union[os.PathLike, str], 289 patch_shape: Tuple[int, int], 290 batch_size: int, 291 tumor_types: Optional[Union[str, List[str]]] = None, 292 level: Optional[int] = None, 293 download: bool = False, 294 label_dtype: torch.dtype = torch.int64, 295 resize_inputs: bool = False, 296 **kwargs 297) -> DataLoader: 298 """Get the CATCH dataloader for tissue segmentation in canine cutaneous tumor histopathology images. 299 300 Args: 301 path: Filepath to a folder where the data will be saved. 302 patch_shape: The patch shape to use for training. 303 batch_size: The batch size for training. 304 tumor_types: The tumor types to use. By default all seven tumor types are used. 305 level: The DICOM pyramid level to read. By default the highest resolution (base) level is used. 306 download: Whether to download the data if it is not present. 307 label_dtype: The datatype of the labels. 308 resize_inputs: Whether to resize the input images. 309 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 310 311 Returns: 312 The DataLoader. 313 """ 314 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 315 dataset = get_catch_dataset( 316 path=path, patch_shape=patch_shape, tumor_types=tumor_types, level=level, download=download, 317 label_dtype=label_dtype, resize_inputs=resize_inputs, **ds_kwargs 318 ) 319 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the CATCH dataloader for tissue segmentation in canine cutaneous tumor histopathology images.
Arguments:
- path: Filepath to a folder where the data will be saved.
- patch_shape: The patch shape to use for training.
- batch_size: The batch size for training.
- tumor_types: The tumor types to use. By default all seven tumor types are used.
- level: The DICOM pyramid level to read. By default the highest resolution (base) level is used.
- download: Whether to download the data if it is not present.
- label_dtype: The datatype of the labels.
- resize_inputs: Whether to resize the input images.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.