torch_em.data.datasets.histopathology.tissueseg
This dataset contains annotations for tissue vs. background segmentation in whole-slide histopathology images of multiple organs and stains.
The dataset is a representative sample from the publication https://doi.org/10.7717/peerj.8242 ("Resolution-agnostic tissue segmentation in whole-slide histopathology images with convolutional neural networks") and is hosted on Zenodo at https://doi.org/10.5281/zenodo.3375528. Please cite it if you use this dataset in your research.
The data consists of ten whole-slide images. The five 'development' samples (breast, breast lymph node and three tongue stains) provide annotations at a pixel spacing of 0.5 micrometer, with both a binary (tissue vs. background) and a six-class label. The five 'dissimilar' samples (brain, cornea, kidney, skin and uterus) provide only the binary annotation at a pixel spacing of 2.0 micrometer.
The label values follow the scheme used by the authors. The unannotated regions (typically the glass background outside the region of interest) are labeled as 0. The binary annotations use 3 for non-tissue and 6 for tissue. The six-class annotations additionally use 1 (edge artifacts), 2 (inner artifacts), 4 (external margin) and 5 (internal margin), with 3 for background and 6 for tissue.
NOTE: The whole-slide images and masks are multi-resolution pyramidal TIFFs of several gigabytes each. On the first use each requested sample is converted into a chunked HDF5 file at the resolution matching its annotation, which requires some time and disk space.
1"""This dataset contains annotations for tissue vs. background segmentation in 2whole-slide histopathology images of multiple organs and stains. 3 4The dataset is a representative sample from the publication 5https://doi.org/10.7717/peerj.8242 ("Resolution-agnostic tissue segmentation 6in whole-slide histopathology images with convolutional neural networks") and 7is hosted on Zenodo at https://doi.org/10.5281/zenodo.3375528. 8Please cite it if you use this dataset in your research. 9 10The data consists of ten whole-slide images. The five 'development' samples 11(breast, breast lymph node and three tongue stains) provide annotations at a 12pixel spacing of 0.5 micrometer, with both a binary (tissue vs. background) and 13a six-class label. The five 'dissimilar' samples (brain, cornea, kidney, skin 14and uterus) provide only the binary annotation at a pixel spacing of 2.0 15micrometer. 16 17The label values follow the scheme used by the authors. The unannotated regions 18(typically the glass background outside the region of interest) are labeled as 0. 19The binary annotations use 3 for non-tissue and 6 for tissue. The six-class 20annotations additionally use 1 (edge artifacts), 2 (inner artifacts), 214 (external margin) and 5 (internal margin), with 3 for background and 6 for tissue. 22 23NOTE: The whole-slide images and masks are multi-resolution pyramidal TIFFs of 24several gigabytes each. On the first use each requested sample is converted into 25a chunked HDF5 file at the resolution matching its annotation, which requires 26some time and disk space. 27""" 28 29import os 30from pathlib import Path 31from typing import List, Literal, Optional, Tuple, Union 32 33from tqdm import tqdm 34 35import torch 36 37from torch.utils.data import Dataset, DataLoader 38 39import torch_em 40 41from .. import util 42 43 44BASE_URL = "https://zenodo.org/api/records/3375528/files" 45 46SAMPLES = { 47 "breast": "breast_hne_00", 48 "breast_lymph_node": "breast_lymph_node_hne_00", 49 "tongue_hne": "tongue_hne_00", 50 "tongue_ki67": "tongue_ki67_00", 51 "tongue_ae1ae3": "tongue_ae1ae3_00", 52 "brain": "brain_alcianblue_00", 53 "cornea": "cornea_grocott_00", 54 "kidney": "kidney_cab_00", 55 "skin": "skin_perls_00", 56 "uterus": "uterus_vonkossa_00", 57} 58 59# The development samples are densely annotated at 0.5 micrometer with binary and six-class labels. 60DEVELOPMENT_SAMPLES = ["breast", "breast_lymph_node", "tongue_hne", "tongue_ki67", "tongue_ae1ae3"] 61 62# The dissimilar samples only have a binary annotation at 2.0 micrometer. 63DISSIMILAR_SAMPLES = ["brain", "cornea", "kidney", "skin", "uterus"] 64 65 66def _mask_filename(sample, annotations): 67 stem = SAMPLES[sample] 68 if sample in DEVELOPMENT_SAMPLES: 69 return f"{stem}_mask_cl2_sp0.5.tif" if annotations == "binary" else f"{stem}_mask_cl6_sp0.5.tif" 70 else: 71 return f"{stem}_mask_cl2_sp2.0.tif" 72 73 74def _resolve_samples(samples, annotations): 75 if annotations not in ("binary", "semantic"): 76 raise ValueError(f"'{annotations}' is not a valid annotation choice. Use 'binary' or 'semantic'.") 77 78 if samples is None: 79 samples = DEVELOPMENT_SAMPLES if annotations == "semantic" else list(SAMPLES.keys()) 80 81 if isinstance(samples, str): 82 samples = [samples] 83 84 for sample in samples: 85 if sample not in SAMPLES: 86 raise ValueError(f"'{sample}' is not a valid sample. Choose from {list(SAMPLES.keys())}.") 87 if annotations == "semantic" and sample not in DEVELOPMENT_SAMPLES: 88 raise ValueError(f"The sample '{sample}' does not have semantic annotations. Use annotations='binary'.") 89 90 return samples 91 92 93def _download_file(path, filename, download): 94 out_path = os.path.join(path, filename) 95 if os.path.exists(out_path): 96 return out_path 97 98 # The whole-slide images are several gigabytes each, so we do not verify checksums. 99 url = f"{BASE_URL}/{filename}/content" 100 util.download_source(path=out_path, url=url, download=download, checksum=None) 101 return out_path 102 103 104def _open_level(series, level_index): 105 import zarr 106 107 # The pyramidal TIFFs are natively tiled, so a zarr view reads only the requested tiles lazily. 108 # Multi-level series open as a group keyed by the level index; single-level series open as an array. 109 array = zarr.open(series.aszarr(), mode="r") 110 return array if hasattr(array, "shape") else array[str(level_index)] 111 112 113def _convert_sample(wsi_path, mask_paths, output_path, tile=4096): 114 import h5py 115 import tifffile 116 117 # All annotations of a sample share the same resolution, so the raw level is matched to the first mask. 118 first_mask = tifffile.TiffFile(list(mask_paths.values())[0]) 119 height, width = first_mask.series[0].levels[0].shape 120 121 image_series = tifffile.TiffFile(wsi_path).series[0] 122 level_index = next((i for i, lv in enumerate(image_series.levels) if lv.shape[:2] == (height, width)), None) 123 if level_index is None: 124 raise RuntimeError( 125 f"Could not find a resolution level in '{wsi_path}' matching the mask shape ({height}, {width})." 126 ) 127 128 image = _open_level(image_series, level_index) 129 masks = {ann: _open_level(tifffile.TiffFile(p).series[0], 0) for ann, p in mask_paths.items()} 130 for ann, mask in masks.items(): 131 if mask.shape != (height, width): 132 raise RuntimeError(f"Mask '{ann}' shape {mask.shape} does not match the raw shape ({height}, {width}).") 133 134 tmp_path = output_path + ".tmp" 135 with h5py.File(tmp_path, "w") as f: 136 raw = f.create_dataset( 137 "images/raw", shape=(3, height, width), dtype="uint8", compression="gzip", chunks=(1, 512, 512) 138 ) 139 mask_datasets = {} 140 for ann in masks: 141 mask_datasets[ann] = f.create_dataset( 142 f"labels/{ann}", shape=(height, width), dtype="uint8", compression="gzip", chunks=(512, 512) 143 ) 144 for y in tqdm(range(0, height, tile), desc=f"Converting {Path(wsi_path).stem}"): 145 for x in range(0, width, tile): 146 th, tw = min(tile, height - y), min(tile, width - x) 147 raw[:, y:y + th, x:x + tw] = image[y:y + th, x:x + tw].transpose(2, 0, 1) 148 for ann in masks: 149 mask_datasets[ann][y:y + th, x:x + tw] = masks[ann][y:y + th, x:x + tw] 150 151 os.replace(tmp_path, output_path) 152 153 154def get_tissueseg_data( 155 path: Union[os.PathLike, str], 156 samples: Optional[Union[str, List[str]]] = None, 157 annotations: Literal["binary", "semantic"] = "binary", 158 download: bool = False, 159) -> str: 160 """Download and preprocess the tissue segmentation data. 161 162 Args: 163 path: Filepath to a folder where the data will be saved. 164 samples: The samples to use. By default all samples valid for the annotation choice are used. 165 annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes). 166 download: Whether to download the data if it is not present. 167 168 Returns: 169 Filepath to the folder where the preprocessed data is stored. 170 """ 171 samples = _resolve_samples(samples, annotations) 172 173 raw_dir = os.path.join(path, "raw") 174 preprocessed_dir = os.path.join(path, "preprocessed") 175 os.makedirs(raw_dir, exist_ok=True) 176 os.makedirs(preprocessed_dir, exist_ok=True) 177 178 for sample in samples: 179 output_path = os.path.join(preprocessed_dir, f"{sample}.h5") 180 if os.path.exists(output_path): 181 continue 182 183 wsi_path = _download_file(raw_dir, f"{SAMPLES[sample]}.tif", download) 184 185 # Store all available masks for the sample so that 'binary' and 'semantic' share one file. 186 mask_choices = ["binary", "semantic"] if sample in DEVELOPMENT_SAMPLES else ["binary"] 187 mask_paths = { 188 choice: _download_file(raw_dir, _mask_filename(sample, choice), download) for choice in mask_choices 189 } 190 191 _convert_sample(wsi_path, mask_paths, output_path) 192 193 return preprocessed_dir 194 195 196def get_tissueseg_paths( 197 path: Union[os.PathLike, str], 198 samples: Optional[Union[str, List[str]]] = None, 199 annotations: Literal["binary", "semantic"] = "binary", 200 download: bool = False, 201) -> List[str]: 202 """Get paths to the tissue segmentation data. 203 204 Args: 205 path: Filepath to a folder where the data will be saved. 206 samples: The samples to use. By default all samples valid for the annotation choice are used. 207 annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes). 208 download: Whether to download the data if it is not present. 209 210 Returns: 211 List of filepaths to the preprocessed HDF5 files. 212 """ 213 samples = _resolve_samples(samples, annotations) 214 preprocessed_dir = get_tissueseg_data(path, samples, annotations, download) 215 volume_paths = [os.path.join(preprocessed_dir, f"{sample}.h5") for sample in samples] 216 217 missing = [p for p in volume_paths if not os.path.exists(p)] 218 if missing: 219 raise RuntimeError(f"Could not find the preprocessed data at {missing}.") 220 221 return volume_paths 222 223 224def get_tissueseg_dataset( 225 path: Union[os.PathLike, str], 226 patch_shape: Tuple[int, int], 227 samples: Optional[Union[str, List[str]]] = None, 228 annotations: Literal["binary", "semantic"] = "binary", 229 download: bool = False, 230 label_dtype: torch.dtype = torch.int64, 231 resize_inputs: bool = False, 232 **kwargs 233) -> Dataset: 234 """Get the tissue segmentation dataset for tissue vs. background segmentation in whole-slide images. 235 236 Args: 237 path: Filepath to a folder where the data will be saved. 238 patch_shape: The patch shape to use for training. 239 samples: The samples to use. By default all samples valid for the annotation choice are used. 240 annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes). 241 download: Whether to download the data if it is not present. 242 label_dtype: The datatype of the labels. 243 resize_inputs: Whether to resize the input images. 244 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 245 246 Returns: 247 The segmentation dataset. 248 """ 249 volume_paths = get_tissueseg_paths(path, samples, annotations, download) 250 251 if resize_inputs: 252 resize_kwargs = {"patch_shape": patch_shape, "is_rgb": True} 253 kwargs, patch_shape = util.update_kwargs_for_resize_trafo( 254 kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs 255 ) 256 257 return torch_em.default_segmentation_dataset( 258 raw_paths=volume_paths, 259 raw_key="images/raw", 260 label_paths=volume_paths, 261 label_key=f"labels/{annotations}", 262 patch_shape=patch_shape, 263 label_dtype=label_dtype, 264 is_seg_dataset=True, 265 with_channels=True, 266 ndim=2, 267 **kwargs 268 ) 269 270 271def get_tissueseg_loader( 272 path: Union[os.PathLike, str], 273 patch_shape: Tuple[int, int], 274 batch_size: int, 275 samples: Optional[Union[str, List[str]]] = None, 276 annotations: Literal["binary", "semantic"] = "binary", 277 download: bool = False, 278 label_dtype: torch.dtype = torch.int64, 279 resize_inputs: bool = False, 280 **kwargs 281) -> DataLoader: 282 """Get the tissue segmentation dataloader for tissue vs. background segmentation in whole-slide images. 283 284 Args: 285 path: Filepath to a folder where the data will be saved. 286 patch_shape: The patch shape to use for training. 287 batch_size: The batch size for training. 288 samples: The samples to use. By default all samples valid for the annotation choice are used. 289 annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes). 290 download: Whether to download the data if it is not present. 291 label_dtype: The datatype of the labels. 292 resize_inputs: Whether to resize the input images. 293 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 294 295 Returns: 296 The DataLoader. 297 """ 298 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 299 dataset = get_tissueseg_dataset( 300 path=path, patch_shape=patch_shape, samples=samples, annotations=annotations, download=download, 301 label_dtype=label_dtype, resize_inputs=resize_inputs, **ds_kwargs 302 ) 303 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
155def get_tissueseg_data( 156 path: Union[os.PathLike, str], 157 samples: Optional[Union[str, List[str]]] = None, 158 annotations: Literal["binary", "semantic"] = "binary", 159 download: bool = False, 160) -> str: 161 """Download and preprocess the tissue segmentation data. 162 163 Args: 164 path: Filepath to a folder where the data will be saved. 165 samples: The samples to use. By default all samples valid for the annotation choice are used. 166 annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes). 167 download: Whether to download the data if it is not present. 168 169 Returns: 170 Filepath to the folder where the preprocessed data is stored. 171 """ 172 samples = _resolve_samples(samples, annotations) 173 174 raw_dir = os.path.join(path, "raw") 175 preprocessed_dir = os.path.join(path, "preprocessed") 176 os.makedirs(raw_dir, exist_ok=True) 177 os.makedirs(preprocessed_dir, exist_ok=True) 178 179 for sample in samples: 180 output_path = os.path.join(preprocessed_dir, f"{sample}.h5") 181 if os.path.exists(output_path): 182 continue 183 184 wsi_path = _download_file(raw_dir, f"{SAMPLES[sample]}.tif", download) 185 186 # Store all available masks for the sample so that 'binary' and 'semantic' share one file. 187 mask_choices = ["binary", "semantic"] if sample in DEVELOPMENT_SAMPLES else ["binary"] 188 mask_paths = { 189 choice: _download_file(raw_dir, _mask_filename(sample, choice), download) for choice in mask_choices 190 } 191 192 _convert_sample(wsi_path, mask_paths, output_path) 193 194 return preprocessed_dir
Download and preprocess the tissue segmentation data.
Arguments:
- path: Filepath to a folder where the data will be saved.
- samples: The samples to use. By default all samples valid for the annotation choice are used.
- annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
- download: Whether to download the data if it is not present.
Returns:
Filepath to the folder where the preprocessed data is stored.
197def get_tissueseg_paths( 198 path: Union[os.PathLike, str], 199 samples: Optional[Union[str, List[str]]] = None, 200 annotations: Literal["binary", "semantic"] = "binary", 201 download: bool = False, 202) -> List[str]: 203 """Get paths to the tissue segmentation data. 204 205 Args: 206 path: Filepath to a folder where the data will be saved. 207 samples: The samples to use. By default all samples valid for the annotation choice are used. 208 annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes). 209 download: Whether to download the data if it is not present. 210 211 Returns: 212 List of filepaths to the preprocessed HDF5 files. 213 """ 214 samples = _resolve_samples(samples, annotations) 215 preprocessed_dir = get_tissueseg_data(path, samples, annotations, download) 216 volume_paths = [os.path.join(preprocessed_dir, f"{sample}.h5") for sample in samples] 217 218 missing = [p for p in volume_paths if not os.path.exists(p)] 219 if missing: 220 raise RuntimeError(f"Could not find the preprocessed data at {missing}.") 221 222 return volume_paths
Get paths to the tissue segmentation data.
Arguments:
- path: Filepath to a folder where the data will be saved.
- samples: The samples to use. By default all samples valid for the annotation choice are used.
- annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
- download: Whether to download the data if it is not present.
Returns:
List of filepaths to the preprocessed HDF5 files.
225def get_tissueseg_dataset( 226 path: Union[os.PathLike, str], 227 patch_shape: Tuple[int, int], 228 samples: Optional[Union[str, List[str]]] = None, 229 annotations: Literal["binary", "semantic"] = "binary", 230 download: bool = False, 231 label_dtype: torch.dtype = torch.int64, 232 resize_inputs: bool = False, 233 **kwargs 234) -> Dataset: 235 """Get the tissue segmentation dataset for tissue vs. background segmentation in whole-slide images. 236 237 Args: 238 path: Filepath to a folder where the data will be saved. 239 patch_shape: The patch shape to use for training. 240 samples: The samples to use. By default all samples valid for the annotation choice are used. 241 annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes). 242 download: Whether to download the data if it is not present. 243 label_dtype: The datatype of the labels. 244 resize_inputs: Whether to resize the input images. 245 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 246 247 Returns: 248 The segmentation dataset. 249 """ 250 volume_paths = get_tissueseg_paths(path, samples, annotations, download) 251 252 if resize_inputs: 253 resize_kwargs = {"patch_shape": patch_shape, "is_rgb": True} 254 kwargs, patch_shape = util.update_kwargs_for_resize_trafo( 255 kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs 256 ) 257 258 return torch_em.default_segmentation_dataset( 259 raw_paths=volume_paths, 260 raw_key="images/raw", 261 label_paths=volume_paths, 262 label_key=f"labels/{annotations}", 263 patch_shape=patch_shape, 264 label_dtype=label_dtype, 265 is_seg_dataset=True, 266 with_channels=True, 267 ndim=2, 268 **kwargs 269 )
Get the tissue segmentation dataset for tissue vs. background segmentation in whole-slide images.
Arguments:
- path: Filepath to a folder where the data will be saved.
- patch_shape: The patch shape to use for training.
- samples: The samples to use. By default all samples valid for the annotation choice are used.
- annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
- download: Whether to download the data if it is not present.
- label_dtype: The datatype of the labels.
- resize_inputs: Whether to resize the input images.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
272def get_tissueseg_loader( 273 path: Union[os.PathLike, str], 274 patch_shape: Tuple[int, int], 275 batch_size: int, 276 samples: Optional[Union[str, List[str]]] = None, 277 annotations: Literal["binary", "semantic"] = "binary", 278 download: bool = False, 279 label_dtype: torch.dtype = torch.int64, 280 resize_inputs: bool = False, 281 **kwargs 282) -> DataLoader: 283 """Get the tissue segmentation dataloader for tissue vs. background segmentation in whole-slide images. 284 285 Args: 286 path: Filepath to a folder where the data will be saved. 287 patch_shape: The patch shape to use for training. 288 batch_size: The batch size for training. 289 samples: The samples to use. By default all samples valid for the annotation choice are used. 290 annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes). 291 download: Whether to download the data if it is not present. 292 label_dtype: The datatype of the labels. 293 resize_inputs: Whether to resize the input images. 294 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 295 296 Returns: 297 The DataLoader. 298 """ 299 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 300 dataset = get_tissueseg_dataset( 301 path=path, patch_shape=patch_shape, samples=samples, annotations=annotations, download=download, 302 label_dtype=label_dtype, resize_inputs=resize_inputs, **ds_kwargs 303 ) 304 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the tissue segmentation dataloader for tissue vs. background segmentation in whole-slide images.
Arguments:
- path: Filepath to a folder where the data will be saved.
- patch_shape: The patch shape to use for training.
- batch_size: The batch size for training.
- samples: The samples to use. By default all samples valid for the annotation choice are used.
- annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
- download: Whether to download the data if it is not present.
- label_dtype: The datatype of the labels.
- resize_inputs: Whether to resize the input images.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.