torch_em.data.datasets.histopathology.tissueseg

This dataset contains annotations for tissue vs. background segmentation in whole-slide histopathology images of multiple organs and stains.

The dataset is a representative sample from the publication https://doi.org/10.7717/peerj.8242 ("Resolution-agnostic tissue segmentation in whole-slide histopathology images with convolutional neural networks") and is hosted on Zenodo at https://doi.org/10.5281/zenodo.3375528. Please cite it if you use this dataset in your research.

The data consists of ten whole-slide images. The five 'development' samples (breast, breast lymph node and three tongue stains) provide annotations at a pixel spacing of 0.5 micrometer, with both a binary (tissue vs. background) and a six-class label. The five 'dissimilar' samples (brain, cornea, kidney, skin and uterus) provide only the binary annotation at a pixel spacing of 2.0 micrometer.

The label values follow the scheme used by the authors. The unannotated regions (typically the glass background outside the region of interest) are labeled as 0. The binary annotations use 3 for non-tissue and 6 for tissue. The six-class annotations additionally use 1 (edge artifacts), 2 (inner artifacts), 4 (external margin) and 5 (internal margin), with 3 for background and 6 for tissue.

NOTE: The whole-slide images and masks are multi-resolution pyramidal TIFFs of several gigabytes each. On the first use each requested sample is converted into a chunked HDF5 file at the resolution matching its annotation, which requires some time and disk space.

  1"""This dataset contains annotations for tissue vs. background segmentation in
  2whole-slide histopathology images of multiple organs and stains.
  3
  4The dataset is a representative sample from the publication
  5https://doi.org/10.7717/peerj.8242 ("Resolution-agnostic tissue segmentation
  6in whole-slide histopathology images with convolutional neural networks") and
  7is hosted on Zenodo at https://doi.org/10.5281/zenodo.3375528.
  8Please cite it if you use this dataset in your research.
  9
 10The data consists of ten whole-slide images. The five 'development' samples
 11(breast, breast lymph node and three tongue stains) provide annotations at a
 12pixel spacing of 0.5 micrometer, with both a binary (tissue vs. background) and
 13a six-class label. The five 'dissimilar' samples (brain, cornea, kidney, skin
 14and uterus) provide only the binary annotation at a pixel spacing of 2.0
 15micrometer.
 16
 17The label values follow the scheme used by the authors. The unannotated regions
 18(typically the glass background outside the region of interest) are labeled as 0.
 19The binary annotations use 3 for non-tissue and 6 for tissue. The six-class
 20annotations additionally use 1 (edge artifacts), 2 (inner artifacts),
 214 (external margin) and 5 (internal margin), with 3 for background and 6 for tissue.
 22
 23NOTE: The whole-slide images and masks are multi-resolution pyramidal TIFFs of
 24several gigabytes each. On the first use each requested sample is converted into
 25a chunked HDF5 file at the resolution matching its annotation, which requires
 26some time and disk space.
 27"""
 28
 29import os
 30from pathlib import Path
 31from typing import List, Literal, Optional, Tuple, Union
 32
 33from tqdm import tqdm
 34
 35import torch
 36
 37from torch.utils.data import Dataset, DataLoader
 38
 39import torch_em
 40
 41from .. import util
 42
 43
 44BASE_URL = "https://zenodo.org/api/records/3375528/files"
 45
 46SAMPLES = {
 47    "breast": "breast_hne_00",
 48    "breast_lymph_node": "breast_lymph_node_hne_00",
 49    "tongue_hne": "tongue_hne_00",
 50    "tongue_ki67": "tongue_ki67_00",
 51    "tongue_ae1ae3": "tongue_ae1ae3_00",
 52    "brain": "brain_alcianblue_00",
 53    "cornea": "cornea_grocott_00",
 54    "kidney": "kidney_cab_00",
 55    "skin": "skin_perls_00",
 56    "uterus": "uterus_vonkossa_00",
 57}
 58
 59# The development samples are densely annotated at 0.5 micrometer with binary and six-class labels.
 60DEVELOPMENT_SAMPLES = ["breast", "breast_lymph_node", "tongue_hne", "tongue_ki67", "tongue_ae1ae3"]
 61
 62# The dissimilar samples only have a binary annotation at 2.0 micrometer.
 63DISSIMILAR_SAMPLES = ["brain", "cornea", "kidney", "skin", "uterus"]
 64
 65
 66def _mask_filename(sample, annotations):
 67    stem = SAMPLES[sample]
 68    if sample in DEVELOPMENT_SAMPLES:
 69        return f"{stem}_mask_cl2_sp0.5.tif" if annotations == "binary" else f"{stem}_mask_cl6_sp0.5.tif"
 70    else:
 71        return f"{stem}_mask_cl2_sp2.0.tif"
 72
 73
 74def _resolve_samples(samples, annotations):
 75    if annotations not in ("binary", "semantic"):
 76        raise ValueError(f"'{annotations}' is not a valid annotation choice. Use 'binary' or 'semantic'.")
 77
 78    if samples is None:
 79        samples = DEVELOPMENT_SAMPLES if annotations == "semantic" else list(SAMPLES.keys())
 80
 81    if isinstance(samples, str):
 82        samples = [samples]
 83
 84    for sample in samples:
 85        if sample not in SAMPLES:
 86            raise ValueError(f"'{sample}' is not a valid sample. Choose from {list(SAMPLES.keys())}.")
 87        if annotations == "semantic" and sample not in DEVELOPMENT_SAMPLES:
 88            raise ValueError(f"The sample '{sample}' does not have semantic annotations. Use annotations='binary'.")
 89
 90    return samples
 91
 92
 93def _download_file(path, filename, download):
 94    out_path = os.path.join(path, filename)
 95    if os.path.exists(out_path):
 96        return out_path
 97
 98    # The whole-slide images are several gigabytes each, so we do not verify checksums.
 99    url = f"{BASE_URL}/{filename}/content"
100    util.download_source(path=out_path, url=url, download=download, checksum=None)
101    return out_path
102
103
104def _open_level(series, level_index):
105    import zarr
106
107    # The pyramidal TIFFs are natively tiled, so a zarr view reads only the requested tiles lazily.
108    # Multi-level series open as a group keyed by the level index; single-level series open as an array.
109    array = zarr.open(series.aszarr(), mode="r")
110    return array if hasattr(array, "shape") else array[str(level_index)]
111
112
113def _convert_sample(wsi_path, mask_paths, output_path, tile=4096):
114    import h5py
115    import tifffile
116
117    # All annotations of a sample share the same resolution, so the raw level is matched to the first mask.
118    first_mask = tifffile.TiffFile(list(mask_paths.values())[0])
119    height, width = first_mask.series[0].levels[0].shape
120
121    image_series = tifffile.TiffFile(wsi_path).series[0]
122    level_index = next((i for i, lv in enumerate(image_series.levels) if lv.shape[:2] == (height, width)), None)
123    if level_index is None:
124        raise RuntimeError(
125            f"Could not find a resolution level in '{wsi_path}' matching the mask shape ({height}, {width})."
126        )
127
128    image = _open_level(image_series, level_index)
129    masks = {ann: _open_level(tifffile.TiffFile(p).series[0], 0) for ann, p in mask_paths.items()}
130    for ann, mask in masks.items():
131        if mask.shape != (height, width):
132            raise RuntimeError(f"Mask '{ann}' shape {mask.shape} does not match the raw shape ({height}, {width}).")
133
134    tmp_path = output_path + ".tmp"
135    with h5py.File(tmp_path, "w") as f:
136        raw = f.create_dataset(
137            "images/raw", shape=(3, height, width), dtype="uint8", compression="gzip", chunks=(1, 512, 512)
138        )
139        mask_datasets = {}
140        for ann in masks:
141            mask_datasets[ann] = f.create_dataset(
142                f"labels/{ann}", shape=(height, width), dtype="uint8", compression="gzip", chunks=(512, 512)
143            )
144        for y in tqdm(range(0, height, tile), desc=f"Converting {Path(wsi_path).stem}"):
145            for x in range(0, width, tile):
146                th, tw = min(tile, height - y), min(tile, width - x)
147                raw[:, y:y + th, x:x + tw] = image[y:y + th, x:x + tw].transpose(2, 0, 1)
148                for ann in masks:
149                    mask_datasets[ann][y:y + th, x:x + tw] = masks[ann][y:y + th, x:x + tw]
150
151    os.replace(tmp_path, output_path)
152
153
154def get_tissueseg_data(
155    path: Union[os.PathLike, str],
156    samples: Optional[Union[str, List[str]]] = None,
157    annotations: Literal["binary", "semantic"] = "binary",
158    download: bool = False,
159) -> str:
160    """Download and preprocess the tissue segmentation data.
161
162    Args:
163        path: Filepath to a folder where the data will be saved.
164        samples: The samples to use. By default all samples valid for the annotation choice are used.
165        annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
166        download: Whether to download the data if it is not present.
167
168    Returns:
169        Filepath to the folder where the preprocessed data is stored.
170    """
171    samples = _resolve_samples(samples, annotations)
172
173    raw_dir = os.path.join(path, "raw")
174    preprocessed_dir = os.path.join(path, "preprocessed")
175    os.makedirs(raw_dir, exist_ok=True)
176    os.makedirs(preprocessed_dir, exist_ok=True)
177
178    for sample in samples:
179        output_path = os.path.join(preprocessed_dir, f"{sample}.h5")
180        if os.path.exists(output_path):
181            continue
182
183        wsi_path = _download_file(raw_dir, f"{SAMPLES[sample]}.tif", download)
184
185        # Store all available masks for the sample so that 'binary' and 'semantic' share one file.
186        mask_choices = ["binary", "semantic"] if sample in DEVELOPMENT_SAMPLES else ["binary"]
187        mask_paths = {
188            choice: _download_file(raw_dir, _mask_filename(sample, choice), download) for choice in mask_choices
189        }
190
191        _convert_sample(wsi_path, mask_paths, output_path)
192
193    return preprocessed_dir
194
195
196def get_tissueseg_paths(
197    path: Union[os.PathLike, str],
198    samples: Optional[Union[str, List[str]]] = None,
199    annotations: Literal["binary", "semantic"] = "binary",
200    download: bool = False,
201) -> List[str]:
202    """Get paths to the tissue segmentation data.
203
204    Args:
205        path: Filepath to a folder where the data will be saved.
206        samples: The samples to use. By default all samples valid for the annotation choice are used.
207        annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
208        download: Whether to download the data if it is not present.
209
210    Returns:
211        List of filepaths to the preprocessed HDF5 files.
212    """
213    samples = _resolve_samples(samples, annotations)
214    preprocessed_dir = get_tissueseg_data(path, samples, annotations, download)
215    volume_paths = [os.path.join(preprocessed_dir, f"{sample}.h5") for sample in samples]
216
217    missing = [p for p in volume_paths if not os.path.exists(p)]
218    if missing:
219        raise RuntimeError(f"Could not find the preprocessed data at {missing}.")
220
221    return volume_paths
222
223
224def get_tissueseg_dataset(
225    path: Union[os.PathLike, str],
226    patch_shape: Tuple[int, int],
227    samples: Optional[Union[str, List[str]]] = None,
228    annotations: Literal["binary", "semantic"] = "binary",
229    download: bool = False,
230    label_dtype: torch.dtype = torch.int64,
231    resize_inputs: bool = False,
232    **kwargs
233) -> Dataset:
234    """Get the tissue segmentation dataset for tissue vs. background segmentation in whole-slide images.
235
236    Args:
237        path: Filepath to a folder where the data will be saved.
238        patch_shape: The patch shape to use for training.
239        samples: The samples to use. By default all samples valid for the annotation choice are used.
240        annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
241        download: Whether to download the data if it is not present.
242        label_dtype: The datatype of the labels.
243        resize_inputs: Whether to resize the input images.
244        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
245
246    Returns:
247        The segmentation dataset.
248    """
249    volume_paths = get_tissueseg_paths(path, samples, annotations, download)
250
251    if resize_inputs:
252        resize_kwargs = {"patch_shape": patch_shape, "is_rgb": True}
253        kwargs, patch_shape = util.update_kwargs_for_resize_trafo(
254            kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs
255        )
256
257    return torch_em.default_segmentation_dataset(
258        raw_paths=volume_paths,
259        raw_key="images/raw",
260        label_paths=volume_paths,
261        label_key=f"labels/{annotations}",
262        patch_shape=patch_shape,
263        label_dtype=label_dtype,
264        is_seg_dataset=True,
265        with_channels=True,
266        ndim=2,
267        **kwargs
268    )
269
270
271def get_tissueseg_loader(
272    path: Union[os.PathLike, str],
273    patch_shape: Tuple[int, int],
274    batch_size: int,
275    samples: Optional[Union[str, List[str]]] = None,
276    annotations: Literal["binary", "semantic"] = "binary",
277    download: bool = False,
278    label_dtype: torch.dtype = torch.int64,
279    resize_inputs: bool = False,
280    **kwargs
281) -> DataLoader:
282    """Get the tissue segmentation dataloader for tissue vs. background segmentation in whole-slide images.
283
284    Args:
285        path: Filepath to a folder where the data will be saved.
286        patch_shape: The patch shape to use for training.
287        batch_size: The batch size for training.
288        samples: The samples to use. By default all samples valid for the annotation choice are used.
289        annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
290        download: Whether to download the data if it is not present.
291        label_dtype: The datatype of the labels.
292        resize_inputs: Whether to resize the input images.
293        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
294
295    Returns:
296        The DataLoader.
297    """
298    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
299    dataset = get_tissueseg_dataset(
300        path=path, patch_shape=patch_shape, samples=samples, annotations=annotations, download=download,
301        label_dtype=label_dtype, resize_inputs=resize_inputs, **ds_kwargs
302    )
303    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
BASE_URL = 'https://zenodo.org/api/records/3375528/files'
SAMPLES = {'breast': 'breast_hne_00', 'breast_lymph_node': 'breast_lymph_node_hne_00', 'tongue_hne': 'tongue_hne_00', 'tongue_ki67': 'tongue_ki67_00', 'tongue_ae1ae3': 'tongue_ae1ae3_00', 'brain': 'brain_alcianblue_00', 'cornea': 'cornea_grocott_00', 'kidney': 'kidney_cab_00', 'skin': 'skin_perls_00', 'uterus': 'uterus_vonkossa_00'}
DEVELOPMENT_SAMPLES = ['breast', 'breast_lymph_node', 'tongue_hne', 'tongue_ki67', 'tongue_ae1ae3']
DISSIMILAR_SAMPLES = ['brain', 'cornea', 'kidney', 'skin', 'uterus']
def get_tissueseg_data( path: Union[os.PathLike, str], samples: Union[List[str], str, NoneType] = None, annotations: Literal['binary', 'semantic'] = 'binary', download: bool = False) -> str:
155def get_tissueseg_data(
156    path: Union[os.PathLike, str],
157    samples: Optional[Union[str, List[str]]] = None,
158    annotations: Literal["binary", "semantic"] = "binary",
159    download: bool = False,
160) -> str:
161    """Download and preprocess the tissue segmentation data.
162
163    Args:
164        path: Filepath to a folder where the data will be saved.
165        samples: The samples to use. By default all samples valid for the annotation choice are used.
166        annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
167        download: Whether to download the data if it is not present.
168
169    Returns:
170        Filepath to the folder where the preprocessed data is stored.
171    """
172    samples = _resolve_samples(samples, annotations)
173
174    raw_dir = os.path.join(path, "raw")
175    preprocessed_dir = os.path.join(path, "preprocessed")
176    os.makedirs(raw_dir, exist_ok=True)
177    os.makedirs(preprocessed_dir, exist_ok=True)
178
179    for sample in samples:
180        output_path = os.path.join(preprocessed_dir, f"{sample}.h5")
181        if os.path.exists(output_path):
182            continue
183
184        wsi_path = _download_file(raw_dir, f"{SAMPLES[sample]}.tif", download)
185
186        # Store all available masks for the sample so that 'binary' and 'semantic' share one file.
187        mask_choices = ["binary", "semantic"] if sample in DEVELOPMENT_SAMPLES else ["binary"]
188        mask_paths = {
189            choice: _download_file(raw_dir, _mask_filename(sample, choice), download) for choice in mask_choices
190        }
191
192        _convert_sample(wsi_path, mask_paths, output_path)
193
194    return preprocessed_dir

Download and preprocess the tissue segmentation data.

Arguments:
  • path: Filepath to a folder where the data will be saved.
  • samples: The samples to use. By default all samples valid for the annotation choice are used.
  • annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
  • download: Whether to download the data if it is not present.
Returns:

Filepath to the folder where the preprocessed data is stored.

def get_tissueseg_paths( path: Union[os.PathLike, str], samples: Union[List[str], str, NoneType] = None, annotations: Literal['binary', 'semantic'] = 'binary', download: bool = False) -> List[str]:
197def get_tissueseg_paths(
198    path: Union[os.PathLike, str],
199    samples: Optional[Union[str, List[str]]] = None,
200    annotations: Literal["binary", "semantic"] = "binary",
201    download: bool = False,
202) -> List[str]:
203    """Get paths to the tissue segmentation data.
204
205    Args:
206        path: Filepath to a folder where the data will be saved.
207        samples: The samples to use. By default all samples valid for the annotation choice are used.
208        annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
209        download: Whether to download the data if it is not present.
210
211    Returns:
212        List of filepaths to the preprocessed HDF5 files.
213    """
214    samples = _resolve_samples(samples, annotations)
215    preprocessed_dir = get_tissueseg_data(path, samples, annotations, download)
216    volume_paths = [os.path.join(preprocessed_dir, f"{sample}.h5") for sample in samples]
217
218    missing = [p for p in volume_paths if not os.path.exists(p)]
219    if missing:
220        raise RuntimeError(f"Could not find the preprocessed data at {missing}.")
221
222    return volume_paths

Get paths to the tissue segmentation data.

Arguments:
  • path: Filepath to a folder where the data will be saved.
  • samples: The samples to use. By default all samples valid for the annotation choice are used.
  • annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
  • download: Whether to download the data if it is not present.
Returns:

List of filepaths to the preprocessed HDF5 files.

def get_tissueseg_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int], samples: Union[List[str], str, NoneType] = None, annotations: Literal['binary', 'semantic'] = 'binary', download: bool = False, label_dtype: torch.dtype = torch.int64, resize_inputs: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset:
225def get_tissueseg_dataset(
226    path: Union[os.PathLike, str],
227    patch_shape: Tuple[int, int],
228    samples: Optional[Union[str, List[str]]] = None,
229    annotations: Literal["binary", "semantic"] = "binary",
230    download: bool = False,
231    label_dtype: torch.dtype = torch.int64,
232    resize_inputs: bool = False,
233    **kwargs
234) -> Dataset:
235    """Get the tissue segmentation dataset for tissue vs. background segmentation in whole-slide images.
236
237    Args:
238        path: Filepath to a folder where the data will be saved.
239        patch_shape: The patch shape to use for training.
240        samples: The samples to use. By default all samples valid for the annotation choice are used.
241        annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
242        download: Whether to download the data if it is not present.
243        label_dtype: The datatype of the labels.
244        resize_inputs: Whether to resize the input images.
245        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
246
247    Returns:
248        The segmentation dataset.
249    """
250    volume_paths = get_tissueseg_paths(path, samples, annotations, download)
251
252    if resize_inputs:
253        resize_kwargs = {"patch_shape": patch_shape, "is_rgb": True}
254        kwargs, patch_shape = util.update_kwargs_for_resize_trafo(
255            kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs
256        )
257
258    return torch_em.default_segmentation_dataset(
259        raw_paths=volume_paths,
260        raw_key="images/raw",
261        label_paths=volume_paths,
262        label_key=f"labels/{annotations}",
263        patch_shape=patch_shape,
264        label_dtype=label_dtype,
265        is_seg_dataset=True,
266        with_channels=True,
267        ndim=2,
268        **kwargs
269    )

Get the tissue segmentation dataset for tissue vs. background segmentation in whole-slide images.

Arguments:
  • path: Filepath to a folder where the data will be saved.
  • patch_shape: The patch shape to use for training.
  • samples: The samples to use. By default all samples valid for the annotation choice are used.
  • annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
  • download: Whether to download the data if it is not present.
  • label_dtype: The datatype of the labels.
  • resize_inputs: Whether to resize the input images.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.
Returns:

The segmentation dataset.

def get_tissueseg_loader( path: Union[os.PathLike, str], patch_shape: Tuple[int, int], batch_size: int, samples: Union[List[str], str, NoneType] = None, annotations: Literal['binary', 'semantic'] = 'binary', download: bool = False, label_dtype: torch.dtype = torch.int64, resize_inputs: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader:
272def get_tissueseg_loader(
273    path: Union[os.PathLike, str],
274    patch_shape: Tuple[int, int],
275    batch_size: int,
276    samples: Optional[Union[str, List[str]]] = None,
277    annotations: Literal["binary", "semantic"] = "binary",
278    download: bool = False,
279    label_dtype: torch.dtype = torch.int64,
280    resize_inputs: bool = False,
281    **kwargs
282) -> DataLoader:
283    """Get the tissue segmentation dataloader for tissue vs. background segmentation in whole-slide images.
284
285    Args:
286        path: Filepath to a folder where the data will be saved.
287        patch_shape: The patch shape to use for training.
288        batch_size: The batch size for training.
289        samples: The samples to use. By default all samples valid for the annotation choice are used.
290        annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
291        download: Whether to download the data if it is not present.
292        label_dtype: The datatype of the labels.
293        resize_inputs: Whether to resize the input images.
294        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
295
296    Returns:
297        The DataLoader.
298    """
299    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
300    dataset = get_tissueseg_dataset(
301        path=path, patch_shape=patch_shape, samples=samples, annotations=annotations, download=download,
302        label_dtype=label_dtype, resize_inputs=resize_inputs, **ds_kwargs
303    )
304    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the tissue segmentation dataloader for tissue vs. background segmentation in whole-slide images.

Arguments:
  • path: Filepath to a folder where the data will be saved.
  • patch_shape: The patch shape to use for training.
  • batch_size: The batch size for training.
  • samples: The samples to use. By default all samples valid for the annotation choice are used.
  • annotations: The annotation type. Either 'binary' (tissue vs. background) or 'semantic' (six classes).
  • download: Whether to download the data if it is not present.
  • label_dtype: The datatype of the labels.
  • resize_inputs: Whether to resize the input images.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.
Returns:

The DataLoader.