torch_em.data.datasets.histopathology.phenocell

The PhenoCell dataset contains annotations for cell phenotyping in H&E stained histopathology images, with instance segmentation and 14 granular cell types derived from co-registered multiplexed (CODEX) imaging.

The dataset is part of the PhenoBench (PathoCellBench) benchmark and is hosted on HuggingFace at https://huggingface.co/datasets/Kainmueller-Lab/phenobench. This dataset is from the publication https://doi.org/10.48550/arXiv.2507.03532. Please cite it if you use this dataset in your research.

The data consists of 109 fields of view of 1440x1920 pixels. On the first use each field of view is converted into a single chunked and compressed HDF5 file with the following layout: - 'raw/histopathology/h&e': the (3, H, W) H&E image. - 'raw/codex/all': the (58, H, W) stack of co-registered CODEX channels. - 'raw/codex/_': each individual CODEX channel (H, W), e.g. 'raw/codex/CD20_B_cells' (see CODEX_CHANNELS for the full list of 58 channels). - 'labels/instances': the instance segmentation. - 'labels/semantic_coarse': the coarse 15-class cell type map (the benchmark labels). - 'labels/semantic_fine': the fine-grained 30-class cell type map.

The coarse semantic classes ('semantic_coarse' label choice) are: 0: Background 1: B cells 2: Macrophages/Monocytes 3: Adipocytes 4: Dendritic cells 5: T cells 6: Granulocytes 7: NK cells 8: Nerves 9: Plasma cells 10: Smooth muscle 11: Stroma 12: Tumor cells 13: Vasculature/Lymphatics 14: Other cells

The 'semantic_fine' label choice has 30 granular classes that the coarse ones are collapsed from: 0: background 1: B cells 2: CD11b+ monocytes 3: CD11b+CD68+ macrophages 4: CD11c+ DCs 5: CD163+ macrophages 6: CD3+ T cells 7: CD4+ T cells 8: CD4+ T cells CD45RO+ 9: CD4+ T cells GATA3+ 10: CD68+ macrophages 11: CD68+ macrophages GzmB+ 12: CD68+CD163+ macrophages 13: CD8+ T cells 14: NK cells 15: Tregs 16: adipocytes 17: dirt 18: granulocytes 19: immune cells 20: immune cells / vasculature 21: lymphatics 22: nerves 23: plasma cells 24: smooth muscle 25: stroma 26: tumor cells 27: tumor cells / immune cells 28: undefined 29: vasculature

NOTE: Downloading requires 'huggingface_hub'. The dataset is large (each field of view is around 350 MB), so by default only the requested split is downloaded.

View Source

  1"""The PhenoCell dataset contains annotations for cell phenotyping in
  2H&E stained histopathology images, with instance segmentation and 14 granular
  3cell types derived from co-registered multiplexed (CODEX) imaging.
  4
  5The dataset is part of the PhenoBench (PathoCellBench) benchmark and is hosted on
  6HuggingFace at https://huggingface.co/datasets/Kainmueller-Lab/phenobench.
  7This dataset is from the publication https://doi.org/10.48550/arXiv.2507.03532.
  8Please cite it if you use this dataset in your research.
  9
 10The data consists of 109 fields of view of 1440x1920 pixels. On the first use each
 11field of view is converted into a single chunked and compressed HDF5 file with the
 12following layout:
 13    - 'raw/histopathology/h&e': the (3, H, W) H&E image.
 14    - 'raw/codex/all': the (58, H, W) stack of co-registered CODEX channels.
 15    - 'raw/codex/<marker>_<target>': each individual CODEX channel (H, W), e.g.
 16      'raw/codex/CD20_B_cells' (see `CODEX_CHANNELS` for the full list of 58 channels).
 17    - 'labels/instances': the instance segmentation.
 18    - 'labels/semantic_coarse': the coarse 15-class cell type map (the benchmark labels).
 19    - 'labels/semantic_fine': the fine-grained 30-class cell type map.
 20
 21The coarse semantic classes ('semantic_coarse' label choice) are:
 22    0: Background
 23    1: B cells
 24    2: Macrophages/Monocytes
 25    3: Adipocytes
 26    4: Dendritic cells
 27    5: T cells
 28    6: Granulocytes
 29    7: NK cells
 30    8: Nerves
 31    9: Plasma cells
 32    10: Smooth muscle
 33    11: Stroma
 34    12: Tumor cells
 35    13: Vasculature/Lymphatics
 36    14: Other cells
 37
 38The 'semantic_fine' label choice has 30 granular classes that the coarse ones are
 39collapsed from:
 40    0: background
 41    1: B cells
 42    2: CD11b+ monocytes
 43    3: CD11b+CD68+ macrophages
 44    4: CD11c+ DCs
 45    5: CD163+ macrophages
 46    6: CD3+ T cells
 47    7: CD4+ T cells
 48    8: CD4+ T cells CD45RO+
 49    9: CD4+ T cells GATA3+
 50    10: CD68+ macrophages
 51    11: CD68+ macrophages GzmB+
 52    12: CD68+CD163+ macrophages
 53    13: CD8+ T cells
 54    14: NK cells
 55    15: Tregs
 56    16: adipocytes
 57    17: dirt
 58    18: granulocytes
 59    19: immune cells
 60    20: immune cells / vasculature
 61    21: lymphatics
 62    22: nerves
 63    23: plasma cells
 64    24: smooth muscle
 65    25: stroma
 66    26: tumor cells
 67    27: tumor cells / immune cells
 68    28: undefined
 69    29: vasculature
 70
 71NOTE: Downloading requires 'huggingface_hub'. The dataset is large (each field of
 72view is around 350 MB), so by default only the requested split is downloaded.
 73"""
 74
 75import os
 76from pathlib import Path
 77from typing import List, Literal, Optional, Tuple, Union
 78
 79from tqdm import tqdm
 80
 81import torch
 82
 83from torch.utils.data import Dataset, DataLoader
 84
 85import torch_em
 86
 87from .. import util
 88
 89
 90HF_REPO = "Kainmueller-Lab/phenobench"
 91SRC_HDF_DIR = "pathocell_hdf"
 92SPLIT_FILE = "data/phenocell/splits/phenocell_dataset_split.csv"
 93
 94# Source label key in the downloaded HDF5 -> destination key in the converted HDF5.
 95SOURCE_LABELS = {
 96    "gt_inst": "labels/instances",
 97    "gt_ct_coarse": "labels/semantic_coarse",
 98    "gt_ct": "labels/semantic_fine",
 99}
100
101LABEL_KEYS = {
102    "instances": "labels/instances",
103    "semantic_coarse": "labels/semantic_coarse",
104    "semantic_fine": "labels/semantic_fine",
105}
106
107# The multi-channel raw inputs. Individual CODEX channels (see CODEX_CHANNELS) can also be chosen.
108MODALITY_KEYS = {
109    "histopathology": "raw/histopathology/h&e",
110    "codex": "raw/codex/all",
111}
112
113# The 58 CODEX channels in their stored order, named '<marker>_<target>' (the keys under 'raw/codex/').
114CODEX_CHANNELS = [
115    "CD44_stroma", "FOXP3_regulatory_T_cells", "CDX2_intestinal_epithelia", "CD8_cytotoxic_T_cells",
116    "p53_tumor_suppressor", "GATA3_Th2_helper_T_cells", "CD45_hematopoietic_cells", "T-bet_Th1_cells",
117    "beta-catenin_Wnt_signaling", "HLA-DR_MHC-II", "PD-L1_checkpoint", "Ki67_proliferation",
118    "CD45RA_naive_T_cells", "CD4_T_helper_cells", "CD21_DCs", "MUC-1_epithelia", "CD30_costimulator",
119    "CD2_T_cells", "Vimentin_cytoplasm", "CD20_B_cells", "LAG-3_checkpoint", "Na-K-ATPase_membranes",
120    "CD5_T_cells", "IDO-1_metabolism", "Cytokeratin_epithelia", "CD11b_macrophages", "CD56_NK_cells",
121    "aSMA_smooth_muscle", "BCL-2_apoptosis", "CD25_IL-2_Ra", "Collagen_IV_bas._memb.", "CD11c_DCs",
122    "PD-1_checkpoint", "HOCHST13", "Granzyme_B_cytotoxicity", "EGFR_signaling", "VISTA_costimulator",
123    "CD15_granulocytes", "CD194_CCR4_chemokine_R", "ICOS_costimulator", "MMP9_matrix_metalloproteinase",
124    "Synaptophysin_neuroendocrine", "CD71_transferrin_R", "GFAP_nerves", "CD7_T_cells", "CD3_T_cells",
125    "Chromogranin_A_neuroendocrine", "CD163_macrophages", "CD57_NK_cells", "CD45RO_memory_cells",
126    "CD68_macrophages", "CD31_vasculature", "Podoplanin_lymphatics", "CD34_vasculature", "CD38_multifunctional",
127    "CD138_plasma_cells", "MMP12_matrix_metalloproteinase", "DRAQ5",
128]
129
130
131def _samples_for_split(split_csv, split):
132    import pandas as pd
133
134    df = pd.read_csv(split_csv)
135    if split is not None:
136        if split not in ("train", "valid", "test"):
137            raise ValueError(f"'{split}' is not a valid split choice. Use 'train', 'valid' or 'test'.")
138        df = df[df["train_test_val_split"] == split]
139
140    return sorted(df["sample_name"].tolist())
141
142
143def _convert_sample(src_path, output_path):
144    import h5py
145
146    with h5py.File(src_path, "r") as f:
147        image = f["img"][:]
148        codex = f["ifl"][:]
149        labels = {dst: f[src][0] for src, dst in SOURCE_LABELS.items()}
150
151    if codex.shape[0] != len(CODEX_CHANNELS):
152        raise RuntimeError(f"Expected {len(CODEX_CHANNELS)} CODEX channels, but found {codex.shape[0]}.")
153
154    tmp_path = output_path + ".tmp"
155    with h5py.File(tmp_path, "w") as f:
156        f.create_dataset("raw/histopathology/h&e", data=image, compression="gzip", chunks=(1, 512, 512))
157        f.create_dataset("raw/codex/all", data=codex, compression="gzip", chunks=(1, 512, 512))
158        for i, name in enumerate(CODEX_CHANNELS):
159            f.create_dataset(f"raw/codex/{name}", data=codex[i], compression="gzip", chunks=(512, 512))
160        for dst, label in labels.items():
161            f.create_dataset(dst, data=label, compression="gzip", chunks=(512, 512))
162
163    os.replace(tmp_path, output_path)
164
165
166def get_phenocell_data(
167    path: Union[os.PathLike, str],
168    split: Optional[Literal["train", "valid", "test"]] = None,
169    download: bool = False,
170) -> str:
171    """Download and preprocess the PhenoCell data.
172
173    Args:
174        path: Filepath to a folder where the downloaded data will be saved.
175        split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
176        download: Whether to download the data if it is not present.
177
178    Returns:
179        Filepath to the folder where the preprocessed data is stored.
180    """
181    try:
182        from huggingface_hub import hf_hub_download, snapshot_download
183    except ImportError:
184        raise ImportError("'huggingface_hub' is required to download PhenoCell. Install it via conda/pip.")
185
186    preprocessed_dir = os.path.join(path, "preprocessed")
187    os.makedirs(preprocessed_dir, exist_ok=True)
188
189    if not os.path.exists(os.path.join(path, SPLIT_FILE)):
190        if not download:
191            raise RuntimeError(f"Cannot find the data at {path}, but download was set to False.")
192        hf_hub_download(repo_id=HF_REPO, repo_type="dataset", filename=SPLIT_FILE, local_dir=path)
193
194    samples = _samples_for_split(os.path.join(path, SPLIT_FILE), split)
195    to_convert = [s for s in samples if not os.path.exists(os.path.join(preprocessed_dir, f"{Path(s).stem}.h5"))]
196
197    if to_convert:
198        if not download:
199            raise RuntimeError(f"Cannot find the data at {path}, but download was set to False.")
200        patterns = [f"{SRC_HDF_DIR}/{s}" for s in to_convert]
201        snapshot_download(repo_id=HF_REPO, repo_type="dataset", local_dir=path, allow_patterns=patterns)
202
203        for sample in tqdm(to_convert, desc="Converting PhenoCell fields of view"):
204            _convert_sample(
205                os.path.join(path, SRC_HDF_DIR, sample),
206                os.path.join(preprocessed_dir, f"{Path(sample).stem}.h5"),
207            )
208
209    return preprocessed_dir
210
211
212def get_phenocell_paths(
213    path: Union[os.PathLike, str],
214    split: Optional[Literal["train", "valid", "test"]] = None,
215    download: bool = False,
216) -> List[str]:
217    """Get paths to the PhenoCell data.
218
219    Args:
220        path: Filepath to a folder where the downloaded data will be saved.
221        split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
222        download: Whether to download the data if it is not present.
223
224    Returns:
225        List of filepaths to the preprocessed HDF5 files.
226    """
227    preprocessed_dir = get_phenocell_data(path, split, download)
228    samples = _samples_for_split(os.path.join(path, SPLIT_FILE), split)
229    volume_paths = [os.path.join(preprocessed_dir, f"{Path(s).stem}.h5") for s in samples]
230
231    missing = [p for p in volume_paths if not os.path.exists(p)]
232    if missing:
233        raise RuntimeError(f"Could not find the data at {missing}.")
234
235    return volume_paths
236
237
238def get_phenocell_dataset(
239    path: Union[os.PathLike, str],
240    patch_shape: Tuple[int, int],
241    split: Optional[Literal["train", "valid", "test"]] = None,
242    label_choice: Literal["instances", "semantic_coarse", "semantic_fine"] = "instances",
243    modality: str = "histopathology",
244    download: bool = False,
245    label_dtype: torch.dtype = torch.int64,
246    resize_inputs: bool = False,
247    **kwargs
248) -> Dataset:
249    """Get the PhenoCell dataset for cell phenotyping in H&E stained histopathology images.
250
251    Args:
252        path: Filepath to a folder where the downloaded data will be saved.
253        patch_shape: The patch shape to use for training.
254        split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
255        label_choice: The label type. Either 'instances', 'semantic_coarse' (15-class) or 'semantic_fine' (30-class).
256        modality: The raw input. Either 'histopathology' (3-channel H&E), 'codex' (58-channel multiplexed stack)
257            or the name of a single CODEX channel (see `CODEX_CHANNELS`), e.g. 'CD20_B_cells'.
258        download: Whether to download the data if it is not present.
259        label_dtype: The datatype of the labels.
260        resize_inputs: Whether to resize the input images.
261        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
262
263    Returns:
264        The segmentation dataset.
265    """
266    if label_choice not in LABEL_KEYS:
267        raise ValueError(f"'{label_choice}' is not a valid label choice. Choose from {list(LABEL_KEYS.keys())}.")
268
269    if modality in MODALITY_KEYS:
270        raw_key, with_channels = MODALITY_KEYS[modality], True
271    elif modality in CODEX_CHANNELS:
272        raw_key, with_channels = f"raw/codex/{modality}", False
273    else:
274        raise ValueError(f"'{modality}' is not a valid modality. Use 'histopathology', 'codex' or a CODEX channel.")
275
276    volume_paths = get_phenocell_paths(path, split, download)
277
278    if resize_inputs:
279        resize_kwargs = {"patch_shape": patch_shape, "is_rgb": modality == "histopathology"}
280        kwargs, patch_shape = util.update_kwargs_for_resize_trafo(
281            kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs
282        )
283
284    return torch_em.default_segmentation_dataset(
285        raw_paths=volume_paths,
286        raw_key=raw_key,
287        label_paths=volume_paths,
288        label_key=LABEL_KEYS[label_choice],
289        patch_shape=patch_shape,
290        label_dtype=label_dtype,
291        is_seg_dataset=True,
292        with_channels=with_channels,
293        ndim=2,
294        **kwargs
295    )
296
297
298def get_phenocell_loader(
299    path: Union[os.PathLike, str],
300    patch_shape: Tuple[int, int],
301    batch_size: int,
302    split: Optional[Literal["train", "valid", "test"]] = None,
303    label_choice: Literal["instances", "semantic_coarse", "semantic_fine"] = "instances",
304    modality: str = "histopathology",
305    download: bool = False,
306    label_dtype: torch.dtype = torch.int64,
307    resize_inputs: bool = False,
308    **kwargs
309) -> DataLoader:
310    """Get the PhenoCell dataloader for cell phenotyping in H&E stained histopathology images.
311
312    Args:
313        path: Filepath to a folder where the downloaded data will be saved.
314        patch_shape: The patch shape to use for training.
315        batch_size: The batch size for training.
316        split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
317        label_choice: The label type. Either 'instances', 'semantic_coarse' (15-class) or 'semantic_fine' (30-class).
318        modality: The raw input. Either 'histopathology' (3-channel H&E), 'codex' (58-channel multiplexed stack)
319            or the name of a single CODEX channel (see `CODEX_CHANNELS`), e.g. 'CD20_B_cells'.
320        download: Whether to download the data if it is not present.
321        label_dtype: The datatype of the labels.
322        resize_inputs: Whether to resize the input images.
323        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
324
325    Returns:
326        The DataLoader.
327    """
328    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
329    dataset = get_phenocell_dataset(
330        path=path, patch_shape=patch_shape, split=split, label_choice=label_choice, modality=modality,
331        download=download, label_dtype=label_dtype, resize_inputs=resize_inputs, **ds_kwargs
332    )
333    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

HF_REPO = 'Kainmueller-Lab/phenobench'

SRC_HDF_DIR = 'pathocell_hdf'

SPLIT_FILE = 'data/phenocell/splits/phenocell_dataset_split.csv'

SOURCE_LABELS = {'gt_inst': 'labels/instances', 'gt_ct_coarse': 'labels/semantic_coarse', 'gt_ct': 'labels/semantic_fine'}

LABEL_KEYS = {'instances': 'labels/instances', 'semantic_coarse': 'labels/semantic_coarse', 'semantic_fine': 'labels/semantic_fine'}

MODALITY_KEYS = {'histopathology': 'raw/histopathology/h&e', 'codex': 'raw/codex/all'}

CODEX_CHANNELS = ['CD44_stroma', 'FOXP3_regulatory_T_cells', 'CDX2_intestinal_epithelia', 'CD8_cytotoxic_T_cells', 'p53_tumor_suppressor', 'GATA3_Th2_helper_T_cells', 'CD45_hematopoietic_cells', 'T-bet_Th1_cells', 'beta-catenin_Wnt_signaling', 'HLA-DR_MHC-II', 'PD-L1_checkpoint', 'Ki67_proliferation', 'CD45RA_naive_T_cells', 'CD4_T_helper_cells', 'CD21_DCs', 'MUC-1_epithelia', 'CD30_costimulator', 'CD2_T_cells', 'Vimentin_cytoplasm', 'CD20_B_cells', 'LAG-3_checkpoint', 'Na-K-ATPase_membranes', 'CD5_T_cells', 'IDO-1_metabolism', 'Cytokeratin_epithelia', 'CD11b_macrophages', 'CD56_NK_cells', 'aSMA_smooth_muscle', 'BCL-2_apoptosis', 'CD25_IL-2_Ra', 'Collagen_IV_bas._memb.', 'CD11c_DCs', 'PD-1_checkpoint', 'HOCHST13', 'Granzyme_B_cytotoxicity', 'EGFR_signaling', 'VISTA_costimulator', 'CD15_granulocytes', 'CD194_CCR4_chemokine_R', 'ICOS_costimulator', 'MMP9_matrix_metalloproteinase', 'Synaptophysin_neuroendocrine', 'CD71_transferrin_R', 'GFAP_nerves', 'CD7_T_cells', 'CD3_T_cells', 'Chromogranin_A_neuroendocrine', 'CD163_macrophages', 'CD57_NK_cells', 'CD45RO_memory_cells', 'CD68_macrophages', 'CD31_vasculature', 'Podoplanin_lymphatics', 'CD34_vasculature', 'CD38_multifunctional', 'CD138_plasma_cells', 'MMP12_matrix_metalloproteinase', 'DRAQ5']

def get_phenocell_data( path: Union[os.PathLike, str], split: Optional[Literal['train', 'valid', 'test']] = None, download: bool = False) -> str: View Source

167def get_phenocell_data(
168    path: Union[os.PathLike, str],
169    split: Optional[Literal["train", "valid", "test"]] = None,
170    download: bool = False,
171) -> str:
172    """Download and preprocess the PhenoCell data.
173
174    Args:
175        path: Filepath to a folder where the downloaded data will be saved.
176        split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
177        download: Whether to download the data if it is not present.
178
179    Returns:
180        Filepath to the folder where the preprocessed data is stored.
181    """
182    try:
183        from huggingface_hub import hf_hub_download, snapshot_download
184    except ImportError:
185        raise ImportError("'huggingface_hub' is required to download PhenoCell. Install it via conda/pip.")
186
187    preprocessed_dir = os.path.join(path, "preprocessed")
188    os.makedirs(preprocessed_dir, exist_ok=True)
189
190    if not os.path.exists(os.path.join(path, SPLIT_FILE)):
191        if not download:
192            raise RuntimeError(f"Cannot find the data at {path}, but download was set to False.")
193        hf_hub_download(repo_id=HF_REPO, repo_type="dataset", filename=SPLIT_FILE, local_dir=path)
194
195    samples = _samples_for_split(os.path.join(path, SPLIT_FILE), split)
196    to_convert = [s for s in samples if not os.path.exists(os.path.join(preprocessed_dir, f"{Path(s).stem}.h5"))]
197
198    if to_convert:
199        if not download:
200            raise RuntimeError(f"Cannot find the data at {path}, but download was set to False.")
201        patterns = [f"{SRC_HDF_DIR}/{s}" for s in to_convert]
202        snapshot_download(repo_id=HF_REPO, repo_type="dataset", local_dir=path, allow_patterns=patterns)
203
204        for sample in tqdm(to_convert, desc="Converting PhenoCell fields of view"):
205            _convert_sample(
206                os.path.join(path, SRC_HDF_DIR, sample),
207                os.path.join(preprocessed_dir, f"{Path(sample).stem}.h5"),
208            )
209
210    return preprocessed_dir

Download and preprocess the PhenoCell data.

Arguments:

path: Filepath to a folder where the downloaded data will be saved.
split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
download: Whether to download the data if it is not present.

Returns:

Filepath to the folder where the preprocessed data is stored.

def get_phenocell_paths( path: Union[os.PathLike, str], split: Optional[Literal['train', 'valid', 'test']] = None, download: bool = False) -> List[str]: View Source

213def get_phenocell_paths(
214    path: Union[os.PathLike, str],
215    split: Optional[Literal["train", "valid", "test"]] = None,
216    download: bool = False,
217) -> List[str]:
218    """Get paths to the PhenoCell data.
219
220    Args:
221        path: Filepath to a folder where the downloaded data will be saved.
222        split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
223        download: Whether to download the data if it is not present.
224
225    Returns:
226        List of filepaths to the preprocessed HDF5 files.
227    """
228    preprocessed_dir = get_phenocell_data(path, split, download)
229    samples = _samples_for_split(os.path.join(path, SPLIT_FILE), split)
230    volume_paths = [os.path.join(preprocessed_dir, f"{Path(s).stem}.h5") for s in samples]
231
232    missing = [p for p in volume_paths if not os.path.exists(p)]
233    if missing:
234        raise RuntimeError(f"Could not find the data at {missing}.")
235
236    return volume_paths

Get paths to the PhenoCell data.

Arguments:

path: Filepath to a folder where the downloaded data will be saved.
split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
download: Whether to download the data if it is not present.

Returns:

List of filepaths to the preprocessed HDF5 files.

def get_phenocell_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int], split: Optional[Literal['train', 'valid', 'test']] = None, label_choice: Literal['instances', 'semantic_coarse', 'semantic_fine'] = 'instances', modality: str = 'histopathology', download: bool = False, label_dtype: torch.dtype = torch.int64, resize_inputs: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset: View Source

239def get_phenocell_dataset(
240    path: Union[os.PathLike, str],
241    patch_shape: Tuple[int, int],
242    split: Optional[Literal["train", "valid", "test"]] = None,
243    label_choice: Literal["instances", "semantic_coarse", "semantic_fine"] = "instances",
244    modality: str = "histopathology",
245    download: bool = False,
246    label_dtype: torch.dtype = torch.int64,
247    resize_inputs: bool = False,
248    **kwargs
249) -> Dataset:
250    """Get the PhenoCell dataset for cell phenotyping in H&E stained histopathology images.
251
252    Args:
253        path: Filepath to a folder where the downloaded data will be saved.
254        patch_shape: The patch shape to use for training.
255        split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
256        label_choice: The label type. Either 'instances', 'semantic_coarse' (15-class) or 'semantic_fine' (30-class).
257        modality: The raw input. Either 'histopathology' (3-channel H&E), 'codex' (58-channel multiplexed stack)
258            or the name of a single CODEX channel (see `CODEX_CHANNELS`), e.g. 'CD20_B_cells'.
259        download: Whether to download the data if it is not present.
260        label_dtype: The datatype of the labels.
261        resize_inputs: Whether to resize the input images.
262        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
263
264    Returns:
265        The segmentation dataset.
266    """
267    if label_choice not in LABEL_KEYS:
268        raise ValueError(f"'{label_choice}' is not a valid label choice. Choose from {list(LABEL_KEYS.keys())}.")
269
270    if modality in MODALITY_KEYS:
271        raw_key, with_channels = MODALITY_KEYS[modality], True
272    elif modality in CODEX_CHANNELS:
273        raw_key, with_channels = f"raw/codex/{modality}", False
274    else:
275        raise ValueError(f"'{modality}' is not a valid modality. Use 'histopathology', 'codex' or a CODEX channel.")
276
277    volume_paths = get_phenocell_paths(path, split, download)
278
279    if resize_inputs:
280        resize_kwargs = {"patch_shape": patch_shape, "is_rgb": modality == "histopathology"}
281        kwargs, patch_shape = util.update_kwargs_for_resize_trafo(
282            kwargs=kwargs, patch_shape=patch_shape, resize_inputs=resize_inputs, resize_kwargs=resize_kwargs
283        )
284
285    return torch_em.default_segmentation_dataset(
286        raw_paths=volume_paths,
287        raw_key=raw_key,
288        label_paths=volume_paths,
289        label_key=LABEL_KEYS[label_choice],
290        patch_shape=patch_shape,
291        label_dtype=label_dtype,
292        is_seg_dataset=True,
293        with_channels=with_channels,
294        ndim=2,
295        **kwargs
296    )

Get the PhenoCell dataset for cell phenotyping in H&E stained histopathology images.

Arguments:

path: Filepath to a folder where the downloaded data will be saved.
patch_shape: The patch shape to use for training.
split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
label_choice: The label type. Either 'instances', 'semantic_coarse' (15-class) or 'semantic_fine' (30-class).
modality: The raw input. Either 'histopathology' (3-channel H&E), 'codex' (58-channel multiplexed stack) or the name of a single CODEX channel (see CODEX_CHANNELS), e.g. 'CD20_B_cells'.
download: Whether to download the data if it is not present.
label_dtype: The datatype of the labels.
resize_inputs: Whether to resize the input images.
kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.

Returns:

The segmentation dataset.

def get_phenocell_loader( path: Union[os.PathLike, str], patch_shape: Tuple[int, int], batch_size: int, split: Optional[Literal['train', 'valid', 'test']] = None, label_choice: Literal['instances', 'semantic_coarse', 'semantic_fine'] = 'instances', modality: str = 'histopathology', download: bool = False, label_dtype: torch.dtype = torch.int64, resize_inputs: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader: View Source

299def get_phenocell_loader(
300    path: Union[os.PathLike, str],
301    patch_shape: Tuple[int, int],
302    batch_size: int,
303    split: Optional[Literal["train", "valid", "test"]] = None,
304    label_choice: Literal["instances", "semantic_coarse", "semantic_fine"] = "instances",
305    modality: str = "histopathology",
306    download: bool = False,
307    label_dtype: torch.dtype = torch.int64,
308    resize_inputs: bool = False,
309    **kwargs
310) -> DataLoader:
311    """Get the PhenoCell dataloader for cell phenotyping in H&E stained histopathology images.
312
313    Args:
314        path: Filepath to a folder where the downloaded data will be saved.
315        patch_shape: The patch shape to use for training.
316        batch_size: The batch size for training.
317        split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
318        label_choice: The label type. Either 'instances', 'semantic_coarse' (15-class) or 'semantic_fine' (30-class).
319        modality: The raw input. Either 'histopathology' (3-channel H&E), 'codex' (58-channel multiplexed stack)
320            or the name of a single CODEX channel (see `CODEX_CHANNELS`), e.g. 'CD20_B_cells'.
321        download: Whether to download the data if it is not present.
322        label_dtype: The datatype of the labels.
323        resize_inputs: Whether to resize the input images.
324        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
325
326    Returns:
327        The DataLoader.
328    """
329    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
330    dataset = get_phenocell_dataset(
331        path=path, patch_shape=patch_shape, split=split, label_choice=label_choice, modality=modality,
332        download=download, label_dtype=label_dtype, resize_inputs=resize_inputs, **ds_kwargs
333    )
334    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the PhenoCell dataloader for cell phenotyping in H&E stained histopathology images.

Arguments:

path: Filepath to a folder where the downloaded data will be saved.
patch_shape: The patch shape to use for training.
batch_size: The batch size for training.
split: The split to use. Either 'train', 'valid', 'test' or None for all fields of view.
label_choice: The label type. Either 'instances', 'semantic_coarse' (15-class) or 'semantic_fine' (30-class).
modality: The raw input. Either 'histopathology' (3-channel H&E), 'codex' (58-channel multiplexed stack) or the name of a single CODEX channel (see CODEX_CHANNELS), e.g. 'CD20_B_cells'.
download: Whether to download the data if it is not present.
label_dtype: The datatype of the labels.
resize_inputs: Whether to resize the input images.
kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.

Returns:

The DataLoader.