torch_em.data.datasets.light_microscopy.mucic

The MUCIC (Masaryk University Cell Image Collection) datasets contain synthetic 3D microscopy images for cell and nucleus segmentation benchmarking.

NOTE: Most of the datasets available at MUCIC are synthetic images (see detailed description below).

Available datasets:

Colon Tissue: 30 synthetic 3D images of human colon tissue with semantic segmentation labels
HL60 Cell Line: Synthetic 3D images of HL60 cells with instance segmentation labels
Granulocytes: Synthetic 3D images of granulocytes with instance segmentation labels
Vasculogenesis: Time-lapse 2D images of living cells with semantic segmentation labels
MDA231: 3D fluorescence images of MDA231 cells with full instance segmentation annotations

The datasets are from CBIA (Centre for Biomedical Image Analysis) at Masaryk University.

The data is located at https://cbia.fi.muni.cz/datasets/.

Colon Tissue: https://doi.org/10.1007/978-3-642-21593-3_4
HL60 Cell Line: https://doi.org/10.1002/cyto.a.20811
Granulocytes: https://doi.org/10.1002/cyto.a.20811
Vasculogenesis: https://doi.org/10.1109/ICIP.2016.7532871
MDA231: Cell Tracking Challenge (Fluo-C3DL-MDA231) with ISBI 2025 full annotations Please cite the relevant publication if you use this dataset in your research.

View Source

  1"""The MUCIC (Masaryk University Cell Image Collection) datasets contain synthetic 3D
  2microscopy images for cell and nucleus segmentation benchmarking.
  3
  4NOTE: Most of the datasets available at MUCIC are synthetic images (see detailed description below).
  5
  6Available datasets:
  7- Colon Tissue: 30 synthetic 3D images of human colon tissue with semantic segmentation labels
  8- HL60 Cell Line: Synthetic 3D images of HL60 cells with instance segmentation labels
  9- Granulocytes: Synthetic 3D images of granulocytes with instance segmentation labels
 10- Vasculogenesis: Time-lapse 2D images of living cells with semantic segmentation labels
 11- MDA231: 3D fluorescence images of MDA231 cells with full instance segmentation annotations
 12
 13The datasets are from CBIA (Centre for Biomedical Image Analysis) at Masaryk University.
 14
 15The data is located at https://cbia.fi.muni.cz/datasets/.
 16- Colon Tissue: https://doi.org/10.1007/978-3-642-21593-3_4
 17- HL60 Cell Line: https://doi.org/10.1002/cyto.a.20811
 18- Granulocytes: https://doi.org/10.1002/cyto.a.20811
 19- Vasculogenesis: https://doi.org/10.1109/ICIP.2016.7532871
 20- MDA231: Cell Tracking Challenge (Fluo-C3DL-MDA231) with ISBI 2025 full annotations
 21Please cite the relevant publication if you use this dataset in your research.
 22"""
 23
 24import os
 25from glob import glob
 26from typing import Union, Tuple, List, Optional
 27
 28import numpy as np
 29
 30from torch.utils.data import Dataset, DataLoader
 31
 32import torch_em
 33
 34from .. import util
 35
 36
 37URLS = {
 38    "colon_tissue": {
 39        "low": "https://datasets.gryf.fi.muni.cz/iciar2011/ColonTissue_LowNoise_3D_HDF5.zip",
 40        "high": "https://datasets.gryf.fi.muni.cz/iciar2011/ColonTissue_HighNoise_3D_HDF5.zip",
 41    },
 42    "hl60": {
 43        "low_c00": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C00_3D_HDF5.zip",
 44        "low_c25": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C25_3D_HDF5.zip",
 45        "low_c50": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C50_3D_HDF5.zip",
 46        "low_c75": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C75_3D_HDF5.zip",
 47        "high_c00": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C00_3D_HDF5.zip",
 48        "high_c25": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C25_3D_HDF5.zip",
 49        "high_c50": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C50_3D_HDF5.zip",
 50        "high_c75": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C75_3D_HDF5.zip",
 51    },
 52    "granulocytes": {
 53        "low": "https://datasets.gryf.fi.muni.cz/cytometry2009/Granulocytes_LowNoise_3D_HDF5.zip",
 54        "high": "https://datasets.gryf.fi.muni.cz/cytometry2009/Granulocytes_HighNoise_3D_HDF5.zip",
 55    },
 56    "vasculogenesis": {
 57        "default": {
 58            "images": "https://datasets.gryf.fi.muni.cz/icip2016/vasculogenesis-images.zip",
 59            "labels": "https://datasets.gryf.fi.muni.cz/icip2016/vasculogenesis-labels.zip",
 60        },
 61    },
 62    "mda231": {
 63        "default": {
 64            "images": "https://data.celltrackingchallenge.net/training-datasets/Fluo-C3DL-MDA231.zip",
 65            "labels": "https://datasets.gryf.fi.muni.cz/isbi2025/Fluo-C3DL-MDA231_Full_Annotations.zip",
 66        },
 67    },
 68}
 69
 70CELL_LINES = list(URLS.keys())
 71
 72
 73def _get_variants(cell_line):
 74    return list(URLS[cell_line].keys())
 75
 76
 77# Cell lines with semantic labels that need connected components for instance segmentation
 78_SEMANTIC_LABEL_CELL_LINES = ["colon_tissue", "vasculogenesis"]
 79
 80# Cell lines with separate image/label zip files
 81_SEPARATE_ZIPS_CELL_LINES = ["vasculogenesis", "mda231"]
 82
 83# Cell lines that are 2D (others are 3D)
 84_2D_CELL_LINES = ["vasculogenesis"]
 85
 86
 87def _create_mucic_h5(path, cell_line, variant):
 88    """Create processed h5 files from raw and label files."""
 89    import h5py
 90    from tqdm import tqdm
 91
 92    data_dir = os.path.join(path, cell_line, variant)
 93    h5_out_dir = os.path.join(path, cell_line, "processed", variant)
 94    os.makedirs(h5_out_dir, exist_ok=True)
 95
 96    # Find all raw files (image-final_*.h5)
 97    raw_files = sorted(glob(os.path.join(data_dir, "**", "image-final_*.h5"), recursive=True))
 98    if not raw_files:
 99        raw_files = sorted(glob(os.path.join(data_dir, "**", "image-final_*.hdf5"), recursive=True))
100
101    needs_connected_components = cell_line in _SEMANTIC_LABEL_CELL_LINES
102
103    for raw_path in tqdm(raw_files, desc=f"Processing {cell_line} {variant} data"):
104        # Find corresponding label file
105        label_path = raw_path.replace("image-final_", "image-labels_")
106        if not os.path.exists(label_path):
107            continue
108
109        # Get output filename
110        fname = os.path.basename(raw_path)
111        out_fname = fname.replace("image-final_", f"{cell_line}_").replace(".hdf5", ".h5")
112        out_path = os.path.join(h5_out_dir, out_fname)
113
114        if os.path.exists(out_path):
115            continue
116
117        with h5py.File(raw_path, "r") as f:
118            raw = f["Image"][:]
119
120        with h5py.File(label_path, "r") as f:
121            labels = f["Image"][:]
122
123        # Convert semantic labels to instance labels if needed
124        if needs_connected_components:
125            from skimage.measure import label
126            instances = label(labels > 0).astype("int64")
127        else:
128            instances = labels.astype("int64")
129
130        with h5py.File(out_path, "w") as f:
131            f.create_dataset("raw", data=raw, compression="gzip")
132            f.create_dataset("labels/instances", data=instances, compression="gzip")
133            f.create_dataset("labels/semantic", data=(labels > 0).astype("uint8"), compression="gzip")
134
135    return h5_out_dir
136
137
138def _semantic_to_instances_watershed(semantic_mask, erosion_iterations=2):
139    """Convert semantic mask to instance labels using erosion + watershed.
140
141    This handles cases where cells are touching by a few pixels:
142    1. Erode the mask to separate touching cells
143    2. Run connected components on eroded mask to get seed labels
144    3. Use watershed to expand seeds back to original mask boundaries
145    """
146    from scipy.ndimage import binary_erosion, distance_transform_edt
147    from skimage.measure import label
148    from skimage.segmentation import watershed
149
150    binary_mask = semantic_mask > 0
151
152    # Erode to separate touching cells
153    eroded = binary_erosion(binary_mask, iterations=erosion_iterations)
154
155    # Get seed labels from eroded mask
156    seeds = label(eroded)
157
158    # Use watershed to expand seeds to fill original mask
159    # Distance transform gives us the "landscape" for watershed
160    distance = distance_transform_edt(binary_mask)
161    instances = watershed(-distance, seeds, mask=binary_mask)
162
163    return instances.astype("int64")
164
165
166def _create_vasculogenesis_h5(path, variant):
167    """Create processed h5 files for vasculogenesis from separate image/label PNG directories."""
168    import h5py
169    import imageio.v2 as imageio
170    from tqdm import tqdm
171
172    data_dir = os.path.join(path, "vasculogenesis", variant)
173    h5_out_dir = os.path.join(path, "vasculogenesis", "processed", variant)
174    os.makedirs(h5_out_dir, exist_ok=True)
175
176    # Find image and label directories
177    images_dir = os.path.join(data_dir, "images")
178    labels_dir = os.path.join(data_dir, "labels")
179
180    # Find all PNG image files (image_XXXX.png)
181    raw_files = sorted(glob(os.path.join(images_dir, "*.png")))
182
183    for raw_path in tqdm(raw_files, desc=f"Processing vasculogenesis {variant} data"):
184        # Find corresponding label file (pairs of image_XXXX.png and mask_XXXX.png)
185        fname = os.path.basename(raw_path)
186        label_fname = fname.replace("image_", "mask_")
187        label_path = os.path.join(labels_dir, label_fname)
188
189        if not os.path.exists(label_path):
190            continue
191
192        # Output filename
193        file_id = fname.replace("image_", "").replace(".png", "")
194        out_fname = f"vasculogenesis_{file_id}.h5"
195        out_path = os.path.join(h5_out_dir, out_fname)
196
197        if os.path.exists(out_path):
198            continue
199
200        raw = imageio.imread(raw_path)
201        labels_data = imageio.imread(label_path)
202
203        # Convert semantic labels to instance labels using erosion + watershed
204        instances = _semantic_to_instances_watershed(labels_data)
205
206        with h5py.File(out_path, "w") as f:
207            f.create_dataset("raw", data=raw, compression="gzip")
208            f.create_dataset("labels/instances", data=instances, compression="gzip")
209            f.create_dataset("labels/semantic", data=(labels_data > 0).astype("uint8"), compression="gzip")
210
211    return h5_out_dir
212
213
214def _create_mda231_h5(path, variant):
215    """Create processed h5 files for MDA231 from CTC data with full annotations."""
216    import h5py
217    import tifffile
218    from tqdm import tqdm
219
220    data_dir = os.path.join(path, "mda231", variant)
221    h5_out_dir = os.path.join(path, "mda231", "processed", variant)
222    os.makedirs(h5_out_dir, exist_ok=True)
223
224    # Directory structure after unzip:
225    # images/ -> Fluo-C3DL-MDA231/01/, Fluo-C3DL-MDA231/02/
226    # labels/ -> Fluo-C3DL-MDA231_Full_Annotations/S01_FA_MV/S01_FA_A1/, S02_FA_A1/
227    images_base = os.path.join(data_dir, "images", "Fluo-C3DL-MDA231")
228    labels_base = os.path.join(data_dir, "labels", "Fluo-C3DL-MDA231_Full_Annotations")
229
230    # Map sequences to their annotation directories
231    seq_to_labels = {
232        "01": os.path.join(labels_base, "S01_FA_MV", "S01_FA_A1"),
233        "02": os.path.join(labels_base, "S02_FA_A1"),
234    }
235
236    for seq_id, labels_dir in seq_to_labels.items():
237        images_dir = os.path.join(images_base, seq_id)
238
239        if not os.path.exists(images_dir) or not os.path.exists(labels_dir):
240            continue
241
242        # Find all raw TIFF files (t000.tif, t001.tif, ...)
243        raw_files = sorted(glob(os.path.join(images_dir, "t*.tif")))
244
245        for raw_path in tqdm(raw_files, desc=f"Processing MDA231 seq {seq_id}"):
246            # Map t000.tif -> man_seg_full000.tif
247            fname = os.path.basename(raw_path)
248            time_id = fname.replace(".tif", "").replace("t", "")
249            label_fname = f"man_seg_full{time_id}.tif"
250            label_path = os.path.join(labels_dir, label_fname)
251
252            if not os.path.exists(label_path):
253                continue
254
255            out_fname = f"mda231_{seq_id}_{time_id}.h5"
256            out_path = os.path.join(h5_out_dir, out_fname)
257
258            if os.path.exists(out_path):
259                continue
260
261            raw = tifffile.imread(raw_path)
262            labels = tifffile.imread(label_path).astype("int64")
263
264            with h5py.File(out_path, "w") as f:
265                f.create_dataset("raw", data=raw, compression="gzip")
266                f.create_dataset("labels/instances", data=labels, compression="gzip")
267                f.create_dataset("labels/semantic", data=(labels > 0).astype("uint8"), compression="gzip")
268
269    return h5_out_dir
270
271
272def get_mucic_data(
273    path: Union[os.PathLike, str],
274    cell_line: str,
275    variant: Optional[Union[str, List[str]]] = None,
276    download: bool = False,
277) -> str:
278    """Download the MUCIC dataset for a specific cell line.
279
280    Args:
281        path: Filepath to a folder where the downloaded data will be saved.
282        cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
283        variant: The dataset variant(s).
284            For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels).
285            For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'),
286            e.g. 'low_c00'.
287            For 'vasculogenesis': 'default'.
288            If None, downloads all variants for the selected cell line.
289        download: Whether to download the data if it is not present.
290
291    Returns:
292        The filepath to the dataset directory.
293    """
294    assert cell_line in CELL_LINES, f"'{cell_line}' is not valid. Choose from {CELL_LINES}."
295
296    valid_variants = _get_variants(cell_line)
297    if variant is None:
298        variant = valid_variants
299    elif isinstance(variant, str):
300        variant = [variant]
301
302    for v in variant:
303        assert v in valid_variants, f"'{v}' is not valid for '{cell_line}'. Choose from {valid_variants}."
304
305        data_dir = os.path.join(path, cell_line, v)
306
307        # Check if data already exists - different file types for different datasets
308        if cell_line == "mda231":
309            file_pattern = "*.tif"
310        elif cell_line == "vasculogenesis":
311            file_pattern = "*.png"
312        else:
313            file_pattern = "*.h5"
314
315        if os.path.exists(data_dir) and len(glob(os.path.join(data_dir, "**", file_pattern), recursive=True)) > 0:
316            continue
317
318        os.makedirs(data_dir, exist_ok=True)
319
320        # Handle cell lines with separate image/label zip files
321        if cell_line in _SEPARATE_ZIPS_CELL_LINES:
322            urls = URLS[cell_line][v]
323            # Download and extract images
324            images_zip = os.path.join(path, f"{cell_line}_{v}_images.zip")
325            util.download_source(path=images_zip, url=urls["images"], download=download, checksum=None)
326            util.unzip(zip_path=images_zip, dst=os.path.join(data_dir, "images"), remove=False)
327            # Download and extract labels
328            labels_zip = os.path.join(path, f"{cell_line}_{v}_labels.zip")
329            util.download_source(path=labels_zip, url=urls["labels"], download=download, checksum=None)
330            util.unzip(zip_path=labels_zip, dst=os.path.join(data_dir, "labels"), remove=False)
331        else:
332            zip_path = os.path.join(path, f"{cell_line}_{v}.zip")
333            util.download_source(path=zip_path, url=URLS[cell_line][v], download=download, checksum=None)
334            util.unzip(zip_path=zip_path, dst=data_dir, remove=False)
335
336    return os.path.join(path, cell_line)
337
338
339def get_mucic_paths(
340    path: Union[os.PathLike, str],
341    cell_line: str,
342    variant: Optional[Union[str, List[str]]] = None,
343    download: bool = False,
344) -> List[str]:
345    """Get paths to the MUCIC data for a specific cell line.
346
347    Args:
348        path: Filepath to a folder where the downloaded data will be saved.
349        cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
350        variant: The dataset variant(s). If None, uses all variants.
351        download: Whether to download the data if it is not present.
352
353    Returns:
354        List of filepaths for the processed h5 data.
355    """
356    from natsort import natsorted
357
358    assert cell_line in CELL_LINES, f"'{cell_line}' is not valid. Choose from {CELL_LINES}."
359
360    get_mucic_data(path, cell_line, variant, download)
361
362    valid_variants = _get_variants(cell_line)
363    if variant is None:
364        variant = valid_variants
365    elif isinstance(variant, str):
366        variant = [variant]
367
368    all_h5_paths = []
369    for v in variant:
370        h5_out_dir = os.path.join(path, cell_line, "processed", v)
371
372        # Process data if not already done
373        if not os.path.exists(h5_out_dir) or len(glob(os.path.join(h5_out_dir, "*.h5"))) == 0:
374            if cell_line == "vasculogenesis":
375                _create_vasculogenesis_h5(path, v)
376            elif cell_line == "mda231":
377                _create_mda231_h5(path, v)
378            else:
379                _create_mucic_h5(path, cell_line, v)
380
381        h5_paths = glob(os.path.join(h5_out_dir, "*.h5"))
382        all_h5_paths.extend(h5_paths)
383
384    assert len(all_h5_paths) > 0, f"No data found for cell_line '{cell_line}', variant '{variant}'"
385
386    return natsorted(all_h5_paths)
387
388
389def get_mucic_dataset(
390    path: Union[os.PathLike, str],
391    patch_shape: Tuple[int, int, int],
392    cell_line: str,
393    variant: Optional[Union[str, List[str]]] = None,
394    segmentation_type: str = "instances",
395    download: bool = False,
396    **kwargs
397) -> Dataset:
398    """Get the MUCIC dataset for cell segmentation.
399
400    Args:
401        path: Filepath to a folder where the downloaded data will be saved.
402        patch_shape: The patch shape to use for training.
403        cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
404        variant: The dataset variant(s).
405            For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels).
406            For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'),
407            e.g. 'low_c00'.
408            For 'vasculogenesis' and 'mda231': 'default'.
409            If None, uses all variants for the selected cell line.
410        segmentation_type: The type of segmentation labels to use.
411            One of 'instances' or 'semantic' (binary mask).
412        download: Whether to download the data if it is not present.
413        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
414
415    Returns:
416        The segmentation dataset.
417    """
418    assert segmentation_type in ("instances", "semantic"), \
419        f"'{segmentation_type}' is not valid. Choose from 'instances' or 'semantic'."
420
421    h5_paths = get_mucic_paths(path, cell_line, variant, download)
422
423    label_key = f"labels/{segmentation_type}"
424
425    kwargs, _ = util.add_instance_label_transform(
426        kwargs, add_binary_target=True, label_dtype=np.int64,
427    )
428
429    # Determine dimensionality based on cell line
430    ndim = 2 if cell_line in _2D_CELL_LINES else 3
431
432    return torch_em.default_segmentation_dataset(
433        raw_paths=h5_paths,
434        raw_key="raw",
435        label_paths=h5_paths,
436        label_key=label_key,
437        patch_shape=patch_shape,
438        ndim=ndim,
439        **kwargs
440    )
441
442
443def get_mucic_loader(
444    path: Union[os.PathLike, str],
445    batch_size: int,
446    patch_shape: Tuple[int, int, int],
447    cell_line: str,
448    variant: Optional[Union[str, List[str]]] = None,
449    segmentation_type: str = "instances",
450    download: bool = False,
451    **kwargs
452) -> DataLoader:
453    """Get the MUCIC dataloader for cell segmentation.
454
455    Args:
456        path: Filepath to a folder where the downloaded data will be saved.
457        batch_size: The batch size for training.
458        patch_shape: The patch shape to use for training.
459        cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
460        variant: The dataset variant(s).
461            For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels).
462            For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'),
463            e.g. 'low_c00'.
464            For 'vasculogenesis' and 'mda231': 'default'.
465            If None, uses all variants for the selected cell line.
466        segmentation_type: The type of segmentation labels to use.
467            One of 'instances' or 'semantic' (binary mask).
468        download: Whether to download the data if it is not present.
469        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
470
471    Returns:
472        The DataLoader.
473    """
474    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
475    dataset = get_mucic_dataset(
476        path=path,
477        patch_shape=patch_shape,
478        cell_line=cell_line,
479        variant=variant,
480        segmentation_type=segmentation_type,
481        download=download,
482        **ds_kwargs,
483    )
484    return torch_em.get_data_loader(dataset=dataset, batch_size=batch_size, **loader_kwargs)

URLS = {'colon_tissue': {'low': 'https://datasets.gryf.fi.muni.cz/iciar2011/ColonTissue_LowNoise_3D_HDF5.zip', 'high': 'https://datasets.gryf.fi.muni.cz/iciar2011/ColonTissue_HighNoise_3D_HDF5.zip'}, 'hl60': {'low_c00': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C00_3D_HDF5.zip', 'low_c25': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C25_3D_HDF5.zip', 'low_c50': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C50_3D_HDF5.zip', 'low_c75': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C75_3D_HDF5.zip', 'high_c00': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C00_3D_HDF5.zip', 'high_c25': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C25_3D_HDF5.zip', 'high_c50': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C50_3D_HDF5.zip', 'high_c75': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C75_3D_HDF5.zip'}, 'granulocytes': {'low': 'https://datasets.gryf.fi.muni.cz/cytometry2009/Granulocytes_LowNoise_3D_HDF5.zip', 'high': 'https://datasets.gryf.fi.muni.cz/cytometry2009/Granulocytes_HighNoise_3D_HDF5.zip'}, 'vasculogenesis': {'default': {'images': 'https://datasets.gryf.fi.muni.cz/icip2016/vasculogenesis-images.zip', 'labels': 'https://datasets.gryf.fi.muni.cz/icip2016/vasculogenesis-labels.zip'}}, 'mda231': {'default': {'images': 'https://data.celltrackingchallenge.net/training-datasets/Fluo-C3DL-MDA231.zip', 'labels': 'https://datasets.gryf.fi.muni.cz/isbi2025/Fluo-C3DL-MDA231_Full_Annotations.zip'}}}

CELL_LINES = ['colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', 'mda231']

def get_mucic_data( path: Union[os.PathLike, str], cell_line: str, variant: Union[List[str], str, NoneType] = None, download: bool = False) -> str: View Source

273def get_mucic_data(
274    path: Union[os.PathLike, str],
275    cell_line: str,
276    variant: Optional[Union[str, List[str]]] = None,
277    download: bool = False,
278) -> str:
279    """Download the MUCIC dataset for a specific cell line.
280
281    Args:
282        path: Filepath to a folder where the downloaded data will be saved.
283        cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
284        variant: The dataset variant(s).
285            For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels).
286            For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'),
287            e.g. 'low_c00'.
288            For 'vasculogenesis': 'default'.
289            If None, downloads all variants for the selected cell line.
290        download: Whether to download the data if it is not present.
291
292    Returns:
293        The filepath to the dataset directory.
294    """
295    assert cell_line in CELL_LINES, f"'{cell_line}' is not valid. Choose from {CELL_LINES}."
296
297    valid_variants = _get_variants(cell_line)
298    if variant is None:
299        variant = valid_variants
300    elif isinstance(variant, str):
301        variant = [variant]
302
303    for v in variant:
304        assert v in valid_variants, f"'{v}' is not valid for '{cell_line}'. Choose from {valid_variants}."
305
306        data_dir = os.path.join(path, cell_line, v)
307
308        # Check if data already exists - different file types for different datasets
309        if cell_line == "mda231":
310            file_pattern = "*.tif"
311        elif cell_line == "vasculogenesis":
312            file_pattern = "*.png"
313        else:
314            file_pattern = "*.h5"
315
316        if os.path.exists(data_dir) and len(glob(os.path.join(data_dir, "**", file_pattern), recursive=True)) > 0:
317            continue
318
319        os.makedirs(data_dir, exist_ok=True)
320
321        # Handle cell lines with separate image/label zip files
322        if cell_line in _SEPARATE_ZIPS_CELL_LINES:
323            urls = URLS[cell_line][v]
324            # Download and extract images
325            images_zip = os.path.join(path, f"{cell_line}_{v}_images.zip")
326            util.download_source(path=images_zip, url=urls["images"], download=download, checksum=None)
327            util.unzip(zip_path=images_zip, dst=os.path.join(data_dir, "images"), remove=False)
328            # Download and extract labels
329            labels_zip = os.path.join(path, f"{cell_line}_{v}_labels.zip")
330            util.download_source(path=labels_zip, url=urls["labels"], download=download, checksum=None)
331            util.unzip(zip_path=labels_zip, dst=os.path.join(data_dir, "labels"), remove=False)
332        else:
333            zip_path = os.path.join(path, f"{cell_line}_{v}.zip")
334            util.download_source(path=zip_path, url=URLS[cell_line][v], download=download, checksum=None)
335            util.unzip(zip_path=zip_path, dst=data_dir, remove=False)
336
337    return os.path.join(path, cell_line)

Download the MUCIC dataset for a specific cell line.

Arguments:

path: Filepath to a folder where the downloaded data will be saved.
cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
variant: The dataset variant(s). For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), e.g. 'low_c00'. For 'vasculogenesis': 'default'. If None, downloads all variants for the selected cell line.
download: Whether to download the data if it is not present.

Returns:

The filepath to the dataset directory.

def get_mucic_paths( path: Union[os.PathLike, str], cell_line: str, variant: Union[List[str], str, NoneType] = None, download: bool = False) -> List[str]: View Source

340def get_mucic_paths(
341    path: Union[os.PathLike, str],
342    cell_line: str,
343    variant: Optional[Union[str, List[str]]] = None,
344    download: bool = False,
345) -> List[str]:
346    """Get paths to the MUCIC data for a specific cell line.
347
348    Args:
349        path: Filepath to a folder where the downloaded data will be saved.
350        cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
351        variant: The dataset variant(s). If None, uses all variants.
352        download: Whether to download the data if it is not present.
353
354    Returns:
355        List of filepaths for the processed h5 data.
356    """
357    from natsort import natsorted
358
359    assert cell_line in CELL_LINES, f"'{cell_line}' is not valid. Choose from {CELL_LINES}."
360
361    get_mucic_data(path, cell_line, variant, download)
362
363    valid_variants = _get_variants(cell_line)
364    if variant is None:
365        variant = valid_variants
366    elif isinstance(variant, str):
367        variant = [variant]
368
369    all_h5_paths = []
370    for v in variant:
371        h5_out_dir = os.path.join(path, cell_line, "processed", v)
372
373        # Process data if not already done
374        if not os.path.exists(h5_out_dir) or len(glob(os.path.join(h5_out_dir, "*.h5"))) == 0:
375            if cell_line == "vasculogenesis":
376                _create_vasculogenesis_h5(path, v)
377            elif cell_line == "mda231":
378                _create_mda231_h5(path, v)
379            else:
380                _create_mucic_h5(path, cell_line, v)
381
382        h5_paths = glob(os.path.join(h5_out_dir, "*.h5"))
383        all_h5_paths.extend(h5_paths)
384
385    assert len(all_h5_paths) > 0, f"No data found for cell_line '{cell_line}', variant '{variant}'"
386
387    return natsorted(all_h5_paths)

Get paths to the MUCIC data for a specific cell line.

Arguments:

path: Filepath to a folder where the downloaded data will be saved.
cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
variant: The dataset variant(s). If None, uses all variants.
download: Whether to download the data if it is not present.

Returns:

List of filepaths for the processed h5 data.

def get_mucic_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], cell_line: str, variant: Union[List[str], str, NoneType] = None, segmentation_type: str = 'instances', download: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset: View Source

390def get_mucic_dataset(
391    path: Union[os.PathLike, str],
392    patch_shape: Tuple[int, int, int],
393    cell_line: str,
394    variant: Optional[Union[str, List[str]]] = None,
395    segmentation_type: str = "instances",
396    download: bool = False,
397    **kwargs
398) -> Dataset:
399    """Get the MUCIC dataset for cell segmentation.
400
401    Args:
402        path: Filepath to a folder where the downloaded data will be saved.
403        patch_shape: The patch shape to use for training.
404        cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
405        variant: The dataset variant(s).
406            For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels).
407            For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'),
408            e.g. 'low_c00'.
409            For 'vasculogenesis' and 'mda231': 'default'.
410            If None, uses all variants for the selected cell line.
411        segmentation_type: The type of segmentation labels to use.
412            One of 'instances' or 'semantic' (binary mask).
413        download: Whether to download the data if it is not present.
414        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
415
416    Returns:
417        The segmentation dataset.
418    """
419    assert segmentation_type in ("instances", "semantic"), \
420        f"'{segmentation_type}' is not valid. Choose from 'instances' or 'semantic'."
421
422    h5_paths = get_mucic_paths(path, cell_line, variant, download)
423
424    label_key = f"labels/{segmentation_type}"
425
426    kwargs, _ = util.add_instance_label_transform(
427        kwargs, add_binary_target=True, label_dtype=np.int64,
428    )
429
430    # Determine dimensionality based on cell line
431    ndim = 2 if cell_line in _2D_CELL_LINES else 3
432
433    return torch_em.default_segmentation_dataset(
434        raw_paths=h5_paths,
435        raw_key="raw",
436        label_paths=h5_paths,
437        label_key=label_key,
438        patch_shape=patch_shape,
439        ndim=ndim,
440        **kwargs
441    )

Get the MUCIC dataset for cell segmentation.

Arguments:

path: Filepath to a folder where the downloaded data will be saved.
patch_shape: The patch shape to use for training.
cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
variant: The dataset variant(s). For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), e.g. 'low_c00'. For 'vasculogenesis' and 'mda231': 'default'. If None, uses all variants for the selected cell line.
segmentation_type: The type of segmentation labels to use. One of 'instances' or 'semantic' (binary mask).
download: Whether to download the data if it is not present.
kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.

Returns:

The segmentation dataset.

def get_mucic_loader( path: Union[os.PathLike, str], batch_size: int, patch_shape: Tuple[int, int, int], cell_line: str, variant: Union[List[str], str, NoneType] = None, segmentation_type: str = 'instances', download: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader: View Source

444def get_mucic_loader(
445    path: Union[os.PathLike, str],
446    batch_size: int,
447    patch_shape: Tuple[int, int, int],
448    cell_line: str,
449    variant: Optional[Union[str, List[str]]] = None,
450    segmentation_type: str = "instances",
451    download: bool = False,
452    **kwargs
453) -> DataLoader:
454    """Get the MUCIC dataloader for cell segmentation.
455
456    Args:
457        path: Filepath to a folder where the downloaded data will be saved.
458        batch_size: The batch size for training.
459        patch_shape: The patch shape to use for training.
460        cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
461        variant: The dataset variant(s).
462            For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels).
463            For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'),
464            e.g. 'low_c00'.
465            For 'vasculogenesis' and 'mda231': 'default'.
466            If None, uses all variants for the selected cell line.
467        segmentation_type: The type of segmentation labels to use.
468            One of 'instances' or 'semantic' (binary mask).
469        download: Whether to download the data if it is not present.
470        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
471
472    Returns:
473        The DataLoader.
474    """
475    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
476    dataset = get_mucic_dataset(
477        path=path,
478        patch_shape=patch_shape,
479        cell_line=cell_line,
480        variant=variant,
481        segmentation_type=segmentation_type,
482        download=download,
483        **ds_kwargs,
484    )
485    return torch_em.get_data_loader(dataset=dataset, batch_size=batch_size, **loader_kwargs)

Get the MUCIC dataloader for cell segmentation.

Arguments:

path: Filepath to a folder where the downloaded data will be saved.
batch_size: The batch size for training.
patch_shape: The patch shape to use for training.
cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
variant: The dataset variant(s). For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), e.g. 'low_c00'. For 'vasculogenesis' and 'mda231': 'default'. If None, uses all variants for the selected cell line.
segmentation_type: The type of segmentation labels to use. One of 'instances' or 'semantic' (binary mask).
download: Whether to download the data if it is not present.
kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.

Returns:

The DataLoader.