torch_em.data.datasets.light_microscopy.mucic
The MUCIC (Masaryk University Cell Image Collection) datasets contain synthetic 3D microscopy images for cell and nucleus segmentation benchmarking.
NOTE: Most of the datasets available at MUCIC are synthetic images (see detailed description below).
Available datasets:
- Colon Tissue: 30 synthetic 3D images of human colon tissue with semantic segmentation labels
- HL60 Cell Line: Synthetic 3D images of HL60 cells with instance segmentation labels
- Granulocytes: Synthetic 3D images of granulocytes with instance segmentation labels
- Vasculogenesis: Time-lapse 2D images of living cells with semantic segmentation labels
- MDA231: 3D fluorescence images of MDA231 cells with full instance segmentation annotations
The datasets are from CBIA (Centre for Biomedical Image Analysis) at Masaryk University.
The data is located at https://cbia.fi.muni.cz/datasets/.
- Colon Tissue: https://doi.org/10.1007/978-3-642-21593-3_4
- HL60 Cell Line: https://doi.org/10.1002/cyto.a.20811
- Granulocytes: https://doi.org/10.1002/cyto.a.20811
- Vasculogenesis: https://doi.org/10.1109/ICIP.2016.7532871
- MDA231: Cell Tracking Challenge (Fluo-C3DL-MDA231) with ISBI 2025 full annotations Please cite the relevant publication if you use this dataset in your research.
1"""The MUCIC (Masaryk University Cell Image Collection) datasets contain synthetic 3D 2microscopy images for cell and nucleus segmentation benchmarking. 3 4NOTE: Most of the datasets available at MUCIC are synthetic images (see detailed description below). 5 6Available datasets: 7- Colon Tissue: 30 synthetic 3D images of human colon tissue with semantic segmentation labels 8- HL60 Cell Line: Synthetic 3D images of HL60 cells with instance segmentation labels 9- Granulocytes: Synthetic 3D images of granulocytes with instance segmentation labels 10- Vasculogenesis: Time-lapse 2D images of living cells with semantic segmentation labels 11- MDA231: 3D fluorescence images of MDA231 cells with full instance segmentation annotations 12 13The datasets are from CBIA (Centre for Biomedical Image Analysis) at Masaryk University. 14 15The data is located at https://cbia.fi.muni.cz/datasets/. 16- Colon Tissue: https://doi.org/10.1007/978-3-642-21593-3_4 17- HL60 Cell Line: https://doi.org/10.1002/cyto.a.20811 18- Granulocytes: https://doi.org/10.1002/cyto.a.20811 19- Vasculogenesis: https://doi.org/10.1109/ICIP.2016.7532871 20- MDA231: Cell Tracking Challenge (Fluo-C3DL-MDA231) with ISBI 2025 full annotations 21Please cite the relevant publication if you use this dataset in your research. 22""" 23 24import os 25from glob import glob 26from typing import Union, Tuple, List, Optional 27 28import numpy as np 29 30from torch.utils.data import Dataset, DataLoader 31 32import torch_em 33 34from .. import util 35 36 37URLS = { 38 "colon_tissue": { 39 "low": "https://datasets.gryf.fi.muni.cz/iciar2011/ColonTissue_LowNoise_3D_HDF5.zip", 40 "high": "https://datasets.gryf.fi.muni.cz/iciar2011/ColonTissue_HighNoise_3D_HDF5.zip", 41 }, 42 "hl60": { 43 "low_c00": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C00_3D_HDF5.zip", 44 "low_c25": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C25_3D_HDF5.zip", 45 "low_c50": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C50_3D_HDF5.zip", 46 "low_c75": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C75_3D_HDF5.zip", 47 "high_c00": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C00_3D_HDF5.zip", 48 "high_c25": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C25_3D_HDF5.zip", 49 "high_c50": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C50_3D_HDF5.zip", 50 "high_c75": "https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C75_3D_HDF5.zip", 51 }, 52 "granulocytes": { 53 "low": "https://datasets.gryf.fi.muni.cz/cytometry2009/Granulocytes_LowNoise_3D_HDF5.zip", 54 "high": "https://datasets.gryf.fi.muni.cz/cytometry2009/Granulocytes_HighNoise_3D_HDF5.zip", 55 }, 56 "vasculogenesis": { 57 "default": { 58 "images": "https://datasets.gryf.fi.muni.cz/icip2016/vasculogenesis-images.zip", 59 "labels": "https://datasets.gryf.fi.muni.cz/icip2016/vasculogenesis-labels.zip", 60 }, 61 }, 62 "mda231": { 63 "default": { 64 "images": "https://data.celltrackingchallenge.net/training-datasets/Fluo-C3DL-MDA231.zip", 65 "labels": "https://datasets.gryf.fi.muni.cz/isbi2025/Fluo-C3DL-MDA231_Full_Annotations.zip", 66 }, 67 }, 68} 69 70CELL_LINES = list(URLS.keys()) 71 72 73def _get_variants(cell_line): 74 return list(URLS[cell_line].keys()) 75 76 77# Cell lines with semantic labels that need connected components for instance segmentation 78_SEMANTIC_LABEL_CELL_LINES = ["colon_tissue", "vasculogenesis"] 79 80# Cell lines with separate image/label zip files 81_SEPARATE_ZIPS_CELL_LINES = ["vasculogenesis", "mda231"] 82 83# Cell lines that are 2D (others are 3D) 84_2D_CELL_LINES = ["vasculogenesis"] 85 86 87def _create_mucic_h5(path, cell_line, variant): 88 """Create processed h5 files from raw and label files.""" 89 import h5py 90 from tqdm import tqdm 91 92 data_dir = os.path.join(path, cell_line, variant) 93 h5_out_dir = os.path.join(path, cell_line, "processed", variant) 94 os.makedirs(h5_out_dir, exist_ok=True) 95 96 # Find all raw files (image-final_*.h5) 97 raw_files = sorted(glob(os.path.join(data_dir, "**", "image-final_*.h5"), recursive=True)) 98 if not raw_files: 99 raw_files = sorted(glob(os.path.join(data_dir, "**", "image-final_*.hdf5"), recursive=True)) 100 101 needs_connected_components = cell_line in _SEMANTIC_LABEL_CELL_LINES 102 103 for raw_path in tqdm(raw_files, desc=f"Processing {cell_line} {variant} data"): 104 # Find corresponding label file 105 label_path = raw_path.replace("image-final_", "image-labels_") 106 if not os.path.exists(label_path): 107 continue 108 109 # Get output filename 110 fname = os.path.basename(raw_path) 111 out_fname = fname.replace("image-final_", f"{cell_line}_").replace(".hdf5", ".h5") 112 out_path = os.path.join(h5_out_dir, out_fname) 113 114 if os.path.exists(out_path): 115 continue 116 117 with h5py.File(raw_path, "r") as f: 118 raw = f["Image"][:] 119 120 with h5py.File(label_path, "r") as f: 121 labels = f["Image"][:] 122 123 # Convert semantic labels to instance labels if needed 124 if needs_connected_components: 125 from skimage.measure import label 126 instances = label(labels > 0).astype("int64") 127 else: 128 instances = labels.astype("int64") 129 130 with h5py.File(out_path, "w") as f: 131 f.create_dataset("raw", data=raw, compression="gzip") 132 f.create_dataset("labels/instances", data=instances, compression="gzip") 133 f.create_dataset("labels/semantic", data=(labels > 0).astype("uint8"), compression="gzip") 134 135 return h5_out_dir 136 137 138def _semantic_to_instances_watershed(semantic_mask, erosion_iterations=2): 139 """Convert semantic mask to instance labels using erosion + watershed. 140 141 This handles cases where cells are touching by a few pixels: 142 1. Erode the mask to separate touching cells 143 2. Run connected components on eroded mask to get seed labels 144 3. Use watershed to expand seeds back to original mask boundaries 145 """ 146 from scipy.ndimage import binary_erosion, distance_transform_edt 147 from skimage.measure import label 148 from skimage.segmentation import watershed 149 150 binary_mask = semantic_mask > 0 151 152 # Erode to separate touching cells 153 eroded = binary_erosion(binary_mask, iterations=erosion_iterations) 154 155 # Get seed labels from eroded mask 156 seeds = label(eroded) 157 158 # Use watershed to expand seeds to fill original mask 159 # Distance transform gives us the "landscape" for watershed 160 distance = distance_transform_edt(binary_mask) 161 instances = watershed(-distance, seeds, mask=binary_mask) 162 163 return instances.astype("int64") 164 165 166def _create_vasculogenesis_h5(path, variant): 167 """Create processed h5 files for vasculogenesis from separate image/label PNG directories.""" 168 import h5py 169 import imageio.v2 as imageio 170 from tqdm import tqdm 171 172 data_dir = os.path.join(path, "vasculogenesis", variant) 173 h5_out_dir = os.path.join(path, "vasculogenesis", "processed", variant) 174 os.makedirs(h5_out_dir, exist_ok=True) 175 176 # Find image and label directories 177 images_dir = os.path.join(data_dir, "images") 178 labels_dir = os.path.join(data_dir, "labels") 179 180 # Find all PNG image files (image_XXXX.png) 181 raw_files = sorted(glob(os.path.join(images_dir, "*.png"))) 182 183 for raw_path in tqdm(raw_files, desc=f"Processing vasculogenesis {variant} data"): 184 # Find corresponding label file (pairs of image_XXXX.png and mask_XXXX.png) 185 fname = os.path.basename(raw_path) 186 label_fname = fname.replace("image_", "mask_") 187 label_path = os.path.join(labels_dir, label_fname) 188 189 if not os.path.exists(label_path): 190 continue 191 192 # Output filename 193 file_id = fname.replace("image_", "").replace(".png", "") 194 out_fname = f"vasculogenesis_{file_id}.h5" 195 out_path = os.path.join(h5_out_dir, out_fname) 196 197 if os.path.exists(out_path): 198 continue 199 200 raw = imageio.imread(raw_path) 201 labels_data = imageio.imread(label_path) 202 203 # Convert semantic labels to instance labels using erosion + watershed 204 instances = _semantic_to_instances_watershed(labels_data) 205 206 with h5py.File(out_path, "w") as f: 207 f.create_dataset("raw", data=raw, compression="gzip") 208 f.create_dataset("labels/instances", data=instances, compression="gzip") 209 f.create_dataset("labels/semantic", data=(labels_data > 0).astype("uint8"), compression="gzip") 210 211 return h5_out_dir 212 213 214def _create_mda231_h5(path, variant): 215 """Create processed h5 files for MDA231 from CTC data with full annotations.""" 216 import h5py 217 import tifffile 218 from tqdm import tqdm 219 220 data_dir = os.path.join(path, "mda231", variant) 221 h5_out_dir = os.path.join(path, "mda231", "processed", variant) 222 os.makedirs(h5_out_dir, exist_ok=True) 223 224 # Directory structure after unzip: 225 # images/ -> Fluo-C3DL-MDA231/01/, Fluo-C3DL-MDA231/02/ 226 # labels/ -> Fluo-C3DL-MDA231_Full_Annotations/S01_FA_MV/S01_FA_A1/, S02_FA_A1/ 227 images_base = os.path.join(data_dir, "images", "Fluo-C3DL-MDA231") 228 labels_base = os.path.join(data_dir, "labels", "Fluo-C3DL-MDA231_Full_Annotations") 229 230 # Map sequences to their annotation directories 231 seq_to_labels = { 232 "01": os.path.join(labels_base, "S01_FA_MV", "S01_FA_A1"), 233 "02": os.path.join(labels_base, "S02_FA_A1"), 234 } 235 236 for seq_id, labels_dir in seq_to_labels.items(): 237 images_dir = os.path.join(images_base, seq_id) 238 239 if not os.path.exists(images_dir) or not os.path.exists(labels_dir): 240 continue 241 242 # Find all raw TIFF files (t000.tif, t001.tif, ...) 243 raw_files = sorted(glob(os.path.join(images_dir, "t*.tif"))) 244 245 for raw_path in tqdm(raw_files, desc=f"Processing MDA231 seq {seq_id}"): 246 # Map t000.tif -> man_seg_full000.tif 247 fname = os.path.basename(raw_path) 248 time_id = fname.replace(".tif", "").replace("t", "") 249 label_fname = f"man_seg_full{time_id}.tif" 250 label_path = os.path.join(labels_dir, label_fname) 251 252 if not os.path.exists(label_path): 253 continue 254 255 out_fname = f"mda231_{seq_id}_{time_id}.h5" 256 out_path = os.path.join(h5_out_dir, out_fname) 257 258 if os.path.exists(out_path): 259 continue 260 261 raw = tifffile.imread(raw_path) 262 labels = tifffile.imread(label_path).astype("int64") 263 264 with h5py.File(out_path, "w") as f: 265 f.create_dataset("raw", data=raw, compression="gzip") 266 f.create_dataset("labels/instances", data=labels, compression="gzip") 267 f.create_dataset("labels/semantic", data=(labels > 0).astype("uint8"), compression="gzip") 268 269 return h5_out_dir 270 271 272def get_mucic_data( 273 path: Union[os.PathLike, str], 274 cell_line: str, 275 variant: Optional[Union[str, List[str]]] = None, 276 download: bool = False, 277) -> str: 278 """Download the MUCIC dataset for a specific cell line. 279 280 Args: 281 path: Filepath to a folder where the downloaded data will be saved. 282 cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'. 283 variant: The dataset variant(s). 284 For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). 285 For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), 286 e.g. 'low_c00'. 287 For 'vasculogenesis': 'default'. 288 If None, downloads all variants for the selected cell line. 289 download: Whether to download the data if it is not present. 290 291 Returns: 292 The filepath to the dataset directory. 293 """ 294 assert cell_line in CELL_LINES, f"'{cell_line}' is not valid. Choose from {CELL_LINES}." 295 296 valid_variants = _get_variants(cell_line) 297 if variant is None: 298 variant = valid_variants 299 elif isinstance(variant, str): 300 variant = [variant] 301 302 for v in variant: 303 assert v in valid_variants, f"'{v}' is not valid for '{cell_line}'. Choose from {valid_variants}." 304 305 data_dir = os.path.join(path, cell_line, v) 306 307 # Check if data already exists - different file types for different datasets 308 if cell_line == "mda231": 309 file_pattern = "*.tif" 310 elif cell_line == "vasculogenesis": 311 file_pattern = "*.png" 312 else: 313 file_pattern = "*.h5" 314 315 if os.path.exists(data_dir) and len(glob(os.path.join(data_dir, "**", file_pattern), recursive=True)) > 0: 316 continue 317 318 os.makedirs(data_dir, exist_ok=True) 319 320 # Handle cell lines with separate image/label zip files 321 if cell_line in _SEPARATE_ZIPS_CELL_LINES: 322 urls = URLS[cell_line][v] 323 # Download and extract images 324 images_zip = os.path.join(path, f"{cell_line}_{v}_images.zip") 325 util.download_source(path=images_zip, url=urls["images"], download=download, checksum=None) 326 util.unzip(zip_path=images_zip, dst=os.path.join(data_dir, "images"), remove=False) 327 # Download and extract labels 328 labels_zip = os.path.join(path, f"{cell_line}_{v}_labels.zip") 329 util.download_source(path=labels_zip, url=urls["labels"], download=download, checksum=None) 330 util.unzip(zip_path=labels_zip, dst=os.path.join(data_dir, "labels"), remove=False) 331 else: 332 zip_path = os.path.join(path, f"{cell_line}_{v}.zip") 333 util.download_source(path=zip_path, url=URLS[cell_line][v], download=download, checksum=None) 334 util.unzip(zip_path=zip_path, dst=data_dir, remove=False) 335 336 return os.path.join(path, cell_line) 337 338 339def get_mucic_paths( 340 path: Union[os.PathLike, str], 341 cell_line: str, 342 variant: Optional[Union[str, List[str]]] = None, 343 download: bool = False, 344) -> List[str]: 345 """Get paths to the MUCIC data for a specific cell line. 346 347 Args: 348 path: Filepath to a folder where the downloaded data will be saved. 349 cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'. 350 variant: The dataset variant(s). If None, uses all variants. 351 download: Whether to download the data if it is not present. 352 353 Returns: 354 List of filepaths for the processed h5 data. 355 """ 356 from natsort import natsorted 357 358 assert cell_line in CELL_LINES, f"'{cell_line}' is not valid. Choose from {CELL_LINES}." 359 360 get_mucic_data(path, cell_line, variant, download) 361 362 valid_variants = _get_variants(cell_line) 363 if variant is None: 364 variant = valid_variants 365 elif isinstance(variant, str): 366 variant = [variant] 367 368 all_h5_paths = [] 369 for v in variant: 370 h5_out_dir = os.path.join(path, cell_line, "processed", v) 371 372 # Process data if not already done 373 if not os.path.exists(h5_out_dir) or len(glob(os.path.join(h5_out_dir, "*.h5"))) == 0: 374 if cell_line == "vasculogenesis": 375 _create_vasculogenesis_h5(path, v) 376 elif cell_line == "mda231": 377 _create_mda231_h5(path, v) 378 else: 379 _create_mucic_h5(path, cell_line, v) 380 381 h5_paths = glob(os.path.join(h5_out_dir, "*.h5")) 382 all_h5_paths.extend(h5_paths) 383 384 assert len(all_h5_paths) > 0, f"No data found for cell_line '{cell_line}', variant '{variant}'" 385 386 return natsorted(all_h5_paths) 387 388 389def get_mucic_dataset( 390 path: Union[os.PathLike, str], 391 patch_shape: Tuple[int, int, int], 392 cell_line: str, 393 variant: Optional[Union[str, List[str]]] = None, 394 segmentation_type: str = "instances", 395 download: bool = False, 396 **kwargs 397) -> Dataset: 398 """Get the MUCIC dataset for cell segmentation. 399 400 Args: 401 path: Filepath to a folder where the downloaded data will be saved. 402 patch_shape: The patch shape to use for training. 403 cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'. 404 variant: The dataset variant(s). 405 For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). 406 For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), 407 e.g. 'low_c00'. 408 For 'vasculogenesis' and 'mda231': 'default'. 409 If None, uses all variants for the selected cell line. 410 segmentation_type: The type of segmentation labels to use. 411 One of 'instances' or 'semantic' (binary mask). 412 download: Whether to download the data if it is not present. 413 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 414 415 Returns: 416 The segmentation dataset. 417 """ 418 assert segmentation_type in ("instances", "semantic"), \ 419 f"'{segmentation_type}' is not valid. Choose from 'instances' or 'semantic'." 420 421 h5_paths = get_mucic_paths(path, cell_line, variant, download) 422 423 label_key = f"labels/{segmentation_type}" 424 425 kwargs, _ = util.add_instance_label_transform( 426 kwargs, add_binary_target=True, label_dtype=np.int64, 427 ) 428 429 # Determine dimensionality based on cell line 430 ndim = 2 if cell_line in _2D_CELL_LINES else 3 431 432 return torch_em.default_segmentation_dataset( 433 raw_paths=h5_paths, 434 raw_key="raw", 435 label_paths=h5_paths, 436 label_key=label_key, 437 patch_shape=patch_shape, 438 ndim=ndim, 439 **kwargs 440 ) 441 442 443def get_mucic_loader( 444 path: Union[os.PathLike, str], 445 batch_size: int, 446 patch_shape: Tuple[int, int, int], 447 cell_line: str, 448 variant: Optional[Union[str, List[str]]] = None, 449 segmentation_type: str = "instances", 450 download: bool = False, 451 **kwargs 452) -> DataLoader: 453 """Get the MUCIC dataloader for cell segmentation. 454 455 Args: 456 path: Filepath to a folder where the downloaded data will be saved. 457 batch_size: The batch size for training. 458 patch_shape: The patch shape to use for training. 459 cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'. 460 variant: The dataset variant(s). 461 For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). 462 For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), 463 e.g. 'low_c00'. 464 For 'vasculogenesis' and 'mda231': 'default'. 465 If None, uses all variants for the selected cell line. 466 segmentation_type: The type of segmentation labels to use. 467 One of 'instances' or 'semantic' (binary mask). 468 download: Whether to download the data if it is not present. 469 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 470 471 Returns: 472 The DataLoader. 473 """ 474 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 475 dataset = get_mucic_dataset( 476 path=path, 477 patch_shape=patch_shape, 478 cell_line=cell_line, 479 variant=variant, 480 segmentation_type=segmentation_type, 481 download=download, 482 **ds_kwargs, 483 ) 484 return torch_em.get_data_loader(dataset=dataset, batch_size=batch_size, **loader_kwargs)
URLS =
{'colon_tissue': {'low': 'https://datasets.gryf.fi.muni.cz/iciar2011/ColonTissue_LowNoise_3D_HDF5.zip', 'high': 'https://datasets.gryf.fi.muni.cz/iciar2011/ColonTissue_HighNoise_3D_HDF5.zip'}, 'hl60': {'low_c00': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C00_3D_HDF5.zip', 'low_c25': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C25_3D_HDF5.zip', 'low_c50': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C50_3D_HDF5.zip', 'low_c75': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_LowNoise_C75_3D_HDF5.zip', 'high_c00': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C00_3D_HDF5.zip', 'high_c25': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C25_3D_HDF5.zip', 'high_c50': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C50_3D_HDF5.zip', 'high_c75': 'https://datasets.gryf.fi.muni.cz/cytometry2009/HL60_HighNoise_C75_3D_HDF5.zip'}, 'granulocytes': {'low': 'https://datasets.gryf.fi.muni.cz/cytometry2009/Granulocytes_LowNoise_3D_HDF5.zip', 'high': 'https://datasets.gryf.fi.muni.cz/cytometry2009/Granulocytes_HighNoise_3D_HDF5.zip'}, 'vasculogenesis': {'default': {'images': 'https://datasets.gryf.fi.muni.cz/icip2016/vasculogenesis-images.zip', 'labels': 'https://datasets.gryf.fi.muni.cz/icip2016/vasculogenesis-labels.zip'}}, 'mda231': {'default': {'images': 'https://data.celltrackingchallenge.net/training-datasets/Fluo-C3DL-MDA231.zip', 'labels': 'https://datasets.gryf.fi.muni.cz/isbi2025/Fluo-C3DL-MDA231_Full_Annotations.zip'}}}
CELL_LINES =
['colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', 'mda231']
def
get_mucic_data( path: Union[os.PathLike, str], cell_line: str, variant: Union[List[str], str, NoneType] = None, download: bool = False) -> str:
273def get_mucic_data( 274 path: Union[os.PathLike, str], 275 cell_line: str, 276 variant: Optional[Union[str, List[str]]] = None, 277 download: bool = False, 278) -> str: 279 """Download the MUCIC dataset for a specific cell line. 280 281 Args: 282 path: Filepath to a folder where the downloaded data will be saved. 283 cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'. 284 variant: The dataset variant(s). 285 For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). 286 For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), 287 e.g. 'low_c00'. 288 For 'vasculogenesis': 'default'. 289 If None, downloads all variants for the selected cell line. 290 download: Whether to download the data if it is not present. 291 292 Returns: 293 The filepath to the dataset directory. 294 """ 295 assert cell_line in CELL_LINES, f"'{cell_line}' is not valid. Choose from {CELL_LINES}." 296 297 valid_variants = _get_variants(cell_line) 298 if variant is None: 299 variant = valid_variants 300 elif isinstance(variant, str): 301 variant = [variant] 302 303 for v in variant: 304 assert v in valid_variants, f"'{v}' is not valid for '{cell_line}'. Choose from {valid_variants}." 305 306 data_dir = os.path.join(path, cell_line, v) 307 308 # Check if data already exists - different file types for different datasets 309 if cell_line == "mda231": 310 file_pattern = "*.tif" 311 elif cell_line == "vasculogenesis": 312 file_pattern = "*.png" 313 else: 314 file_pattern = "*.h5" 315 316 if os.path.exists(data_dir) and len(glob(os.path.join(data_dir, "**", file_pattern), recursive=True)) > 0: 317 continue 318 319 os.makedirs(data_dir, exist_ok=True) 320 321 # Handle cell lines with separate image/label zip files 322 if cell_line in _SEPARATE_ZIPS_CELL_LINES: 323 urls = URLS[cell_line][v] 324 # Download and extract images 325 images_zip = os.path.join(path, f"{cell_line}_{v}_images.zip") 326 util.download_source(path=images_zip, url=urls["images"], download=download, checksum=None) 327 util.unzip(zip_path=images_zip, dst=os.path.join(data_dir, "images"), remove=False) 328 # Download and extract labels 329 labels_zip = os.path.join(path, f"{cell_line}_{v}_labels.zip") 330 util.download_source(path=labels_zip, url=urls["labels"], download=download, checksum=None) 331 util.unzip(zip_path=labels_zip, dst=os.path.join(data_dir, "labels"), remove=False) 332 else: 333 zip_path = os.path.join(path, f"{cell_line}_{v}.zip") 334 util.download_source(path=zip_path, url=URLS[cell_line][v], download=download, checksum=None) 335 util.unzip(zip_path=zip_path, dst=data_dir, remove=False) 336 337 return os.path.join(path, cell_line)
Download the MUCIC dataset for a specific cell line.
Arguments:
- path: Filepath to a folder where the downloaded data will be saved.
- cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
- variant: The dataset variant(s). For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), e.g. 'low_c00'. For 'vasculogenesis': 'default'. If None, downloads all variants for the selected cell line.
- download: Whether to download the data if it is not present.
Returns:
The filepath to the dataset directory.
def
get_mucic_paths( path: Union[os.PathLike, str], cell_line: str, variant: Union[List[str], str, NoneType] = None, download: bool = False) -> List[str]:
340def get_mucic_paths( 341 path: Union[os.PathLike, str], 342 cell_line: str, 343 variant: Optional[Union[str, List[str]]] = None, 344 download: bool = False, 345) -> List[str]: 346 """Get paths to the MUCIC data for a specific cell line. 347 348 Args: 349 path: Filepath to a folder where the downloaded data will be saved. 350 cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'. 351 variant: The dataset variant(s). If None, uses all variants. 352 download: Whether to download the data if it is not present. 353 354 Returns: 355 List of filepaths for the processed h5 data. 356 """ 357 from natsort import natsorted 358 359 assert cell_line in CELL_LINES, f"'{cell_line}' is not valid. Choose from {CELL_LINES}." 360 361 get_mucic_data(path, cell_line, variant, download) 362 363 valid_variants = _get_variants(cell_line) 364 if variant is None: 365 variant = valid_variants 366 elif isinstance(variant, str): 367 variant = [variant] 368 369 all_h5_paths = [] 370 for v in variant: 371 h5_out_dir = os.path.join(path, cell_line, "processed", v) 372 373 # Process data if not already done 374 if not os.path.exists(h5_out_dir) or len(glob(os.path.join(h5_out_dir, "*.h5"))) == 0: 375 if cell_line == "vasculogenesis": 376 _create_vasculogenesis_h5(path, v) 377 elif cell_line == "mda231": 378 _create_mda231_h5(path, v) 379 else: 380 _create_mucic_h5(path, cell_line, v) 381 382 h5_paths = glob(os.path.join(h5_out_dir, "*.h5")) 383 all_h5_paths.extend(h5_paths) 384 385 assert len(all_h5_paths) > 0, f"No data found for cell_line '{cell_line}', variant '{variant}'" 386 387 return natsorted(all_h5_paths)
Get paths to the MUCIC data for a specific cell line.
Arguments:
- path: Filepath to a folder where the downloaded data will be saved.
- cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
- variant: The dataset variant(s). If None, uses all variants.
- download: Whether to download the data if it is not present.
Returns:
List of filepaths for the processed h5 data.
def
get_mucic_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], cell_line: str, variant: Union[List[str], str, NoneType] = None, segmentation_type: str = 'instances', download: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset:
390def get_mucic_dataset( 391 path: Union[os.PathLike, str], 392 patch_shape: Tuple[int, int, int], 393 cell_line: str, 394 variant: Optional[Union[str, List[str]]] = None, 395 segmentation_type: str = "instances", 396 download: bool = False, 397 **kwargs 398) -> Dataset: 399 """Get the MUCIC dataset for cell segmentation. 400 401 Args: 402 path: Filepath to a folder where the downloaded data will be saved. 403 patch_shape: The patch shape to use for training. 404 cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'. 405 variant: The dataset variant(s). 406 For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). 407 For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), 408 e.g. 'low_c00'. 409 For 'vasculogenesis' and 'mda231': 'default'. 410 If None, uses all variants for the selected cell line. 411 segmentation_type: The type of segmentation labels to use. 412 One of 'instances' or 'semantic' (binary mask). 413 download: Whether to download the data if it is not present. 414 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 415 416 Returns: 417 The segmentation dataset. 418 """ 419 assert segmentation_type in ("instances", "semantic"), \ 420 f"'{segmentation_type}' is not valid. Choose from 'instances' or 'semantic'." 421 422 h5_paths = get_mucic_paths(path, cell_line, variant, download) 423 424 label_key = f"labels/{segmentation_type}" 425 426 kwargs, _ = util.add_instance_label_transform( 427 kwargs, add_binary_target=True, label_dtype=np.int64, 428 ) 429 430 # Determine dimensionality based on cell line 431 ndim = 2 if cell_line in _2D_CELL_LINES else 3 432 433 return torch_em.default_segmentation_dataset( 434 raw_paths=h5_paths, 435 raw_key="raw", 436 label_paths=h5_paths, 437 label_key=label_key, 438 patch_shape=patch_shape, 439 ndim=ndim, 440 **kwargs 441 )
Get the MUCIC dataset for cell segmentation.
Arguments:
- path: Filepath to a folder where the downloaded data will be saved.
- patch_shape: The patch shape to use for training.
- cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
- variant: The dataset variant(s). For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), e.g. 'low_c00'. For 'vasculogenesis' and 'mda231': 'default'. If None, uses all variants for the selected cell line.
- segmentation_type: The type of segmentation labels to use. One of 'instances' or 'semantic' (binary mask).
- download: Whether to download the data if it is not present.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
def
get_mucic_loader( path: Union[os.PathLike, str], batch_size: int, patch_shape: Tuple[int, int, int], cell_line: str, variant: Union[List[str], str, NoneType] = None, segmentation_type: str = 'instances', download: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader:
444def get_mucic_loader( 445 path: Union[os.PathLike, str], 446 batch_size: int, 447 patch_shape: Tuple[int, int, int], 448 cell_line: str, 449 variant: Optional[Union[str, List[str]]] = None, 450 segmentation_type: str = "instances", 451 download: bool = False, 452 **kwargs 453) -> DataLoader: 454 """Get the MUCIC dataloader for cell segmentation. 455 456 Args: 457 path: Filepath to a folder where the downloaded data will be saved. 458 batch_size: The batch size for training. 459 patch_shape: The patch shape to use for training. 460 cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'. 461 variant: The dataset variant(s). 462 For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). 463 For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), 464 e.g. 'low_c00'. 465 For 'vasculogenesis' and 'mda231': 'default'. 466 If None, uses all variants for the selected cell line. 467 segmentation_type: The type of segmentation labels to use. 468 One of 'instances' or 'semantic' (binary mask). 469 download: Whether to download the data if it is not present. 470 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 471 472 Returns: 473 The DataLoader. 474 """ 475 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 476 dataset = get_mucic_dataset( 477 path=path, 478 patch_shape=patch_shape, 479 cell_line=cell_line, 480 variant=variant, 481 segmentation_type=segmentation_type, 482 download=download, 483 **ds_kwargs, 484 ) 485 return torch_em.get_data_loader(dataset=dataset, batch_size=batch_size, **loader_kwargs)
Get the MUCIC dataloader for cell segmentation.
Arguments:
- path: Filepath to a folder where the downloaded data will be saved.
- batch_size: The batch size for training.
- patch_shape: The patch shape to use for training.
- cell_line: The cell line to use. One of 'colon_tissue', 'hl60', 'granulocytes', 'vasculogenesis', or 'mda231'.
- variant: The dataset variant(s). For 'colon_tissue' and 'granulocytes': 'low' or 'high' (noise levels). For 'hl60': combination of noise ('low', 'high') and clustering ('c00', 'c25', 'c50', 'c75'), e.g. 'low_c00'. For 'vasculogenesis' and 'mda231': 'default'. If None, uses all variants for the selected cell line.
- segmentation_type: The type of segmentation labels to use. One of 'instances' or 'semantic' (binary mask).
- download: Whether to download the data if it is not present.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.