torch_em.data.datasets.electron_microscopy.human_liver_em
The human liver EM dataset contains multiscale SBF-SEM images of human liver tissue with semantic segmentations of cellular and organelle structures.
The dataset covers a reconstructed human periportal liver volume (597 z-slices, 20000 x 20000 pixels per slice) with 9 binary semantic segmentation classes (0=background, 255=foreground - not instance segmentation).
Currently recommended label choices for training: "er", "mito", "nucleus". These are the classes that are well-represented and have clear biological meaning at the imaging resolution of this dataset.
All 9 available classes:
- "bile_duct": bile duct (sparse, often absent in a given crop)
- "cell_boundary": cell boundary region (coarse, marks interior near cell edge)
- "cholangiocyte": cholangiocyte cells (sparse)
- "endothelial": endothelial cells (sparse)
- "er": endoplasmic reticulum (recommended)
- "hepatocyte": hepatocyte cell interior - tissue-level mask (large filled regions)
- "mito": mitochondria (recommended)
- "nucleus": nucleus (recommended)
- "sinusoid": sinusoidal capillary (sparse)
NOTE (on other organelles): the 9 classes above are tissue/cell-level annotations. Organelle-level segmentations for additional structures (lipid droplets, Golgi, etc.) are not available in EMPIAR-13356. The Parlakgul liver dataset (EMPIAR-10791) provides richer organelle annotations (ER sheets/tubules, lipid droplets, nuclear membrane, plasma membrane) at higher FIB-SEM resolution for mouse liver.
NOTE (on resolution): the pixel size is not documented in EMPIAR-13356. Based on the visible tissue scale (~1-2mm tissue spanning 20000 pixels), xy resolution is estimated at ~50-100nm/pixel. The z section thickness is also unconfirmed. Check the paper for the exact values before selecting patch shapes for isotropic training.
Data is streamed lazily from EMPIAR-13356 via HTTP: raw 16-bit TIFFs and binary PNG
masks are fetched per z-slice and cached in a single zarr v3 store per bounding box.
All 9 label classes are stored together (raw, bile_duct, cell_boundary, ..., sinusoid).
The label_choice parameter in the loader selects which array to use as labels.
Bounding boxes are specified as (x_min, x_max, y_min, y_max, z_min, z_max) in voxels. The full volume is (597, 20000, 20000) voxels (z, y, x). Tissue spans roughly x=[1195, 18890], y=[469, 19570] - the volume edges are empty.
This dataset is from the publication https://www.biorxiv.org/content/10.64898/2026.04.22.719970v1. Please cite it if you use this dataset in your research.
The data is publicly available at https://www.ebi.ac.uk/empiar/EMPIAR-13356/.
1"""The human liver EM dataset contains multiscale SBF-SEM images of human liver tissue 2with semantic segmentations of cellular and organelle structures. 3 4The dataset covers a reconstructed human periportal liver volume (597 z-slices, 520000 x 20000 pixels per slice) with 9 binary semantic segmentation classes 6(0=background, 255=foreground - not instance segmentation). 7 8Currently recommended label choices for training: "er", "mito", "nucleus". 9These are the classes that are well-represented and have clear biological meaning 10at the imaging resolution of this dataset. 11 12All 9 available classes: 13- "bile_duct": bile duct (sparse, often absent in a given crop) 14- "cell_boundary": cell boundary region (coarse, marks interior near cell edge) 15- "cholangiocyte": cholangiocyte cells (sparse) 16- "endothelial": endothelial cells (sparse) 17- "er": endoplasmic reticulum (recommended) 18- "hepatocyte": hepatocyte cell interior - tissue-level mask (large filled regions) 19- "mito": mitochondria (recommended) 20- "nucleus": nucleus (recommended) 21- "sinusoid": sinusoidal capillary (sparse) 22 23NOTE (on other organelles): the 9 classes above are tissue/cell-level annotations. 24Organelle-level segmentations for additional structures (lipid droplets, Golgi, etc.) 25are not available in EMPIAR-13356. The Parlakgul liver dataset (EMPIAR-10791) provides 26richer organelle annotations (ER sheets/tubules, lipid droplets, nuclear membrane, 27plasma membrane) at higher FIB-SEM resolution for mouse liver. 28 29NOTE (on resolution): the pixel size is not documented in EMPIAR-13356. Based on the 30visible tissue scale (~1-2mm tissue spanning 20000 pixels), xy resolution is estimated 31at ~50-100nm/pixel. The z section thickness is also unconfirmed. Check the paper for 32the exact values before selecting patch shapes for isotropic training. 33 34Data is streamed lazily from EMPIAR-13356 via HTTP: raw 16-bit TIFFs and binary PNG 35masks are fetched per z-slice and cached in a single zarr v3 store per bounding box. 36All 9 label classes are stored together (raw, bile_duct, cell_boundary, ..., sinusoid). 37The `label_choice` parameter in the loader selects which array to use as labels. 38 39Bounding boxes are specified as (x_min, x_max, y_min, y_max, z_min, z_max) in voxels. 40The full volume is (597, 20000, 20000) voxels (z, y, x). Tissue spans roughly 41x=[1195, 18890], y=[469, 19570] - the volume edges are empty. 42 43This dataset is from the publication https://www.biorxiv.org/content/10.64898/2026.04.22.719970v1. 44Please cite it if you use this dataset in your research. 45 46The data is publicly available at https://www.ebi.ac.uk/empiar/EMPIAR-13356/. 47""" 48 49import hashlib 50import io 51import os 52from typing import List, Literal, Optional, Tuple, Union 53 54import numpy as np 55from torch.utils.data import DataLoader, Dataset 56 57import torch_em 58from .. import util 59 60 61EMPIAR_BASE = "https://ftp.ebi.ac.uk/empiar/world_availability/13356/data" 62 63HUMAN_LIVER_EM_LABEL_DIRS = { 64 "er": "humanliver_er_mask", 65 "mito": "humanliver_mito_mask", 66 "nucleus": "humanliver_nucleus_mask", 67} 68 69HUMAN_LIVER_EM_SHAPE = (597, 20000, 20000) 70# Tissue spans x=[1195,18890], y=[469,19570] - edges are empty background. 71HUMAN_LIVER_EM_TISSUE_BBOX = (1195, 18890, 469, 19570, 0, 597) 72 73# Zarr layout for bbox crops. 74HUMAN_LIVER_EM_CHUNK_SHAPE = (64, 256, 256) 75# Zarr layout for full-volume sharded store. 76# Shards: (64, 4096, 4096) outer; chunks: (8, 256, 256) inner. 77HUMAN_LIVER_EM_SHARD_SHAPE = (64, 4096, 4096) 78HUMAN_LIVER_EM_INNER_CHUNK = (8, 256, 256) 79 80LabelChoice = Literal["er", "mito", "nucleus"] 81 82 83def _bbox_to_str(bbox): 84 return hashlib.md5("_".join(str(v) for v in bbox).encode()).hexdigest()[:12] 85 86 87class _HttpFile: 88 """Seekable file-like object backed by HTTP range requests for efficient partial TIFF reading.""" 89 90 def __init__(self, url): 91 import requests 92 self.url = url 93 self._pos = 0 94 r = requests.head(url, timeout=30) 95 r.raise_for_status() 96 self._size = int(r.headers["Content-Length"]) 97 98 def read(self, n=-1): 99 import requests 100 end = (self._size - 1) if n == -1 else min(self._pos + n - 1, self._size - 1) 101 if self._pos > end: 102 return b"" 103 r = requests.get(self.url, headers={"Range": f"bytes={self._pos}-{end}"}, timeout=120) 104 data = r.content 105 self._pos += len(data) 106 return data 107 108 def seek(self, pos, whence=0): 109 if whence == 0: 110 self._pos = pos 111 elif whence == 1: 112 self._pos += pos 113 elif whence == 2: 114 self._pos = self._size + pos 115 self._pos = max(0, min(self._pos, self._size)) 116 return self._pos 117 118 def tell(self): 119 return self._pos 120 121 def seekable(self): 122 return True 123 124 def readable(self): 125 return True 126 127 def __enter__(self): 128 return self 129 130 def __exit__(self, *args): 131 pass 132 133 134def _read_raw_slice(z, x_min, x_max, y_min, y_max): 135 """Read a cropped region from a remote TIFF using a single HTTP range request covering 136 only the strips needed for the y range, avoiding downloading the full ~880 MB file.""" 137 import requests 138 import tifffile 139 140 url = f"{EMPIAR_BASE}/humanliver_raw_images/humanliver_raw_{z:03d}.tif" 141 142 # Read TIFF metadata to get strip offsets and byte counts. 143 with tifffile.TiffFile(_HttpFile(url)) as tif: 144 page = tif.pages[0] 145 offsets = page.dataoffsets 146 bytecounts = page.databytecounts 147 width = page.imagewidth 148 dtype = np.dtype(page.dtype) 149 150 # One strip per row - request only strips y_min..y_max in one range request. 151 from imagecodecs import lzw_decode 152 start_byte = offsets[y_min] 153 end_byte = offsets[y_max - 1] + bytecounts[y_max - 1] - 1 154 r = requests.get(url, headers={"Range": f"bytes={start_byte}-{end_byte}"}, timeout=120) 155 r.raise_for_status() 156 raw_bytes = r.content 157 158 # Predictor=2 (horizontal differencing) is used - need cumsum after LZW decode. 159 from imagecodecs import delta_decode 160 rows = [] 161 for i in range(y_min, y_max): 162 strip_start = offsets[i] - start_byte 163 strip_data = raw_bytes[strip_start:strip_start + bytecounts[i]] 164 decoded = np.frombuffer(lzw_decode(strip_data), dtype=dtype) 165 row = delta_decode(decoded, axis=-1, dist=1, out=decoded).reshape(width) 166 rows.append(row[x_min:x_max]) 167 168 return np.stack(rows, axis=0) 169 170 171def _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max): 172 import PIL.Image 173 import requests 174 PIL.Image.MAX_IMAGE_PIXELS = None 175 label_dir = HUMAN_LIVER_EM_LABEL_DIRS[label_key] 176 url = f"{EMPIAR_BASE}/{label_dir}/{label_dir}_{z:03d}.png" 177 r = requests.get(url, timeout=120) 178 r.raise_for_status() 179 img = np.array(PIL.Image.open(io.BytesIO(r.content))) 180 if img.ndim == 3: 181 img = img[..., 0] 182 return img[y_min:y_max, x_min:x_max] 183 184 185def get_human_liver_em_data( 186 path: Union[os.PathLike, str], 187 bounding_box: Tuple[int, int, int, int, int, int], 188 download: bool = False, 189) -> str: 190 """Stream a subvolume from the human liver EM dataset and cache it as a zarr v3 store. 191 192 All 9 label classes are stored in a single zarr alongside the raw EM, so each 193 bounding box is only downloaded once regardless of which label_choice is used. 194 195 Args: 196 path: Filepath to a folder where the cached zarr store will be saved. 197 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 198 in voxel coordinates. Tissue spans roughly x=[1195,18890], y=[469,19570]. 199 download: Whether to stream and cache the data if it is not present. 200 201 Returns: 202 The filepath to the cached zarr store. 203 """ 204 import zarr 205 from zarr.codecs import BloscCodec 206 207 os.makedirs(str(path), exist_ok=True) 208 zarr_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.zarr") 209 210 root = zarr.open_group(zarr_path, mode="a") 211 all_keys = ["raw"] + list(HUMAN_LIVER_EM_LABEL_DIRS.keys()) 212 if all(k in root for k in all_keys): 213 return zarr_path 214 215 if not download: 216 raise RuntimeError( 217 f"No cached data found at '{zarr_path}'. Set download=True to stream it from EMPIAR." 218 ) 219 220 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 221 shape = (z_max - z_min, y_max - y_min, x_max - x_min) 222 223 print(f"Streaming Human Liver EM + all labels for bbox {bounding_box} ...") 224 raw_vol = np.zeros(shape, dtype=np.uint16) 225 label_vols = {k: np.zeros(shape, dtype=np.uint8) for k in HUMAN_LIVER_EM_LABEL_DIRS} 226 227 for i, z in enumerate(range(z_min, z_max)): 228 raw_vol[i] = _read_raw_slice(z, x_min, x_max, y_min, y_max) 229 for label_key in HUMAN_LIVER_EM_LABEL_DIRS: 230 label_vols[label_key][i] = _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max) 231 if (i + 1) % 5 == 0: 232 print(f" {i + 1}/{z_max - z_min} slices done") 233 234 def _make_array(name, data, is_label): 235 shuffle = "bitshuffle" if is_label else "shuffle" 236 arr = root.create_array( 237 name, shape=data.shape, chunks=HUMAN_LIVER_EM_CHUNK_SHAPE, dtype=data.dtype, 238 compressors=BloscCodec(cname="zstd", clevel=6, shuffle=shuffle), 239 ) 240 arr[:] = data 241 242 root.attrs["bounding_box"] = list(bounding_box) 243 244 if "raw" not in root: 245 _make_array("raw", raw_vol, is_label=False) 246 for label_key, vol in label_vols.items(): 247 if label_key not in root: 248 _make_array(label_key, vol, is_label=True) 249 250 print(f"Cached to {zarr_path} (shape {shape})") 251 return zarr_path 252 253 254def get_human_liver_em_full_volume( 255 path: Union[os.PathLike, str], 256 download: bool = False, 257) -> str: 258 """Download the full human liver EM tissue volume into a sharded zarr v3 store. 259 260 Downloads all 597 z-slices for the tissue region x=[1195,18890], y=[469,19570] 261 with raw EM + er/mito/nucleus labels. Data is written slice by slice to avoid 262 memory issues. Estimated storage: ~100-150 GB compressed. Estimated download 263 time: ~12 hours (one-time cost). 264 265 The sharded zarr uses shard shape (64, 4096, 4096) with inner chunks (8, 256, 256), 266 enabling efficient random crop access during training without loading the full volume. 267 268 Args: 269 path: Filepath to a folder where the zarr store will be saved. 270 download: Whether to stream and cache the data if it is not present. 271 272 Returns: 273 The filepath to the full-volume zarr store. 274 """ 275 import zarr 276 from zarr.codecs import BloscCodec, ShardingCodec 277 278 os.makedirs(str(path), exist_ok=True) 279 zarr_path = os.path.join(str(path), "full_volume.zarr") 280 281 x_min, x_max, y_min, y_max, z_min, z_max = HUMAN_LIVER_EM_TISSUE_BBOX 282 shape = (z_max - z_min, y_max - y_min, x_max - x_min) 283 284 root = zarr.open_group(zarr_path, mode="a") 285 all_keys = ["raw"] + list(HUMAN_LIVER_EM_LABEL_DIRS.keys()) 286 if all(k in root for k in all_keys): 287 return zarr_path 288 289 if not download: 290 raise RuntimeError( 291 f"Full-volume zarr not found at '{zarr_path}'. Set download=True to stream from EMPIAR." 292 " Note: download takes ~12 hours and requires ~100-150 GB disk space." 293 ) 294 295 def _make_sharded(name, dtype, is_label): 296 shuffle = "bitshuffle" if is_label else "shuffle" 297 return root.create_array( 298 name, shape=shape, chunks=HUMAN_LIVER_EM_SHARD_SHAPE, dtype=dtype, 299 compressors=ShardingCodec( 300 chunk_shape=HUMAN_LIVER_EM_INNER_CHUNK, 301 codecs=[BloscCodec(cname="zstd", clevel=6, shuffle=shuffle)], 302 ), 303 ) 304 305 if "raw" not in root: 306 _make_sharded("raw", np.dtype("uint16"), is_label=False) 307 for label_key in HUMAN_LIVER_EM_LABEL_DIRS: 308 if label_key not in root: 309 _make_sharded(label_key, np.dtype("uint8"), is_label=True) 310 311 root.attrs["tissue_bbox"] = list(HUMAN_LIVER_EM_TISSUE_BBOX) 312 n_slices = z_max - z_min 313 print(f"Streaming full Human Liver EM volume ({shape}) - this will take several hours ...") 314 315 for i, z in enumerate(range(z_min, z_max)): 316 raw_slice = _read_raw_slice(z, x_min, x_max, y_min, y_max) 317 root["raw"][i] = raw_slice 318 for label_key in HUMAN_LIVER_EM_LABEL_DIRS: 319 mask_slice = _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max) 320 root[label_key][i] = mask_slice 321 if (i + 1) % 10 == 0: 322 print(f" {i + 1}/{n_slices} slices done") 323 324 print(f"Full volume cached to {zarr_path} (shape {shape})") 325 return zarr_path 326 327 328def get_human_liver_em_paths( 329 path: Union[os.PathLike, str], 330 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 331 download: bool = False, 332 full_volume: bool = False, 333) -> List[str]: 334 """Get paths to human liver EM zarr stores. 335 336 Args: 337 path: Filepath to a folder where the cached zarr stores will be saved. 338 bounding_boxes: List of regions to fetch, each as 339 (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. 340 Ignored when full_volume=True. 341 download: Whether to stream and cache the data if it is not present. 342 full_volume: If True, download/use the full tissue volume as a single sharded 343 zarr v3 store (~12h download, ~100-150 GB). Supersedes bounding_boxes. 344 345 Returns: 346 List of filepaths to the cached zarr stores. 347 """ 348 if full_volume: 349 return [get_human_liver_em_full_volume(path, download)] 350 if bounding_boxes is None: 351 raise ValueError("Provide bounding_boxes or set full_volume=True.") 352 return [get_human_liver_em_data(path, bbox, download) for bbox in bounding_boxes] 353 354 355def get_human_liver_em_dataset( 356 path: Union[os.PathLike, str], 357 patch_shape: Tuple[int, int, int], 358 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 359 label_choice: LabelChoice = "mito", 360 download: bool = False, 361 full_volume: bool = False, 362 **kwargs, 363) -> Dataset: 364 """Get the human liver EM dataset for semantic segmentation. 365 366 Args: 367 path: Filepath to a folder where the cached zarr stores will be saved. 368 patch_shape: The patch shape (z, y, x) to use for training. The pixel 369 resolution is unconfirmed (estimated ~50-100 nm/px xy). Check the 370 paper for the exact values when selecting isotropic patch shapes. 371 bounding_boxes: List of subvolumes to use, each as 372 (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. 373 Ignored when full_volume=True. 374 label_choice: Which structure to segment. 375 download: Whether to stream and cache data if not already present. 376 full_volume: If True, use the full sharded tissue volume (~12h one-time download). 377 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 378 379 Returns: 380 The segmentation dataset. 381 """ 382 assert len(patch_shape) == 3 383 384 paths = get_human_liver_em_paths(path, bounding_boxes, download, full_volume) 385 386 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 387 388 return torch_em.default_segmentation_dataset( 389 raw_paths=paths, 390 raw_key="raw", 391 label_paths=paths, 392 label_key=label_choice, 393 patch_shape=patch_shape, 394 **kwargs, 395 ) 396 397 398def get_human_liver_em_loader( 399 path: Union[os.PathLike, str], 400 patch_shape: Tuple[int, int, int], 401 batch_size: int, 402 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 403 label_choice: LabelChoice = "mito", 404 download: bool = False, 405 full_volume: bool = False, 406 **kwargs, 407) -> DataLoader: 408 """Get the DataLoader for semantic segmentation in the human liver EM dataset. 409 410 Args: 411 path: Filepath to a folder where the cached zarr stores will be saved. 412 patch_shape: The patch shape (z, y, x) to use for training. 413 batch_size: The batch size for training. 414 bounding_boxes: List of subvolumes to use, each as 415 (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. 416 Ignored when full_volume=True. 417 label_choice: Which structure to segment. 418 download: Whether to stream and cache data if not already present. 419 full_volume: If True, use the full sharded tissue volume (~12h one-time download). 420 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 421 or for the PyTorch DataLoader. 422 423 Returns: 424 The DataLoader. 425 """ 426 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 427 dataset = get_human_liver_em_dataset( 428 path, patch_shape, bounding_boxes, label_choice=label_choice, 429 download=download, full_volume=full_volume, **ds_kwargs 430 ) 431 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
186def get_human_liver_em_data( 187 path: Union[os.PathLike, str], 188 bounding_box: Tuple[int, int, int, int, int, int], 189 download: bool = False, 190) -> str: 191 """Stream a subvolume from the human liver EM dataset and cache it as a zarr v3 store. 192 193 All 9 label classes are stored in a single zarr alongside the raw EM, so each 194 bounding box is only downloaded once regardless of which label_choice is used. 195 196 Args: 197 path: Filepath to a folder where the cached zarr store will be saved. 198 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 199 in voxel coordinates. Tissue spans roughly x=[1195,18890], y=[469,19570]. 200 download: Whether to stream and cache the data if it is not present. 201 202 Returns: 203 The filepath to the cached zarr store. 204 """ 205 import zarr 206 from zarr.codecs import BloscCodec 207 208 os.makedirs(str(path), exist_ok=True) 209 zarr_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.zarr") 210 211 root = zarr.open_group(zarr_path, mode="a") 212 all_keys = ["raw"] + list(HUMAN_LIVER_EM_LABEL_DIRS.keys()) 213 if all(k in root for k in all_keys): 214 return zarr_path 215 216 if not download: 217 raise RuntimeError( 218 f"No cached data found at '{zarr_path}'. Set download=True to stream it from EMPIAR." 219 ) 220 221 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 222 shape = (z_max - z_min, y_max - y_min, x_max - x_min) 223 224 print(f"Streaming Human Liver EM + all labels for bbox {bounding_box} ...") 225 raw_vol = np.zeros(shape, dtype=np.uint16) 226 label_vols = {k: np.zeros(shape, dtype=np.uint8) for k in HUMAN_LIVER_EM_LABEL_DIRS} 227 228 for i, z in enumerate(range(z_min, z_max)): 229 raw_vol[i] = _read_raw_slice(z, x_min, x_max, y_min, y_max) 230 for label_key in HUMAN_LIVER_EM_LABEL_DIRS: 231 label_vols[label_key][i] = _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max) 232 if (i + 1) % 5 == 0: 233 print(f" {i + 1}/{z_max - z_min} slices done") 234 235 def _make_array(name, data, is_label): 236 shuffle = "bitshuffle" if is_label else "shuffle" 237 arr = root.create_array( 238 name, shape=data.shape, chunks=HUMAN_LIVER_EM_CHUNK_SHAPE, dtype=data.dtype, 239 compressors=BloscCodec(cname="zstd", clevel=6, shuffle=shuffle), 240 ) 241 arr[:] = data 242 243 root.attrs["bounding_box"] = list(bounding_box) 244 245 if "raw" not in root: 246 _make_array("raw", raw_vol, is_label=False) 247 for label_key, vol in label_vols.items(): 248 if label_key not in root: 249 _make_array(label_key, vol, is_label=True) 250 251 print(f"Cached to {zarr_path} (shape {shape})") 252 return zarr_path
Stream a subvolume from the human liver EM dataset and cache it as a zarr v3 store.
All 9 label classes are stored in a single zarr alongside the raw EM, so each bounding box is only downloaded once regardless of which label_choice is used.
Arguments:
- path: Filepath to a folder where the cached zarr store will be saved.
- bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. Tissue spans roughly x=[1195,18890], y=[469,19570].
- download: Whether to stream and cache the data if it is not present.
Returns:
The filepath to the cached zarr store.
255def get_human_liver_em_full_volume( 256 path: Union[os.PathLike, str], 257 download: bool = False, 258) -> str: 259 """Download the full human liver EM tissue volume into a sharded zarr v3 store. 260 261 Downloads all 597 z-slices for the tissue region x=[1195,18890], y=[469,19570] 262 with raw EM + er/mito/nucleus labels. Data is written slice by slice to avoid 263 memory issues. Estimated storage: ~100-150 GB compressed. Estimated download 264 time: ~12 hours (one-time cost). 265 266 The sharded zarr uses shard shape (64, 4096, 4096) with inner chunks (8, 256, 256), 267 enabling efficient random crop access during training without loading the full volume. 268 269 Args: 270 path: Filepath to a folder where the zarr store will be saved. 271 download: Whether to stream and cache the data if it is not present. 272 273 Returns: 274 The filepath to the full-volume zarr store. 275 """ 276 import zarr 277 from zarr.codecs import BloscCodec, ShardingCodec 278 279 os.makedirs(str(path), exist_ok=True) 280 zarr_path = os.path.join(str(path), "full_volume.zarr") 281 282 x_min, x_max, y_min, y_max, z_min, z_max = HUMAN_LIVER_EM_TISSUE_BBOX 283 shape = (z_max - z_min, y_max - y_min, x_max - x_min) 284 285 root = zarr.open_group(zarr_path, mode="a") 286 all_keys = ["raw"] + list(HUMAN_LIVER_EM_LABEL_DIRS.keys()) 287 if all(k in root for k in all_keys): 288 return zarr_path 289 290 if not download: 291 raise RuntimeError( 292 f"Full-volume zarr not found at '{zarr_path}'. Set download=True to stream from EMPIAR." 293 " Note: download takes ~12 hours and requires ~100-150 GB disk space." 294 ) 295 296 def _make_sharded(name, dtype, is_label): 297 shuffle = "bitshuffle" if is_label else "shuffle" 298 return root.create_array( 299 name, shape=shape, chunks=HUMAN_LIVER_EM_SHARD_SHAPE, dtype=dtype, 300 compressors=ShardingCodec( 301 chunk_shape=HUMAN_LIVER_EM_INNER_CHUNK, 302 codecs=[BloscCodec(cname="zstd", clevel=6, shuffle=shuffle)], 303 ), 304 ) 305 306 if "raw" not in root: 307 _make_sharded("raw", np.dtype("uint16"), is_label=False) 308 for label_key in HUMAN_LIVER_EM_LABEL_DIRS: 309 if label_key not in root: 310 _make_sharded(label_key, np.dtype("uint8"), is_label=True) 311 312 root.attrs["tissue_bbox"] = list(HUMAN_LIVER_EM_TISSUE_BBOX) 313 n_slices = z_max - z_min 314 print(f"Streaming full Human Liver EM volume ({shape}) - this will take several hours ...") 315 316 for i, z in enumerate(range(z_min, z_max)): 317 raw_slice = _read_raw_slice(z, x_min, x_max, y_min, y_max) 318 root["raw"][i] = raw_slice 319 for label_key in HUMAN_LIVER_EM_LABEL_DIRS: 320 mask_slice = _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max) 321 root[label_key][i] = mask_slice 322 if (i + 1) % 10 == 0: 323 print(f" {i + 1}/{n_slices} slices done") 324 325 print(f"Full volume cached to {zarr_path} (shape {shape})") 326 return zarr_path
Download the full human liver EM tissue volume into a sharded zarr v3 store.
Downloads all 597 z-slices for the tissue region x=[1195,18890], y=[469,19570] with raw EM + er/mito/nucleus labels. Data is written slice by slice to avoid memory issues. Estimated storage: ~100-150 GB compressed. Estimated download time: ~12 hours (one-time cost).
The sharded zarr uses shard shape (64, 4096, 4096) with inner chunks (8, 256, 256), enabling efficient random crop access during training without loading the full volume.
Arguments:
- path: Filepath to a folder where the zarr store will be saved.
- download: Whether to stream and cache the data if it is not present.
Returns:
The filepath to the full-volume zarr store.
329def get_human_liver_em_paths( 330 path: Union[os.PathLike, str], 331 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 332 download: bool = False, 333 full_volume: bool = False, 334) -> List[str]: 335 """Get paths to human liver EM zarr stores. 336 337 Args: 338 path: Filepath to a folder where the cached zarr stores will be saved. 339 bounding_boxes: List of regions to fetch, each as 340 (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. 341 Ignored when full_volume=True. 342 download: Whether to stream and cache the data if it is not present. 343 full_volume: If True, download/use the full tissue volume as a single sharded 344 zarr v3 store (~12h download, ~100-150 GB). Supersedes bounding_boxes. 345 346 Returns: 347 List of filepaths to the cached zarr stores. 348 """ 349 if full_volume: 350 return [get_human_liver_em_full_volume(path, download)] 351 if bounding_boxes is None: 352 raise ValueError("Provide bounding_boxes or set full_volume=True.") 353 return [get_human_liver_em_data(path, bbox, download) for bbox in bounding_boxes]
Get paths to human liver EM zarr stores.
Arguments:
- path: Filepath to a folder where the cached zarr stores will be saved.
- bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. Ignored when full_volume=True.
- download: Whether to stream and cache the data if it is not present.
- full_volume: If True, download/use the full tissue volume as a single sharded zarr v3 store (~12h download, ~100-150 GB). Supersedes bounding_boxes.
Returns:
List of filepaths to the cached zarr stores.
356def get_human_liver_em_dataset( 357 path: Union[os.PathLike, str], 358 patch_shape: Tuple[int, int, int], 359 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 360 label_choice: LabelChoice = "mito", 361 download: bool = False, 362 full_volume: bool = False, 363 **kwargs, 364) -> Dataset: 365 """Get the human liver EM dataset for semantic segmentation. 366 367 Args: 368 path: Filepath to a folder where the cached zarr stores will be saved. 369 patch_shape: The patch shape (z, y, x) to use for training. The pixel 370 resolution is unconfirmed (estimated ~50-100 nm/px xy). Check the 371 paper for the exact values when selecting isotropic patch shapes. 372 bounding_boxes: List of subvolumes to use, each as 373 (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. 374 Ignored when full_volume=True. 375 label_choice: Which structure to segment. 376 download: Whether to stream and cache data if not already present. 377 full_volume: If True, use the full sharded tissue volume (~12h one-time download). 378 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 379 380 Returns: 381 The segmentation dataset. 382 """ 383 assert len(patch_shape) == 3 384 385 paths = get_human_liver_em_paths(path, bounding_boxes, download, full_volume) 386 387 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 388 389 return torch_em.default_segmentation_dataset( 390 raw_paths=paths, 391 raw_key="raw", 392 label_paths=paths, 393 label_key=label_choice, 394 patch_shape=patch_shape, 395 **kwargs, 396 )
Get the human liver EM dataset for semantic segmentation.
Arguments:
- path: Filepath to a folder where the cached zarr stores will be saved.
- patch_shape: The patch shape (z, y, x) to use for training. The pixel resolution is unconfirmed (estimated ~50-100 nm/px xy). Check the paper for the exact values when selecting isotropic patch shapes.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. Ignored when full_volume=True.
- label_choice: Which structure to segment.
- download: Whether to stream and cache data if not already present.
- full_volume: If True, use the full sharded tissue volume (~12h one-time download).
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
399def get_human_liver_em_loader( 400 path: Union[os.PathLike, str], 401 patch_shape: Tuple[int, int, int], 402 batch_size: int, 403 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 404 label_choice: LabelChoice = "mito", 405 download: bool = False, 406 full_volume: bool = False, 407 **kwargs, 408) -> DataLoader: 409 """Get the DataLoader for semantic segmentation in the human liver EM dataset. 410 411 Args: 412 path: Filepath to a folder where the cached zarr stores will be saved. 413 patch_shape: The patch shape (z, y, x) to use for training. 414 batch_size: The batch size for training. 415 bounding_boxes: List of subvolumes to use, each as 416 (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. 417 Ignored when full_volume=True. 418 label_choice: Which structure to segment. 419 download: Whether to stream and cache data if not already present. 420 full_volume: If True, use the full sharded tissue volume (~12h one-time download). 421 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 422 or for the PyTorch DataLoader. 423 424 Returns: 425 The DataLoader. 426 """ 427 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 428 dataset = get_human_liver_em_dataset( 429 path, patch_shape, bounding_boxes, label_choice=label_choice, 430 download=download, full_volume=full_volume, **ds_kwargs 431 ) 432 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the DataLoader for semantic segmentation in the human liver EM dataset.
Arguments:
- path: Filepath to a folder where the cached zarr stores will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- batch_size: The batch size for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. Ignored when full_volume=True.
- label_choice: Which structure to segment.
- download: Whether to stream and cache data if not already present.
- full_volume: If True, use the full sharded tissue volume (~12h one-time download).
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.