torch_em.data.datasets.electron_microscopy.human_liver_em

The human liver EM dataset contains multiscale SBF-SEM images of human liver tissue with semantic segmentations of cellular and organelle structures.

The dataset covers a reconstructed human periportal liver volume (597 z-slices, 20000 x 20000 pixels per slice) with 9 binary semantic segmentation classes (0=background, 255=foreground - not instance segmentation).

Currently recommended label choices for training: "er", "mito", "nucleus". These are the classes that are well-represented and have clear biological meaning at the imaging resolution of this dataset.

All 9 available classes:

  • "bile_duct": bile duct (sparse, often absent in a given crop)
  • "cell_boundary": cell boundary region (coarse, marks interior near cell edge)
  • "cholangiocyte": cholangiocyte cells (sparse)
  • "endothelial": endothelial cells (sparse)
  • "er": endoplasmic reticulum (recommended)
  • "hepatocyte": hepatocyte cell interior - tissue-level mask (large filled regions)
  • "mito": mitochondria (recommended)
  • "nucleus": nucleus (recommended)
  • "sinusoid": sinusoidal capillary (sparse)

NOTE (on other organelles): the 9 classes above are tissue/cell-level annotations. Organelle-level segmentations for additional structures (lipid droplets, Golgi, etc.) are not available in EMPIAR-13356. The Parlakgul liver dataset (EMPIAR-10791) provides richer organelle annotations (ER sheets/tubules, lipid droplets, nuclear membrane, plasma membrane) at higher FIB-SEM resolution for mouse liver.

NOTE (on resolution): the pixel size is not documented in EMPIAR-13356. Based on the visible tissue scale (~1-2mm tissue spanning 20000 pixels), xy resolution is estimated at ~50-100nm/pixel. The z section thickness is also unconfirmed. Check the paper for the exact values before selecting patch shapes for isotropic training.

Data is streamed lazily from EMPIAR-13356 via HTTP: raw 16-bit TIFFs and binary PNG masks are fetched per z-slice and cached in a single zarr v3 store per bounding box. All 9 label classes are stored together (raw, bile_duct, cell_boundary, ..., sinusoid). The label_choice parameter in the loader selects which array to use as labels.

Bounding boxes are specified as (x_min, x_max, y_min, y_max, z_min, z_max) in voxels. The full volume is (597, 20000, 20000) voxels (z, y, x). Tissue spans roughly x=[1195, 18890], y=[469, 19570] - the volume edges are empty.

This dataset is from the publication https://www.biorxiv.org/content/10.64898/2026.04.22.719970v1. Please cite it if you use this dataset in your research.

The data is publicly available at https://www.ebi.ac.uk/empiar/EMPIAR-13356/.

  1"""The human liver EM dataset contains multiscale SBF-SEM images of human liver tissue
  2with semantic segmentations of cellular and organelle structures.
  3
  4The dataset covers a reconstructed human periportal liver volume (597 z-slices,
  520000 x 20000 pixels per slice) with 9 binary semantic segmentation classes
  6(0=background, 255=foreground - not instance segmentation).
  7
  8Currently recommended label choices for training: "er", "mito", "nucleus".
  9These are the classes that are well-represented and have clear biological meaning
 10at the imaging resolution of this dataset.
 11
 12All 9 available classes:
 13- "bile_duct": bile duct (sparse, often absent in a given crop)
 14- "cell_boundary": cell boundary region (coarse, marks interior near cell edge)
 15- "cholangiocyte": cholangiocyte cells (sparse)
 16- "endothelial": endothelial cells (sparse)
 17- "er": endoplasmic reticulum (recommended)
 18- "hepatocyte": hepatocyte cell interior - tissue-level mask (large filled regions)
 19- "mito": mitochondria (recommended)
 20- "nucleus": nucleus (recommended)
 21- "sinusoid": sinusoidal capillary (sparse)
 22
 23NOTE (on other organelles): the 9 classes above are tissue/cell-level annotations.
 24Organelle-level segmentations for additional structures (lipid droplets, Golgi, etc.)
 25are not available in EMPIAR-13356. The Parlakgul liver dataset (EMPIAR-10791) provides
 26richer organelle annotations (ER sheets/tubules, lipid droplets, nuclear membrane,
 27plasma membrane) at higher FIB-SEM resolution for mouse liver.
 28
 29NOTE (on resolution): the pixel size is not documented in EMPIAR-13356. Based on the
 30visible tissue scale (~1-2mm tissue spanning 20000 pixels), xy resolution is estimated
 31at ~50-100nm/pixel. The z section thickness is also unconfirmed. Check the paper for
 32the exact values before selecting patch shapes for isotropic training.
 33
 34Data is streamed lazily from EMPIAR-13356 via HTTP: raw 16-bit TIFFs and binary PNG
 35masks are fetched per z-slice and cached in a single zarr v3 store per bounding box.
 36All 9 label classes are stored together (raw, bile_duct, cell_boundary, ..., sinusoid).
 37The `label_choice` parameter in the loader selects which array to use as labels.
 38
 39Bounding boxes are specified as (x_min, x_max, y_min, y_max, z_min, z_max) in voxels.
 40The full volume is (597, 20000, 20000) voxels (z, y, x). Tissue spans roughly
 41x=[1195, 18890], y=[469, 19570] - the volume edges are empty.
 42
 43This dataset is from the publication https://www.biorxiv.org/content/10.64898/2026.04.22.719970v1.
 44Please cite it if you use this dataset in your research.
 45
 46The data is publicly available at https://www.ebi.ac.uk/empiar/EMPIAR-13356/.
 47"""
 48
 49import hashlib
 50import io
 51import os
 52from typing import List, Literal, Optional, Tuple, Union
 53
 54import numpy as np
 55from torch.utils.data import DataLoader, Dataset
 56
 57import torch_em
 58from .. import util
 59
 60
 61EMPIAR_BASE = "https://ftp.ebi.ac.uk/empiar/world_availability/13356/data"
 62
 63HUMAN_LIVER_EM_LABEL_DIRS = {
 64    "er": "humanliver_er_mask",
 65    "mito": "humanliver_mito_mask",
 66    "nucleus": "humanliver_nucleus_mask",
 67}
 68
 69HUMAN_LIVER_EM_SHAPE = (597, 20000, 20000)
 70# Tissue spans x=[1195,18890], y=[469,19570] - edges are empty background.
 71HUMAN_LIVER_EM_TISSUE_BBOX = (1195, 18890, 469, 19570, 0, 597)
 72
 73# Zarr layout for bbox crops.
 74HUMAN_LIVER_EM_CHUNK_SHAPE = (64, 256, 256)
 75# Zarr layout for full-volume sharded store.
 76# Shards: (64, 4096, 4096) outer; chunks: (8, 256, 256) inner.
 77HUMAN_LIVER_EM_SHARD_SHAPE = (64, 4096, 4096)
 78HUMAN_LIVER_EM_INNER_CHUNK = (8, 256, 256)
 79
 80LabelChoice = Literal["er", "mito", "nucleus"]
 81
 82
 83def _bbox_to_str(bbox):
 84    return hashlib.md5("_".join(str(v) for v in bbox).encode()).hexdigest()[:12]
 85
 86
 87class _HttpFile:
 88    """Seekable file-like object backed by HTTP range requests for efficient partial TIFF reading."""
 89
 90    def __init__(self, url):
 91        import requests
 92        self.url = url
 93        self._pos = 0
 94        r = requests.head(url, timeout=30)
 95        r.raise_for_status()
 96        self._size = int(r.headers["Content-Length"])
 97
 98    def read(self, n=-1):
 99        import requests
100        end = (self._size - 1) if n == -1 else min(self._pos + n - 1, self._size - 1)
101        if self._pos > end:
102            return b""
103        r = requests.get(self.url, headers={"Range": f"bytes={self._pos}-{end}"}, timeout=120)
104        data = r.content
105        self._pos += len(data)
106        return data
107
108    def seek(self, pos, whence=0):
109        if whence == 0:
110            self._pos = pos
111        elif whence == 1:
112            self._pos += pos
113        elif whence == 2:
114            self._pos = self._size + pos
115        self._pos = max(0, min(self._pos, self._size))
116        return self._pos
117
118    def tell(self):
119        return self._pos
120
121    def seekable(self):
122        return True
123
124    def readable(self):
125        return True
126
127    def __enter__(self):
128        return self
129
130    def __exit__(self, *args):
131        pass
132
133
134def _read_raw_slice(z, x_min, x_max, y_min, y_max):
135    """Read a cropped region from a remote TIFF using a single HTTP range request covering
136    only the strips needed for the y range, avoiding downloading the full ~880 MB file."""
137    import requests
138    import tifffile
139
140    url = f"{EMPIAR_BASE}/humanliver_raw_images/humanliver_raw_{z:03d}.tif"
141
142    # Read TIFF metadata to get strip offsets and byte counts.
143    with tifffile.TiffFile(_HttpFile(url)) as tif:
144        page = tif.pages[0]
145        offsets = page.dataoffsets
146        bytecounts = page.databytecounts
147        width = page.imagewidth
148        dtype = np.dtype(page.dtype)
149
150    # One strip per row - request only strips y_min..y_max in one range request.
151    from imagecodecs import lzw_decode
152    start_byte = offsets[y_min]
153    end_byte = offsets[y_max - 1] + bytecounts[y_max - 1] - 1
154    r = requests.get(url, headers={"Range": f"bytes={start_byte}-{end_byte}"}, timeout=120)
155    r.raise_for_status()
156    raw_bytes = r.content
157
158    # Predictor=2 (horizontal differencing) is used - need cumsum after LZW decode.
159    from imagecodecs import delta_decode
160    rows = []
161    for i in range(y_min, y_max):
162        strip_start = offsets[i] - start_byte
163        strip_data = raw_bytes[strip_start:strip_start + bytecounts[i]]
164        decoded = np.frombuffer(lzw_decode(strip_data), dtype=dtype)
165        row = delta_decode(decoded, axis=-1, dist=1, out=decoded).reshape(width)
166        rows.append(row[x_min:x_max])
167
168    return np.stack(rows, axis=0)
169
170
171def _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max):
172    import PIL.Image
173    import requests
174    PIL.Image.MAX_IMAGE_PIXELS = None
175    label_dir = HUMAN_LIVER_EM_LABEL_DIRS[label_key]
176    url = f"{EMPIAR_BASE}/{label_dir}/{label_dir}_{z:03d}.png"
177    r = requests.get(url, timeout=120)
178    r.raise_for_status()
179    img = np.array(PIL.Image.open(io.BytesIO(r.content)))
180    if img.ndim == 3:
181        img = img[..., 0]
182    return img[y_min:y_max, x_min:x_max]
183
184
185def get_human_liver_em_data(
186    path: Union[os.PathLike, str],
187    bounding_box: Tuple[int, int, int, int, int, int],
188    download: bool = False,
189) -> str:
190    """Stream a subvolume from the human liver EM dataset and cache it as a zarr v3 store.
191
192    All 9 label classes are stored in a single zarr alongside the raw EM, so each
193    bounding box is only downloaded once regardless of which label_choice is used.
194
195    Args:
196        path: Filepath to a folder where the cached zarr store will be saved.
197        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
198            in voxel coordinates. Tissue spans roughly x=[1195,18890], y=[469,19570].
199        download: Whether to stream and cache the data if it is not present.
200
201    Returns:
202        The filepath to the cached zarr store.
203    """
204    import zarr
205    from zarr.codecs import BloscCodec
206
207    os.makedirs(str(path), exist_ok=True)
208    zarr_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.zarr")
209
210    root = zarr.open_group(zarr_path, mode="a")
211    all_keys = ["raw"] + list(HUMAN_LIVER_EM_LABEL_DIRS.keys())
212    if all(k in root for k in all_keys):
213        return zarr_path
214
215    if not download:
216        raise RuntimeError(
217            f"No cached data found at '{zarr_path}'. Set download=True to stream it from EMPIAR."
218        )
219
220    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
221    shape = (z_max - z_min, y_max - y_min, x_max - x_min)
222
223    print(f"Streaming Human Liver EM + all labels for bbox {bounding_box} ...")
224    raw_vol = np.zeros(shape, dtype=np.uint16)
225    label_vols = {k: np.zeros(shape, dtype=np.uint8) for k in HUMAN_LIVER_EM_LABEL_DIRS}
226
227    for i, z in enumerate(range(z_min, z_max)):
228        raw_vol[i] = _read_raw_slice(z, x_min, x_max, y_min, y_max)
229        for label_key in HUMAN_LIVER_EM_LABEL_DIRS:
230            label_vols[label_key][i] = _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max)
231        if (i + 1) % 5 == 0:
232            print(f"  {i + 1}/{z_max - z_min} slices done")
233
234    def _make_array(name, data, is_label):
235        shuffle = "bitshuffle" if is_label else "shuffle"
236        arr = root.create_array(
237            name, shape=data.shape, chunks=HUMAN_LIVER_EM_CHUNK_SHAPE, dtype=data.dtype,
238            compressors=BloscCodec(cname="zstd", clevel=6, shuffle=shuffle),
239        )
240        arr[:] = data
241
242    root.attrs["bounding_box"] = list(bounding_box)
243
244    if "raw" not in root:
245        _make_array("raw", raw_vol, is_label=False)
246    for label_key, vol in label_vols.items():
247        if label_key not in root:
248            _make_array(label_key, vol, is_label=True)
249
250    print(f"Cached to {zarr_path} (shape {shape})")
251    return zarr_path
252
253
254def get_human_liver_em_full_volume(
255    path: Union[os.PathLike, str],
256    download: bool = False,
257) -> str:
258    """Download the full human liver EM tissue volume into a sharded zarr v3 store.
259
260    Downloads all 597 z-slices for the tissue region x=[1195,18890], y=[469,19570]
261    with raw EM + er/mito/nucleus labels. Data is written slice by slice to avoid
262    memory issues. Estimated storage: ~100-150 GB compressed. Estimated download
263    time: ~12 hours (one-time cost).
264
265    The sharded zarr uses shard shape (64, 4096, 4096) with inner chunks (8, 256, 256),
266    enabling efficient random crop access during training without loading the full volume.
267
268    Args:
269        path: Filepath to a folder where the zarr store will be saved.
270        download: Whether to stream and cache the data if it is not present.
271
272    Returns:
273        The filepath to the full-volume zarr store.
274    """
275    import zarr
276    from zarr.codecs import BloscCodec, ShardingCodec
277
278    os.makedirs(str(path), exist_ok=True)
279    zarr_path = os.path.join(str(path), "full_volume.zarr")
280
281    x_min, x_max, y_min, y_max, z_min, z_max = HUMAN_LIVER_EM_TISSUE_BBOX
282    shape = (z_max - z_min, y_max - y_min, x_max - x_min)
283
284    root = zarr.open_group(zarr_path, mode="a")
285    all_keys = ["raw"] + list(HUMAN_LIVER_EM_LABEL_DIRS.keys())
286    if all(k in root for k in all_keys):
287        return zarr_path
288
289    if not download:
290        raise RuntimeError(
291            f"Full-volume zarr not found at '{zarr_path}'. Set download=True to stream from EMPIAR."
292            " Note: download takes ~12 hours and requires ~100-150 GB disk space."
293        )
294
295    def _make_sharded(name, dtype, is_label):
296        shuffle = "bitshuffle" if is_label else "shuffle"
297        return root.create_array(
298            name, shape=shape, chunks=HUMAN_LIVER_EM_SHARD_SHAPE, dtype=dtype,
299            compressors=ShardingCodec(
300                chunk_shape=HUMAN_LIVER_EM_INNER_CHUNK,
301                codecs=[BloscCodec(cname="zstd", clevel=6, shuffle=shuffle)],
302            ),
303        )
304
305    if "raw" not in root:
306        _make_sharded("raw", np.dtype("uint16"), is_label=False)
307    for label_key in HUMAN_LIVER_EM_LABEL_DIRS:
308        if label_key not in root:
309            _make_sharded(label_key, np.dtype("uint8"), is_label=True)
310
311    root.attrs["tissue_bbox"] = list(HUMAN_LIVER_EM_TISSUE_BBOX)
312    n_slices = z_max - z_min
313    print(f"Streaming full Human Liver EM volume ({shape}) - this will take several hours ...")
314
315    for i, z in enumerate(range(z_min, z_max)):
316        raw_slice = _read_raw_slice(z, x_min, x_max, y_min, y_max)
317        root["raw"][i] = raw_slice
318        for label_key in HUMAN_LIVER_EM_LABEL_DIRS:
319            mask_slice = _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max)
320            root[label_key][i] = mask_slice
321        if (i + 1) % 10 == 0:
322            print(f"  {i + 1}/{n_slices} slices done")
323
324    print(f"Full volume cached to {zarr_path} (shape {shape})")
325    return zarr_path
326
327
328def get_human_liver_em_paths(
329    path: Union[os.PathLike, str],
330    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
331    download: bool = False,
332    full_volume: bool = False,
333) -> List[str]:
334    """Get paths to human liver EM zarr stores.
335
336    Args:
337        path: Filepath to a folder where the cached zarr stores will be saved.
338        bounding_boxes: List of regions to fetch, each as
339            (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
340            Ignored when full_volume=True.
341        download: Whether to stream and cache the data if it is not present.
342        full_volume: If True, download/use the full tissue volume as a single sharded
343            zarr v3 store (~12h download, ~100-150 GB). Supersedes bounding_boxes.
344
345    Returns:
346        List of filepaths to the cached zarr stores.
347    """
348    if full_volume:
349        return [get_human_liver_em_full_volume(path, download)]
350    if bounding_boxes is None:
351        raise ValueError("Provide bounding_boxes or set full_volume=True.")
352    return [get_human_liver_em_data(path, bbox, download) for bbox in bounding_boxes]
353
354
355def get_human_liver_em_dataset(
356    path: Union[os.PathLike, str],
357    patch_shape: Tuple[int, int, int],
358    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
359    label_choice: LabelChoice = "mito",
360    download: bool = False,
361    full_volume: bool = False,
362    **kwargs,
363) -> Dataset:
364    """Get the human liver EM dataset for semantic segmentation.
365
366    Args:
367        path: Filepath to a folder where the cached zarr stores will be saved.
368        patch_shape: The patch shape (z, y, x) to use for training. The pixel
369            resolution is unconfirmed (estimated ~50-100 nm/px xy). Check the
370            paper for the exact values when selecting isotropic patch shapes.
371        bounding_boxes: List of subvolumes to use, each as
372            (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
373            Ignored when full_volume=True.
374        label_choice: Which structure to segment.
375        download: Whether to stream and cache data if not already present.
376        full_volume: If True, use the full sharded tissue volume (~12h one-time download).
377        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
378
379    Returns:
380        The segmentation dataset.
381    """
382    assert len(patch_shape) == 3
383
384    paths = get_human_liver_em_paths(path, bounding_boxes, download, full_volume)
385
386    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
387
388    return torch_em.default_segmentation_dataset(
389        raw_paths=paths,
390        raw_key="raw",
391        label_paths=paths,
392        label_key=label_choice,
393        patch_shape=patch_shape,
394        **kwargs,
395    )
396
397
398def get_human_liver_em_loader(
399    path: Union[os.PathLike, str],
400    patch_shape: Tuple[int, int, int],
401    batch_size: int,
402    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
403    label_choice: LabelChoice = "mito",
404    download: bool = False,
405    full_volume: bool = False,
406    **kwargs,
407) -> DataLoader:
408    """Get the DataLoader for semantic segmentation in the human liver EM dataset.
409
410    Args:
411        path: Filepath to a folder where the cached zarr stores will be saved.
412        patch_shape: The patch shape (z, y, x) to use for training.
413        batch_size: The batch size for training.
414        bounding_boxes: List of subvolumes to use, each as
415            (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
416            Ignored when full_volume=True.
417        label_choice: Which structure to segment.
418        download: Whether to stream and cache data if not already present.
419        full_volume: If True, use the full sharded tissue volume (~12h one-time download).
420        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
421            or for the PyTorch DataLoader.
422
423    Returns:
424        The DataLoader.
425    """
426    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
427    dataset = get_human_liver_em_dataset(
428        path, patch_shape, bounding_boxes, label_choice=label_choice,
429        download=download, full_volume=full_volume, **ds_kwargs
430    )
431    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
EMPIAR_BASE = 'https://ftp.ebi.ac.uk/empiar/world_availability/13356/data'
HUMAN_LIVER_EM_LABEL_DIRS = {'er': 'humanliver_er_mask', 'mito': 'humanliver_mito_mask', 'nucleus': 'humanliver_nucleus_mask'}
HUMAN_LIVER_EM_SHAPE = (597, 20000, 20000)
HUMAN_LIVER_EM_TISSUE_BBOX = (1195, 18890, 469, 19570, 0, 597)
HUMAN_LIVER_EM_CHUNK_SHAPE = (64, 256, 256)
HUMAN_LIVER_EM_SHARD_SHAPE = (64, 4096, 4096)
HUMAN_LIVER_EM_INNER_CHUNK = (8, 256, 256)
LabelChoice = typing.Literal['er', 'mito', 'nucleus']
def get_human_liver_em_data( path: Union[os.PathLike, str], bounding_box: Tuple[int, int, int, int, int, int], download: bool = False) -> str:
186def get_human_liver_em_data(
187    path: Union[os.PathLike, str],
188    bounding_box: Tuple[int, int, int, int, int, int],
189    download: bool = False,
190) -> str:
191    """Stream a subvolume from the human liver EM dataset and cache it as a zarr v3 store.
192
193    All 9 label classes are stored in a single zarr alongside the raw EM, so each
194    bounding box is only downloaded once regardless of which label_choice is used.
195
196    Args:
197        path: Filepath to a folder where the cached zarr store will be saved.
198        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
199            in voxel coordinates. Tissue spans roughly x=[1195,18890], y=[469,19570].
200        download: Whether to stream and cache the data if it is not present.
201
202    Returns:
203        The filepath to the cached zarr store.
204    """
205    import zarr
206    from zarr.codecs import BloscCodec
207
208    os.makedirs(str(path), exist_ok=True)
209    zarr_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.zarr")
210
211    root = zarr.open_group(zarr_path, mode="a")
212    all_keys = ["raw"] + list(HUMAN_LIVER_EM_LABEL_DIRS.keys())
213    if all(k in root for k in all_keys):
214        return zarr_path
215
216    if not download:
217        raise RuntimeError(
218            f"No cached data found at '{zarr_path}'. Set download=True to stream it from EMPIAR."
219        )
220
221    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
222    shape = (z_max - z_min, y_max - y_min, x_max - x_min)
223
224    print(f"Streaming Human Liver EM + all labels for bbox {bounding_box} ...")
225    raw_vol = np.zeros(shape, dtype=np.uint16)
226    label_vols = {k: np.zeros(shape, dtype=np.uint8) for k in HUMAN_LIVER_EM_LABEL_DIRS}
227
228    for i, z in enumerate(range(z_min, z_max)):
229        raw_vol[i] = _read_raw_slice(z, x_min, x_max, y_min, y_max)
230        for label_key in HUMAN_LIVER_EM_LABEL_DIRS:
231            label_vols[label_key][i] = _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max)
232        if (i + 1) % 5 == 0:
233            print(f"  {i + 1}/{z_max - z_min} slices done")
234
235    def _make_array(name, data, is_label):
236        shuffle = "bitshuffle" if is_label else "shuffle"
237        arr = root.create_array(
238            name, shape=data.shape, chunks=HUMAN_LIVER_EM_CHUNK_SHAPE, dtype=data.dtype,
239            compressors=BloscCodec(cname="zstd", clevel=6, shuffle=shuffle),
240        )
241        arr[:] = data
242
243    root.attrs["bounding_box"] = list(bounding_box)
244
245    if "raw" not in root:
246        _make_array("raw", raw_vol, is_label=False)
247    for label_key, vol in label_vols.items():
248        if label_key not in root:
249            _make_array(label_key, vol, is_label=True)
250
251    print(f"Cached to {zarr_path} (shape {shape})")
252    return zarr_path

Stream a subvolume from the human liver EM dataset and cache it as a zarr v3 store.

All 9 label classes are stored in a single zarr alongside the raw EM, so each bounding box is only downloaded once regardless of which label_choice is used.

Arguments:
  • path: Filepath to a folder where the cached zarr store will be saved.
  • bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. Tissue spans roughly x=[1195,18890], y=[469,19570].
  • download: Whether to stream and cache the data if it is not present.
Returns:

The filepath to the cached zarr store.

def get_human_liver_em_full_volume(path: Union[os.PathLike, str], download: bool = False) -> str:
255def get_human_liver_em_full_volume(
256    path: Union[os.PathLike, str],
257    download: bool = False,
258) -> str:
259    """Download the full human liver EM tissue volume into a sharded zarr v3 store.
260
261    Downloads all 597 z-slices for the tissue region x=[1195,18890], y=[469,19570]
262    with raw EM + er/mito/nucleus labels. Data is written slice by slice to avoid
263    memory issues. Estimated storage: ~100-150 GB compressed. Estimated download
264    time: ~12 hours (one-time cost).
265
266    The sharded zarr uses shard shape (64, 4096, 4096) with inner chunks (8, 256, 256),
267    enabling efficient random crop access during training without loading the full volume.
268
269    Args:
270        path: Filepath to a folder where the zarr store will be saved.
271        download: Whether to stream and cache the data if it is not present.
272
273    Returns:
274        The filepath to the full-volume zarr store.
275    """
276    import zarr
277    from zarr.codecs import BloscCodec, ShardingCodec
278
279    os.makedirs(str(path), exist_ok=True)
280    zarr_path = os.path.join(str(path), "full_volume.zarr")
281
282    x_min, x_max, y_min, y_max, z_min, z_max = HUMAN_LIVER_EM_TISSUE_BBOX
283    shape = (z_max - z_min, y_max - y_min, x_max - x_min)
284
285    root = zarr.open_group(zarr_path, mode="a")
286    all_keys = ["raw"] + list(HUMAN_LIVER_EM_LABEL_DIRS.keys())
287    if all(k in root for k in all_keys):
288        return zarr_path
289
290    if not download:
291        raise RuntimeError(
292            f"Full-volume zarr not found at '{zarr_path}'. Set download=True to stream from EMPIAR."
293            " Note: download takes ~12 hours and requires ~100-150 GB disk space."
294        )
295
296    def _make_sharded(name, dtype, is_label):
297        shuffle = "bitshuffle" if is_label else "shuffle"
298        return root.create_array(
299            name, shape=shape, chunks=HUMAN_LIVER_EM_SHARD_SHAPE, dtype=dtype,
300            compressors=ShardingCodec(
301                chunk_shape=HUMAN_LIVER_EM_INNER_CHUNK,
302                codecs=[BloscCodec(cname="zstd", clevel=6, shuffle=shuffle)],
303            ),
304        )
305
306    if "raw" not in root:
307        _make_sharded("raw", np.dtype("uint16"), is_label=False)
308    for label_key in HUMAN_LIVER_EM_LABEL_DIRS:
309        if label_key not in root:
310            _make_sharded(label_key, np.dtype("uint8"), is_label=True)
311
312    root.attrs["tissue_bbox"] = list(HUMAN_LIVER_EM_TISSUE_BBOX)
313    n_slices = z_max - z_min
314    print(f"Streaming full Human Liver EM volume ({shape}) - this will take several hours ...")
315
316    for i, z in enumerate(range(z_min, z_max)):
317        raw_slice = _read_raw_slice(z, x_min, x_max, y_min, y_max)
318        root["raw"][i] = raw_slice
319        for label_key in HUMAN_LIVER_EM_LABEL_DIRS:
320            mask_slice = _read_mask_slice(label_key, z, x_min, x_max, y_min, y_max)
321            root[label_key][i] = mask_slice
322        if (i + 1) % 10 == 0:
323            print(f"  {i + 1}/{n_slices} slices done")
324
325    print(f"Full volume cached to {zarr_path} (shape {shape})")
326    return zarr_path

Download the full human liver EM tissue volume into a sharded zarr v3 store.

Downloads all 597 z-slices for the tissue region x=[1195,18890], y=[469,19570] with raw EM + er/mito/nucleus labels. Data is written slice by slice to avoid memory issues. Estimated storage: ~100-150 GB compressed. Estimated download time: ~12 hours (one-time cost).

The sharded zarr uses shard shape (64, 4096, 4096) with inner chunks (8, 256, 256), enabling efficient random crop access during training without loading the full volume.

Arguments:
  • path: Filepath to a folder where the zarr store will be saved.
  • download: Whether to stream and cache the data if it is not present.
Returns:

The filepath to the full-volume zarr store.

def get_human_liver_em_paths( path: Union[os.PathLike, str], bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False, full_volume: bool = False) -> List[str]:
329def get_human_liver_em_paths(
330    path: Union[os.PathLike, str],
331    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
332    download: bool = False,
333    full_volume: bool = False,
334) -> List[str]:
335    """Get paths to human liver EM zarr stores.
336
337    Args:
338        path: Filepath to a folder where the cached zarr stores will be saved.
339        bounding_boxes: List of regions to fetch, each as
340            (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
341            Ignored when full_volume=True.
342        download: Whether to stream and cache the data if it is not present.
343        full_volume: If True, download/use the full tissue volume as a single sharded
344            zarr v3 store (~12h download, ~100-150 GB). Supersedes bounding_boxes.
345
346    Returns:
347        List of filepaths to the cached zarr stores.
348    """
349    if full_volume:
350        return [get_human_liver_em_full_volume(path, download)]
351    if bounding_boxes is None:
352        raise ValueError("Provide bounding_boxes or set full_volume=True.")
353    return [get_human_liver_em_data(path, bbox, download) for bbox in bounding_boxes]

Get paths to human liver EM zarr stores.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. Ignored when full_volume=True.
  • download: Whether to stream and cache the data if it is not present.
  • full_volume: If True, download/use the full tissue volume as a single sharded zarr v3 store (~12h download, ~100-150 GB). Supersedes bounding_boxes.
Returns:

List of filepaths to the cached zarr stores.

def get_human_liver_em_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, label_choice: Literal['er', 'mito', 'nucleus'] = 'mito', download: bool = False, full_volume: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset:
356def get_human_liver_em_dataset(
357    path: Union[os.PathLike, str],
358    patch_shape: Tuple[int, int, int],
359    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
360    label_choice: LabelChoice = "mito",
361    download: bool = False,
362    full_volume: bool = False,
363    **kwargs,
364) -> Dataset:
365    """Get the human liver EM dataset for semantic segmentation.
366
367    Args:
368        path: Filepath to a folder where the cached zarr stores will be saved.
369        patch_shape: The patch shape (z, y, x) to use for training. The pixel
370            resolution is unconfirmed (estimated ~50-100 nm/px xy). Check the
371            paper for the exact values when selecting isotropic patch shapes.
372        bounding_boxes: List of subvolumes to use, each as
373            (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
374            Ignored when full_volume=True.
375        label_choice: Which structure to segment.
376        download: Whether to stream and cache data if not already present.
377        full_volume: If True, use the full sharded tissue volume (~12h one-time download).
378        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
379
380    Returns:
381        The segmentation dataset.
382    """
383    assert len(patch_shape) == 3
384
385    paths = get_human_liver_em_paths(path, bounding_boxes, download, full_volume)
386
387    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
388
389    return torch_em.default_segmentation_dataset(
390        raw_paths=paths,
391        raw_key="raw",
392        label_paths=paths,
393        label_key=label_choice,
394        patch_shape=patch_shape,
395        **kwargs,
396    )

Get the human liver EM dataset for semantic segmentation.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training. The pixel resolution is unconfirmed (estimated ~50-100 nm/px xy). Check the paper for the exact values when selecting isotropic patch shapes.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. Ignored when full_volume=True.
  • label_choice: Which structure to segment.
  • download: Whether to stream and cache data if not already present.
  • full_volume: If True, use the full sharded tissue volume (~12h one-time download).
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.
Returns:

The segmentation dataset.

def get_human_liver_em_loader( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], batch_size: int, bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, label_choice: Literal['er', 'mito', 'nucleus'] = 'mito', download: bool = False, full_volume: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader:
399def get_human_liver_em_loader(
400    path: Union[os.PathLike, str],
401    patch_shape: Tuple[int, int, int],
402    batch_size: int,
403    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
404    label_choice: LabelChoice = "mito",
405    download: bool = False,
406    full_volume: bool = False,
407    **kwargs,
408) -> DataLoader:
409    """Get the DataLoader for semantic segmentation in the human liver EM dataset.
410
411    Args:
412        path: Filepath to a folder where the cached zarr stores will be saved.
413        patch_shape: The patch shape (z, y, x) to use for training.
414        batch_size: The batch size for training.
415        bounding_boxes: List of subvolumes to use, each as
416            (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
417            Ignored when full_volume=True.
418        label_choice: Which structure to segment.
419        download: Whether to stream and cache data if not already present.
420        full_volume: If True, use the full sharded tissue volume (~12h one-time download).
421        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
422            or for the PyTorch DataLoader.
423
424    Returns:
425        The DataLoader.
426    """
427    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
428    dataset = get_human_liver_em_dataset(
429        path, patch_shape, bounding_boxes, label_choice=label_choice,
430        download=download, full_volume=full_volume, **ds_kwargs
431    )
432    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the DataLoader for semantic segmentation in the human liver EM dataset.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • batch_size: The batch size for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates. Ignored when full_volume=True.
  • label_choice: Which structure to segment.
  • download: Whether to stream and cache data if not already present.
  • full_volume: If True, use the full sharded tissue volume (~12h one-time download).
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.
Returns:

The DataLoader.