torch_em.data.datasets.electron_microscopy.parlakgul_liver

The Parlakgul liver dataset contains FIB-SEM volumes of mouse liver with dense semantic segmentation of 7 organelle classes. All labels are binary semantic masks (0=background, 1=foreground) - not instance segmentation.

Four FIB-SEM volumes are available across lean and obese conditions:

  • 6461 (lean): 12000 x 8000 x 5638 voxels, 8 nm isotropic
  • 6464 (obese 1): 9112 x 10200 x 7896 voxels, 8 nm isotropic
  • 9430 (obese 2): 8000 x 8050 x 8501 voxels, 8 nm isotropic
  • 1857 (obese Climp63): 9700 x 9650 x 3629 voxels, 8 nm isotropic

Seven semantic segmentation classes are available via the label_choice parameter:

  • "er": endoplasmic reticulum
  • "er_sheets": ER sheets
  • "er_tubules": ER tubules
  • "mito": mitochondria
  • "lipid_droplet": lipid droplets
  • "nuclear_membrane": nuclear membrane
  • "plasma_membrane": plasma membrane (not available for 1857)

Data is streamed lazily from EMPIAR-10791 via HTTP: raw TIFFs are fetched per z-slice, segmentation is extracted per z-slice from ZIP archives using HTTP range requests. Only the requested bounding box region is downloaded and cached as zarr v3.

Bounding boxes are specified as (x_min, x_max, y_min, y_max, z_min, z_max) in voxels.

This dataset is from the publication https://doi.org/10.1038/s41586-022-04518-2. Please cite it if you use this dataset in your research.

The data is publicly available at https://www.ebi.ac.uk/empiar/EMPIAR-10791/.

  1"""The Parlakgul liver dataset contains FIB-SEM volumes of mouse liver with dense
  2semantic segmentation of 7 organelle classes. All labels are binary semantic masks
  3(0=background, 1=foreground) - not instance segmentation.
  4
  5Four FIB-SEM volumes are available across lean and obese conditions:
  6- 6461 (lean): 12000 x 8000 x 5638 voxels, 8 nm isotropic
  7- 6464 (obese 1): 9112 x 10200 x 7896 voxels, 8 nm isotropic
  8- 9430 (obese 2): 8000 x 8050 x 8501 voxels, 8 nm isotropic
  9- 1857 (obese Climp63): 9700 x 9650 x 3629 voxels, 8 nm isotropic
 10
 11Seven semantic segmentation classes are available via the `label_choice` parameter:
 12- "er": endoplasmic reticulum
 13- "er_sheets": ER sheets
 14- "er_tubules": ER tubules
 15- "mito": mitochondria
 16- "lipid_droplet": lipid droplets
 17- "nuclear_membrane": nuclear membrane
 18- "plasma_membrane": plasma membrane (not available for 1857)
 19
 20Data is streamed lazily from EMPIAR-10791 via HTTP: raw TIFFs are fetched per z-slice,
 21segmentation is extracted per z-slice from ZIP archives using HTTP range requests.
 22Only the requested bounding box region is downloaded and cached as zarr v3.
 23
 24Bounding boxes are specified as (x_min, x_max, y_min, y_max, z_min, z_max) in voxels.
 25
 26This dataset is from the publication https://doi.org/10.1038/s41586-022-04518-2.
 27Please cite it if you use this dataset in your research.
 28
 29The data is publicly available at https://www.ebi.ac.uk/empiar/EMPIAR-10791/.
 30"""
 31
 32import hashlib
 33import io
 34import os
 35import zipfile
 36from typing import Dict, List, Literal, Tuple, Union
 37
 38import numpy as np
 39from torch.utils.data import DataLoader, Dataset
 40
 41import torch_em
 42from .. import util
 43
 44
 45EMPIAR_BASE = "https://ftp.ebi.ac.uk/empiar/world_availability/10791/data"
 46PARLAKGUL_PAPER_BASE = (
 47    "Parlakgul%20et%20al%20-%20Regulation%20of%20liver%20subcellular%20architecture%20"
 48    "controls%20metabolic%20homeostasis/FIB-SEM%20Raw%20and%20Segmentation%20Data"
 49)
 50
 51PARLAKGUL_SAMPLES: Dict[str, dict] = {
 52    "6461": {
 53        "condition": "lean",
 54        "raw_dir": "6461%20-%20Lean%20Liver/6461%20Lean%20Liver%20-%20Raw",
 55        "seg_dir": "6461%20-%20Lean%20Liver/6461%20Lean%20Liver%20-%20Segmentation",
 56        "raw_pattern": "Gunes_WT1_8x8x8nm_3MHz.{z:04d}.tif",
 57        "shape": (5638, 8000, 12000),
 58        "seg_zips": {
 59            "er": "6461%20Lean%20ER.zip",
 60            "er_sheets": "6461%20Lean%20ER%20Sheets.zip",
 61            "er_tubules": "6461%20Lean%20ER%20Tubules.zip",
 62            "mito": "6461%20Lean%20Mitochondria.zip",
 63            "lipid_droplet": "6461%20Lean%20Lipid%20Droplet.zip",
 64            "nuclear_membrane": "6461%20Lean%20Nuclear%20membrane.zip",
 65            "plasma_membrane": "6461%20Lean%20Plasma%20Membrane.zip",
 66        },
 67    },
 68    "6464": {
 69        "condition": "obese1",
 70        "raw_dir": "6464%20-%20Obese1%20Liver/6464%20Obese1%20Liver%20-%20Raw",
 71        "seg_dir": "6464%20-%20Obese1%20Liver/6464%20Obese1%20Liver%20-%20Segmentation",
 72        "raw_pattern": "Gunes_HFD1_8x8x8nm_3MHz.{z:04d}.tif",
 73        "shape": (7896, 10200, 9112),
 74        "seg_zips": {
 75            "er": "6464%20Obese1%20ER.zip",
 76            "er_sheets": "6464%20Obese1%20ER%20Sheets.zip",
 77            "er_tubules": "6464%20Obese1%20ER%20Tubules.zip",
 78            "mito": "6464%20Obese1%20Mitochondria.zip",
 79            "lipid_droplet": "6464%20Obese1%20Lipid%20Droplet.zip",
 80            "nuclear_membrane": "6464%20Obese1%20Nuclear%20membrane.zip",
 81            "plasma_membrane": "6464%20Obese1%20Plasma%20Membrane.zip",
 82        },
 83    },
 84    "9430": {
 85        "condition": "obese2",
 86        "raw_dir": "9430%20-%20Obese2%20Liver/9430%20Obese2%20Liver%20-%20Raw",
 87        "seg_dir": "9430%20-%20Obese2%20Liver/9430%20Obese2%20Liver%20-%20Segmentation",
 88        "raw_pattern": "Gunes_HFD2_8x8x8nm_3MHz.{z:04d}.tif",
 89        "shape": (8501, 8050, 8000),
 90        "seg_zips": {
 91            "er": "9430%20Obese2%20ER.zip",
 92            "er_sheets": "9430%20Obese2%20ER%20Sheets.zip",
 93            "er_tubules": "9430%20Obese2%20ER%20Tubules.zip",
 94            "mito": "9430%20Obese2%20Mitochondria.zip",
 95            "lipid_droplet": "9430%20Obese2%20Lipid%20Droplet.zip",
 96            "nuclear_membrane": "9430%20Obese2%20Nuclear%20membrane.zip",
 97            "plasma_membrane": "9430%20Obese2%20Plasma%20Membrane.zip",
 98        },
 99    },
100    "1857": {
101        "condition": "obese_climp63",
102        "raw_dir": "1857%20-%20Obese%20Climp-63%20Liver/1857%20Obese%20Climp63%20Liver%20-%20Raw",
103        "seg_dir": "1857%20-%20Obese%20Climp-63%20Liver/1857%20Obese%20Climp63%20Liver%20-%20Segmentation",
104        "raw_pattern": "Gunes_CLIMP63_8x8x8nm_3MHz.{z:04d}.tif",
105        "shape": (3629, 9650, 9700),
106        "seg_zips": {
107            "er": "1857%20Obese%20Climp63%20ER.zip",
108            "er_sheets": "1857%20Obese%20Climp63%20ER%20Sheets.zip",
109            "er_tubules": "1857%20Obese%20Climp63%20ER%20Tubules.zip",
110            "mito": "1857%20Obese%20Climp63%20Mitochondria.zip",
111            "lipid_droplet": "1857%20Obese%20Climp63%20Lipid%20Droplet.zip",
112            "nuclear_membrane": "1857%20Obese%20Climp63%20Nuclear%20membrane.zip",
113        },
114    },
115}
116
117PARLAKGUL_CHUNK_SHAPE = (64, 256, 256)
118LabelChoice = Literal[
119    "er", "er_sheets", "er_tubules", "mito", "lipid_droplet", "nuclear_membrane", "plasma_membrane"
120]
121
122
123def _bbox_to_str(bbox):
124    return hashlib.md5("_".join(str(v) for v in bbox).encode()).hexdigest()[:12]
125
126
127class _HttpFile:
128    """Seekable file-like object backed by HTTP range requests."""
129
130    def __init__(self, url):
131        import requests
132        self.url = url
133        self._pos = 0
134        r = requests.head(url, timeout=30)
135        r.raise_for_status()
136        self._size = int(r.headers["Content-Length"])
137
138    def read(self, n=-1):
139        import time
140        import requests
141        end = (self._size - 1) if n == -1 else min(self._pos + n - 1, self._size - 1)
142        if self._pos > end:
143            return b""
144        for attempt in range(5):
145            try:
146                r = requests.get(self.url, headers={"Range": f"bytes={self._pos}-{end}"}, timeout=120)
147                data = r.content
148                self._pos += len(data)
149                return data
150            except Exception:
151                if attempt == 4:
152                    raise
153                time.sleep(2 ** attempt)
154
155    def seek(self, pos, whence=0):
156        if whence == 0:
157            self._pos = pos
158        elif whence == 1:
159            self._pos += pos
160        elif whence == 2:
161            self._pos = self._size + pos
162        self._pos = max(0, min(self._pos, self._size))
163        return self._pos
164
165    def tell(self):
166        return self._pos
167
168    def seekable(self):
169        return True
170
171    def readable(self):
172        return True
173
174    def __enter__(self):
175        return self
176
177    def __exit__(self, *args):
178        pass
179
180
181def _read_zip_slice(zip_url, slice_idx, x_min, x_max, y_min, y_max):
182    """Extract one segmentation TIFF from a remote ZIP using HTTP range requests."""
183    import tifffile
184
185    zf = zipfile.ZipFile(_HttpFile(zip_url))
186    names = sorted(n for n in zf.namelist() if n.endswith(".tiff") or n.endswith(".tif"))
187    if slice_idx >= len(names):
188        raise IndexError(f"Slice {slice_idx} out of range (zip has {len(names)} TIFFs)")
189    data = zf.read(names[slice_idx])
190    img = tifffile.imread(io.BytesIO(data))
191    return img[y_min:y_max, x_min:x_max]
192
193
194def _read_raw_slice(raw_url, x_min, x_max, y_min, y_max):
195    """Download one raw TIFF slice and crop to the requested region."""
196    import time
197    import requests
198    import tifffile
199
200    for attempt in range(5):
201        try:
202            r = requests.get(raw_url, timeout=180)
203            r.raise_for_status()
204            img = tifffile.imread(io.BytesIO(r.content))
205            return img[y_min:y_max, x_min:x_max]
206        except Exception:
207            if attempt == 4:
208                raise
209            time.sleep(2 ** attempt)
210
211
212def get_parlakgul_liver_data(
213    path: Union[os.PathLike, str],
214    bounding_box: Tuple[int, int, int, int, int, int],
215    sample: Literal["6461", "6464", "9430", "1857"] = "6461",
216    label_choice: LabelChoice = "mito",
217    download: bool = False,
218) -> str:
219    """Stream a subvolume from the Parlakgul liver dataset and cache it as a zarr v3 store.
220
221    Args:
222        path: Filepath to a folder where the cached zarr store will be saved.
223        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
224            in voxel coordinates at 8 nm isotropic resolution.
225        sample: Which liver sample to use. One of "6461" (lean), "6464" (obese 1),
226            "9430" (obese 2), "1857" (obese Climp63).
227        label_choice: Which organelle segmentation to use as labels.
228        download: Whether to stream and cache the data if it is not present.
229
230    Returns:
231        The filepath to the cached zarr store.
232    """
233    import zarr
234    from zarr.codecs import BloscCodec
235
236    os.makedirs(str(path), exist_ok=True)
237    zarr_path = os.path.join(str(path), f"{sample}_{label_choice}_{_bbox_to_str(bounding_box)}.zarr")
238
239    root = zarr.open_group(zarr_path, mode="a")
240    if "raw" in root and "labels" in root:
241        return zarr_path
242
243    if not download:
244        raise RuntimeError(
245            f"No cached data found at '{zarr_path}'. Set download=True to stream it from EMPIAR."
246        )
247
248    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
249    sample_info = PARLAKGUL_SAMPLES[sample]
250
251    if label_choice not in sample_info["seg_zips"]:
252        raise ValueError(f"label_choice='{label_choice}' not available for sample='{sample}'")
253
254    shape = (z_max - z_min, y_max - y_min, x_max - x_min)
255    raw_arr = np.zeros(shape, dtype=np.uint8)
256    lbl_arr = np.zeros(shape, dtype=np.uint8)
257
258    raw_base = f"{EMPIAR_BASE}/{PARLAKGUL_PAPER_BASE}/{sample_info['raw_dir']}"
259    zip_name = sample_info["seg_zips"][label_choice]
260    seg_zip_url = f"{EMPIAR_BASE}/{PARLAKGUL_PAPER_BASE}/{sample_info['seg_dir']}/{zip_name}"
261
262    print(f"Streaming Parlakgul {sample} ({sample_info['condition']}) EM + {label_choice} ...")
263    for i, z in enumerate(range(z_min, z_max)):
264        fname = sample_info["raw_pattern"].format(z=z)
265        raw_url = f"{raw_base}/{fname}"
266        raw_arr[i] = _read_raw_slice(raw_url, x_min, x_max, y_min, y_max)
267        lbl_arr[i] = _read_zip_slice(seg_zip_url, z, x_min, x_max, y_min, y_max)
268        if (i + 1) % 10 == 0:
269            print(f"  {i + 1}/{z_max - z_min} slices done")
270
271    def _make_array(name, data, is_label):
272        shuffle = "bitshuffle" if is_label else "shuffle"
273        arr = root.create_array(
274            name, shape=data.shape, chunks=PARLAKGUL_CHUNK_SHAPE, dtype=data.dtype,
275            compressors=BloscCodec(cname="zstd", clevel=6, shuffle=shuffle),
276        )
277        arr[:] = data
278
279    root.attrs["bounding_box"] = list(bounding_box)
280    root.attrs["sample"] = sample
281    root.attrs["label_choice"] = label_choice
282    root.attrs["resolution_nm"] = [8, 8, 8]
283
284    if "raw" not in root:
285        _make_array("raw", raw_arr, is_label=False)
286    if "labels" not in root:
287        _make_array("labels", lbl_arr, is_label=True)
288
289    print(f"Cached to {zarr_path} (shape {shape})")
290    return zarr_path
291
292
293def get_parlakgul_liver_paths(
294    path: Union[os.PathLike, str],
295    bounding_boxes: List[Tuple[int, int, int, int, int, int]],
296    sample: Literal["6461", "6464", "9430", "1857"] = "6461",
297    label_choice: LabelChoice = "mito",
298    download: bool = False,
299) -> List[str]:
300    """Get paths to Parlakgul liver zarr stores.
301
302    Args:
303        path: Filepath to a folder where the cached zarr stores will be saved.
304        bounding_boxes: List of regions to fetch, each as
305            (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
306        sample: Which liver sample to use.
307        label_choice: Which organelle segmentation to use as labels.
308        download: Whether to stream and cache the data if it is not present.
309
310    Returns:
311        List of filepaths to the cached zarr stores.
312    """
313    return [get_parlakgul_liver_data(path, bbox, sample, label_choice, download) for bbox in bounding_boxes]
314
315
316def get_parlakgul_liver_dataset(
317    path: Union[os.PathLike, str],
318    patch_shape: Tuple[int, int, int],
319    bounding_boxes: List[Tuple[int, int, int, int, int, int]],
320    sample: Literal["6461", "6464", "9430", "1857"] = "6461",
321    label_choice: LabelChoice = "mito",
322    download: bool = False,
323    **kwargs,
324) -> Dataset:
325    """Get the Parlakgul liver dataset for organelle segmentation.
326
327    Args:
328        path: Filepath to a folder where the cached zarr stores will be saved.
329        patch_shape: The patch shape (z, y, x) to use for training.
330        bounding_boxes: List of subvolumes to use, each as
331            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
332        sample: Which liver sample to use. One of "6461", "6464", "9430", "1857".
333        label_choice: Which organelle to segment.
334        download: Whether to stream and cache data if not already present.
335        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
336
337    Returns:
338        The segmentation dataset.
339    """
340    assert len(patch_shape) == 3
341
342    paths = get_parlakgul_liver_paths(path, bounding_boxes, sample, label_choice, download)
343
344    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
345
346    return torch_em.default_segmentation_dataset(
347        raw_paths=paths,
348        raw_key="raw",
349        label_paths=paths,
350        label_key="labels",
351        patch_shape=patch_shape,
352        **kwargs,
353    )
354
355
356def get_parlakgul_liver_loader(
357    path: Union[os.PathLike, str],
358    patch_shape: Tuple[int, int, int],
359    batch_size: int,
360    bounding_boxes: List[Tuple[int, int, int, int, int, int]],
361    sample: Literal["6461", "6464", "9430", "1857"] = "6461",
362    label_choice: LabelChoice = "mito",
363    download: bool = False,
364    **kwargs,
365) -> DataLoader:
366    """Get the DataLoader for organelle segmentation in the Parlakgul liver dataset.
367
368    Args:
369        path: Filepath to a folder where the cached zarr stores will be saved.
370        patch_shape: The patch shape (z, y, x) to use for training.
371        batch_size: The batch size for training.
372        bounding_boxes: List of subvolumes to use, each as
373            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
374        sample: Which liver sample to use. One of "6461", "6464", "9430", "1857".
375        label_choice: Which organelle to segment.
376        download: Whether to stream and cache data if not already present.
377        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
378            or for the PyTorch DataLoader.
379
380    Returns:
381        The DataLoader.
382    """
383    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
384    dataset = get_parlakgul_liver_dataset(
385        path, patch_shape, bounding_boxes, sample=sample, label_choice=label_choice,
386        download=download, **ds_kwargs
387    )
388    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
EMPIAR_BASE = 'https://ftp.ebi.ac.uk/empiar/world_availability/10791/data'
PARLAKGUL_PAPER_BASE = 'Parlakgul%20et%20al%20-%20Regulation%20of%20liver%20subcellular%20architecture%20controls%20metabolic%20homeostasis/FIB-SEM%20Raw%20and%20Segmentation%20Data'
PARLAKGUL_SAMPLES: Dict[str, dict] = {'6461': {'condition': 'lean', 'raw_dir': '6461%20-%20Lean%20Liver/6461%20Lean%20Liver%20-%20Raw', 'seg_dir': '6461%20-%20Lean%20Liver/6461%20Lean%20Liver%20-%20Segmentation', 'raw_pattern': 'Gunes_WT1_8x8x8nm_3MHz.{z:04d}.tif', 'shape': (5638, 8000, 12000), 'seg_zips': {'er': '6461%20Lean%20ER.zip', 'er_sheets': '6461%20Lean%20ER%20Sheets.zip', 'er_tubules': '6461%20Lean%20ER%20Tubules.zip', 'mito': '6461%20Lean%20Mitochondria.zip', 'lipid_droplet': '6461%20Lean%20Lipid%20Droplet.zip', 'nuclear_membrane': '6461%20Lean%20Nuclear%20membrane.zip', 'plasma_membrane': '6461%20Lean%20Plasma%20Membrane.zip'}}, '6464': {'condition': 'obese1', 'raw_dir': '6464%20-%20Obese1%20Liver/6464%20Obese1%20Liver%20-%20Raw', 'seg_dir': '6464%20-%20Obese1%20Liver/6464%20Obese1%20Liver%20-%20Segmentation', 'raw_pattern': 'Gunes_HFD1_8x8x8nm_3MHz.{z:04d}.tif', 'shape': (7896, 10200, 9112), 'seg_zips': {'er': '6464%20Obese1%20ER.zip', 'er_sheets': '6464%20Obese1%20ER%20Sheets.zip', 'er_tubules': '6464%20Obese1%20ER%20Tubules.zip', 'mito': '6464%20Obese1%20Mitochondria.zip', 'lipid_droplet': '6464%20Obese1%20Lipid%20Droplet.zip', 'nuclear_membrane': '6464%20Obese1%20Nuclear%20membrane.zip', 'plasma_membrane': '6464%20Obese1%20Plasma%20Membrane.zip'}}, '9430': {'condition': 'obese2', 'raw_dir': '9430%20-%20Obese2%20Liver/9430%20Obese2%20Liver%20-%20Raw', 'seg_dir': '9430%20-%20Obese2%20Liver/9430%20Obese2%20Liver%20-%20Segmentation', 'raw_pattern': 'Gunes_HFD2_8x8x8nm_3MHz.{z:04d}.tif', 'shape': (8501, 8050, 8000), 'seg_zips': {'er': '9430%20Obese2%20ER.zip', 'er_sheets': '9430%20Obese2%20ER%20Sheets.zip', 'er_tubules': '9430%20Obese2%20ER%20Tubules.zip', 'mito': '9430%20Obese2%20Mitochondria.zip', 'lipid_droplet': '9430%20Obese2%20Lipid%20Droplet.zip', 'nuclear_membrane': '9430%20Obese2%20Nuclear%20membrane.zip', 'plasma_membrane': '9430%20Obese2%20Plasma%20Membrane.zip'}}, '1857': {'condition': 'obese_climp63', 'raw_dir': '1857%20-%20Obese%20Climp-63%20Liver/1857%20Obese%20Climp63%20Liver%20-%20Raw', 'seg_dir': '1857%20-%20Obese%20Climp-63%20Liver/1857%20Obese%20Climp63%20Liver%20-%20Segmentation', 'raw_pattern': 'Gunes_CLIMP63_8x8x8nm_3MHz.{z:04d}.tif', 'shape': (3629, 9650, 9700), 'seg_zips': {'er': '1857%20Obese%20Climp63%20ER.zip', 'er_sheets': '1857%20Obese%20Climp63%20ER%20Sheets.zip', 'er_tubules': '1857%20Obese%20Climp63%20ER%20Tubules.zip', 'mito': '1857%20Obese%20Climp63%20Mitochondria.zip', 'lipid_droplet': '1857%20Obese%20Climp63%20Lipid%20Droplet.zip', 'nuclear_membrane': '1857%20Obese%20Climp63%20Nuclear%20membrane.zip'}}}
PARLAKGUL_CHUNK_SHAPE = (64, 256, 256)
LabelChoice = typing.Literal['er', 'er_sheets', 'er_tubules', 'mito', 'lipid_droplet', 'nuclear_membrane', 'plasma_membrane']
def get_parlakgul_liver_data( path: Union[os.PathLike, str], bounding_box: Tuple[int, int, int, int, int, int], sample: Literal['6461', '6464', '9430', '1857'] = '6461', label_choice: Literal['er', 'er_sheets', 'er_tubules', 'mito', 'lipid_droplet', 'nuclear_membrane', 'plasma_membrane'] = 'mito', download: bool = False) -> str:
213def get_parlakgul_liver_data(
214    path: Union[os.PathLike, str],
215    bounding_box: Tuple[int, int, int, int, int, int],
216    sample: Literal["6461", "6464", "9430", "1857"] = "6461",
217    label_choice: LabelChoice = "mito",
218    download: bool = False,
219) -> str:
220    """Stream a subvolume from the Parlakgul liver dataset and cache it as a zarr v3 store.
221
222    Args:
223        path: Filepath to a folder where the cached zarr store will be saved.
224        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
225            in voxel coordinates at 8 nm isotropic resolution.
226        sample: Which liver sample to use. One of "6461" (lean), "6464" (obese 1),
227            "9430" (obese 2), "1857" (obese Climp63).
228        label_choice: Which organelle segmentation to use as labels.
229        download: Whether to stream and cache the data if it is not present.
230
231    Returns:
232        The filepath to the cached zarr store.
233    """
234    import zarr
235    from zarr.codecs import BloscCodec
236
237    os.makedirs(str(path), exist_ok=True)
238    zarr_path = os.path.join(str(path), f"{sample}_{label_choice}_{_bbox_to_str(bounding_box)}.zarr")
239
240    root = zarr.open_group(zarr_path, mode="a")
241    if "raw" in root and "labels" in root:
242        return zarr_path
243
244    if not download:
245        raise RuntimeError(
246            f"No cached data found at '{zarr_path}'. Set download=True to stream it from EMPIAR."
247        )
248
249    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
250    sample_info = PARLAKGUL_SAMPLES[sample]
251
252    if label_choice not in sample_info["seg_zips"]:
253        raise ValueError(f"label_choice='{label_choice}' not available for sample='{sample}'")
254
255    shape = (z_max - z_min, y_max - y_min, x_max - x_min)
256    raw_arr = np.zeros(shape, dtype=np.uint8)
257    lbl_arr = np.zeros(shape, dtype=np.uint8)
258
259    raw_base = f"{EMPIAR_BASE}/{PARLAKGUL_PAPER_BASE}/{sample_info['raw_dir']}"
260    zip_name = sample_info["seg_zips"][label_choice]
261    seg_zip_url = f"{EMPIAR_BASE}/{PARLAKGUL_PAPER_BASE}/{sample_info['seg_dir']}/{zip_name}"
262
263    print(f"Streaming Parlakgul {sample} ({sample_info['condition']}) EM + {label_choice} ...")
264    for i, z in enumerate(range(z_min, z_max)):
265        fname = sample_info["raw_pattern"].format(z=z)
266        raw_url = f"{raw_base}/{fname}"
267        raw_arr[i] = _read_raw_slice(raw_url, x_min, x_max, y_min, y_max)
268        lbl_arr[i] = _read_zip_slice(seg_zip_url, z, x_min, x_max, y_min, y_max)
269        if (i + 1) % 10 == 0:
270            print(f"  {i + 1}/{z_max - z_min} slices done")
271
272    def _make_array(name, data, is_label):
273        shuffle = "bitshuffle" if is_label else "shuffle"
274        arr = root.create_array(
275            name, shape=data.shape, chunks=PARLAKGUL_CHUNK_SHAPE, dtype=data.dtype,
276            compressors=BloscCodec(cname="zstd", clevel=6, shuffle=shuffle),
277        )
278        arr[:] = data
279
280    root.attrs["bounding_box"] = list(bounding_box)
281    root.attrs["sample"] = sample
282    root.attrs["label_choice"] = label_choice
283    root.attrs["resolution_nm"] = [8, 8, 8]
284
285    if "raw" not in root:
286        _make_array("raw", raw_arr, is_label=False)
287    if "labels" not in root:
288        _make_array("labels", lbl_arr, is_label=True)
289
290    print(f"Cached to {zarr_path} (shape {shape})")
291    return zarr_path

Stream a subvolume from the Parlakgul liver dataset and cache it as a zarr v3 store.

Arguments:
  • path: Filepath to a folder where the cached zarr store will be saved.
  • bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates at 8 nm isotropic resolution.
  • sample: Which liver sample to use. One of "6461" (lean), "6464" (obese 1), "9430" (obese 2), "1857" (obese Climp63).
  • label_choice: Which organelle segmentation to use as labels.
  • download: Whether to stream and cache the data if it is not present.
Returns:

The filepath to the cached zarr store.

def get_parlakgul_liver_paths( path: Union[os.PathLike, str], bounding_boxes: List[Tuple[int, int, int, int, int, int]], sample: Literal['6461', '6464', '9430', '1857'] = '6461', label_choice: Literal['er', 'er_sheets', 'er_tubules', 'mito', 'lipid_droplet', 'nuclear_membrane', 'plasma_membrane'] = 'mito', download: bool = False) -> List[str]:
294def get_parlakgul_liver_paths(
295    path: Union[os.PathLike, str],
296    bounding_boxes: List[Tuple[int, int, int, int, int, int]],
297    sample: Literal["6461", "6464", "9430", "1857"] = "6461",
298    label_choice: LabelChoice = "mito",
299    download: bool = False,
300) -> List[str]:
301    """Get paths to Parlakgul liver zarr stores.
302
303    Args:
304        path: Filepath to a folder where the cached zarr stores will be saved.
305        bounding_boxes: List of regions to fetch, each as
306            (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
307        sample: Which liver sample to use.
308        label_choice: Which organelle segmentation to use as labels.
309        download: Whether to stream and cache the data if it is not present.
310
311    Returns:
312        List of filepaths to the cached zarr stores.
313    """
314    return [get_parlakgul_liver_data(path, bbox, sample, label_choice, download) for bbox in bounding_boxes]

Get paths to Parlakgul liver zarr stores.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in voxel coordinates.
  • sample: Which liver sample to use.
  • label_choice: Which organelle segmentation to use as labels.
  • download: Whether to stream and cache the data if it is not present.
Returns:

List of filepaths to the cached zarr stores.

def get_parlakgul_liver_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], bounding_boxes: List[Tuple[int, int, int, int, int, int]], sample: Literal['6461', '6464', '9430', '1857'] = '6461', label_choice: Literal['er', 'er_sheets', 'er_tubules', 'mito', 'lipid_droplet', 'nuclear_membrane', 'plasma_membrane'] = 'mito', download: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset:
317def get_parlakgul_liver_dataset(
318    path: Union[os.PathLike, str],
319    patch_shape: Tuple[int, int, int],
320    bounding_boxes: List[Tuple[int, int, int, int, int, int]],
321    sample: Literal["6461", "6464", "9430", "1857"] = "6461",
322    label_choice: LabelChoice = "mito",
323    download: bool = False,
324    **kwargs,
325) -> Dataset:
326    """Get the Parlakgul liver dataset for organelle segmentation.
327
328    Args:
329        path: Filepath to a folder where the cached zarr stores will be saved.
330        patch_shape: The patch shape (z, y, x) to use for training.
331        bounding_boxes: List of subvolumes to use, each as
332            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
333        sample: Which liver sample to use. One of "6461", "6464", "9430", "1857".
334        label_choice: Which organelle to segment.
335        download: Whether to stream and cache data if not already present.
336        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
337
338    Returns:
339        The segmentation dataset.
340    """
341    assert len(patch_shape) == 3
342
343    paths = get_parlakgul_liver_paths(path, bounding_boxes, sample, label_choice, download)
344
345    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
346
347    return torch_em.default_segmentation_dataset(
348        raw_paths=paths,
349        raw_key="raw",
350        label_paths=paths,
351        label_key="labels",
352        patch_shape=patch_shape,
353        **kwargs,
354    )

Get the Parlakgul liver dataset for organelle segmentation.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
  • sample: Which liver sample to use. One of "6461", "6464", "9430", "1857".
  • label_choice: Which organelle to segment.
  • download: Whether to stream and cache data if not already present.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.
Returns:

The segmentation dataset.

def get_parlakgul_liver_loader( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], batch_size: int, bounding_boxes: List[Tuple[int, int, int, int, int, int]], sample: Literal['6461', '6464', '9430', '1857'] = '6461', label_choice: Literal['er', 'er_sheets', 'er_tubules', 'mito', 'lipid_droplet', 'nuclear_membrane', 'plasma_membrane'] = 'mito', download: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader:
357def get_parlakgul_liver_loader(
358    path: Union[os.PathLike, str],
359    patch_shape: Tuple[int, int, int],
360    batch_size: int,
361    bounding_boxes: List[Tuple[int, int, int, int, int, int]],
362    sample: Literal["6461", "6464", "9430", "1857"] = "6461",
363    label_choice: LabelChoice = "mito",
364    download: bool = False,
365    **kwargs,
366) -> DataLoader:
367    """Get the DataLoader for organelle segmentation in the Parlakgul liver dataset.
368
369    Args:
370        path: Filepath to a folder where the cached zarr stores will be saved.
371        patch_shape: The patch shape (z, y, x) to use for training.
372        batch_size: The batch size for training.
373        bounding_boxes: List of subvolumes to use, each as
374            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
375        sample: Which liver sample to use. One of "6461", "6464", "9430", "1857".
376        label_choice: Which organelle to segment.
377        download: Whether to stream and cache data if not already present.
378        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
379            or for the PyTorch DataLoader.
380
381    Returns:
382        The DataLoader.
383    """
384    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
385    dataset = get_parlakgul_liver_dataset(
386        path, patch_shape, bounding_boxes, sample=sample, label_choice=label_choice,
387        download=download, **ds_kwargs
388    )
389    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the DataLoader for organelle segmentation in the Parlakgul liver dataset.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • batch_size: The batch size for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
  • sample: Which liver sample to use. One of "6461", "6464", "9430", "1857".
  • label_choice: Which organelle to segment.
  • download: Whether to stream and cache data if not already present.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.
Returns:

The DataLoader.