torch_em.data.datasets.light_microscopy.e11bio

The E11bio PRISM dataset contains multi-channel expansion microscopy images of mouse hippocampal CA3 tissue with dense neuron instance segmentation.

The data was generated using PRISM technology: viral barcoding combined with expansion microscopy and iterative immunolabeling. The tissue is physically expanded ~5× and imaged across 10 - 18 fluorescent channels (varying per crop) encoding combinatorial protein barcodes for single-neuron reconstruction.

Voxel resolution (after expansion): ~35 x 35 x 80 nm (xy / z).

Pre-packaged training crops are available on S3 in two flavours:

  • 'instance': 14 crops with dense neuron instance segmentation labels.
  • 'semantic': 17 crops with semantic segmentation labels.

Each channel is stored as a separate (Z, Y, X) dataset under 'raw/ch_00', 'raw/ch_01', ... Labels are stored as (Z, Y, X) uint32. When raw spatial dimensions exceed labels, the raw is offset-aligned (center-crop) to match.

Channel counts per crop:
  • crops 0 - 4: 18 channels
  • crops 5 - 11: 12 channels
  • crop 12: 10 channels
  • crop 13: 11 channels

Specify a consistent channel when mixing crops from different groups.

The data is hosted at s3://e11bio-prism (anonymous access, no credentials required). The dataset is described in the E11bio open-data repository: https://github.com/e11bio/e11-open-data The dataset is from the publication https://www.biorxiv.org/content/10.1101/2025.09.26.678648v1. Please cite this resource if you use the dataset in your research.

NOTE: accessing this dataset requires the s3fs package (pip install s3fs).

  1"""The E11bio PRISM dataset contains multi-channel expansion microscopy images of mouse
  2hippocampal CA3 tissue with dense neuron instance segmentation.
  3
  4The data was generated using PRISM technology: viral barcoding combined with expansion
  5microscopy and iterative immunolabeling. The tissue is physically expanded ~5× and imaged
  6across 10 - 18 fluorescent channels (varying per crop) encoding combinatorial protein barcodes
  7for single-neuron reconstruction.
  8
  9Voxel resolution (after expansion): ~35 x 35 x 80 nm (xy / z).
 10
 11Pre-packaged training crops are available on S3 in two flavours:
 12  - 'instance': 14 crops with dense neuron instance segmentation labels.
 13  - 'semantic': 17 crops with semantic segmentation labels.
 14
 15Each channel is stored as a separate (Z, Y, X) dataset under 'raw/ch_00', 'raw/ch_01', ...
 16Labels are stored as (Z, Y, X) uint32. When raw spatial dimensions exceed labels, the raw
 17is offset-aligned (center-crop) to match.
 18
 19Channel counts per crop:
 20  - crops 0 - 4: 18 channels
 21  - crops 5 - 11: 12 channels
 22  - crop 12: 10 channels
 23  - crop 13: 11 channels
 24Specify a consistent channel when mixing crops from different groups.
 25
 26The data is hosted at s3://e11bio-prism (anonymous access, no credentials required).
 27The dataset is described in the E11bio open-data repository: https://github.com/e11bio/e11-open-data
 28The dataset is from the publication https://www.biorxiv.org/content/10.1101/2025.09.26.678648v1.
 29Please cite this resource if you use the dataset in your research.
 30
 31NOTE: accessing this dataset requires the `s3fs` package (pip install s3fs).
 32"""
 33
 34import os
 35from typing import List, Literal, Optional, Tuple, Union
 36
 37from torch.utils.data import DataLoader, Dataset
 38
 39import torch_em
 40
 41from .. import util
 42
 43
 44S3_BASE = "e11bio-prism/ls/models/training_data"
 45
 46SPLIT_NUM_CROPS = {
 47    "instance": 14,
 48    "semantic": 17,
 49}
 50
 51
 52def _get_store(split, crop_id):
 53    import s3fs
 54    fs = s3fs.S3FileSystem(anon=True)
 55    return s3fs.S3Map(f"{S3_BASE}/{split}/crop_{crop_id}.zarr", s3=fs)
 56
 57
 58def get_e11bio_data(
 59    path: Union[os.PathLike, str],
 60    split: Literal["instance", "semantic"] = "instance",
 61    crop_ids: Optional[List[int]] = None,
 62    download: bool = False,
 63) -> List[str]:
 64    """Download and cache E11bio PRISM training crops as HDF5 files.
 65
 66    Each HDF5 file contains:
 67      - raw/ch_00, raw/ch_01, ...: one (Z, Y, X) uint8 dataset per channel.
 68      - labels: (Z, Y, X) uint32 instance or semantic segmentation.
 69
 70    Args:
 71        path: Filepath to a folder where the cached HDF5 files will be saved.
 72        split: Which training split to use. Either 'instance' (14 crops, neuron instance
 73            segmentation) or 'semantic' (17 crops, semantic segmentation).
 74        crop_ids: Which crop indices to use. Defaults to all crops for the given split.
 75        download: Whether to download the data if not already present.
 76
 77    Returns:
 78        List of filepaths to the cached HDF5 files.
 79    """
 80    import h5py
 81    import zarr
 82    from skimage.segmentation import relabel_sequential
 83
 84    if split not in SPLIT_NUM_CROPS:
 85        raise ValueError(f"split must be one of {list(SPLIT_NUM_CROPS)}, got {split!r}")
 86
 87    if crop_ids is None:
 88        crop_ids = list(range(SPLIT_NUM_CROPS[split]))
 89
 90    split_dir = os.path.join(path, split)
 91    os.makedirs(split_dir, exist_ok=True)
 92
 93    h5_paths = []
 94    for crop_id in crop_ids:
 95        h5_path = os.path.join(split_dir, f"crop_{crop_id}.h5")
 96        h5_paths.append(h5_path)
 97
 98        if os.path.exists(h5_path):
 99            continue
100
101        if not download:
102            raise RuntimeError(
103                f"No cached data found at '{h5_path}'. Set download=True to stream it from S3."
104            )
105
106        try:
107            import s3fs  # noqa
108        except ImportError:
109            raise ImportError(
110                "The 's3fs' package is required to access the E11bio dataset. "
111                "Install it with: 'pip install s3fs'."
112            )
113
114        print(f"Streaming E11bio PRISM {split} crop_{crop_id} from S3 ...")
115        store = _get_store(split, crop_id)
116        f = zarr.open(store, mode="r")
117
118        raw_arr = f["raw"][:]  # (C, Z, Y, X)
119        labels_arr = f["labels"][:]  # (Z, Y, X)
120
121        # Align raw spatially to labels using the stored offsets.
122        raw_offset = f["raw"].attrs.get("offset", [0, 0, 0])
123        lbl_offset = f["labels"].attrs.get("offset", [0, 0, 0])
124        resolution = f["raw"].attrs.get("resolution", [1, 1, 1])
125
126        z0 = round((lbl_offset[0] - raw_offset[0]) / resolution[0])
127        y0 = round((lbl_offset[1] - raw_offset[1]) / resolution[1])
128        x0 = round((lbl_offset[2] - raw_offset[2]) / resolution[2])
129
130        lz, ly, lx = labels_arr.shape
131        raw_arr = raw_arr[:, z0:z0 + lz, y0:y0 + ly, x0:x0 + lx]
132
133        # Relabel to consecutive integers.
134        labels_arr, _, _ = relabel_sequential(labels_arr)
135
136        with h5py.File(h5_path, "w", locking=False) as out:
137            out.attrs["crop_id"] = crop_id
138            out.attrs["split"] = split
139            out.attrs["num_channels"] = raw_arr.shape[0]
140            raw_grp = out.create_group("raw")
141            for ch_idx, ch_data in enumerate(raw_arr):
142                raw_grp.create_dataset(
143                    f"ch_{ch_idx:02d}", data=ch_data.astype("uint8"), compression="gzip", chunks=True
144                )
145            out.create_dataset("labels", data=labels_arr.astype("uint32"), compression="gzip", chunks=True)
146
147        print(f"Cached to {h5_path}  ({raw_arr.shape[0]} channels, spatial {labels_arr.shape})")
148
149    return h5_paths
150
151
152def get_e11bio_paths(
153    path: Union[os.PathLike, str],
154    split: Literal["instance", "semantic"] = "instance",
155    crop_ids: Optional[List[int]] = None,
156    download: bool = False,
157) -> List[str]:
158    """Get paths to the E11bio PRISM HDF5 cache files.
159
160    Args:
161        path: Filepath to a folder where the cached HDF5 files will be saved.
162        split: Which training split to use. Either 'instance' or 'semantic'.
163        crop_ids: Which crop indices to use. Defaults to all crops for the given split.
164        download: Whether to download the data if not already present.
165
166    Returns:
167        List of filepaths to the cached HDF5 files.
168    """
169    return get_e11bio_data(path, split, crop_ids, download)
170
171
172def get_e11bio_dataset(
173    path: Union[os.PathLike, str],
174    patch_shape: Tuple[int, int, int],
175    split: Literal["instance", "semantic"] = "instance",
176    crop_ids: Optional[List[int]] = None,
177    channel: int = 0,
178    download: bool = False,
179    offsets: Optional[List[List[int]]] = None,
180    boundaries: bool = False,
181    **kwargs,
182) -> Dataset:
183    """Get the E11bio PRISM dataset for neuron instance or semantic segmentation.
184
185    Args:
186        path: Filepath to a folder where the cached HDF5 files will be saved.
187        patch_shape: The patch shape (z, y, x) to use for training.
188        split: Which training split to use. Either 'instance' (14 crops) or
189            'semantic' (17 crops).
190        crop_ids: Which crop indices to use. Defaults to all crops for the given split.
191        channel: Which fluorescence channel to use as raw input (default 0).
192            Channel counts vary per crop (10 - 18); use a channel index present in all
193            selected crops (0 - 9 is safe for all crops).
194        download: Whether to download the data if not already present.
195        offsets: Offset values for affinity computation used as target.
196        boundaries: Whether to compute boundaries as the target.
197        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
198
199    Returns:
200        The segmentation dataset.
201    """
202    assert len(patch_shape) == 3
203
204    paths = get_e11bio_paths(path, split, crop_ids, download)
205
206    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
207    kwargs, _ = util.add_instance_label_transform(
208        kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets
209    )
210
211    return torch_em.default_segmentation_dataset(
212        raw_paths=paths,
213        raw_key=f"raw/ch_{channel:02d}",
214        label_paths=paths,
215        label_key="labels",
216        patch_shape=patch_shape,
217        ndim=3,
218        **kwargs,
219    )
220
221
222def get_e11bio_loader(
223    path: Union[os.PathLike, str],
224    patch_shape: Tuple[int, int, int],
225    batch_size: int,
226    split: Literal["instance", "semantic"] = "instance",
227    crop_ids: Optional[List[int]] = None,
228    channel: int = 0,
229    download: bool = False,
230    offsets: Optional[List[List[int]]] = None,
231    boundaries: bool = False,
232    **kwargs,
233) -> DataLoader:
234    """Get the DataLoader for neuron instance or semantic segmentation in the E11bio PRISM dataset.
235
236    Args:
237        path: Filepath to a folder where the cached HDF5 files will be saved.
238        patch_shape: The patch shape (z, y, x) to use for training.
239        batch_size: The batch size for training.
240        split: Which training split to use. Either 'instance' (14 crops) or
241            'semantic' (17 crops).
242        crop_ids: Which crop indices to use. Defaults to all crops for the given split.
243        channel: Which fluorescence channel to use as raw input (default 0).
244            Channel counts vary per crop (10 - 18); use a channel index present in all
245            selected crops (0 - 9 is safe for all crops).
246        download: Whether to download the data if not already present.
247        offsets: Offset values for affinity computation used as target.
248        boundaries: Whether to compute boundaries as the target.
249        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
250            or for the PyTorch DataLoader.
251
252    Returns:
253        The DataLoader.
254    """
255    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
256    dataset = get_e11bio_dataset(
257        path, patch_shape, split, crop_ids, channel, download, offsets, boundaries, **ds_kwargs
258    )
259    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
S3_BASE = 'e11bio-prism/ls/models/training_data'
SPLIT_NUM_CROPS = {'instance': 14, 'semantic': 17}
def get_e11bio_data( path: Union[os.PathLike, str], split: Literal['instance', 'semantic'] = 'instance', crop_ids: Optional[List[int]] = None, download: bool = False) -> List[str]:
 59def get_e11bio_data(
 60    path: Union[os.PathLike, str],
 61    split: Literal["instance", "semantic"] = "instance",
 62    crop_ids: Optional[List[int]] = None,
 63    download: bool = False,
 64) -> List[str]:
 65    """Download and cache E11bio PRISM training crops as HDF5 files.
 66
 67    Each HDF5 file contains:
 68      - raw/ch_00, raw/ch_01, ...: one (Z, Y, X) uint8 dataset per channel.
 69      - labels: (Z, Y, X) uint32 instance or semantic segmentation.
 70
 71    Args:
 72        path: Filepath to a folder where the cached HDF5 files will be saved.
 73        split: Which training split to use. Either 'instance' (14 crops, neuron instance
 74            segmentation) or 'semantic' (17 crops, semantic segmentation).
 75        crop_ids: Which crop indices to use. Defaults to all crops for the given split.
 76        download: Whether to download the data if not already present.
 77
 78    Returns:
 79        List of filepaths to the cached HDF5 files.
 80    """
 81    import h5py
 82    import zarr
 83    from skimage.segmentation import relabel_sequential
 84
 85    if split not in SPLIT_NUM_CROPS:
 86        raise ValueError(f"split must be one of {list(SPLIT_NUM_CROPS)}, got {split!r}")
 87
 88    if crop_ids is None:
 89        crop_ids = list(range(SPLIT_NUM_CROPS[split]))
 90
 91    split_dir = os.path.join(path, split)
 92    os.makedirs(split_dir, exist_ok=True)
 93
 94    h5_paths = []
 95    for crop_id in crop_ids:
 96        h5_path = os.path.join(split_dir, f"crop_{crop_id}.h5")
 97        h5_paths.append(h5_path)
 98
 99        if os.path.exists(h5_path):
100            continue
101
102        if not download:
103            raise RuntimeError(
104                f"No cached data found at '{h5_path}'. Set download=True to stream it from S3."
105            )
106
107        try:
108            import s3fs  # noqa
109        except ImportError:
110            raise ImportError(
111                "The 's3fs' package is required to access the E11bio dataset. "
112                "Install it with: 'pip install s3fs'."
113            )
114
115        print(f"Streaming E11bio PRISM {split} crop_{crop_id} from S3 ...")
116        store = _get_store(split, crop_id)
117        f = zarr.open(store, mode="r")
118
119        raw_arr = f["raw"][:]  # (C, Z, Y, X)
120        labels_arr = f["labels"][:]  # (Z, Y, X)
121
122        # Align raw spatially to labels using the stored offsets.
123        raw_offset = f["raw"].attrs.get("offset", [0, 0, 0])
124        lbl_offset = f["labels"].attrs.get("offset", [0, 0, 0])
125        resolution = f["raw"].attrs.get("resolution", [1, 1, 1])
126
127        z0 = round((lbl_offset[0] - raw_offset[0]) / resolution[0])
128        y0 = round((lbl_offset[1] - raw_offset[1]) / resolution[1])
129        x0 = round((lbl_offset[2] - raw_offset[2]) / resolution[2])
130
131        lz, ly, lx = labels_arr.shape
132        raw_arr = raw_arr[:, z0:z0 + lz, y0:y0 + ly, x0:x0 + lx]
133
134        # Relabel to consecutive integers.
135        labels_arr, _, _ = relabel_sequential(labels_arr)
136
137        with h5py.File(h5_path, "w", locking=False) as out:
138            out.attrs["crop_id"] = crop_id
139            out.attrs["split"] = split
140            out.attrs["num_channels"] = raw_arr.shape[0]
141            raw_grp = out.create_group("raw")
142            for ch_idx, ch_data in enumerate(raw_arr):
143                raw_grp.create_dataset(
144                    f"ch_{ch_idx:02d}", data=ch_data.astype("uint8"), compression="gzip", chunks=True
145                )
146            out.create_dataset("labels", data=labels_arr.astype("uint32"), compression="gzip", chunks=True)
147
148        print(f"Cached to {h5_path}  ({raw_arr.shape[0]} channels, spatial {labels_arr.shape})")
149
150    return h5_paths

Download and cache E11bio PRISM training crops as HDF5 files.

Each HDF5 file contains:

  • raw/ch_00, raw/ch_01, ...: one (Z, Y, X) uint8 dataset per channel.
  • labels: (Z, Y, X) uint32 instance or semantic segmentation.
Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • split: Which training split to use. Either 'instance' (14 crops, neuron instance segmentation) or 'semantic' (17 crops, semantic segmentation).
  • crop_ids: Which crop indices to use. Defaults to all crops for the given split.
  • download: Whether to download the data if not already present.
Returns:

List of filepaths to the cached HDF5 files.

def get_e11bio_paths( path: Union[os.PathLike, str], split: Literal['instance', 'semantic'] = 'instance', crop_ids: Optional[List[int]] = None, download: bool = False) -> List[str]:
153def get_e11bio_paths(
154    path: Union[os.PathLike, str],
155    split: Literal["instance", "semantic"] = "instance",
156    crop_ids: Optional[List[int]] = None,
157    download: bool = False,
158) -> List[str]:
159    """Get paths to the E11bio PRISM HDF5 cache files.
160
161    Args:
162        path: Filepath to a folder where the cached HDF5 files will be saved.
163        split: Which training split to use. Either 'instance' or 'semantic'.
164        crop_ids: Which crop indices to use. Defaults to all crops for the given split.
165        download: Whether to download the data if not already present.
166
167    Returns:
168        List of filepaths to the cached HDF5 files.
169    """
170    return get_e11bio_data(path, split, crop_ids, download)

Get paths to the E11bio PRISM HDF5 cache files.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • split: Which training split to use. Either 'instance' or 'semantic'.
  • crop_ids: Which crop indices to use. Defaults to all crops for the given split.
  • download: Whether to download the data if not already present.
Returns:

List of filepaths to the cached HDF5 files.

def get_e11bio_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], split: Literal['instance', 'semantic'] = 'instance', crop_ids: Optional[List[int]] = None, channel: int = 0, download: bool = False, offsets: Optional[List[List[int]]] = None, boundaries: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset:
173def get_e11bio_dataset(
174    path: Union[os.PathLike, str],
175    patch_shape: Tuple[int, int, int],
176    split: Literal["instance", "semantic"] = "instance",
177    crop_ids: Optional[List[int]] = None,
178    channel: int = 0,
179    download: bool = False,
180    offsets: Optional[List[List[int]]] = None,
181    boundaries: bool = False,
182    **kwargs,
183) -> Dataset:
184    """Get the E11bio PRISM dataset for neuron instance or semantic segmentation.
185
186    Args:
187        path: Filepath to a folder where the cached HDF5 files will be saved.
188        patch_shape: The patch shape (z, y, x) to use for training.
189        split: Which training split to use. Either 'instance' (14 crops) or
190            'semantic' (17 crops).
191        crop_ids: Which crop indices to use. Defaults to all crops for the given split.
192        channel: Which fluorescence channel to use as raw input (default 0).
193            Channel counts vary per crop (10 - 18); use a channel index present in all
194            selected crops (0 - 9 is safe for all crops).
195        download: Whether to download the data if not already present.
196        offsets: Offset values for affinity computation used as target.
197        boundaries: Whether to compute boundaries as the target.
198        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
199
200    Returns:
201        The segmentation dataset.
202    """
203    assert len(patch_shape) == 3
204
205    paths = get_e11bio_paths(path, split, crop_ids, download)
206
207    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
208    kwargs, _ = util.add_instance_label_transform(
209        kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets
210    )
211
212    return torch_em.default_segmentation_dataset(
213        raw_paths=paths,
214        raw_key=f"raw/ch_{channel:02d}",
215        label_paths=paths,
216        label_key="labels",
217        patch_shape=patch_shape,
218        ndim=3,
219        **kwargs,
220    )

Get the E11bio PRISM dataset for neuron instance or semantic segmentation.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • split: Which training split to use. Either 'instance' (14 crops) or 'semantic' (17 crops).
  • crop_ids: Which crop indices to use. Defaults to all crops for the given split.
  • channel: Which fluorescence channel to use as raw input (default 0). Channel counts vary per crop (10 - 18); use a channel index present in all selected crops (0 - 9 is safe for all crops).
  • download: Whether to download the data if not already present.
  • offsets: Offset values for affinity computation used as target.
  • boundaries: Whether to compute boundaries as the target.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.
Returns:

The segmentation dataset.

def get_e11bio_loader( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], batch_size: int, split: Literal['instance', 'semantic'] = 'instance', crop_ids: Optional[List[int]] = None, channel: int = 0, download: bool = False, offsets: Optional[List[List[int]]] = None, boundaries: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader:
223def get_e11bio_loader(
224    path: Union[os.PathLike, str],
225    patch_shape: Tuple[int, int, int],
226    batch_size: int,
227    split: Literal["instance", "semantic"] = "instance",
228    crop_ids: Optional[List[int]] = None,
229    channel: int = 0,
230    download: bool = False,
231    offsets: Optional[List[List[int]]] = None,
232    boundaries: bool = False,
233    **kwargs,
234) -> DataLoader:
235    """Get the DataLoader for neuron instance or semantic segmentation in the E11bio PRISM dataset.
236
237    Args:
238        path: Filepath to a folder where the cached HDF5 files will be saved.
239        patch_shape: The patch shape (z, y, x) to use for training.
240        batch_size: The batch size for training.
241        split: Which training split to use. Either 'instance' (14 crops) or
242            'semantic' (17 crops).
243        crop_ids: Which crop indices to use. Defaults to all crops for the given split.
244        channel: Which fluorescence channel to use as raw input (default 0).
245            Channel counts vary per crop (10 - 18); use a channel index present in all
246            selected crops (0 - 9 is safe for all crops).
247        download: Whether to download the data if not already present.
248        offsets: Offset values for affinity computation used as target.
249        boundaries: Whether to compute boundaries as the target.
250        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
251            or for the PyTorch DataLoader.
252
253    Returns:
254        The DataLoader.
255    """
256    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
257    dataset = get_e11bio_dataset(
258        path, patch_shape, split, crop_ids, channel, download, offsets, boundaries, **ds_kwargs
259    )
260    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the DataLoader for neuron instance or semantic segmentation in the E11bio PRISM dataset.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • batch_size: The batch size for training.
  • split: Which training split to use. Either 'instance' (14 crops) or 'semantic' (17 crops).
  • crop_ids: Which crop indices to use. Defaults to all crops for the given split.
  • channel: Which fluorescence channel to use as raw input (default 0). Channel counts vary per crop (10 - 18); use a channel index present in all selected crops (0 - 9 is safe for all crops).
  • download: Whether to download the data if not already present.
  • offsets: Offset values for affinity computation used as target.
  • boundaries: Whether to compute boundaries as the target.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.
Returns:

The DataLoader.