torch_em.data.datasets.light_microscopy.e11bio
The E11bio PRISM dataset contains multi-channel expansion microscopy images of mouse hippocampal CA3 tissue with dense neuron instance segmentation.
The data was generated using PRISM technology: viral barcoding combined with expansion microscopy and iterative immunolabeling. The tissue is physically expanded ~5× and imaged across 10 - 18 fluorescent channels (varying per crop) encoding combinatorial protein barcodes for single-neuron reconstruction.
Voxel resolution (after expansion): ~35 x 35 x 80 nm (xy / z).
Pre-packaged training crops are available on S3 in two flavours:
- 'instance': 14 crops with dense neuron instance segmentation labels.
- 'semantic': 17 crops with semantic segmentation labels.
Each channel is stored as a separate (Z, Y, X) dataset under 'raw/ch_00', 'raw/ch_01', ... Labels are stored as (Z, Y, X) uint32. When raw spatial dimensions exceed labels, the raw is offset-aligned (center-crop) to match.
Channel counts per crop:
- crops 0 - 4: 18 channels
- crops 5 - 11: 12 channels
- crop 12: 10 channels
- crop 13: 11 channels
Specify a consistent channel when mixing crops from different groups.
The data is hosted at s3://e11bio-prism (anonymous access, no credentials required). The dataset is described in the E11bio open-data repository: https://github.com/e11bio/e11-open-data The dataset is from the publication https://www.biorxiv.org/content/10.1101/2025.09.26.678648v1. Please cite this resource if you use the dataset in your research.
NOTE: accessing this dataset requires the s3fs package (pip install s3fs).
1"""The E11bio PRISM dataset contains multi-channel expansion microscopy images of mouse 2hippocampal CA3 tissue with dense neuron instance segmentation. 3 4The data was generated using PRISM technology: viral barcoding combined with expansion 5microscopy and iterative immunolabeling. The tissue is physically expanded ~5× and imaged 6across 10 - 18 fluorescent channels (varying per crop) encoding combinatorial protein barcodes 7for single-neuron reconstruction. 8 9Voxel resolution (after expansion): ~35 x 35 x 80 nm (xy / z). 10 11Pre-packaged training crops are available on S3 in two flavours: 12 - 'instance': 14 crops with dense neuron instance segmentation labels. 13 - 'semantic': 17 crops with semantic segmentation labels. 14 15Each channel is stored as a separate (Z, Y, X) dataset under 'raw/ch_00', 'raw/ch_01', ... 16Labels are stored as (Z, Y, X) uint32. When raw spatial dimensions exceed labels, the raw 17is offset-aligned (center-crop) to match. 18 19Channel counts per crop: 20 - crops 0 - 4: 18 channels 21 - crops 5 - 11: 12 channels 22 - crop 12: 10 channels 23 - crop 13: 11 channels 24Specify a consistent channel when mixing crops from different groups. 25 26The data is hosted at s3://e11bio-prism (anonymous access, no credentials required). 27The dataset is described in the E11bio open-data repository: https://github.com/e11bio/e11-open-data 28The dataset is from the publication https://www.biorxiv.org/content/10.1101/2025.09.26.678648v1. 29Please cite this resource if you use the dataset in your research. 30 31NOTE: accessing this dataset requires the `s3fs` package (pip install s3fs). 32""" 33 34import os 35from typing import List, Literal, Optional, Tuple, Union 36 37from torch.utils.data import DataLoader, Dataset 38 39import torch_em 40 41from .. import util 42 43 44S3_BASE = "e11bio-prism/ls/models/training_data" 45 46SPLIT_NUM_CROPS = { 47 "instance": 14, 48 "semantic": 17, 49} 50 51 52def _get_store(split, crop_id): 53 import s3fs 54 fs = s3fs.S3FileSystem(anon=True) 55 return s3fs.S3Map(f"{S3_BASE}/{split}/crop_{crop_id}.zarr", s3=fs) 56 57 58def get_e11bio_data( 59 path: Union[os.PathLike, str], 60 split: Literal["instance", "semantic"] = "instance", 61 crop_ids: Optional[List[int]] = None, 62 download: bool = False, 63) -> List[str]: 64 """Download and cache E11bio PRISM training crops as HDF5 files. 65 66 Each HDF5 file contains: 67 - raw/ch_00, raw/ch_01, ...: one (Z, Y, X) uint8 dataset per channel. 68 - labels: (Z, Y, X) uint32 instance or semantic segmentation. 69 70 Args: 71 path: Filepath to a folder where the cached HDF5 files will be saved. 72 split: Which training split to use. Either 'instance' (14 crops, neuron instance 73 segmentation) or 'semantic' (17 crops, semantic segmentation). 74 crop_ids: Which crop indices to use. Defaults to all crops for the given split. 75 download: Whether to download the data if not already present. 76 77 Returns: 78 List of filepaths to the cached HDF5 files. 79 """ 80 import h5py 81 import zarr 82 from skimage.segmentation import relabel_sequential 83 84 if split not in SPLIT_NUM_CROPS: 85 raise ValueError(f"split must be one of {list(SPLIT_NUM_CROPS)}, got {split!r}") 86 87 if crop_ids is None: 88 crop_ids = list(range(SPLIT_NUM_CROPS[split])) 89 90 split_dir = os.path.join(path, split) 91 os.makedirs(split_dir, exist_ok=True) 92 93 h5_paths = [] 94 for crop_id in crop_ids: 95 h5_path = os.path.join(split_dir, f"crop_{crop_id}.h5") 96 h5_paths.append(h5_path) 97 98 if os.path.exists(h5_path): 99 continue 100 101 if not download: 102 raise RuntimeError( 103 f"No cached data found at '{h5_path}'. Set download=True to stream it from S3." 104 ) 105 106 try: 107 import s3fs # noqa 108 except ImportError: 109 raise ImportError( 110 "The 's3fs' package is required to access the E11bio dataset. " 111 "Install it with: 'pip install s3fs'." 112 ) 113 114 print(f"Streaming E11bio PRISM {split} crop_{crop_id} from S3 ...") 115 store = _get_store(split, crop_id) 116 f = zarr.open(store, mode="r") 117 118 raw_arr = f["raw"][:] # (C, Z, Y, X) 119 labels_arr = f["labels"][:] # (Z, Y, X) 120 121 # Align raw spatially to labels using the stored offsets. 122 raw_offset = f["raw"].attrs.get("offset", [0, 0, 0]) 123 lbl_offset = f["labels"].attrs.get("offset", [0, 0, 0]) 124 resolution = f["raw"].attrs.get("resolution", [1, 1, 1]) 125 126 z0 = round((lbl_offset[0] - raw_offset[0]) / resolution[0]) 127 y0 = round((lbl_offset[1] - raw_offset[1]) / resolution[1]) 128 x0 = round((lbl_offset[2] - raw_offset[2]) / resolution[2]) 129 130 lz, ly, lx = labels_arr.shape 131 raw_arr = raw_arr[:, z0:z0 + lz, y0:y0 + ly, x0:x0 + lx] 132 133 # Relabel to consecutive integers. 134 labels_arr, _, _ = relabel_sequential(labels_arr) 135 136 with h5py.File(h5_path, "w", locking=False) as out: 137 out.attrs["crop_id"] = crop_id 138 out.attrs["split"] = split 139 out.attrs["num_channels"] = raw_arr.shape[0] 140 raw_grp = out.create_group("raw") 141 for ch_idx, ch_data in enumerate(raw_arr): 142 raw_grp.create_dataset( 143 f"ch_{ch_idx:02d}", data=ch_data.astype("uint8"), compression="gzip", chunks=True 144 ) 145 out.create_dataset("labels", data=labels_arr.astype("uint32"), compression="gzip", chunks=True) 146 147 print(f"Cached to {h5_path} ({raw_arr.shape[0]} channels, spatial {labels_arr.shape})") 148 149 return h5_paths 150 151 152def get_e11bio_paths( 153 path: Union[os.PathLike, str], 154 split: Literal["instance", "semantic"] = "instance", 155 crop_ids: Optional[List[int]] = None, 156 download: bool = False, 157) -> List[str]: 158 """Get paths to the E11bio PRISM HDF5 cache files. 159 160 Args: 161 path: Filepath to a folder where the cached HDF5 files will be saved. 162 split: Which training split to use. Either 'instance' or 'semantic'. 163 crop_ids: Which crop indices to use. Defaults to all crops for the given split. 164 download: Whether to download the data if not already present. 165 166 Returns: 167 List of filepaths to the cached HDF5 files. 168 """ 169 return get_e11bio_data(path, split, crop_ids, download) 170 171 172def get_e11bio_dataset( 173 path: Union[os.PathLike, str], 174 patch_shape: Tuple[int, int, int], 175 split: Literal["instance", "semantic"] = "instance", 176 crop_ids: Optional[List[int]] = None, 177 channel: int = 0, 178 download: bool = False, 179 offsets: Optional[List[List[int]]] = None, 180 boundaries: bool = False, 181 **kwargs, 182) -> Dataset: 183 """Get the E11bio PRISM dataset for neuron instance or semantic segmentation. 184 185 Args: 186 path: Filepath to a folder where the cached HDF5 files will be saved. 187 patch_shape: The patch shape (z, y, x) to use for training. 188 split: Which training split to use. Either 'instance' (14 crops) or 189 'semantic' (17 crops). 190 crop_ids: Which crop indices to use. Defaults to all crops for the given split. 191 channel: Which fluorescence channel to use as raw input (default 0). 192 Channel counts vary per crop (10 - 18); use a channel index present in all 193 selected crops (0 - 9 is safe for all crops). 194 download: Whether to download the data if not already present. 195 offsets: Offset values for affinity computation used as target. 196 boundaries: Whether to compute boundaries as the target. 197 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 198 199 Returns: 200 The segmentation dataset. 201 """ 202 assert len(patch_shape) == 3 203 204 paths = get_e11bio_paths(path, split, crop_ids, download) 205 206 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 207 kwargs, _ = util.add_instance_label_transform( 208 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 209 ) 210 211 return torch_em.default_segmentation_dataset( 212 raw_paths=paths, 213 raw_key=f"raw/ch_{channel:02d}", 214 label_paths=paths, 215 label_key="labels", 216 patch_shape=patch_shape, 217 ndim=3, 218 **kwargs, 219 ) 220 221 222def get_e11bio_loader( 223 path: Union[os.PathLike, str], 224 patch_shape: Tuple[int, int, int], 225 batch_size: int, 226 split: Literal["instance", "semantic"] = "instance", 227 crop_ids: Optional[List[int]] = None, 228 channel: int = 0, 229 download: bool = False, 230 offsets: Optional[List[List[int]]] = None, 231 boundaries: bool = False, 232 **kwargs, 233) -> DataLoader: 234 """Get the DataLoader for neuron instance or semantic segmentation in the E11bio PRISM dataset. 235 236 Args: 237 path: Filepath to a folder where the cached HDF5 files will be saved. 238 patch_shape: The patch shape (z, y, x) to use for training. 239 batch_size: The batch size for training. 240 split: Which training split to use. Either 'instance' (14 crops) or 241 'semantic' (17 crops). 242 crop_ids: Which crop indices to use. Defaults to all crops for the given split. 243 channel: Which fluorescence channel to use as raw input (default 0). 244 Channel counts vary per crop (10 - 18); use a channel index present in all 245 selected crops (0 - 9 is safe for all crops). 246 download: Whether to download the data if not already present. 247 offsets: Offset values for affinity computation used as target. 248 boundaries: Whether to compute boundaries as the target. 249 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 250 or for the PyTorch DataLoader. 251 252 Returns: 253 The DataLoader. 254 """ 255 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 256 dataset = get_e11bio_dataset( 257 path, patch_shape, split, crop_ids, channel, download, offsets, boundaries, **ds_kwargs 258 ) 259 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
59def get_e11bio_data( 60 path: Union[os.PathLike, str], 61 split: Literal["instance", "semantic"] = "instance", 62 crop_ids: Optional[List[int]] = None, 63 download: bool = False, 64) -> List[str]: 65 """Download and cache E11bio PRISM training crops as HDF5 files. 66 67 Each HDF5 file contains: 68 - raw/ch_00, raw/ch_01, ...: one (Z, Y, X) uint8 dataset per channel. 69 - labels: (Z, Y, X) uint32 instance or semantic segmentation. 70 71 Args: 72 path: Filepath to a folder where the cached HDF5 files will be saved. 73 split: Which training split to use. Either 'instance' (14 crops, neuron instance 74 segmentation) or 'semantic' (17 crops, semantic segmentation). 75 crop_ids: Which crop indices to use. Defaults to all crops for the given split. 76 download: Whether to download the data if not already present. 77 78 Returns: 79 List of filepaths to the cached HDF5 files. 80 """ 81 import h5py 82 import zarr 83 from skimage.segmentation import relabel_sequential 84 85 if split not in SPLIT_NUM_CROPS: 86 raise ValueError(f"split must be one of {list(SPLIT_NUM_CROPS)}, got {split!r}") 87 88 if crop_ids is None: 89 crop_ids = list(range(SPLIT_NUM_CROPS[split])) 90 91 split_dir = os.path.join(path, split) 92 os.makedirs(split_dir, exist_ok=True) 93 94 h5_paths = [] 95 for crop_id in crop_ids: 96 h5_path = os.path.join(split_dir, f"crop_{crop_id}.h5") 97 h5_paths.append(h5_path) 98 99 if os.path.exists(h5_path): 100 continue 101 102 if not download: 103 raise RuntimeError( 104 f"No cached data found at '{h5_path}'. Set download=True to stream it from S3." 105 ) 106 107 try: 108 import s3fs # noqa 109 except ImportError: 110 raise ImportError( 111 "The 's3fs' package is required to access the E11bio dataset. " 112 "Install it with: 'pip install s3fs'." 113 ) 114 115 print(f"Streaming E11bio PRISM {split} crop_{crop_id} from S3 ...") 116 store = _get_store(split, crop_id) 117 f = zarr.open(store, mode="r") 118 119 raw_arr = f["raw"][:] # (C, Z, Y, X) 120 labels_arr = f["labels"][:] # (Z, Y, X) 121 122 # Align raw spatially to labels using the stored offsets. 123 raw_offset = f["raw"].attrs.get("offset", [0, 0, 0]) 124 lbl_offset = f["labels"].attrs.get("offset", [0, 0, 0]) 125 resolution = f["raw"].attrs.get("resolution", [1, 1, 1]) 126 127 z0 = round((lbl_offset[0] - raw_offset[0]) / resolution[0]) 128 y0 = round((lbl_offset[1] - raw_offset[1]) / resolution[1]) 129 x0 = round((lbl_offset[2] - raw_offset[2]) / resolution[2]) 130 131 lz, ly, lx = labels_arr.shape 132 raw_arr = raw_arr[:, z0:z0 + lz, y0:y0 + ly, x0:x0 + lx] 133 134 # Relabel to consecutive integers. 135 labels_arr, _, _ = relabel_sequential(labels_arr) 136 137 with h5py.File(h5_path, "w", locking=False) as out: 138 out.attrs["crop_id"] = crop_id 139 out.attrs["split"] = split 140 out.attrs["num_channels"] = raw_arr.shape[0] 141 raw_grp = out.create_group("raw") 142 for ch_idx, ch_data in enumerate(raw_arr): 143 raw_grp.create_dataset( 144 f"ch_{ch_idx:02d}", data=ch_data.astype("uint8"), compression="gzip", chunks=True 145 ) 146 out.create_dataset("labels", data=labels_arr.astype("uint32"), compression="gzip", chunks=True) 147 148 print(f"Cached to {h5_path} ({raw_arr.shape[0]} channels, spatial {labels_arr.shape})") 149 150 return h5_paths
Download and cache E11bio PRISM training crops as HDF5 files.
Each HDF5 file contains:
- raw/ch_00, raw/ch_01, ...: one (Z, Y, X) uint8 dataset per channel.
- labels: (Z, Y, X) uint32 instance or semantic segmentation.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- split: Which training split to use. Either 'instance' (14 crops, neuron instance segmentation) or 'semantic' (17 crops, semantic segmentation).
- crop_ids: Which crop indices to use. Defaults to all crops for the given split.
- download: Whether to download the data if not already present.
Returns:
List of filepaths to the cached HDF5 files.
153def get_e11bio_paths( 154 path: Union[os.PathLike, str], 155 split: Literal["instance", "semantic"] = "instance", 156 crop_ids: Optional[List[int]] = None, 157 download: bool = False, 158) -> List[str]: 159 """Get paths to the E11bio PRISM HDF5 cache files. 160 161 Args: 162 path: Filepath to a folder where the cached HDF5 files will be saved. 163 split: Which training split to use. Either 'instance' or 'semantic'. 164 crop_ids: Which crop indices to use. Defaults to all crops for the given split. 165 download: Whether to download the data if not already present. 166 167 Returns: 168 List of filepaths to the cached HDF5 files. 169 """ 170 return get_e11bio_data(path, split, crop_ids, download)
Get paths to the E11bio PRISM HDF5 cache files.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- split: Which training split to use. Either 'instance' or 'semantic'.
- crop_ids: Which crop indices to use. Defaults to all crops for the given split.
- download: Whether to download the data if not already present.
Returns:
List of filepaths to the cached HDF5 files.
173def get_e11bio_dataset( 174 path: Union[os.PathLike, str], 175 patch_shape: Tuple[int, int, int], 176 split: Literal["instance", "semantic"] = "instance", 177 crop_ids: Optional[List[int]] = None, 178 channel: int = 0, 179 download: bool = False, 180 offsets: Optional[List[List[int]]] = None, 181 boundaries: bool = False, 182 **kwargs, 183) -> Dataset: 184 """Get the E11bio PRISM dataset for neuron instance or semantic segmentation. 185 186 Args: 187 path: Filepath to a folder where the cached HDF5 files will be saved. 188 patch_shape: The patch shape (z, y, x) to use for training. 189 split: Which training split to use. Either 'instance' (14 crops) or 190 'semantic' (17 crops). 191 crop_ids: Which crop indices to use. Defaults to all crops for the given split. 192 channel: Which fluorescence channel to use as raw input (default 0). 193 Channel counts vary per crop (10 - 18); use a channel index present in all 194 selected crops (0 - 9 is safe for all crops). 195 download: Whether to download the data if not already present. 196 offsets: Offset values for affinity computation used as target. 197 boundaries: Whether to compute boundaries as the target. 198 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 199 200 Returns: 201 The segmentation dataset. 202 """ 203 assert len(patch_shape) == 3 204 205 paths = get_e11bio_paths(path, split, crop_ids, download) 206 207 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 208 kwargs, _ = util.add_instance_label_transform( 209 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 210 ) 211 212 return torch_em.default_segmentation_dataset( 213 raw_paths=paths, 214 raw_key=f"raw/ch_{channel:02d}", 215 label_paths=paths, 216 label_key="labels", 217 patch_shape=patch_shape, 218 ndim=3, 219 **kwargs, 220 )
Get the E11bio PRISM dataset for neuron instance or semantic segmentation.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- split: Which training split to use. Either 'instance' (14 crops) or 'semantic' (17 crops).
- crop_ids: Which crop indices to use. Defaults to all crops for the given split.
- channel: Which fluorescence channel to use as raw input (default 0). Channel counts vary per crop (10 - 18); use a channel index present in all selected crops (0 - 9 is safe for all crops).
- download: Whether to download the data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
223def get_e11bio_loader( 224 path: Union[os.PathLike, str], 225 patch_shape: Tuple[int, int, int], 226 batch_size: int, 227 split: Literal["instance", "semantic"] = "instance", 228 crop_ids: Optional[List[int]] = None, 229 channel: int = 0, 230 download: bool = False, 231 offsets: Optional[List[List[int]]] = None, 232 boundaries: bool = False, 233 **kwargs, 234) -> DataLoader: 235 """Get the DataLoader for neuron instance or semantic segmentation in the E11bio PRISM dataset. 236 237 Args: 238 path: Filepath to a folder where the cached HDF5 files will be saved. 239 patch_shape: The patch shape (z, y, x) to use for training. 240 batch_size: The batch size for training. 241 split: Which training split to use. Either 'instance' (14 crops) or 242 'semantic' (17 crops). 243 crop_ids: Which crop indices to use. Defaults to all crops for the given split. 244 channel: Which fluorescence channel to use as raw input (default 0). 245 Channel counts vary per crop (10 - 18); use a channel index present in all 246 selected crops (0 - 9 is safe for all crops). 247 download: Whether to download the data if not already present. 248 offsets: Offset values for affinity computation used as target. 249 boundaries: Whether to compute boundaries as the target. 250 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 251 or for the PyTorch DataLoader. 252 253 Returns: 254 The DataLoader. 255 """ 256 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 257 dataset = get_e11bio_dataset( 258 path, patch_shape, split, crop_ids, channel, download, offsets, boundaries, **ds_kwargs 259 ) 260 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the DataLoader for neuron instance or semantic segmentation in the E11bio PRISM dataset.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- batch_size: The batch size for training.
- split: Which training split to use. Either 'instance' (14 crops) or 'semantic' (17 crops).
- crop_ids: Which crop indices to use. Defaults to all crops for the given split.
- channel: Which fluorescence channel to use as raw input (default 0). Channel counts vary per crop (10 - 18); use a channel index present in all selected crops (0 - 9 is safe for all crops).
- download: Whether to download the data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.