torch_em.data.datasets.electron_microscopy.hemibrain
The Hemibrain dataset contains a FIB-SEM volume of the Drosophila central brain with dense neuron instance segmentation and additional volumetric annotations.
It covers approximately half of the central brain at 8 nm isotropic resolution, with ~25,000 neurons reconstructed and proofread across regions including the mushroom body, central complex, lateral horn, and portions of the optic lobe.
Three label types are available via the label_choice parameter:
"neurons": dense neuron instance segmentation (uint64, 8 nm). Nearly fully dense (~99% of voxels labeled), each neuron/glial cell gets a unique ID. Background (0) is only extracellular space.
"mito": mitochondria detection (uint64, 16 nm, upsampled to 8 nm for storage). Four classes - in practice ID=1 ("special") is the mitochondria class (~10% of voxels), ID=4 ("unlabeled") is background (~88%), ID=3 ("background") ~2%, ID=2 ("interior") ~0.1%. For binary training use ID=1 as foreground.
"tissue": coarse tissue type semantic map (uint64, 16 nm, upsampled to 8 nm). Seven classes covering the full volume: 1=broken white tissue (<1%), 2=trachea (<1%), 3=cell bodies (~2%), 4=dark matter (~1%), 5=large dendrites (~14%), 6=neuropil (~83%), 7=out of bounds. Useful as a region prior or multi-class semantic target.
Bounding boxes are always specified in 8 nm voxel coordinates. For mito and tissue (native 16 nm), coordinates are halved during download and labels are upsampled back to 8 nm using nearest-neighbour rescaling (skimage.transform.rescale, order=0).
This dataset is from the publication https://doi.org/10.7554/eLife.57443. Please cite it if you use this dataset in your research.
The dataset is publicly available at https://www.janelia.org/project-team/flyem/hemibrain. Requires cloud-volume: pip install cloud-volume.
NOTE (on data size): the full EM volume is (34427, 39725, 41394) voxels at 8 nm isotropic resolution. Downloading the entire volume is not feasible. Data is instead accessed by specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates, streamed from GCS and cached locally as HDF5 files.
1"""The Hemibrain dataset contains a FIB-SEM volume of the Drosophila central brain 2with dense neuron instance segmentation and additional volumetric annotations. 3 4It covers approximately half of the central brain at 8 nm isotropic resolution, 5with ~25,000 neurons reconstructed and proofread across regions including the 6mushroom body, central complex, lateral horn, and portions of the optic lobe. 7 8Three label types are available via the `label_choice` parameter: 9 10- "neurons": dense neuron instance segmentation (uint64, 8 nm). Nearly fully dense 11 (~99% of voxels labeled), each neuron/glial cell gets a unique ID. Background (0) 12 is only extracellular space. 13 14- "mito": mitochondria detection (uint64, 16 nm, upsampled to 8 nm for storage). 15 Four classes - in practice ID=1 ("special") is the mitochondria class (~10% of 16 voxels), ID=4 ("unlabeled") is background (~88%), ID=3 ("background") ~2%, 17 ID=2 ("interior") ~0.1%. For binary training use ID=1 as foreground. 18 19- "tissue": coarse tissue type semantic map (uint64, 16 nm, upsampled to 8 nm). 20 Seven classes covering the full volume: 21 1=broken white tissue (<1%), 2=trachea (<1%), 3=cell bodies (~2%), 22 4=dark matter (~1%), 5=large dendrites (~14%), 6=neuropil (~83%), 23 7=out of bounds. Useful as a region prior or multi-class semantic target. 24 25Bounding boxes are always specified in 8 nm voxel coordinates. For mito and tissue 26(native 16 nm), coordinates are halved during download and labels are upsampled back 27to 8 nm using nearest-neighbour rescaling (skimage.transform.rescale, order=0). 28 29This dataset is from the publication https://doi.org/10.7554/eLife.57443. 30Please cite it if you use this dataset in your research. 31 32The dataset is publicly available at https://www.janelia.org/project-team/flyem/hemibrain. 33Requires cloud-volume: pip install cloud-volume. 34 35NOTE (on data size): the full EM volume is (34427, 39725, 41394) voxels at 8 nm isotropic 36resolution. Downloading the entire volume is not feasible. Data is instead accessed by 37specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel 38coordinates, streamed from GCS and cached locally as HDF5 files. 39""" 40 41import hashlib 42import os 43from typing import List, Literal, Optional, Tuple, Union 44 45import numpy as np 46from torch.utils.data import DataLoader, Dataset 47 48import torch_em 49from .. import util 50 51 52EM_URL = "gs://neuroglancer-janelia-flyem-hemibrain/emdata/clahe_yz/jpeg" 53SEG_URL = "gs://neuroglancer-janelia-flyem-hemibrain/v1.2/segmentation" 54MITO_URL = "gs://neuroglancer-janelia-flyem-hemibrain/mito_20190717.27250582" 55TISSUE_URL = "gs://neuroglancer-janelia-flyem-hemibrain/mask_normalized_round6" 56 57LABEL_URLS = { 58 "neurons": SEG_URL, 59 "mito": MITO_URL, 60 "tissue": TISSUE_URL, 61} 62# Mito and tissue are at 16 nm (factor 2 coarser than the 8 nm EM). 63LABEL_RESOLUTION_FACTOR = {"neurons": 1, "mito": 2, "tissue": 2} 64 65# A representative 1024³-voxel subvolume near the centre of the reconstructed region. 66# Units are 8 nm voxels in (x, y, z) order, matching the CloudVolume coordinate space. 67DEFAULT_BOUNDING_BOX = (15000, 16024, 18000, 19024, 18000, 19024) 68 69 70def _bbox_to_str(bbox): 71 return hashlib.md5("_".join(str(v) for v in bbox).encode()).hexdigest()[:12] 72 73 74def get_hemibrain_data( 75 path: Union[os.PathLike, str], 76 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 77 label_choice: Literal["neurons", "mito", "tissue"] = "neurons", 78 download: bool = False, 79) -> str: 80 """Stream a subvolume from the Hemibrain dataset and cache it as an HDF5 file. 81 82 Args: 83 path: Filepath to a folder where the cached HDF5 file will be saved. 84 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 85 in 8 nm voxel coordinates. Defaults to a central 1024³ training region. 86 label_choice: Which labels to cache alongside the EM. One of "neurons", "mito", 87 or "tissue". Mito and tissue are at 16 nm and are resampled to match the EM. 88 download: Whether to stream and cache the data if it is not present. 89 90 Returns: 91 The filepath to the cached HDF5 file. 92 """ 93 import h5py 94 95 os.makedirs(str(path), exist_ok=True) 96 h5_path = os.path.join(str(path), f"{label_choice}_{_bbox_to_str(bounding_box)}.h5") 97 if os.path.exists(h5_path): 98 return h5_path 99 100 if not download: 101 raise RuntimeError( 102 f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS." 103 ) 104 105 try: 106 import cloudvolume 107 except ImportError: 108 raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume") 109 110 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 111 print(f"Streaming Hemibrain EM + {label_choice} for bbox {bounding_box} ...") 112 113 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True) 114 raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 115 116 factor = LABEL_RESOLUTION_FACTOR[label_choice] 117 lx0, lx1 = x_min // factor, x_max // factor 118 ly0, ly1 = y_min // factor, y_max // factor 119 lz0, lz1 = z_min // factor, z_max // factor 120 lbl_vol = cloudvolume.CloudVolume(LABEL_URLS[label_choice], use_https=True, mip=0, progress=True) 121 labels = np.array(lbl_vol[lx0:lx1, ly0:ly1, lz0:lz1])[..., 0].transpose(2, 1, 0) 122 if factor > 1: 123 from skimage.transform import rescale 124 labels = rescale(labels, factor, order=0, anti_aliasing=False, preserve_range=True).astype(labels.dtype) 125 labels = labels[:raw.shape[0], :raw.shape[1], :raw.shape[2]] 126 127 with h5py.File(h5_path, "w", locking=False) as f: 128 f.attrs["bounding_box"] = bounding_box 129 f.attrs["label_choice"] = label_choice 130 f.attrs["resolution_nm"] = em_vol.resolution.tolist() 131 f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True) 132 f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True) 133 134 print(f"Cached to {h5_path} (raw {raw.shape}, labels {labels.shape})") 135 return h5_path 136 137 138def get_hemibrain_paths( 139 path: Union[os.PathLike, str], 140 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 141 label_choice: Literal["neurons", "mito", "tissue"] = "neurons", 142 download: bool = False, 143) -> List[str]: 144 """Get paths to Hemibrain HDF5 cache files. 145 146 Args: 147 path: Filepath to a folder where the cached HDF5 files will be saved. 148 bounding_boxes: List of regions to fetch, each as 149 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 150 Defaults to [DEFAULT_BOUNDING_BOX]. 151 label_choice: Which labels to use. One of "neurons", "mito", or "tissue". 152 download: Whether to stream and cache the data if it is not present. 153 154 Returns: 155 List of filepaths to the cached HDF5 files. 156 """ 157 if bounding_boxes is None: 158 bounding_boxes = [DEFAULT_BOUNDING_BOX] 159 return [get_hemibrain_data(path, bbox, label_choice, download) for bbox in bounding_boxes] 160 161 162def get_hemibrain_dataset( 163 path: Union[os.PathLike, str], 164 patch_shape: Tuple[int, int, int], 165 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 166 label_choice: Literal["neurons", "mito", "tissue"] = "neurons", 167 download: bool = False, 168 offsets: Optional[List[List[int]]] = None, 169 boundaries: bool = False, 170 **kwargs, 171) -> Dataset: 172 """Get the Hemibrain dataset for neuron or organelle segmentation. 173 174 Args: 175 path: Filepath to a folder where the cached HDF5 files will be saved. 176 patch_shape: The patch shape (z, y, x) to use for training. 177 bounding_boxes: List of subvolumes to use, each as 178 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 179 Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region. 180 label_choice: Which labels to use. One of "neurons", "mito", or "tissue". 181 download: Whether to stream and cache data if not already present. 182 offsets: Offset values for affinity computation used as target. 183 boundaries: Whether to compute boundaries as the target. 184 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 185 186 Returns: 187 The segmentation dataset. 188 """ 189 assert len(patch_shape) == 3 190 191 paths = get_hemibrain_paths(path, bounding_boxes, label_choice, download) 192 193 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 194 kwargs, _ = util.add_instance_label_transform( 195 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 196 ) 197 198 return torch_em.default_segmentation_dataset( 199 raw_paths=paths, 200 raw_key="raw", 201 label_paths=paths, 202 label_key="labels", 203 patch_shape=patch_shape, 204 **kwargs, 205 ) 206 207 208def get_hemibrain_loader( 209 path: Union[os.PathLike, str], 210 patch_shape: Tuple[int, int, int], 211 batch_size: int, 212 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 213 label_choice: Literal["neurons", "mito", "tissue"] = "neurons", 214 download: bool = False, 215 offsets: Optional[List[List[int]]] = None, 216 boundaries: bool = False, 217 **kwargs, 218) -> DataLoader: 219 """Get the DataLoader for neuron or organelle segmentation in the Hemibrain dataset. 220 221 Args: 222 path: Filepath to a folder where the cached HDF5 files will be saved. 223 patch_shape: The patch shape (z, y, x) to use for training. 224 batch_size: The batch size for training. 225 bounding_boxes: List of subvolumes to use, each as 226 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 227 Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region. 228 label_choice: Which labels to use. One of "neurons", "mito", or "tissue". 229 download: Whether to stream and cache data if not already present. 230 offsets: Offset values for affinity computation used as target. 231 boundaries: Whether to compute boundaries as the target. 232 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 233 or for the PyTorch DataLoader. 234 235 Returns: 236 The DataLoader. 237 """ 238 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 239 dataset = get_hemibrain_dataset( 240 path, patch_shape, bounding_boxes=bounding_boxes, label_choice=label_choice, 241 download=download, offsets=offsets, boundaries=boundaries, **ds_kwargs 242 ) 243 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
75def get_hemibrain_data( 76 path: Union[os.PathLike, str], 77 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 78 label_choice: Literal["neurons", "mito", "tissue"] = "neurons", 79 download: bool = False, 80) -> str: 81 """Stream a subvolume from the Hemibrain dataset and cache it as an HDF5 file. 82 83 Args: 84 path: Filepath to a folder where the cached HDF5 file will be saved. 85 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 86 in 8 nm voxel coordinates. Defaults to a central 1024³ training region. 87 label_choice: Which labels to cache alongside the EM. One of "neurons", "mito", 88 or "tissue". Mito and tissue are at 16 nm and are resampled to match the EM. 89 download: Whether to stream and cache the data if it is not present. 90 91 Returns: 92 The filepath to the cached HDF5 file. 93 """ 94 import h5py 95 96 os.makedirs(str(path), exist_ok=True) 97 h5_path = os.path.join(str(path), f"{label_choice}_{_bbox_to_str(bounding_box)}.h5") 98 if os.path.exists(h5_path): 99 return h5_path 100 101 if not download: 102 raise RuntimeError( 103 f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS." 104 ) 105 106 try: 107 import cloudvolume 108 except ImportError: 109 raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume") 110 111 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 112 print(f"Streaming Hemibrain EM + {label_choice} for bbox {bounding_box} ...") 113 114 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True) 115 raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 116 117 factor = LABEL_RESOLUTION_FACTOR[label_choice] 118 lx0, lx1 = x_min // factor, x_max // factor 119 ly0, ly1 = y_min // factor, y_max // factor 120 lz0, lz1 = z_min // factor, z_max // factor 121 lbl_vol = cloudvolume.CloudVolume(LABEL_URLS[label_choice], use_https=True, mip=0, progress=True) 122 labels = np.array(lbl_vol[lx0:lx1, ly0:ly1, lz0:lz1])[..., 0].transpose(2, 1, 0) 123 if factor > 1: 124 from skimage.transform import rescale 125 labels = rescale(labels, factor, order=0, anti_aliasing=False, preserve_range=True).astype(labels.dtype) 126 labels = labels[:raw.shape[0], :raw.shape[1], :raw.shape[2]] 127 128 with h5py.File(h5_path, "w", locking=False) as f: 129 f.attrs["bounding_box"] = bounding_box 130 f.attrs["label_choice"] = label_choice 131 f.attrs["resolution_nm"] = em_vol.resolution.tolist() 132 f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True) 133 f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True) 134 135 print(f"Cached to {h5_path} (raw {raw.shape}, labels {labels.shape})") 136 return h5_path
Stream a subvolume from the Hemibrain dataset and cache it as an HDF5 file.
Arguments:
- path: Filepath to a folder where the cached HDF5 file will be saved.
- bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
- label_choice: Which labels to cache alongside the EM. One of "neurons", "mito", or "tissue". Mito and tissue are at 16 nm and are resampled to match the EM.
- download: Whether to stream and cache the data if it is not present.
Returns:
The filepath to the cached HDF5 file.
139def get_hemibrain_paths( 140 path: Union[os.PathLike, str], 141 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 142 label_choice: Literal["neurons", "mito", "tissue"] = "neurons", 143 download: bool = False, 144) -> List[str]: 145 """Get paths to Hemibrain HDF5 cache files. 146 147 Args: 148 path: Filepath to a folder where the cached HDF5 files will be saved. 149 bounding_boxes: List of regions to fetch, each as 150 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 151 Defaults to [DEFAULT_BOUNDING_BOX]. 152 label_choice: Which labels to use. One of "neurons", "mito", or "tissue". 153 download: Whether to stream and cache the data if it is not present. 154 155 Returns: 156 List of filepaths to the cached HDF5 files. 157 """ 158 if bounding_boxes is None: 159 bounding_boxes = [DEFAULT_BOUNDING_BOX] 160 return [get_hemibrain_data(path, bbox, label_choice, download) for bbox in bounding_boxes]
Get paths to Hemibrain HDF5 cache files.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX].
- label_choice: Which labels to use. One of "neurons", "mito", or "tissue".
- download: Whether to stream and cache the data if it is not present.
Returns:
List of filepaths to the cached HDF5 files.
163def get_hemibrain_dataset( 164 path: Union[os.PathLike, str], 165 patch_shape: Tuple[int, int, int], 166 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 167 label_choice: Literal["neurons", "mito", "tissue"] = "neurons", 168 download: bool = False, 169 offsets: Optional[List[List[int]]] = None, 170 boundaries: bool = False, 171 **kwargs, 172) -> Dataset: 173 """Get the Hemibrain dataset for neuron or organelle segmentation. 174 175 Args: 176 path: Filepath to a folder where the cached HDF5 files will be saved. 177 patch_shape: The patch shape (z, y, x) to use for training. 178 bounding_boxes: List of subvolumes to use, each as 179 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 180 Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region. 181 label_choice: Which labels to use. One of "neurons", "mito", or "tissue". 182 download: Whether to stream and cache data if not already present. 183 offsets: Offset values for affinity computation used as target. 184 boundaries: Whether to compute boundaries as the target. 185 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 186 187 Returns: 188 The segmentation dataset. 189 """ 190 assert len(patch_shape) == 3 191 192 paths = get_hemibrain_paths(path, bounding_boxes, label_choice, download) 193 194 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 195 kwargs, _ = util.add_instance_label_transform( 196 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 197 ) 198 199 return torch_em.default_segmentation_dataset( 200 raw_paths=paths, 201 raw_key="raw", 202 label_paths=paths, 203 label_key="labels", 204 patch_shape=patch_shape, 205 **kwargs, 206 )
Get the Hemibrain dataset for neuron or organelle segmentation.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
- label_choice: Which labels to use. One of "neurons", "mito", or "tissue".
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
209def get_hemibrain_loader( 210 path: Union[os.PathLike, str], 211 patch_shape: Tuple[int, int, int], 212 batch_size: int, 213 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 214 label_choice: Literal["neurons", "mito", "tissue"] = "neurons", 215 download: bool = False, 216 offsets: Optional[List[List[int]]] = None, 217 boundaries: bool = False, 218 **kwargs, 219) -> DataLoader: 220 """Get the DataLoader for neuron or organelle segmentation in the Hemibrain dataset. 221 222 Args: 223 path: Filepath to a folder where the cached HDF5 files will be saved. 224 patch_shape: The patch shape (z, y, x) to use for training. 225 batch_size: The batch size for training. 226 bounding_boxes: List of subvolumes to use, each as 227 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 228 Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region. 229 label_choice: Which labels to use. One of "neurons", "mito", or "tissue". 230 download: Whether to stream and cache data if not already present. 231 offsets: Offset values for affinity computation used as target. 232 boundaries: Whether to compute boundaries as the target. 233 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 234 or for the PyTorch DataLoader. 235 236 Returns: 237 The DataLoader. 238 """ 239 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 240 dataset = get_hemibrain_dataset( 241 path, patch_shape, bounding_boxes=bounding_boxes, label_choice=label_choice, 242 download=download, offsets=offsets, boundaries=boundaries, **ds_kwargs 243 ) 244 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the DataLoader for neuron or organelle segmentation in the Hemibrain dataset.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- batch_size: The batch size for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
- label_choice: Which labels to use. One of "neurons", "mito", or "tissue".
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.