torch_em.data.datasets.electron_microscopy.malecns
The Male CNS Connectome dataset contains a full FIB-SEM volume of the Drosophila male central nervous system with dense neuron instance segmentation.
It covers the central brain, optic lobes, and ventral nerve cord at 8nm isotropic resolution, with ~135k neurons reconstructed and proofread.
The data is hosted at https://male-cns.janelia.org and accessible via Google Cloud Storage. The EM volume is at gs://flyem-male-cns/em/em-clahe-jpeg and the segmentation is at gs://flyem-male-cns/v0.9/segmentation.
The dataset is described at https://www.biorxiv.org/content/10.1101/2025.10.09.680999v2. Please cite this publication if you use the dataset in your research.
NOTE: accessing this dataset requires the cloud-volume package (pip install cloud-volume).
NOTE (on data size): the full volume is (94088, 78317, 134576) voxels at 8nm isotropic resolution (~978 TB raw uncompressed, ~8 PB labels uncompressed). Downloading the entire volume is not feasible. Data is instead accessed by specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8nm voxel coordinates, which are streamed from GCS and cached locally as HDF5 files (~1.2 GB per 1024³ subvolume). To target specific regions of the CNS (central brain, optic lobes, VNC), the neuropil ROI volume at gs://flyem-male-cns/rois/fullbrain-roi-v4 can be used to identify relevant coordinate ranges.
1"""The Male CNS Connectome dataset contains a full FIB-SEM volume of the Drosophila 2male central nervous system with dense neuron instance segmentation. 3 4It covers the central brain, optic lobes, and ventral nerve cord at 8nm isotropic 5resolution, with ~135k neurons reconstructed and proofread. 6 7The data is hosted at https://male-cns.janelia.org and accessible via Google Cloud Storage. 8The EM volume is at gs://flyem-male-cns/em/em-clahe-jpeg and the segmentation is at 9gs://flyem-male-cns/v0.9/segmentation. 10 11The dataset is described at https://www.biorxiv.org/content/10.1101/2025.10.09.680999v2. 12Please cite this publication if you use the dataset in your research. 13 14NOTE: accessing this dataset requires the `cloud-volume` package (pip install cloud-volume). 15 16NOTE (on data size): the full volume is (94088, 78317, 134576) voxels at 8nm isotropic 17resolution (~978 TB raw uncompressed, ~8 PB labels uncompressed). Downloading the entire 18volume is not feasible. Data is instead accessed by specifying bounding boxes 19(x_min, x_max, y_min, y_max, z_min, z_max) in 8nm voxel coordinates, which are streamed 20from GCS and cached locally as HDF5 files (~1.2 GB per 1024³ subvolume). To target 21specific regions of the CNS (central brain, optic lobes, VNC), the neuropil ROI volume 22at gs://flyem-male-cns/rois/fullbrain-roi-v4 can be used to identify relevant coordinate 23ranges. 24""" 25 26import os 27import hashlib 28from typing import List, Optional, Tuple, Union 29 30import numpy as np 31 32from torch.utils.data import DataLoader, Dataset 33 34import torch_em 35 36from .. import util 37 38 39EM_URL = "gs://flyem-male-cns/em/em-clahe-jpeg" 40SEG_URL = "gs://flyem-male-cns/v0.9/segmentation" 41 42# A representative 1024³-voxel subvolume near the centre of the well-reconstructed region. 43# Units are 8 nm voxels in (x, y, z) order, matching the CloudVolume coordinate space. 44DEFAULT_BOUNDING_BOX = (40000, 41024, 40000, 41024, 20000, 21024) 45 46 47def _bbox_to_str(bbox): 48 """Create a short unique filename stem from a bounding box tuple.""" 49 key = "_".join(str(v) for v in bbox) 50 return hashlib.md5(key.encode()).hexdigest()[:12] 51 52 53def get_malecns_data( 54 path: Union[os.PathLike, str], 55 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 56 download: bool = False, 57) -> str: 58 """Stream a subvolume from the Male CNS Connectome and cache it as an HDF5 file. 59 60 Args: 61 path: Filepath to a folder where the cached HDF5 file will be saved. 62 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 63 in 8 nm voxel coordinates. Defaults to a central 1024³ training region. 64 download: Whether to stream and cache the data if it is not present. 65 66 Returns: 67 The filepath to the cached HDF5 file. 68 """ 69 import h5py 70 71 os.makedirs(path, exist_ok=True) 72 73 stem = _bbox_to_str(bounding_box) 74 h5_path = os.path.join(path, f"{stem}.h5") 75 if os.path.exists(h5_path): 76 return h5_path 77 78 if not download: 79 raise RuntimeError( 80 f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS." 81 ) 82 83 try: 84 import cloudvolume 85 except ImportError: 86 raise ImportError( 87 "The 'cloud-volume' package is required to access the Male CNS dataset. " 88 "Install it with: 'pip install cloud-volume'." 89 ) 90 91 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 92 93 print(f"Streaming Male CNS EM + segmentation for bbox {bounding_box} ...") 94 95 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True) 96 seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True) 97 98 # cloud-volume returns (x, y, z, 1). Let's squeeze channel and transpose to (z, y, x). 99 raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 100 labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 101 102 resolution_nm = em_vol.resolution.tolist() # [x, y, z] in nm 103 104 with h5py.File(h5_path, "w", locking=False) as f: 105 f.attrs["bounding_box"] = bounding_box 106 f.attrs["crop_size"] = raw.shape # (z, y, x) after transpose 107 f.attrs["resolution_nm"] = resolution_nm # [x, y, z] in nm 108 f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True) 109 f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True) 110 111 print(f"Cached to {h5_path} (raw {raw.shape}, labels {labels.shape})") 112 return h5_path 113 114 115def get_malecns_paths( 116 path: Union[os.PathLike, str], 117 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 118 download: bool = False, 119) -> List[str]: 120 """Get paths to the Male CNS HDF5 cache files. 121 122 Args: 123 path: Filepath to a folder where the cached HDF5 files will be saved. 124 bounding_boxes: List of regions to fetch, each as 125 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 126 Defaults to [DEFAULT_BOUNDING_BOX]. 127 download: Whether to stream and cache the data if it is not present. 128 129 Returns: 130 List of filepaths to the cached HDF5 files. 131 """ 132 if bounding_boxes is None: 133 bounding_boxes = [DEFAULT_BOUNDING_BOX] 134 135 return [get_malecns_data(path, bbox, download) for bbox in bounding_boxes] 136 137 138def get_malecns_dataset( 139 path: Union[os.PathLike, str], 140 patch_shape: Tuple[int, int, int], 141 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 142 download: bool = False, 143 offsets: Optional[List[List[int]]] = None, 144 boundaries: bool = False, 145 **kwargs, 146) -> Dataset: 147 """Get the Male CNS Connectome dataset for neuron instance segmentation. 148 149 Args: 150 path: Filepath to a folder where the cached HDF5 files will be saved. 151 patch_shape: The patch shape (z, y, x) to use for training. 152 bounding_boxes: List of subvolumes to use, each as 153 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 154 Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region. 155 download: Whether to stream and cache data if not already present. 156 offsets: Offset values for affinity computation used as target. 157 boundaries: Whether to compute boundaries as the target. 158 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 159 160 Returns: 161 The segmentation dataset. 162 """ 163 assert len(patch_shape) == 3 164 165 paths = get_malecns_paths(path, bounding_boxes, download) 166 167 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 168 kwargs, _ = util.add_instance_label_transform( 169 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 170 ) 171 172 return torch_em.default_segmentation_dataset( 173 raw_paths=paths, 174 raw_key="raw", 175 label_paths=paths, 176 label_key="labels", 177 patch_shape=patch_shape, 178 **kwargs, 179 ) 180 181 182def get_malecns_loader( 183 path: Union[os.PathLike, str], 184 patch_shape: Tuple[int, int, int], 185 batch_size: int, 186 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 187 download: bool = False, 188 offsets: Optional[List[List[int]]] = None, 189 boundaries: bool = False, 190 **kwargs, 191) -> DataLoader: 192 """Get the DataLoader for neuron instance segmentation in the Male CNS Connectome. 193 194 Args: 195 path: Filepath to a folder where the cached HDF5 files will be saved. 196 patch_shape: The patch shape (z, y, x) to use for training. 197 batch_size: The batch size for training. 198 bounding_boxes: List of subvolumes to use, each as 199 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 200 Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region. 201 download: Whether to stream and cache data if not already present. 202 offsets: Offset values for affinity computation used as target. 203 boundaries: Whether to compute boundaries as the target. 204 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 205 206 Returns: 207 The DataLoader. 208 """ 209 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 210 dataset = get_malecns_dataset( 211 path, patch_shape, bounding_boxes, download, offsets, boundaries, **ds_kwargs 212 ) 213 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
54def get_malecns_data( 55 path: Union[os.PathLike, str], 56 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 57 download: bool = False, 58) -> str: 59 """Stream a subvolume from the Male CNS Connectome and cache it as an HDF5 file. 60 61 Args: 62 path: Filepath to a folder where the cached HDF5 file will be saved. 63 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 64 in 8 nm voxel coordinates. Defaults to a central 1024³ training region. 65 download: Whether to stream and cache the data if it is not present. 66 67 Returns: 68 The filepath to the cached HDF5 file. 69 """ 70 import h5py 71 72 os.makedirs(path, exist_ok=True) 73 74 stem = _bbox_to_str(bounding_box) 75 h5_path = os.path.join(path, f"{stem}.h5") 76 if os.path.exists(h5_path): 77 return h5_path 78 79 if not download: 80 raise RuntimeError( 81 f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS." 82 ) 83 84 try: 85 import cloudvolume 86 except ImportError: 87 raise ImportError( 88 "The 'cloud-volume' package is required to access the Male CNS dataset. " 89 "Install it with: 'pip install cloud-volume'." 90 ) 91 92 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 93 94 print(f"Streaming Male CNS EM + segmentation for bbox {bounding_box} ...") 95 96 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True) 97 seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True) 98 99 # cloud-volume returns (x, y, z, 1). Let's squeeze channel and transpose to (z, y, x). 100 raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 101 labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 102 103 resolution_nm = em_vol.resolution.tolist() # [x, y, z] in nm 104 105 with h5py.File(h5_path, "w", locking=False) as f: 106 f.attrs["bounding_box"] = bounding_box 107 f.attrs["crop_size"] = raw.shape # (z, y, x) after transpose 108 f.attrs["resolution_nm"] = resolution_nm # [x, y, z] in nm 109 f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True) 110 f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True) 111 112 print(f"Cached to {h5_path} (raw {raw.shape}, labels {labels.shape})") 113 return h5_path
Stream a subvolume from the Male CNS Connectome and cache it as an HDF5 file.
Arguments:
- path: Filepath to a folder where the cached HDF5 file will be saved.
- bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
- download: Whether to stream and cache the data if it is not present.
Returns:
The filepath to the cached HDF5 file.
116def get_malecns_paths( 117 path: Union[os.PathLike, str], 118 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 119 download: bool = False, 120) -> List[str]: 121 """Get paths to the Male CNS HDF5 cache files. 122 123 Args: 124 path: Filepath to a folder where the cached HDF5 files will be saved. 125 bounding_boxes: List of regions to fetch, each as 126 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 127 Defaults to [DEFAULT_BOUNDING_BOX]. 128 download: Whether to stream and cache the data if it is not present. 129 130 Returns: 131 List of filepaths to the cached HDF5 files. 132 """ 133 if bounding_boxes is None: 134 bounding_boxes = [DEFAULT_BOUNDING_BOX] 135 136 return [get_malecns_data(path, bbox, download) for bbox in bounding_boxes]
Get paths to the Male CNS HDF5 cache files.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX].
- download: Whether to stream and cache the data if it is not present.
Returns:
List of filepaths to the cached HDF5 files.
139def get_malecns_dataset( 140 path: Union[os.PathLike, str], 141 patch_shape: Tuple[int, int, int], 142 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 143 download: bool = False, 144 offsets: Optional[List[List[int]]] = None, 145 boundaries: bool = False, 146 **kwargs, 147) -> Dataset: 148 """Get the Male CNS Connectome dataset for neuron instance segmentation. 149 150 Args: 151 path: Filepath to a folder where the cached HDF5 files will be saved. 152 patch_shape: The patch shape (z, y, x) to use for training. 153 bounding_boxes: List of subvolumes to use, each as 154 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 155 Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region. 156 download: Whether to stream and cache data if not already present. 157 offsets: Offset values for affinity computation used as target. 158 boundaries: Whether to compute boundaries as the target. 159 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 160 161 Returns: 162 The segmentation dataset. 163 """ 164 assert len(patch_shape) == 3 165 166 paths = get_malecns_paths(path, bounding_boxes, download) 167 168 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 169 kwargs, _ = util.add_instance_label_transform( 170 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 171 ) 172 173 return torch_em.default_segmentation_dataset( 174 raw_paths=paths, 175 raw_key="raw", 176 label_paths=paths, 177 label_key="labels", 178 patch_shape=patch_shape, 179 **kwargs, 180 )
Get the Male CNS Connectome dataset for neuron instance segmentation.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region.
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
183def get_malecns_loader( 184 path: Union[os.PathLike, str], 185 patch_shape: Tuple[int, int, int], 186 batch_size: int, 187 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 188 download: bool = False, 189 offsets: Optional[List[List[int]]] = None, 190 boundaries: bool = False, 191 **kwargs, 192) -> DataLoader: 193 """Get the DataLoader for neuron instance segmentation in the Male CNS Connectome. 194 195 Args: 196 path: Filepath to a folder where the cached HDF5 files will be saved. 197 patch_shape: The patch shape (z, y, x) to use for training. 198 batch_size: The batch size for training. 199 bounding_boxes: List of subvolumes to use, each as 200 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 201 Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region. 202 download: Whether to stream and cache data if not already present. 203 offsets: Offset values for affinity computation used as target. 204 boundaries: Whether to compute boundaries as the target. 205 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader. 206 207 Returns: 208 The DataLoader. 209 """ 210 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 211 dataset = get_malecns_dataset( 212 path, patch_shape, bounding_boxes, download, offsets, boundaries, **ds_kwargs 213 ) 214 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the DataLoader for neuron instance segmentation in the Male CNS Connectome.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- batch_size: The batch size for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region.
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.