torch_em.data.datasets.electron_microscopy.malecns

The Male CNS Connectome dataset contains a full FIB-SEM volume of the Drosophila male central nervous system with dense neuron instance segmentation.

It covers the central brain, optic lobes, and ventral nerve cord at 8nm isotropic resolution, with ~135k neurons reconstructed and proofread.

The data is hosted at https://male-cns.janelia.org and accessible via Google Cloud Storage. The EM volume is at gs://flyem-male-cns/em/em-clahe-jpeg and the segmentation is at gs://flyem-male-cns/v0.9/segmentation.

The dataset is described at https://www.biorxiv.org/content/10.1101/2025.10.09.680999v2. Please cite this publication if you use the dataset in your research.

NOTE: accessing this dataset requires the cloud-volume package (pip install cloud-volume).

NOTE (on data size): the full volume is (94088, 78317, 134576) voxels at 8nm isotropic resolution (~978 TB raw uncompressed, ~8 PB labels uncompressed). Downloading the entire volume is not feasible. Data is instead accessed by specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8nm voxel coordinates, which are streamed from GCS and cached locally as HDF5 files (~1.2 GB per 1024³ subvolume). To target specific regions of the CNS (central brain, optic lobes, VNC), the neuropil ROI volume at gs://flyem-male-cns/rois/fullbrain-roi-v4 can be used to identify relevant coordinate ranges.

  1"""The Male CNS Connectome dataset contains a full FIB-SEM volume of the Drosophila
  2male central nervous system with dense neuron instance segmentation.
  3
  4It covers the central brain, optic lobes, and ventral nerve cord at 8nm isotropic
  5resolution, with ~135k neurons reconstructed and proofread.
  6
  7The data is hosted at https://male-cns.janelia.org and accessible via Google Cloud Storage.
  8The EM volume is at gs://flyem-male-cns/em/em-clahe-jpeg and the segmentation is at
  9gs://flyem-male-cns/v0.9/segmentation.
 10
 11The dataset is described at https://www.biorxiv.org/content/10.1101/2025.10.09.680999v2.
 12Please cite this publication if you use the dataset in your research.
 13
 14NOTE: accessing this dataset requires the `cloud-volume` package (pip install cloud-volume).
 15
 16NOTE (on data size): the full volume is (94088, 78317, 134576) voxels at 8nm isotropic
 17resolution (~978 TB raw uncompressed, ~8 PB labels uncompressed). Downloading the entire
 18volume is not feasible. Data is instead accessed by specifying bounding boxes
 19(x_min, x_max, y_min, y_max, z_min, z_max) in 8nm voxel coordinates, which are streamed
 20from GCS and cached locally as HDF5 files (~1.2 GB per 1024³ subvolume). To target
 21specific regions of the CNS (central brain, optic lobes, VNC), the neuropil ROI volume
 22at gs://flyem-male-cns/rois/fullbrain-roi-v4 can be used to identify relevant coordinate
 23ranges.
 24"""
 25
 26import os
 27import hashlib
 28from typing import List, Optional, Tuple, Union
 29
 30import numpy as np
 31
 32from torch.utils.data import DataLoader, Dataset
 33
 34import torch_em
 35
 36from .. import util
 37
 38
 39EM_URL = "gs://flyem-male-cns/em/em-clahe-jpeg"
 40SEG_URL = "gs://flyem-male-cns/v0.9/segmentation"
 41
 42# A representative 1024³-voxel subvolume near the centre of the well-reconstructed region.
 43# Units are 8 nm voxels in (x, y, z) order, matching the CloudVolume coordinate space.
 44DEFAULT_BOUNDING_BOX = (40000, 41024, 40000, 41024, 20000, 21024)
 45
 46
 47def _bbox_to_str(bbox):
 48    """Create a short unique filename stem from a bounding box tuple."""
 49    key = "_".join(str(v) for v in bbox)
 50    return hashlib.md5(key.encode()).hexdigest()[:12]
 51
 52
 53def get_malecns_data(
 54    path: Union[os.PathLike, str],
 55    bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX,
 56    download: bool = False,
 57) -> str:
 58    """Stream a subvolume from the Male CNS Connectome and cache it as an HDF5 file.
 59
 60    Args:
 61        path: Filepath to a folder where the cached HDF5 file will be saved.
 62        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
 63            in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
 64        download: Whether to stream and cache the data if it is not present.
 65
 66    Returns:
 67        The filepath to the cached HDF5 file.
 68    """
 69    import h5py
 70
 71    os.makedirs(path, exist_ok=True)
 72
 73    stem = _bbox_to_str(bounding_box)
 74    h5_path = os.path.join(path, f"{stem}.h5")
 75    if os.path.exists(h5_path):
 76        return h5_path
 77
 78    if not download:
 79        raise RuntimeError(
 80            f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS."
 81        )
 82
 83    try:
 84        import cloudvolume
 85    except ImportError:
 86        raise ImportError(
 87            "The 'cloud-volume' package is required to access the Male CNS dataset. "
 88            "Install it with: 'pip install cloud-volume'."
 89        )
 90
 91    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
 92
 93    print(f"Streaming Male CNS EM + segmentation for bbox {bounding_box} ...")
 94
 95    em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True)
 96    seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True)
 97
 98    # cloud-volume returns (x, y, z, 1). Let's squeeze channel and transpose to (z, y, x).
 99    raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
100    labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
101
102    resolution_nm = em_vol.resolution.tolist()  # [x, y, z] in nm
103
104    with h5py.File(h5_path, "w", locking=False) as f:
105        f.attrs["bounding_box"] = bounding_box
106        f.attrs["crop_size"] = raw.shape  # (z, y, x) after transpose
107        f.attrs["resolution_nm"] = resolution_nm  # [x, y, z] in nm
108        f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True)
109        f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True)
110
111    print(f"Cached to {h5_path}  (raw {raw.shape}, labels {labels.shape})")
112    return h5_path
113
114
115def get_malecns_paths(
116    path: Union[os.PathLike, str],
117    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
118    download: bool = False,
119) -> List[str]:
120    """Get paths to the Male CNS HDF5 cache files.
121
122    Args:
123        path: Filepath to a folder where the cached HDF5 files will be saved.
124        bounding_boxes: List of regions to fetch, each as
125            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
126            Defaults to [DEFAULT_BOUNDING_BOX].
127        download: Whether to stream and cache the data if it is not present.
128
129    Returns:
130        List of filepaths to the cached HDF5 files.
131    """
132    if bounding_boxes is None:
133        bounding_boxes = [DEFAULT_BOUNDING_BOX]
134
135    return [get_malecns_data(path, bbox, download) for bbox in bounding_boxes]
136
137
138def get_malecns_dataset(
139    path: Union[os.PathLike, str],
140    patch_shape: Tuple[int, int, int],
141    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
142    download: bool = False,
143    offsets: Optional[List[List[int]]] = None,
144    boundaries: bool = False,
145    **kwargs,
146) -> Dataset:
147    """Get the Male CNS Connectome dataset for neuron instance segmentation.
148
149    Args:
150        path: Filepath to a folder where the cached HDF5 files will be saved.
151        patch_shape: The patch shape (z, y, x) to use for training.
152        bounding_boxes: List of subvolumes to use, each as
153            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
154            Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region.
155        download: Whether to stream and cache data if not already present.
156        offsets: Offset values for affinity computation used as target.
157        boundaries: Whether to compute boundaries as the target.
158        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
159
160    Returns:
161        The segmentation dataset.
162    """
163    assert len(patch_shape) == 3
164
165    paths = get_malecns_paths(path, bounding_boxes, download)
166
167    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
168    kwargs, _ = util.add_instance_label_transform(
169        kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets
170    )
171
172    return torch_em.default_segmentation_dataset(
173        raw_paths=paths,
174        raw_key="raw",
175        label_paths=paths,
176        label_key="labels",
177        patch_shape=patch_shape,
178        **kwargs,
179    )
180
181
182def get_malecns_loader(
183    path: Union[os.PathLike, str],
184    patch_shape: Tuple[int, int, int],
185    batch_size: int,
186    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
187    download: bool = False,
188    offsets: Optional[List[List[int]]] = None,
189    boundaries: bool = False,
190    **kwargs,
191) -> DataLoader:
192    """Get the DataLoader for neuron instance segmentation in the Male CNS Connectome.
193
194    Args:
195        path: Filepath to a folder where the cached HDF5 files will be saved.
196        patch_shape: The patch shape (z, y, x) to use for training.
197        batch_size: The batch size for training.
198        bounding_boxes: List of subvolumes to use, each as
199            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
200            Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region.
201        download: Whether to stream and cache data if not already present.
202        offsets: Offset values for affinity computation used as target.
203        boundaries: Whether to compute boundaries as the target.
204        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
205
206    Returns:
207        The DataLoader.
208    """
209    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
210    dataset = get_malecns_dataset(
211        path, patch_shape, bounding_boxes, download, offsets, boundaries, **ds_kwargs
212    )
213    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
EM_URL = 'gs://flyem-male-cns/em/em-clahe-jpeg'
SEG_URL = 'gs://flyem-male-cns/v0.9/segmentation'
DEFAULT_BOUNDING_BOX = (40000, 41024, 40000, 41024, 20000, 21024)
def get_malecns_data( path: Union[os.PathLike, str], bounding_box: Tuple[int, int, int, int, int, int] = (40000, 41024, 40000, 41024, 20000, 21024), download: bool = False) -> str:
 54def get_malecns_data(
 55    path: Union[os.PathLike, str],
 56    bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX,
 57    download: bool = False,
 58) -> str:
 59    """Stream a subvolume from the Male CNS Connectome and cache it as an HDF5 file.
 60
 61    Args:
 62        path: Filepath to a folder where the cached HDF5 file will be saved.
 63        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
 64            in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
 65        download: Whether to stream and cache the data if it is not present.
 66
 67    Returns:
 68        The filepath to the cached HDF5 file.
 69    """
 70    import h5py
 71
 72    os.makedirs(path, exist_ok=True)
 73
 74    stem = _bbox_to_str(bounding_box)
 75    h5_path = os.path.join(path, f"{stem}.h5")
 76    if os.path.exists(h5_path):
 77        return h5_path
 78
 79    if not download:
 80        raise RuntimeError(
 81            f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS."
 82        )
 83
 84    try:
 85        import cloudvolume
 86    except ImportError:
 87        raise ImportError(
 88            "The 'cloud-volume' package is required to access the Male CNS dataset. "
 89            "Install it with: 'pip install cloud-volume'."
 90        )
 91
 92    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
 93
 94    print(f"Streaming Male CNS EM + segmentation for bbox {bounding_box} ...")
 95
 96    em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True)
 97    seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True)
 98
 99    # cloud-volume returns (x, y, z, 1). Let's squeeze channel and transpose to (z, y, x).
100    raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
101    labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
102
103    resolution_nm = em_vol.resolution.tolist()  # [x, y, z] in nm
104
105    with h5py.File(h5_path, "w", locking=False) as f:
106        f.attrs["bounding_box"] = bounding_box
107        f.attrs["crop_size"] = raw.shape  # (z, y, x) after transpose
108        f.attrs["resolution_nm"] = resolution_nm  # [x, y, z] in nm
109        f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True)
110        f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True)
111
112    print(f"Cached to {h5_path}  (raw {raw.shape}, labels {labels.shape})")
113    return h5_path

Stream a subvolume from the Male CNS Connectome and cache it as an HDF5 file.

Arguments:
  • path: Filepath to a folder where the cached HDF5 file will be saved.
  • bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
  • download: Whether to stream and cache the data if it is not present.
Returns:

The filepath to the cached HDF5 file.

def get_malecns_paths( path: Union[os.PathLike, str], bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False) -> List[str]:
116def get_malecns_paths(
117    path: Union[os.PathLike, str],
118    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
119    download: bool = False,
120) -> List[str]:
121    """Get paths to the Male CNS HDF5 cache files.
122
123    Args:
124        path: Filepath to a folder where the cached HDF5 files will be saved.
125        bounding_boxes: List of regions to fetch, each as
126            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
127            Defaults to [DEFAULT_BOUNDING_BOX].
128        download: Whether to stream and cache the data if it is not present.
129
130    Returns:
131        List of filepaths to the cached HDF5 files.
132    """
133    if bounding_boxes is None:
134        bounding_boxes = [DEFAULT_BOUNDING_BOX]
135
136    return [get_malecns_data(path, bbox, download) for bbox in bounding_boxes]

Get paths to the Male CNS HDF5 cache files.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX].
  • download: Whether to stream and cache the data if it is not present.
Returns:

List of filepaths to the cached HDF5 files.

def get_malecns_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False, offsets: Optional[List[List[int]]] = None, boundaries: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset:
139def get_malecns_dataset(
140    path: Union[os.PathLike, str],
141    patch_shape: Tuple[int, int, int],
142    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
143    download: bool = False,
144    offsets: Optional[List[List[int]]] = None,
145    boundaries: bool = False,
146    **kwargs,
147) -> Dataset:
148    """Get the Male CNS Connectome dataset for neuron instance segmentation.
149
150    Args:
151        path: Filepath to a folder where the cached HDF5 files will be saved.
152        patch_shape: The patch shape (z, y, x) to use for training.
153        bounding_boxes: List of subvolumes to use, each as
154            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
155            Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region.
156        download: Whether to stream and cache data if not already present.
157        offsets: Offset values for affinity computation used as target.
158        boundaries: Whether to compute boundaries as the target.
159        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
160
161    Returns:
162        The segmentation dataset.
163    """
164    assert len(patch_shape) == 3
165
166    paths = get_malecns_paths(path, bounding_boxes, download)
167
168    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
169    kwargs, _ = util.add_instance_label_transform(
170        kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets
171    )
172
173    return torch_em.default_segmentation_dataset(
174        raw_paths=paths,
175        raw_key="raw",
176        label_paths=paths,
177        label_key="labels",
178        patch_shape=patch_shape,
179        **kwargs,
180    )

Get the Male CNS Connectome dataset for neuron instance segmentation.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region.
  • download: Whether to stream and cache data if not already present.
  • offsets: Offset values for affinity computation used as target.
  • boundaries: Whether to compute boundaries as the target.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.
Returns:

The segmentation dataset.

def get_malecns_loader( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], batch_size: int, bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False, offsets: Optional[List[List[int]]] = None, boundaries: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader:
183def get_malecns_loader(
184    path: Union[os.PathLike, str],
185    patch_shape: Tuple[int, int, int],
186    batch_size: int,
187    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
188    download: bool = False,
189    offsets: Optional[List[List[int]]] = None,
190    boundaries: bool = False,
191    **kwargs,
192) -> DataLoader:
193    """Get the DataLoader for neuron instance segmentation in the Male CNS Connectome.
194
195    Args:
196        path: Filepath to a folder where the cached HDF5 files will be saved.
197        patch_shape: The patch shape (z, y, x) to use for training.
198        batch_size: The batch size for training.
199        bounding_boxes: List of subvolumes to use, each as
200            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
201            Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region.
202        download: Whether to stream and cache data if not already present.
203        offsets: Offset values for affinity computation used as target.
204        boundaries: Whether to compute boundaries as the target.
205        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` or for the PyTorch DataLoader.
206
207    Returns:
208        The DataLoader.
209    """
210    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
211    dataset = get_malecns_dataset(
212        path, patch_shape, bounding_boxes, download, offsets, boundaries, **ds_kwargs
213    )
214    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the DataLoader for neuron instance segmentation in the Male CNS Connectome.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • batch_size: The batch size for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] — a central 1024³ region.
  • download: Whether to stream and cache data if not already present.
  • offsets: Offset values for affinity computation used as target.
  • boundaries: Whether to compute boundaries as the target.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.
Returns:

The DataLoader.