torch_em.data.datasets.electron_microscopy.manc

The MANC (Male Adult Nerve Cord) dataset contains a FIB-SEM volume of the Drosophila male ventral nerve cord with dense neuron instance segmentation.

It covers the full adult male nerve cord at 8 nm isotropic resolution with ~23,000 neurons reconstructed and proofread, including 10 million pre-synaptic sites and 74 million post-synaptic densities.

The EM volume is at gs://flyem-vnc-2-26-213dba213ef26e094c16c860ae7f4be0/v3_emdata_clahe_xy/jpeg and the segmentation is at gs://manc-seg-v1p2/manc-seg-v1.2.

This dataset is from the publication https://doi.org/10.7554/eLife.89346. Please cite it if you use this dataset in your research.

The dataset is publicly available at https://www.janelia.org/project-team/flyem/manc-connectome. Requires cloud-volume: pip install cloud-volume.

NOTE (on data size): the full volume is (46113, 59467, 82276) voxels at 8 nm isotropic resolution. Downloading the entire volume is not feasible. Data is instead accessed by specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates, streamed from GCS and cached locally as HDF5 files.

  1"""The MANC (Male Adult Nerve Cord) dataset contains a FIB-SEM volume of the
  2Drosophila male ventral nerve cord with dense neuron instance segmentation.
  3
  4It covers the full adult male nerve cord at 8 nm isotropic resolution with
  5~23,000 neurons reconstructed and proofread, including 10 million pre-synaptic
  6sites and 74 million post-synaptic densities.
  7
  8The EM volume is at gs://flyem-vnc-2-26-213dba213ef26e094c16c860ae7f4be0/v3_emdata_clahe_xy/jpeg
  9and the segmentation is at gs://manc-seg-v1p2/manc-seg-v1.2.
 10
 11This dataset is from the publication https://doi.org/10.7554/eLife.89346.
 12Please cite it if you use this dataset in your research.
 13
 14The dataset is publicly available at https://www.janelia.org/project-team/flyem/manc-connectome.
 15Requires cloud-volume: pip install cloud-volume.
 16
 17NOTE (on data size): the full volume is (46113, 59467, 82276) voxels at 8 nm isotropic
 18resolution. Downloading the entire volume is not feasible. Data is instead accessed by
 19specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel
 20coordinates, streamed from GCS and cached locally as HDF5 files.
 21"""
 22
 23import hashlib
 24import os
 25from typing import List, Optional, Tuple, Union
 26
 27import numpy as np
 28from torch.utils.data import DataLoader, Dataset
 29
 30import torch_em
 31from .. import util
 32
 33
 34EM_URL = "gs://flyem-vnc-2-26-213dba213ef26e094c16c860ae7f4be0/v3_emdata_clahe_xy/jpeg"
 35SEG_URL = "gs://manc-seg-v1p2/manc-seg-v1.2"
 36
 37# A representative 1024³-voxel subvolume near the centre of the reconstructed region.
 38# Units are 8 nm voxels in (x, y, z) order, matching the CloudVolume coordinate space.
 39DEFAULT_BOUNDING_BOX = (20000, 21024, 25000, 26024, 40000, 41024)
 40
 41
 42def _bbox_to_str(bbox):
 43    return hashlib.md5("_".join(str(v) for v in bbox).encode()).hexdigest()[:12]
 44
 45
 46def get_manc_data(
 47    path: Union[os.PathLike, str],
 48    bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX,
 49    download: bool = False,
 50) -> str:
 51    """Stream a subvolume from the MANC dataset and cache it as an HDF5 file.
 52
 53    Args:
 54        path: Filepath to a folder where the cached HDF5 file will be saved.
 55        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
 56            in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
 57        download: Whether to stream and cache the data if it is not present.
 58
 59    Returns:
 60        The filepath to the cached HDF5 file.
 61    """
 62    import h5py
 63
 64    os.makedirs(str(path), exist_ok=True)
 65    h5_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.h5")
 66    if os.path.exists(h5_path):
 67        return h5_path
 68
 69    if not download:
 70        raise RuntimeError(
 71            f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS."
 72        )
 73
 74    try:
 75        import cloudvolume
 76    except ImportError:
 77        raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume")
 78
 79    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
 80    print(f"Streaming MANC EM + segmentation for bbox {bounding_box} ...")
 81
 82    em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True)
 83    seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True)
 84
 85    raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
 86    labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
 87
 88    with h5py.File(h5_path, "w", locking=False) as f:
 89        f.attrs["bounding_box"] = bounding_box
 90        f.attrs["resolution_nm"] = em_vol.resolution.tolist()
 91        f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True)
 92        f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True)
 93
 94    print(f"Cached to {h5_path} (shape {raw.shape})")
 95    return h5_path
 96
 97
 98def get_manc_paths(
 99    path: Union[os.PathLike, str],
100    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
101    download: bool = False,
102) -> List[str]:
103    """Get paths to MANC HDF5 cache files.
104
105    Args:
106        path: Filepath to a folder where the cached HDF5 files will be saved.
107        bounding_boxes: List of regions to fetch, each as
108            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
109            Defaults to [DEFAULT_BOUNDING_BOX].
110        download: Whether to stream and cache the data if it is not present.
111
112    Returns:
113        List of filepaths to the cached HDF5 files.
114    """
115    if bounding_boxes is None:
116        bounding_boxes = [DEFAULT_BOUNDING_BOX]
117    return [get_manc_data(path, bbox, download) for bbox in bounding_boxes]
118
119
120def get_manc_dataset(
121    path: Union[os.PathLike, str],
122    patch_shape: Tuple[int, int, int],
123    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
124    download: bool = False,
125    offsets: Optional[List[List[int]]] = None,
126    boundaries: bool = False,
127    **kwargs,
128) -> Dataset:
129    """Get the MANC dataset for neuron instance segmentation.
130
131    Args:
132        path: Filepath to a folder where the cached HDF5 files will be saved.
133        patch_shape: The patch shape (z, y, x) to use for training.
134        bounding_boxes: List of subvolumes to use, each as
135            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
136            Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
137        download: Whether to stream and cache data if not already present.
138        offsets: Offset values for affinity computation used as target.
139        boundaries: Whether to compute boundaries as the target.
140        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
141
142    Returns:
143        The segmentation dataset.
144    """
145    assert len(patch_shape) == 3
146
147    paths = get_manc_paths(path, bounding_boxes, download)
148
149    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
150    kwargs, _ = util.add_instance_label_transform(
151        kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets
152    )
153
154    return torch_em.default_segmentation_dataset(
155        raw_paths=paths,
156        raw_key="raw",
157        label_paths=paths,
158        label_key="labels",
159        patch_shape=patch_shape,
160        **kwargs,
161    )
162
163
164def get_manc_loader(
165    path: Union[os.PathLike, str],
166    patch_shape: Tuple[int, int, int],
167    batch_size: int,
168    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
169    download: bool = False,
170    offsets: Optional[List[List[int]]] = None,
171    boundaries: bool = False,
172    **kwargs,
173) -> DataLoader:
174    """Get the DataLoader for neuron instance segmentation in the MANC dataset.
175
176    Args:
177        path: Filepath to a folder where the cached HDF5 files will be saved.
178        patch_shape: The patch shape (z, y, x) to use for training.
179        batch_size: The batch size for training.
180        bounding_boxes: List of subvolumes to use, each as
181            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
182            Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
183        download: Whether to stream and cache data if not already present.
184        offsets: Offset values for affinity computation used as target.
185        boundaries: Whether to compute boundaries as the target.
186        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
187            or for the PyTorch DataLoader.
188
189    Returns:
190        The DataLoader.
191    """
192    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
193    dataset = get_manc_dataset(
194        path, patch_shape, bounding_boxes=bounding_boxes, download=download,
195        offsets=offsets, boundaries=boundaries, **ds_kwargs
196    )
197    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
EM_URL = 'gs://flyem-vnc-2-26-213dba213ef26e094c16c860ae7f4be0/v3_emdata_clahe_xy/jpeg'
SEG_URL = 'gs://manc-seg-v1p2/manc-seg-v1.2'
DEFAULT_BOUNDING_BOX = (20000, 21024, 25000, 26024, 40000, 41024)
def get_manc_data( path: Union[os.PathLike, str], bounding_box: Tuple[int, int, int, int, int, int] = (20000, 21024, 25000, 26024, 40000, 41024), download: bool = False) -> str:
47def get_manc_data(
48    path: Union[os.PathLike, str],
49    bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX,
50    download: bool = False,
51) -> str:
52    """Stream a subvolume from the MANC dataset and cache it as an HDF5 file.
53
54    Args:
55        path: Filepath to a folder where the cached HDF5 file will be saved.
56        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
57            in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
58        download: Whether to stream and cache the data if it is not present.
59
60    Returns:
61        The filepath to the cached HDF5 file.
62    """
63    import h5py
64
65    os.makedirs(str(path), exist_ok=True)
66    h5_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.h5")
67    if os.path.exists(h5_path):
68        return h5_path
69
70    if not download:
71        raise RuntimeError(
72            f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS."
73        )
74
75    try:
76        import cloudvolume
77    except ImportError:
78        raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume")
79
80    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
81    print(f"Streaming MANC EM + segmentation for bbox {bounding_box} ...")
82
83    em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True)
84    seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True)
85
86    raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
87    labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
88
89    with h5py.File(h5_path, "w", locking=False) as f:
90        f.attrs["bounding_box"] = bounding_box
91        f.attrs["resolution_nm"] = em_vol.resolution.tolist()
92        f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True)
93        f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True)
94
95    print(f"Cached to {h5_path} (shape {raw.shape})")
96    return h5_path

Stream a subvolume from the MANC dataset and cache it as an HDF5 file.

Arguments:
  • path: Filepath to a folder where the cached HDF5 file will be saved.
  • bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
  • download: Whether to stream and cache the data if it is not present.
Returns:

The filepath to the cached HDF5 file.

def get_manc_paths( path: Union[os.PathLike, str], bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False) -> List[str]:
 99def get_manc_paths(
100    path: Union[os.PathLike, str],
101    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
102    download: bool = False,
103) -> List[str]:
104    """Get paths to MANC HDF5 cache files.
105
106    Args:
107        path: Filepath to a folder where the cached HDF5 files will be saved.
108        bounding_boxes: List of regions to fetch, each as
109            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
110            Defaults to [DEFAULT_BOUNDING_BOX].
111        download: Whether to stream and cache the data if it is not present.
112
113    Returns:
114        List of filepaths to the cached HDF5 files.
115    """
116    if bounding_boxes is None:
117        bounding_boxes = [DEFAULT_BOUNDING_BOX]
118    return [get_manc_data(path, bbox, download) for bbox in bounding_boxes]

Get paths to MANC HDF5 cache files.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX].
  • download: Whether to stream and cache the data if it is not present.
Returns:

List of filepaths to the cached HDF5 files.

def get_manc_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False, offsets: Optional[List[List[int]]] = None, boundaries: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset:
121def get_manc_dataset(
122    path: Union[os.PathLike, str],
123    patch_shape: Tuple[int, int, int],
124    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
125    download: bool = False,
126    offsets: Optional[List[List[int]]] = None,
127    boundaries: bool = False,
128    **kwargs,
129) -> Dataset:
130    """Get the MANC dataset for neuron instance segmentation.
131
132    Args:
133        path: Filepath to a folder where the cached HDF5 files will be saved.
134        patch_shape: The patch shape (z, y, x) to use for training.
135        bounding_boxes: List of subvolumes to use, each as
136            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
137            Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
138        download: Whether to stream and cache data if not already present.
139        offsets: Offset values for affinity computation used as target.
140        boundaries: Whether to compute boundaries as the target.
141        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
142
143    Returns:
144        The segmentation dataset.
145    """
146    assert len(patch_shape) == 3
147
148    paths = get_manc_paths(path, bounding_boxes, download)
149
150    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
151    kwargs, _ = util.add_instance_label_transform(
152        kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets
153    )
154
155    return torch_em.default_segmentation_dataset(
156        raw_paths=paths,
157        raw_key="raw",
158        label_paths=paths,
159        label_key="labels",
160        patch_shape=patch_shape,
161        **kwargs,
162    )

Get the MANC dataset for neuron instance segmentation.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
  • download: Whether to stream and cache data if not already present.
  • offsets: Offset values for affinity computation used as target.
  • boundaries: Whether to compute boundaries as the target.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.
Returns:

The segmentation dataset.

def get_manc_loader( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], batch_size: int, bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False, offsets: Optional[List[List[int]]] = None, boundaries: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader:
165def get_manc_loader(
166    path: Union[os.PathLike, str],
167    patch_shape: Tuple[int, int, int],
168    batch_size: int,
169    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
170    download: bool = False,
171    offsets: Optional[List[List[int]]] = None,
172    boundaries: bool = False,
173    **kwargs,
174) -> DataLoader:
175    """Get the DataLoader for neuron instance segmentation in the MANC dataset.
176
177    Args:
178        path: Filepath to a folder where the cached HDF5 files will be saved.
179        patch_shape: The patch shape (z, y, x) to use for training.
180        batch_size: The batch size for training.
181        bounding_boxes: List of subvolumes to use, each as
182            (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
183            Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
184        download: Whether to stream and cache data if not already present.
185        offsets: Offset values for affinity computation used as target.
186        boundaries: Whether to compute boundaries as the target.
187        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
188            or for the PyTorch DataLoader.
189
190    Returns:
191        The DataLoader.
192    """
193    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
194    dataset = get_manc_dataset(
195        path, patch_shape, bounding_boxes=bounding_boxes, download=download,
196        offsets=offsets, boundaries=boundaries, **ds_kwargs
197    )
198    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the DataLoader for neuron instance segmentation in the MANC dataset.

Arguments:
  • path: Filepath to a folder where the cached HDF5 files will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • batch_size: The batch size for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
  • download: Whether to stream and cache data if not already present.
  • offsets: Offset values for affinity computation used as target.
  • boundaries: Whether to compute boundaries as the target.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.
Returns:

The DataLoader.