torch_em.data.datasets.electron_microscopy.manc
The MANC (Male Adult Nerve Cord) dataset contains a FIB-SEM volume of the Drosophila male ventral nerve cord with dense neuron instance segmentation.
It covers the full adult male nerve cord at 8 nm isotropic resolution with ~23,000 neurons reconstructed and proofread, including 10 million pre-synaptic sites and 74 million post-synaptic densities.
The EM volume is at gs://flyem-vnc-2-26-213dba213ef26e094c16c860ae7f4be0/v3_emdata_clahe_xy/jpeg and the segmentation is at gs://manc-seg-v1p2/manc-seg-v1.2.
This dataset is from the publication https://doi.org/10.7554/eLife.89346. Please cite it if you use this dataset in your research.
The dataset is publicly available at https://www.janelia.org/project-team/flyem/manc-connectome. Requires cloud-volume: pip install cloud-volume.
NOTE (on data size): the full volume is (46113, 59467, 82276) voxels at 8 nm isotropic resolution. Downloading the entire volume is not feasible. Data is instead accessed by specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates, streamed from GCS and cached locally as HDF5 files.
1"""The MANC (Male Adult Nerve Cord) dataset contains a FIB-SEM volume of the 2Drosophila male ventral nerve cord with dense neuron instance segmentation. 3 4It covers the full adult male nerve cord at 8 nm isotropic resolution with 5~23,000 neurons reconstructed and proofread, including 10 million pre-synaptic 6sites and 74 million post-synaptic densities. 7 8The EM volume is at gs://flyem-vnc-2-26-213dba213ef26e094c16c860ae7f4be0/v3_emdata_clahe_xy/jpeg 9and the segmentation is at gs://manc-seg-v1p2/manc-seg-v1.2. 10 11This dataset is from the publication https://doi.org/10.7554/eLife.89346. 12Please cite it if you use this dataset in your research. 13 14The dataset is publicly available at https://www.janelia.org/project-team/flyem/manc-connectome. 15Requires cloud-volume: pip install cloud-volume. 16 17NOTE (on data size): the full volume is (46113, 59467, 82276) voxels at 8 nm isotropic 18resolution. Downloading the entire volume is not feasible. Data is instead accessed by 19specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel 20coordinates, streamed from GCS and cached locally as HDF5 files. 21""" 22 23import hashlib 24import os 25from typing import List, Optional, Tuple, Union 26 27import numpy as np 28from torch.utils.data import DataLoader, Dataset 29 30import torch_em 31from .. import util 32 33 34EM_URL = "gs://flyem-vnc-2-26-213dba213ef26e094c16c860ae7f4be0/v3_emdata_clahe_xy/jpeg" 35SEG_URL = "gs://manc-seg-v1p2/manc-seg-v1.2" 36 37# A representative 1024³-voxel subvolume near the centre of the reconstructed region. 38# Units are 8 nm voxels in (x, y, z) order, matching the CloudVolume coordinate space. 39DEFAULT_BOUNDING_BOX = (20000, 21024, 25000, 26024, 40000, 41024) 40 41 42def _bbox_to_str(bbox): 43 return hashlib.md5("_".join(str(v) for v in bbox).encode()).hexdigest()[:12] 44 45 46def get_manc_data( 47 path: Union[os.PathLike, str], 48 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 49 download: bool = False, 50) -> str: 51 """Stream a subvolume from the MANC dataset and cache it as an HDF5 file. 52 53 Args: 54 path: Filepath to a folder where the cached HDF5 file will be saved. 55 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 56 in 8 nm voxel coordinates. Defaults to a central 1024³ training region. 57 download: Whether to stream and cache the data if it is not present. 58 59 Returns: 60 The filepath to the cached HDF5 file. 61 """ 62 import h5py 63 64 os.makedirs(str(path), exist_ok=True) 65 h5_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.h5") 66 if os.path.exists(h5_path): 67 return h5_path 68 69 if not download: 70 raise RuntimeError( 71 f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS." 72 ) 73 74 try: 75 import cloudvolume 76 except ImportError: 77 raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume") 78 79 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 80 print(f"Streaming MANC EM + segmentation for bbox {bounding_box} ...") 81 82 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True) 83 seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True) 84 85 raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 86 labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 87 88 with h5py.File(h5_path, "w", locking=False) as f: 89 f.attrs["bounding_box"] = bounding_box 90 f.attrs["resolution_nm"] = em_vol.resolution.tolist() 91 f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True) 92 f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True) 93 94 print(f"Cached to {h5_path} (shape {raw.shape})") 95 return h5_path 96 97 98def get_manc_paths( 99 path: Union[os.PathLike, str], 100 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 101 download: bool = False, 102) -> List[str]: 103 """Get paths to MANC HDF5 cache files. 104 105 Args: 106 path: Filepath to a folder where the cached HDF5 files will be saved. 107 bounding_boxes: List of regions to fetch, each as 108 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 109 Defaults to [DEFAULT_BOUNDING_BOX]. 110 download: Whether to stream and cache the data if it is not present. 111 112 Returns: 113 List of filepaths to the cached HDF5 files. 114 """ 115 if bounding_boxes is None: 116 bounding_boxes = [DEFAULT_BOUNDING_BOX] 117 return [get_manc_data(path, bbox, download) for bbox in bounding_boxes] 118 119 120def get_manc_dataset( 121 path: Union[os.PathLike, str], 122 patch_shape: Tuple[int, int, int], 123 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 124 download: bool = False, 125 offsets: Optional[List[List[int]]] = None, 126 boundaries: bool = False, 127 **kwargs, 128) -> Dataset: 129 """Get the MANC dataset for neuron instance segmentation. 130 131 Args: 132 path: Filepath to a folder where the cached HDF5 files will be saved. 133 patch_shape: The patch shape (z, y, x) to use for training. 134 bounding_boxes: List of subvolumes to use, each as 135 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 136 Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region. 137 download: Whether to stream and cache data if not already present. 138 offsets: Offset values for affinity computation used as target. 139 boundaries: Whether to compute boundaries as the target. 140 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 141 142 Returns: 143 The segmentation dataset. 144 """ 145 assert len(patch_shape) == 3 146 147 paths = get_manc_paths(path, bounding_boxes, download) 148 149 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 150 kwargs, _ = util.add_instance_label_transform( 151 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 152 ) 153 154 return torch_em.default_segmentation_dataset( 155 raw_paths=paths, 156 raw_key="raw", 157 label_paths=paths, 158 label_key="labels", 159 patch_shape=patch_shape, 160 **kwargs, 161 ) 162 163 164def get_manc_loader( 165 path: Union[os.PathLike, str], 166 patch_shape: Tuple[int, int, int], 167 batch_size: int, 168 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 169 download: bool = False, 170 offsets: Optional[List[List[int]]] = None, 171 boundaries: bool = False, 172 **kwargs, 173) -> DataLoader: 174 """Get the DataLoader for neuron instance segmentation in the MANC dataset. 175 176 Args: 177 path: Filepath to a folder where the cached HDF5 files will be saved. 178 patch_shape: The patch shape (z, y, x) to use for training. 179 batch_size: The batch size for training. 180 bounding_boxes: List of subvolumes to use, each as 181 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 182 Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region. 183 download: Whether to stream and cache data if not already present. 184 offsets: Offset values for affinity computation used as target. 185 boundaries: Whether to compute boundaries as the target. 186 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 187 or for the PyTorch DataLoader. 188 189 Returns: 190 The DataLoader. 191 """ 192 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 193 dataset = get_manc_dataset( 194 path, patch_shape, bounding_boxes=bounding_boxes, download=download, 195 offsets=offsets, boundaries=boundaries, **ds_kwargs 196 ) 197 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
47def get_manc_data( 48 path: Union[os.PathLike, str], 49 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 50 download: bool = False, 51) -> str: 52 """Stream a subvolume from the MANC dataset and cache it as an HDF5 file. 53 54 Args: 55 path: Filepath to a folder where the cached HDF5 file will be saved. 56 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 57 in 8 nm voxel coordinates. Defaults to a central 1024³ training region. 58 download: Whether to stream and cache the data if it is not present. 59 60 Returns: 61 The filepath to the cached HDF5 file. 62 """ 63 import h5py 64 65 os.makedirs(str(path), exist_ok=True) 66 h5_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.h5") 67 if os.path.exists(h5_path): 68 return h5_path 69 70 if not download: 71 raise RuntimeError( 72 f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS." 73 ) 74 75 try: 76 import cloudvolume 77 except ImportError: 78 raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume") 79 80 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 81 print(f"Streaming MANC EM + segmentation for bbox {bounding_box} ...") 82 83 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=0, progress=True) 84 seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True) 85 86 raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 87 labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 88 89 with h5py.File(h5_path, "w", locking=False) as f: 90 f.attrs["bounding_box"] = bounding_box 91 f.attrs["resolution_nm"] = em_vol.resolution.tolist() 92 f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True) 93 f.create_dataset("labels", data=labels.astype("uint64"), compression="gzip", chunks=True) 94 95 print(f"Cached to {h5_path} (shape {raw.shape})") 96 return h5_path
Stream a subvolume from the MANC dataset and cache it as an HDF5 file.
Arguments:
- path: Filepath to a folder where the cached HDF5 file will be saved.
- bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to a central 1024³ training region.
- download: Whether to stream and cache the data if it is not present.
Returns:
The filepath to the cached HDF5 file.
99def get_manc_paths( 100 path: Union[os.PathLike, str], 101 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 102 download: bool = False, 103) -> List[str]: 104 """Get paths to MANC HDF5 cache files. 105 106 Args: 107 path: Filepath to a folder where the cached HDF5 files will be saved. 108 bounding_boxes: List of regions to fetch, each as 109 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 110 Defaults to [DEFAULT_BOUNDING_BOX]. 111 download: Whether to stream and cache the data if it is not present. 112 113 Returns: 114 List of filepaths to the cached HDF5 files. 115 """ 116 if bounding_boxes is None: 117 bounding_boxes = [DEFAULT_BOUNDING_BOX] 118 return [get_manc_data(path, bbox, download) for bbox in bounding_boxes]
Get paths to MANC HDF5 cache files.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX].
- download: Whether to stream and cache the data if it is not present.
Returns:
List of filepaths to the cached HDF5 files.
121def get_manc_dataset( 122 path: Union[os.PathLike, str], 123 patch_shape: Tuple[int, int, int], 124 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 125 download: bool = False, 126 offsets: Optional[List[List[int]]] = None, 127 boundaries: bool = False, 128 **kwargs, 129) -> Dataset: 130 """Get the MANC dataset for neuron instance segmentation. 131 132 Args: 133 path: Filepath to a folder where the cached HDF5 files will be saved. 134 patch_shape: The patch shape (z, y, x) to use for training. 135 bounding_boxes: List of subvolumes to use, each as 136 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 137 Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region. 138 download: Whether to stream and cache data if not already present. 139 offsets: Offset values for affinity computation used as target. 140 boundaries: Whether to compute boundaries as the target. 141 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 142 143 Returns: 144 The segmentation dataset. 145 """ 146 assert len(patch_shape) == 3 147 148 paths = get_manc_paths(path, bounding_boxes, download) 149 150 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 151 kwargs, _ = util.add_instance_label_transform( 152 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 153 ) 154 155 return torch_em.default_segmentation_dataset( 156 raw_paths=paths, 157 raw_key="raw", 158 label_paths=paths, 159 label_key="labels", 160 patch_shape=patch_shape, 161 **kwargs, 162 )
Get the MANC dataset for neuron instance segmentation.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
165def get_manc_loader( 166 path: Union[os.PathLike, str], 167 patch_shape: Tuple[int, int, int], 168 batch_size: int, 169 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 170 download: bool = False, 171 offsets: Optional[List[List[int]]] = None, 172 boundaries: bool = False, 173 **kwargs, 174) -> DataLoader: 175 """Get the DataLoader for neuron instance segmentation in the MANC dataset. 176 177 Args: 178 path: Filepath to a folder where the cached HDF5 files will be saved. 179 patch_shape: The patch shape (z, y, x) to use for training. 180 batch_size: The batch size for training. 181 bounding_boxes: List of subvolumes to use, each as 182 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 183 Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region. 184 download: Whether to stream and cache data if not already present. 185 offsets: Offset values for affinity computation used as target. 186 boundaries: Whether to compute boundaries as the target. 187 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 188 or for the PyTorch DataLoader. 189 190 Returns: 191 The DataLoader. 192 """ 193 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 194 dataset = get_manc_dataset( 195 path, patch_shape, bounding_boxes=bounding_boxes, download=download, 196 offsets=offsets, boundaries=boundaries, **ds_kwargs 197 ) 198 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the DataLoader for neuron instance segmentation in the MANC dataset.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- batch_size: The batch size for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] - a central 1024³ region.
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.