torch_em.data.datasets.electron_microscopy.humanneurons
The Human Neurons (H01) dataset contains a petascale FIB-SEM volume of human cerebral cortex with dense automated neuron instance segmentation (C3 release).
The volume covers ~1 mm³ of human temporal cortex at 4 x 4 x 33 nm resolution (~1.4 PB raw uncompressed). The C3 automated segmentation is provided at 8 x 8 x 33 nm resolution, covering the same physical region.
The data is hosted on Google Cloud Storage and described in: Shapson-Coe et al. (2021), https://www.biorxiv.org/content/10.1101/2021.05.29.446289v4. Please cite this publication if you use the dataset in your research.
NOTE: Accessing this dataset requires the cloud-volume package (pip install cloud-volume).
NOTE (on data size): the full volume is 515,892 x 356,400 x 5,293 voxels at 8 x 8 x 33 nm (~350 TB raw, ~1.4 PB at 4 nm). Downloading the entire volume is not feasible. Data is instead streamed and cached locally as HDF5 files by specifying bounding boxes (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates.
The volume is highly anisotropic: 8 nm in-plane (xy) and 33 nm in z. Patch shapes should account for this — e.g. patch_shape=(8, 512, 512) corresponds to a ~264 nm x 4 µm x 4 µm volume. The full z-extent is only 5,293 slices (~175 µm), so bounding boxes spanning the complete z range are feasible.
1"""The Human Neurons (H01) dataset contains a petascale FIB-SEM volume of human cerebral 2cortex with dense automated neuron instance segmentation (C3 release). 3 4The volume covers ~1 mm³ of human temporal cortex at 4 x 4 x 33 nm resolution 5(~1.4 PB raw uncompressed). The C3 automated segmentation is provided at 8 x 8 x 33 nm 6resolution, covering the same physical region. 7 8The data is hosted on Google Cloud Storage and described in: 9Shapson-Coe et al. (2021), https://www.biorxiv.org/content/10.1101/2021.05.29.446289v4. 10Please cite this publication if you use the dataset in your research. 11 12NOTE: Accessing this dataset requires the `cloud-volume` package (pip install cloud-volume). 13 14NOTE (on data size): the full volume is 515,892 x 356,400 x 5,293 voxels at 8 x 8 x 33 nm 15(~350 TB raw, ~1.4 PB at 4 nm). Downloading the entire volume is not feasible. 16Data is instead streamed and cached locally as HDF5 files by specifying bounding boxes 17(x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 18 19The volume is highly anisotropic: 8 nm in-plane (xy) and 33 nm in z. Patch shapes should 20account for this — e.g. patch_shape=(8, 512, 512) corresponds to a ~264 nm x 4 µm x 4 µm 21volume. The full z-extent is only 5,293 slices (~175 µm), so bounding boxes spanning the 22complete z range are feasible. 23""" 24 25import hashlib 26import os 27from typing import List, Optional, Tuple, Union 28 29import numpy as np 30 31from torch.utils.data import DataLoader, Dataset 32 33import torch_em 34 35from .. import util 36 37 38EM_URL = "gs://h01-release/data/20210601/4nm_raw" 39SEG_URL = "gs://h01-release/data/20210601/c3" 40 41# A 2048 × 2048 × 64 subvolume (8 nm xy, 33 nm z) in a neuron-dense region of the cortex. 42# Physical size: ~16 µm × 16 µm × 2.1 µm. Units: 8 nm voxels in (x, y, z) order. 43DEFAULT_BOUNDING_BOX = (271360, 273408, 201728, 203776, 2614, 2678) 44 45 46def _bbox_to_str(bbox): 47 """Create a short unique filename stem from a bounding box tuple.""" 48 key = "_".join(str(v) for v in bbox) 49 return hashlib.md5(key.encode()).hexdigest()[:12] 50 51 52def _fetch(cv, x_min, x_max, y_min, y_max, z_min, z_max): 53 """Fetch a subvolume and return it as a (z, y, x) array.""" 54 arr = np.array(cv[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0] 55 return arr.transpose(2, 1, 0) 56 57 58def get_humanneurons_data( 59 path: Union[os.PathLike, str], 60 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 61 download: bool = False, 62) -> str: 63 """Stream a subvolume from the H01 Human Neurons dataset and cache it as an HDF5 file. 64 65 The HDF5 file contains: 66 - raw: EM grayscale (uint8, 8 nm xy / 33 nm z, z/y/x) 67 - labels: neuron instance segmentation (uint64, 8 nm xy / 33 nm z, z/y/x) 68 69 Both layers are stored at the same 8 x 8 x 33 nm resolution. The raw image is 70 fetched from the 4 nm source at mip=1 (native 8 nm downsampled scale). 71 72 Args: 73 path: Filepath to a folder where the cached HDF5 file will be saved. 74 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 75 in 8 nm voxel coordinates. Defaults to a 2048 x 2048 x 64 training region. 76 download: Whether to stream and cache the data if it is not present. 77 78 Returns: 79 The filepath to the cached HDF5 file. 80 """ 81 import h5py 82 83 os.makedirs(path, exist_ok=True) 84 85 stem = _bbox_to_str(bounding_box) 86 h5_path = os.path.join(path, f"{stem}.h5") 87 88 if os.path.exists(h5_path): 89 return h5_path 90 91 if not download: 92 raise RuntimeError( 93 f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS." 94 ) 95 96 try: 97 import cloudvolume 98 except ImportError: 99 raise ImportError( 100 "The 'cloud-volume' package is required to access the Human Neurons dataset. " 101 "Install it with: 'pip install cloud-volume'." 102 ) 103 104 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 105 106 print(f"Streaming H01 Human Neurons EM + segmentation for bbox {bounding_box} ...") 107 108 # EM at mip=1 gives 8×8×33 nm — same resolution as the C3 segmentation at mip=0. 109 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=1, progress=True) 110 seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True, fill_missing=True) 111 112 raw = _fetch(em_vol, x_min, x_max, y_min, y_max, z_min, z_max) 113 labels = _fetch(seg_vol, x_min, x_max, y_min, y_max, z_min, z_max) 114 115 # Relabel to consecutive integers so IDs fit in uint32 (required for napari and float32 training). 116 from skimage.segmentation import relabel_sequential 117 labels, _, _ = relabel_sequential(labels) 118 119 resolution_nm = em_vol.mip_resolution(1).tolist() # [8, 8, 33] nm 120 121 with h5py.File(h5_path, "w", locking=False) as f: 122 f.attrs["bounding_box"] = bounding_box 123 f.attrs["crop_size"] = raw.shape # (z, y, x) 124 f.attrs["resolution_nm"] = resolution_nm # [x, y, z] in nm 125 f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True) 126 f.create_dataset("labels", data=labels.astype("uint32"), compression="gzip", chunks=True) 127 128 print(f"Cached to {h5_path} (raw {raw.shape}, labels {labels.shape})") 129 return h5_path 130 131 132def get_humanneurons_paths( 133 path: Union[os.PathLike, str], 134 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 135 download: bool = False, 136) -> List[str]: 137 """Get paths to the Human Neurons HDF5 cache files. 138 139 Args: 140 path: Filepath to a folder where the cached HDF5 files will be saved. 141 bounding_boxes: List of regions to fetch, each as 142 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 143 Defaults to [DEFAULT_BOUNDING_BOX]. 144 download: Whether to stream and cache the data if it is not present. 145 146 Returns: 147 List of filepaths to the cached HDF5 files. 148 """ 149 if bounding_boxes is None: 150 bounding_boxes = [DEFAULT_BOUNDING_BOX] 151 152 return [get_humanneurons_data(path, bbox, download) for bbox in bounding_boxes] 153 154 155def get_humanneurons_dataset( 156 path: Union[os.PathLike, str], 157 patch_shape: Tuple[int, int, int], 158 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 159 download: bool = False, 160 offsets: Optional[List[List[int]]] = None, 161 boundaries: bool = False, 162 **kwargs, 163) -> Dataset: 164 """Get the Human Neurons (H01) dataset for neuron instance segmentation. 165 166 Args: 167 path: Filepath to a folder where the cached HDF5 files will be saved. 168 patch_shape: The patch shape (z, y, x) to use for training. 169 The volume is anisotropic (8 nm xy, 33 nm z), so small z values are typical, 170 e.g. patch_shape=(8, 512, 512). 171 bounding_boxes: List of subvolumes to use, each as 172 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 173 Defaults to [DEFAULT_BOUNDING_BOX] — a 2048 x 2048 x 64 cortex region. 174 download: Whether to stream and cache data if not already present. 175 offsets: Offset values for affinity computation used as target. 176 boundaries: Whether to compute boundaries as the target. 177 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 178 179 Returns: 180 The segmentation dataset. 181 """ 182 assert len(patch_shape) == 3 183 184 paths = get_humanneurons_paths(path, bounding_boxes, download) 185 186 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 187 kwargs, _ = util.add_instance_label_transform( 188 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 189 ) 190 191 return torch_em.default_segmentation_dataset( 192 raw_paths=paths, 193 raw_key="raw", 194 label_paths=paths, 195 label_key="labels", 196 patch_shape=patch_shape, 197 **kwargs, 198 ) 199 200 201def get_humanneurons_loader( 202 path: Union[os.PathLike, str], 203 patch_shape: Tuple[int, int, int], 204 batch_size: int, 205 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 206 download: bool = False, 207 offsets: Optional[List[List[int]]] = None, 208 boundaries: bool = False, 209 **kwargs, 210) -> DataLoader: 211 """Get the DataLoader for neuron instance segmentation in the H01 Human Neurons dataset. 212 213 Args: 214 path: Filepath to a folder where the cached HDF5 files will be saved. 215 patch_shape: The patch shape (z, y, x) to use for training. 216 The volume is anisotropic (8 nm xy, 33 nm z), so small z values are typical, 217 e.g. patch_shape=(8, 512, 512). 218 batch_size: The batch size for training. 219 bounding_boxes: List of subvolumes to use, each as 220 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 221 Defaults to [DEFAULT_BOUNDING_BOX] — a 2048 x 2048 x 64 cortex region. 222 download: Whether to stream and cache data if not already present. 223 offsets: Offset values for affinity computation used as target. 224 boundaries: Whether to compute boundaries as the target. 225 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 226 or for the PyTorch DataLoader. 227 228 Returns: 229 The DataLoader. 230 """ 231 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 232 dataset = get_humanneurons_dataset( 233 path, patch_shape, bounding_boxes, download, offsets, boundaries, **ds_kwargs 234 ) 235 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
59def get_humanneurons_data( 60 path: Union[os.PathLike, str], 61 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 62 download: bool = False, 63) -> str: 64 """Stream a subvolume from the H01 Human Neurons dataset and cache it as an HDF5 file. 65 66 The HDF5 file contains: 67 - raw: EM grayscale (uint8, 8 nm xy / 33 nm z, z/y/x) 68 - labels: neuron instance segmentation (uint64, 8 nm xy / 33 nm z, z/y/x) 69 70 Both layers are stored at the same 8 x 8 x 33 nm resolution. The raw image is 71 fetched from the 4 nm source at mip=1 (native 8 nm downsampled scale). 72 73 Args: 74 path: Filepath to a folder where the cached HDF5 file will be saved. 75 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 76 in 8 nm voxel coordinates. Defaults to a 2048 x 2048 x 64 training region. 77 download: Whether to stream and cache the data if it is not present. 78 79 Returns: 80 The filepath to the cached HDF5 file. 81 """ 82 import h5py 83 84 os.makedirs(path, exist_ok=True) 85 86 stem = _bbox_to_str(bounding_box) 87 h5_path = os.path.join(path, f"{stem}.h5") 88 89 if os.path.exists(h5_path): 90 return h5_path 91 92 if not download: 93 raise RuntimeError( 94 f"No cached data found at '{h5_path}'. Set download=True to stream it from GCS." 95 ) 96 97 try: 98 import cloudvolume 99 except ImportError: 100 raise ImportError( 101 "The 'cloud-volume' package is required to access the Human Neurons dataset. " 102 "Install it with: 'pip install cloud-volume'." 103 ) 104 105 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 106 107 print(f"Streaming H01 Human Neurons EM + segmentation for bbox {bounding_box} ...") 108 109 # EM at mip=1 gives 8×8×33 nm — same resolution as the C3 segmentation at mip=0. 110 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=1, progress=True) 111 seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True, fill_missing=True) 112 113 raw = _fetch(em_vol, x_min, x_max, y_min, y_max, z_min, z_max) 114 labels = _fetch(seg_vol, x_min, x_max, y_min, y_max, z_min, z_max) 115 116 # Relabel to consecutive integers so IDs fit in uint32 (required for napari and float32 training). 117 from skimage.segmentation import relabel_sequential 118 labels, _, _ = relabel_sequential(labels) 119 120 resolution_nm = em_vol.mip_resolution(1).tolist() # [8, 8, 33] nm 121 122 with h5py.File(h5_path, "w", locking=False) as f: 123 f.attrs["bounding_box"] = bounding_box 124 f.attrs["crop_size"] = raw.shape # (z, y, x) 125 f.attrs["resolution_nm"] = resolution_nm # [x, y, z] in nm 126 f.create_dataset("raw", data=raw.astype("uint8"), compression="gzip", chunks=True) 127 f.create_dataset("labels", data=labels.astype("uint32"), compression="gzip", chunks=True) 128 129 print(f"Cached to {h5_path} (raw {raw.shape}, labels {labels.shape})") 130 return h5_path
Stream a subvolume from the H01 Human Neurons dataset and cache it as an HDF5 file.
The HDF5 file contains:
- raw: EM grayscale (uint8, 8 nm xy / 33 nm z, z/y/x)
- labels: neuron instance segmentation (uint64, 8 nm xy / 33 nm z, z/y/x)
Both layers are stored at the same 8 x 8 x 33 nm resolution. The raw image is fetched from the 4 nm source at mip=1 (native 8 nm downsampled scale).
Arguments:
- path: Filepath to a folder where the cached HDF5 file will be saved.
- bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to a 2048 x 2048 x 64 training region.
- download: Whether to stream and cache the data if it is not present.
Returns:
The filepath to the cached HDF5 file.
133def get_humanneurons_paths( 134 path: Union[os.PathLike, str], 135 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 136 download: bool = False, 137) -> List[str]: 138 """Get paths to the Human Neurons HDF5 cache files. 139 140 Args: 141 path: Filepath to a folder where the cached HDF5 files will be saved. 142 bounding_boxes: List of regions to fetch, each as 143 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 144 Defaults to [DEFAULT_BOUNDING_BOX]. 145 download: Whether to stream and cache the data if it is not present. 146 147 Returns: 148 List of filepaths to the cached HDF5 files. 149 """ 150 if bounding_boxes is None: 151 bounding_boxes = [DEFAULT_BOUNDING_BOX] 152 153 return [get_humanneurons_data(path, bbox, download) for bbox in bounding_boxes]
Get paths to the Human Neurons HDF5 cache files.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX].
- download: Whether to stream and cache the data if it is not present.
Returns:
List of filepaths to the cached HDF5 files.
156def get_humanneurons_dataset( 157 path: Union[os.PathLike, str], 158 patch_shape: Tuple[int, int, int], 159 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 160 download: bool = False, 161 offsets: Optional[List[List[int]]] = None, 162 boundaries: bool = False, 163 **kwargs, 164) -> Dataset: 165 """Get the Human Neurons (H01) dataset for neuron instance segmentation. 166 167 Args: 168 path: Filepath to a folder where the cached HDF5 files will be saved. 169 patch_shape: The patch shape (z, y, x) to use for training. 170 The volume is anisotropic (8 nm xy, 33 nm z), so small z values are typical, 171 e.g. patch_shape=(8, 512, 512). 172 bounding_boxes: List of subvolumes to use, each as 173 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 174 Defaults to [DEFAULT_BOUNDING_BOX] — a 2048 x 2048 x 64 cortex region. 175 download: Whether to stream and cache data if not already present. 176 offsets: Offset values for affinity computation used as target. 177 boundaries: Whether to compute boundaries as the target. 178 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 179 180 Returns: 181 The segmentation dataset. 182 """ 183 assert len(patch_shape) == 3 184 185 paths = get_humanneurons_paths(path, bounding_boxes, download) 186 187 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 188 kwargs, _ = util.add_instance_label_transform( 189 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 190 ) 191 192 return torch_em.default_segmentation_dataset( 193 raw_paths=paths, 194 raw_key="raw", 195 label_paths=paths, 196 label_key="labels", 197 patch_shape=patch_shape, 198 **kwargs, 199 )
Get the Human Neurons (H01) dataset for neuron instance segmentation.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training. The volume is anisotropic (8 nm xy, 33 nm z), so small z values are typical, e.g. patch_shape=(8, 512, 512).
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] — a 2048 x 2048 x 64 cortex region.
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
202def get_humanneurons_loader( 203 path: Union[os.PathLike, str], 204 patch_shape: Tuple[int, int, int], 205 batch_size: int, 206 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 207 download: bool = False, 208 offsets: Optional[List[List[int]]] = None, 209 boundaries: bool = False, 210 **kwargs, 211) -> DataLoader: 212 """Get the DataLoader for neuron instance segmentation in the H01 Human Neurons dataset. 213 214 Args: 215 path: Filepath to a folder where the cached HDF5 files will be saved. 216 patch_shape: The patch shape (z, y, x) to use for training. 217 The volume is anisotropic (8 nm xy, 33 nm z), so small z values are typical, 218 e.g. patch_shape=(8, 512, 512). 219 batch_size: The batch size for training. 220 bounding_boxes: List of subvolumes to use, each as 221 (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. 222 Defaults to [DEFAULT_BOUNDING_BOX] — a 2048 x 2048 x 64 cortex region. 223 download: Whether to stream and cache data if not already present. 224 offsets: Offset values for affinity computation used as target. 225 boundaries: Whether to compute boundaries as the target. 226 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 227 or for the PyTorch DataLoader. 228 229 Returns: 230 The DataLoader. 231 """ 232 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 233 dataset = get_humanneurons_dataset( 234 path, patch_shape, bounding_boxes, download, offsets, boundaries, **ds_kwargs 235 ) 236 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the DataLoader for neuron instance segmentation in the H01 Human Neurons dataset.
Arguments:
- path: Filepath to a folder where the cached HDF5 files will be saved.
- patch_shape: The patch shape (z, y, x) to use for training. The volume is anisotropic (8 nm xy, 33 nm z), so small z values are typical, e.g. patch_shape=(8, 512, 512).
- batch_size: The batch size for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 8 nm voxel coordinates. Defaults to [DEFAULT_BOUNDING_BOX] — a 2048 x 2048 x 64 cortex region.
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.