torch_em.data.datasets.electron_microscopy.fafb
The FAFB (Full Adult Fly Brain) dataset contains a serial-section TEM volume of the full adult female Drosophila brain with dense neuron instance segmentation from FlyWire.
The EM (FAFB v14) is a ssTEM dataset. The native 4 x 4 x 40 nm mip level is a placeholder with no data - the finest available EM is at mip=2 (16 x 16 x 40 nm), which matches the FlyWire neuron segmentation (materialization v783, Nature 2024 paper) resolution exactly. Both are stored at 16 x 16 x 40 nm.
Bounding boxes are specified in 16 x 16 x 40 nm voxel coordinates (x_min, x_max, y_min, y_max, z_min, z_max). Valid coordinate overlap between EM (mip=2) and seg: x=[5100,59200], y=[1440,29600], z=[16,7062].
The EM is at gs://microns-seunglab/drosophila_v0/alignment/image_rechunked (mip=2) and the neuron segmentation (v783) is at gs://flywire_v141_m783.
This dataset is from the publication https://doi.org/10.1038/s41586-024-07558-y. Please cite it if you use this dataset in your research.
The dataset is publicly available at https://flywire.ai. Requires cloud-volume: pip install cloud-volume.
NOTE (on data size): the full seg volume is (54100, 28160, 7046) voxels at 16 x 16 x 40 nm. Downloading the entire volume is not feasible. Data is streamed from GCS and cached locally as zarr v3 stores by specifying bounding boxes.
NOTE (AA): The data annotations are amazing, I personally think that the segmentation resolution is too low. If we wanna use it, we should go one resolution higher (we are at s2 atm).
1"""The FAFB (Full Adult Fly Brain) dataset contains a serial-section TEM volume of the 2full adult female Drosophila brain with dense neuron instance segmentation from FlyWire. 3 4The EM (FAFB v14) is a ssTEM dataset. The native 4 x 4 x 40 nm mip level is a 5placeholder with no data - the finest available EM is at mip=2 (16 x 16 x 40 nm), 6which matches the FlyWire neuron segmentation (materialization v783, Nature 2024 paper) 7resolution exactly. Both are stored at 16 x 16 x 40 nm. 8 9Bounding boxes are specified in 16 x 16 x 40 nm voxel coordinates 10(x_min, x_max, y_min, y_max, z_min, z_max). 11Valid coordinate overlap between EM (mip=2) and seg: x=[5100,59200], y=[1440,29600], z=[16,7062]. 12 13The EM is at gs://microns-seunglab/drosophila_v0/alignment/image_rechunked (mip=2) and 14the neuron segmentation (v783) is at gs://flywire_v141_m783. 15 16This dataset is from the publication https://doi.org/10.1038/s41586-024-07558-y. 17Please cite it if you use this dataset in your research. 18 19The dataset is publicly available at https://flywire.ai. 20Requires cloud-volume: pip install cloud-volume. 21 22NOTE (on data size): the full seg volume is (54100, 28160, 7046) voxels at 16 x 16 x 40 nm. 23Downloading the entire volume is not feasible. Data is streamed from GCS and cached 24locally as zarr v3 stores by specifying bounding boxes. 25 26NOTE (AA): The data annotations are amazing, I personally think that the segmentation 27resolution is too low. If we wanna use it, we should go one resolution higher 28(we are at s2 atm). 29""" 30 31import hashlib 32import os 33from typing import List, Optional, Tuple, Union 34 35import numpy as np 36from torch.utils.data import DataLoader, Dataset 37 38import torch_em 39from .. import util 40 41 42EM_URL = "gs://microns-seunglab/drosophila_v0/alignment/image_rechunked" 43SEG_URL = "gs://flywire_v141_m783" 44# mip=2 gives 16x16x40nm, matching the seg resolution; mip=0 is a placeholder with no data. 45EM_MIP = 2 46 47# Four 2048x2048x819-voxel crops sampling different brain regions. 48# At 16x16x40 nm this gives ~32x32x32 um physically isotropic subvolumes. 49DEFAULT_BOUNDING_BOXES = [ 50 (6000, 8048, 2000, 4048, 500, 1319), # anterior-left, low z 51 (31000, 33048, 14500, 16548, 3200, 4019), # central brain 52 (56000, 58048, 26500, 28548, 5800, 6619), # posterior-right, high z 53 (15000, 17048, 8000, 10048, 6100, 6919), # mid-left, high z 54] 55DEFAULT_BOUNDING_BOX = DEFAULT_BOUNDING_BOXES[1] 56 57FAFB_CHUNK_SHAPE = (64, 256, 256) 58 59 60def _bbox_to_str(bbox): 61 return hashlib.md5("_".join(str(v) for v in bbox).encode()).hexdigest()[:12] 62 63 64def _create_array(root, name, shape, dtype, is_label): 65 from zarr.codecs import BloscCodec 66 shuffle = "bitshuffle" if (np.issubdtype(dtype, np.integer) and is_label) else "shuffle" 67 return root.create_array( 68 name, 69 shape=shape, 70 chunks=FAFB_CHUNK_SHAPE, 71 dtype=dtype, 72 compressors=BloscCodec(cname="zstd", clevel=6, shuffle=shuffle), 73 ) 74 75 76def get_fafb_data( 77 path: Union[os.PathLike, str], 78 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 79 download: bool = False, 80) -> str: 81 """Stream a subvolume from the FAFB dataset and cache it as a zarr v3 store. 82 83 Args: 84 path: Filepath to a folder where the cached zarr store will be saved. 85 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 86 in 16 nm voxel coordinates. Defaults to a 2048x2048x819 central brain crop. 87 download: Whether to stream and cache the data if it is not present. 88 89 Returns: 90 The filepath to the cached zarr store. 91 """ 92 import zarr 93 94 os.makedirs(str(path), exist_ok=True) 95 zarr_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.zarr") 96 97 root = zarr.open_group(zarr_path, mode="a") 98 if "raw" in root and "labels" in root: 99 return zarr_path 100 101 if not download: 102 raise RuntimeError( 103 f"No cached data found at '{zarr_path}'. Set download=True to stream it from GCS." 104 ) 105 106 try: 107 import cloudvolume 108 except ImportError: 109 raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume") 110 111 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 112 print(f"Streaming FAFB EM + FlyWire segmentation for bbox {bounding_box} ...") 113 114 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=EM_MIP, progress=True) 115 seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True) 116 117 raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 118 labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 119 120 # FlyWire IDs are large uint64 values - relabel to consecutive integers. 121 _, labels = np.unique(labels, return_inverse=True) 122 labels = labels.reshape(raw.shape).astype("uint64") 123 124 shape = tuple(min(r, l) for r, l in zip(raw.shape, labels.shape)) 125 raw = raw[:shape[0], :shape[1], :shape[2]] 126 labels = labels[:shape[0], :shape[1], :shape[2]] 127 128 root.attrs["bounding_box"] = list(bounding_box) 129 root.attrs["resolution_nm"] = [16, 16, 40] 130 131 if "raw" not in root: 132 ds_raw = _create_array(root, "raw", shape, np.dtype("uint8"), is_label=False) 133 ds_raw[:] = raw 134 if "labels" not in root: 135 ds_lbl = _create_array(root, "labels", shape, np.dtype("uint64"), is_label=True) 136 ds_lbl[:] = labels 137 138 print(f"Cached to {zarr_path} (shape {shape})") 139 return zarr_path 140 141 142def get_fafb_paths( 143 path: Union[os.PathLike, str], 144 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 145 download: bool = False, 146) -> List[str]: 147 """Get paths to FAFB zarr stores. 148 149 Args: 150 path: Filepath to a folder where the cached zarr stores will be saved. 151 bounding_boxes: List of regions to fetch, each as 152 (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. 153 Defaults to DEFAULT_BOUNDING_BOXES (4 crops). 154 download: Whether to stream and cache the data if it is not present. 155 156 Returns: 157 List of filepaths to the cached zarr stores. 158 """ 159 if bounding_boxes is None: 160 bounding_boxes = DEFAULT_BOUNDING_BOXES 161 return [get_fafb_data(path, bbox, download) for bbox in bounding_boxes] 162 163 164def get_fafb_dataset( 165 path: Union[os.PathLike, str], 166 patch_shape: Tuple[int, int, int], 167 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 168 download: bool = False, 169 offsets: Optional[List[List[int]]] = None, 170 boundaries: bool = False, 171 **kwargs, 172) -> Dataset: 173 """Get the FAFB dataset for neuron instance segmentation. 174 175 Args: 176 path: Filepath to a folder where the cached zarr stores will be saved. 177 patch_shape: The patch shape (z, y, x) to use for training. 178 bounding_boxes: List of subvolumes to use, each as 179 (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. 180 Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops. 181 download: Whether to stream and cache data if not already present. 182 offsets: Offset values for affinity computation used as target. 183 boundaries: Whether to compute boundaries as the target. 184 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 185 186 Returns: 187 The segmentation dataset. 188 """ 189 assert len(patch_shape) == 3 190 191 paths = get_fafb_paths(path, bounding_boxes, download) 192 193 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 194 kwargs, _ = util.add_instance_label_transform( 195 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 196 ) 197 198 return torch_em.default_segmentation_dataset( 199 raw_paths=paths, 200 raw_key="raw", 201 label_paths=paths, 202 label_key="labels", 203 patch_shape=patch_shape, 204 **kwargs, 205 ) 206 207 208def get_fafb_loader( 209 path: Union[os.PathLike, str], 210 patch_shape: Tuple[int, int, int], 211 batch_size: int, 212 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 213 download: bool = False, 214 offsets: Optional[List[List[int]]] = None, 215 boundaries: bool = False, 216 **kwargs, 217) -> DataLoader: 218 """Get the DataLoader for neuron instance segmentation in the FAFB dataset. 219 220 Args: 221 path: Filepath to a folder where the cached zarr stores will be saved. 222 patch_shape: The patch shape (z, y, x) to use for training. 223 batch_size: The batch size for training. 224 bounding_boxes: List of subvolumes to use, each as 225 (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. 226 Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops. 227 download: Whether to stream and cache data if not already present. 228 offsets: Offset values for affinity computation used as target. 229 boundaries: Whether to compute boundaries as the target. 230 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 231 or for the PyTorch DataLoader. 232 233 Returns: 234 The DataLoader. 235 """ 236 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 237 dataset = get_fafb_dataset( 238 path, patch_shape, bounding_boxes=bounding_boxes, 239 download=download, offsets=offsets, boundaries=boundaries, **ds_kwargs 240 ) 241 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
77def get_fafb_data( 78 path: Union[os.PathLike, str], 79 bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX, 80 download: bool = False, 81) -> str: 82 """Stream a subvolume from the FAFB dataset and cache it as a zarr v3 store. 83 84 Args: 85 path: Filepath to a folder where the cached zarr store will be saved. 86 bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) 87 in 16 nm voxel coordinates. Defaults to a 2048x2048x819 central brain crop. 88 download: Whether to stream and cache the data if it is not present. 89 90 Returns: 91 The filepath to the cached zarr store. 92 """ 93 import zarr 94 95 os.makedirs(str(path), exist_ok=True) 96 zarr_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.zarr") 97 98 root = zarr.open_group(zarr_path, mode="a") 99 if "raw" in root and "labels" in root: 100 return zarr_path 101 102 if not download: 103 raise RuntimeError( 104 f"No cached data found at '{zarr_path}'. Set download=True to stream it from GCS." 105 ) 106 107 try: 108 import cloudvolume 109 except ImportError: 110 raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume") 111 112 x_min, x_max, y_min, y_max, z_min, z_max = bounding_box 113 print(f"Streaming FAFB EM + FlyWire segmentation for bbox {bounding_box} ...") 114 115 em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=EM_MIP, progress=True) 116 seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True) 117 118 raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 119 labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0) 120 121 # FlyWire IDs are large uint64 values - relabel to consecutive integers. 122 _, labels = np.unique(labels, return_inverse=True) 123 labels = labels.reshape(raw.shape).astype("uint64") 124 125 shape = tuple(min(r, l) for r, l in zip(raw.shape, labels.shape)) 126 raw = raw[:shape[0], :shape[1], :shape[2]] 127 labels = labels[:shape[0], :shape[1], :shape[2]] 128 129 root.attrs["bounding_box"] = list(bounding_box) 130 root.attrs["resolution_nm"] = [16, 16, 40] 131 132 if "raw" not in root: 133 ds_raw = _create_array(root, "raw", shape, np.dtype("uint8"), is_label=False) 134 ds_raw[:] = raw 135 if "labels" not in root: 136 ds_lbl = _create_array(root, "labels", shape, np.dtype("uint64"), is_label=True) 137 ds_lbl[:] = labels 138 139 print(f"Cached to {zarr_path} (shape {shape})") 140 return zarr_path
Stream a subvolume from the FAFB dataset and cache it as a zarr v3 store.
Arguments:
- path: Filepath to a folder where the cached zarr store will be saved.
- bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. Defaults to a 2048x2048x819 central brain crop.
- download: Whether to stream and cache the data if it is not present.
Returns:
The filepath to the cached zarr store.
143def get_fafb_paths( 144 path: Union[os.PathLike, str], 145 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 146 download: bool = False, 147) -> List[str]: 148 """Get paths to FAFB zarr stores. 149 150 Args: 151 path: Filepath to a folder where the cached zarr stores will be saved. 152 bounding_boxes: List of regions to fetch, each as 153 (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. 154 Defaults to DEFAULT_BOUNDING_BOXES (4 crops). 155 download: Whether to stream and cache the data if it is not present. 156 157 Returns: 158 List of filepaths to the cached zarr stores. 159 """ 160 if bounding_boxes is None: 161 bounding_boxes = DEFAULT_BOUNDING_BOXES 162 return [get_fafb_data(path, bbox, download) for bbox in bounding_boxes]
Get paths to FAFB zarr stores.
Arguments:
- path: Filepath to a folder where the cached zarr stores will be saved.
- bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. Defaults to DEFAULT_BOUNDING_BOXES (4 crops).
- download: Whether to stream and cache the data if it is not present.
Returns:
List of filepaths to the cached zarr stores.
165def get_fafb_dataset( 166 path: Union[os.PathLike, str], 167 patch_shape: Tuple[int, int, int], 168 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 169 download: bool = False, 170 offsets: Optional[List[List[int]]] = None, 171 boundaries: bool = False, 172 **kwargs, 173) -> Dataset: 174 """Get the FAFB dataset for neuron instance segmentation. 175 176 Args: 177 path: Filepath to a folder where the cached zarr stores will be saved. 178 patch_shape: The patch shape (z, y, x) to use for training. 179 bounding_boxes: List of subvolumes to use, each as 180 (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. 181 Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops. 182 download: Whether to stream and cache data if not already present. 183 offsets: Offset values for affinity computation used as target. 184 boundaries: Whether to compute boundaries as the target. 185 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`. 186 187 Returns: 188 The segmentation dataset. 189 """ 190 assert len(patch_shape) == 3 191 192 paths = get_fafb_paths(path, bounding_boxes, download) 193 194 kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True) 195 kwargs, _ = util.add_instance_label_transform( 196 kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets 197 ) 198 199 return torch_em.default_segmentation_dataset( 200 raw_paths=paths, 201 raw_key="raw", 202 label_paths=paths, 203 label_key="labels", 204 patch_shape=patch_shape, 205 **kwargs, 206 )
Get the FAFB dataset for neuron instance segmentation.
Arguments:
- path: Filepath to a folder where the cached zarr stores will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops.
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_dataset.
Returns:
The segmentation dataset.
209def get_fafb_loader( 210 path: Union[os.PathLike, str], 211 patch_shape: Tuple[int, int, int], 212 batch_size: int, 213 bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, 214 download: bool = False, 215 offsets: Optional[List[List[int]]] = None, 216 boundaries: bool = False, 217 **kwargs, 218) -> DataLoader: 219 """Get the DataLoader for neuron instance segmentation in the FAFB dataset. 220 221 Args: 222 path: Filepath to a folder where the cached zarr stores will be saved. 223 patch_shape: The patch shape (z, y, x) to use for training. 224 batch_size: The batch size for training. 225 bounding_boxes: List of subvolumes to use, each as 226 (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. 227 Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops. 228 download: Whether to stream and cache data if not already present. 229 offsets: Offset values for affinity computation used as target. 230 boundaries: Whether to compute boundaries as the target. 231 kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset` 232 or for the PyTorch DataLoader. 233 234 Returns: 235 The DataLoader. 236 """ 237 ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs) 238 dataset = get_fafb_dataset( 239 path, patch_shape, bounding_boxes=bounding_boxes, 240 download=download, offsets=offsets, boundaries=boundaries, **ds_kwargs 241 ) 242 return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
Get the DataLoader for neuron instance segmentation in the FAFB dataset.
Arguments:
- path: Filepath to a folder where the cached zarr stores will be saved.
- patch_shape: The patch shape (z, y, x) to use for training.
- batch_size: The batch size for training.
- bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops.
- download: Whether to stream and cache data if not already present.
- offsets: Offset values for affinity computation used as target.
- boundaries: Whether to compute boundaries as the target.
- kwargs: Additional keyword arguments for
torch_em.default_segmentation_datasetor for the PyTorch DataLoader.
Returns:
The DataLoader.