torch_em.data.datasets.electron_microscopy.fafb

The FAFB (Full Adult Fly Brain) dataset contains a serial-section TEM volume of the full adult female Drosophila brain with dense neuron instance segmentation from FlyWire.

The EM (FAFB v14) is a ssTEM dataset. The native 4 x 4 x 40 nm mip level is a placeholder with no data - the finest available EM is at mip=2 (16 x 16 x 40 nm), which matches the FlyWire neuron segmentation (materialization v783, Nature 2024 paper) resolution exactly. Both are stored at 16 x 16 x 40 nm.

Bounding boxes are specified in 16 x 16 x 40 nm voxel coordinates (x_min, x_max, y_min, y_max, z_min, z_max). Valid coordinate overlap between EM (mip=2) and seg: x=[5100,59200], y=[1440,29600], z=[16,7062].

The EM is at gs://microns-seunglab/drosophila_v0/alignment/image_rechunked (mip=2) and the neuron segmentation (v783) is at gs://flywire_v141_m783.

This dataset is from the publication https://doi.org/10.1038/s41586-024-07558-y. Please cite it if you use this dataset in your research.

The dataset is publicly available at https://flywire.ai. Requires cloud-volume: pip install cloud-volume.

NOTE (on data size): the full seg volume is (54100, 28160, 7046) voxels at 16 x 16 x 40 nm. Downloading the entire volume is not feasible. Data is streamed from GCS and cached locally as zarr v3 stores by specifying bounding boxes.

NOTE (AA): The data annotations are amazing, I personally think that the segmentation resolution is too low. If we wanna use it, we should go one resolution higher (we are at s2 atm).

  1"""The FAFB (Full Adult Fly Brain) dataset contains a serial-section TEM volume of the
  2full adult female Drosophila brain with dense neuron instance segmentation from FlyWire.
  3
  4The EM (FAFB v14) is a ssTEM dataset. The native 4 x 4 x 40 nm mip level is a
  5placeholder with no data - the finest available EM is at mip=2 (16 x 16 x 40 nm),
  6which matches the FlyWire neuron segmentation (materialization v783, Nature 2024 paper)
  7resolution exactly. Both are stored at 16 x 16 x 40 nm.
  8
  9Bounding boxes are specified in 16 x 16 x 40 nm voxel coordinates
 10(x_min, x_max, y_min, y_max, z_min, z_max).
 11Valid coordinate overlap between EM (mip=2) and seg: x=[5100,59200], y=[1440,29600], z=[16,7062].
 12
 13The EM is at gs://microns-seunglab/drosophila_v0/alignment/image_rechunked (mip=2) and
 14the neuron segmentation (v783) is at gs://flywire_v141_m783.
 15
 16This dataset is from the publication https://doi.org/10.1038/s41586-024-07558-y.
 17Please cite it if you use this dataset in your research.
 18
 19The dataset is publicly available at https://flywire.ai.
 20Requires cloud-volume: pip install cloud-volume.
 21
 22NOTE (on data size): the full seg volume is (54100, 28160, 7046) voxels at 16 x 16 x 40 nm.
 23Downloading the entire volume is not feasible. Data is streamed from GCS and cached
 24locally as zarr v3 stores by specifying bounding boxes.
 25
 26NOTE (AA): The data annotations are amazing, I personally think that the segmentation
 27resolution is too low. If we wanna use it, we should go one resolution higher
 28(we are at s2 atm).
 29"""
 30
 31import hashlib
 32import os
 33from typing import List, Optional, Tuple, Union
 34
 35import numpy as np
 36from torch.utils.data import DataLoader, Dataset
 37
 38import torch_em
 39from .. import util
 40
 41
 42EM_URL = "gs://microns-seunglab/drosophila_v0/alignment/image_rechunked"
 43SEG_URL = "gs://flywire_v141_m783"
 44# mip=2 gives 16x16x40nm, matching the seg resolution; mip=0 is a placeholder with no data.
 45EM_MIP = 2
 46
 47# Four 2048x2048x819-voxel crops sampling different brain regions.
 48# At 16x16x40 nm this gives ~32x32x32 um physically isotropic subvolumes.
 49DEFAULT_BOUNDING_BOXES = [
 50    (6000, 8048, 2000, 4048, 500, 1319),  # anterior-left, low z
 51    (31000, 33048, 14500, 16548, 3200, 4019),  # central brain
 52    (56000, 58048, 26500, 28548, 5800, 6619),  # posterior-right, high z
 53    (15000, 17048, 8000, 10048, 6100, 6919),  # mid-left, high z
 54]
 55DEFAULT_BOUNDING_BOX = DEFAULT_BOUNDING_BOXES[1]
 56
 57FAFB_CHUNK_SHAPE = (64, 256, 256)
 58
 59
 60def _bbox_to_str(bbox):
 61    return hashlib.md5("_".join(str(v) for v in bbox).encode()).hexdigest()[:12]
 62
 63
 64def _create_array(root, name, shape, dtype, is_label):
 65    from zarr.codecs import BloscCodec
 66    shuffle = "bitshuffle" if (np.issubdtype(dtype, np.integer) and is_label) else "shuffle"
 67    return root.create_array(
 68        name,
 69        shape=shape,
 70        chunks=FAFB_CHUNK_SHAPE,
 71        dtype=dtype,
 72        compressors=BloscCodec(cname="zstd", clevel=6, shuffle=shuffle),
 73    )
 74
 75
 76def get_fafb_data(
 77    path: Union[os.PathLike, str],
 78    bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX,
 79    download: bool = False,
 80) -> str:
 81    """Stream a subvolume from the FAFB dataset and cache it as a zarr v3 store.
 82
 83    Args:
 84        path: Filepath to a folder where the cached zarr store will be saved.
 85        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
 86            in 16 nm voxel coordinates. Defaults to a 2048x2048x819 central brain crop.
 87        download: Whether to stream and cache the data if it is not present.
 88
 89    Returns:
 90        The filepath to the cached zarr store.
 91    """
 92    import zarr
 93
 94    os.makedirs(str(path), exist_ok=True)
 95    zarr_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.zarr")
 96
 97    root = zarr.open_group(zarr_path, mode="a")
 98    if "raw" in root and "labels" in root:
 99        return zarr_path
100
101    if not download:
102        raise RuntimeError(
103            f"No cached data found at '{zarr_path}'. Set download=True to stream it from GCS."
104        )
105
106    try:
107        import cloudvolume
108    except ImportError:
109        raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume")
110
111    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
112    print(f"Streaming FAFB EM + FlyWire segmentation for bbox {bounding_box} ...")
113
114    em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=EM_MIP, progress=True)
115    seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True)
116
117    raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
118    labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
119
120    # FlyWire IDs are large uint64 values - relabel to consecutive integers.
121    _, labels = np.unique(labels, return_inverse=True)
122    labels = labels.reshape(raw.shape).astype("uint64")
123
124    shape = tuple(min(r, l) for r, l in zip(raw.shape, labels.shape))
125    raw = raw[:shape[0], :shape[1], :shape[2]]
126    labels = labels[:shape[0], :shape[1], :shape[2]]
127
128    root.attrs["bounding_box"] = list(bounding_box)
129    root.attrs["resolution_nm"] = [16, 16, 40]
130
131    if "raw" not in root:
132        ds_raw = _create_array(root, "raw", shape, np.dtype("uint8"), is_label=False)
133        ds_raw[:] = raw
134    if "labels" not in root:
135        ds_lbl = _create_array(root, "labels", shape, np.dtype("uint64"), is_label=True)
136        ds_lbl[:] = labels
137
138    print(f"Cached to {zarr_path} (shape {shape})")
139    return zarr_path
140
141
142def get_fafb_paths(
143    path: Union[os.PathLike, str],
144    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
145    download: bool = False,
146) -> List[str]:
147    """Get paths to FAFB zarr stores.
148
149    Args:
150        path: Filepath to a folder where the cached zarr stores will be saved.
151        bounding_boxes: List of regions to fetch, each as
152            (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates.
153            Defaults to DEFAULT_BOUNDING_BOXES (4 crops).
154        download: Whether to stream and cache the data if it is not present.
155
156    Returns:
157        List of filepaths to the cached zarr stores.
158    """
159    if bounding_boxes is None:
160        bounding_boxes = DEFAULT_BOUNDING_BOXES
161    return [get_fafb_data(path, bbox, download) for bbox in bounding_boxes]
162
163
164def get_fafb_dataset(
165    path: Union[os.PathLike, str],
166    patch_shape: Tuple[int, int, int],
167    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
168    download: bool = False,
169    offsets: Optional[List[List[int]]] = None,
170    boundaries: bool = False,
171    **kwargs,
172) -> Dataset:
173    """Get the FAFB dataset for neuron instance segmentation.
174
175    Args:
176        path: Filepath to a folder where the cached zarr stores will be saved.
177        patch_shape: The patch shape (z, y, x) to use for training.
178        bounding_boxes: List of subvolumes to use, each as
179            (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates.
180            Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops.
181        download: Whether to stream and cache data if not already present.
182        offsets: Offset values for affinity computation used as target.
183        boundaries: Whether to compute boundaries as the target.
184        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
185
186    Returns:
187        The segmentation dataset.
188    """
189    assert len(patch_shape) == 3
190
191    paths = get_fafb_paths(path, bounding_boxes, download)
192
193    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
194    kwargs, _ = util.add_instance_label_transform(
195        kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets
196    )
197
198    return torch_em.default_segmentation_dataset(
199        raw_paths=paths,
200        raw_key="raw",
201        label_paths=paths,
202        label_key="labels",
203        patch_shape=patch_shape,
204        **kwargs,
205    )
206
207
208def get_fafb_loader(
209    path: Union[os.PathLike, str],
210    patch_shape: Tuple[int, int, int],
211    batch_size: int,
212    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
213    download: bool = False,
214    offsets: Optional[List[List[int]]] = None,
215    boundaries: bool = False,
216    **kwargs,
217) -> DataLoader:
218    """Get the DataLoader for neuron instance segmentation in the FAFB dataset.
219
220    Args:
221        path: Filepath to a folder where the cached zarr stores will be saved.
222        patch_shape: The patch shape (z, y, x) to use for training.
223        batch_size: The batch size for training.
224        bounding_boxes: List of subvolumes to use, each as
225            (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates.
226            Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops.
227        download: Whether to stream and cache data if not already present.
228        offsets: Offset values for affinity computation used as target.
229        boundaries: Whether to compute boundaries as the target.
230        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
231            or for the PyTorch DataLoader.
232
233    Returns:
234        The DataLoader.
235    """
236    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
237    dataset = get_fafb_dataset(
238        path, patch_shape, bounding_boxes=bounding_boxes,
239        download=download, offsets=offsets, boundaries=boundaries, **ds_kwargs
240    )
241    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)
EM_URL = 'gs://microns-seunglab/drosophila_v0/alignment/image_rechunked'
SEG_URL = 'gs://flywire_v141_m783'
EM_MIP = 2
DEFAULT_BOUNDING_BOXES = [(6000, 8048, 2000, 4048, 500, 1319), (31000, 33048, 14500, 16548, 3200, 4019), (56000, 58048, 26500, 28548, 5800, 6619), (15000, 17048, 8000, 10048, 6100, 6919)]
DEFAULT_BOUNDING_BOX = (31000, 33048, 14500, 16548, 3200, 4019)
FAFB_CHUNK_SHAPE = (64, 256, 256)
def get_fafb_data( path: Union[os.PathLike, str], bounding_box: Tuple[int, int, int, int, int, int] = (31000, 33048, 14500, 16548, 3200, 4019), download: bool = False) -> str:
 77def get_fafb_data(
 78    path: Union[os.PathLike, str],
 79    bounding_box: Tuple[int, int, int, int, int, int] = DEFAULT_BOUNDING_BOX,
 80    download: bool = False,
 81) -> str:
 82    """Stream a subvolume from the FAFB dataset and cache it as a zarr v3 store.
 83
 84    Args:
 85        path: Filepath to a folder where the cached zarr store will be saved.
 86        bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max)
 87            in 16 nm voxel coordinates. Defaults to a 2048x2048x819 central brain crop.
 88        download: Whether to stream and cache the data if it is not present.
 89
 90    Returns:
 91        The filepath to the cached zarr store.
 92    """
 93    import zarr
 94
 95    os.makedirs(str(path), exist_ok=True)
 96    zarr_path = os.path.join(str(path), f"{_bbox_to_str(bounding_box)}.zarr")
 97
 98    root = zarr.open_group(zarr_path, mode="a")
 99    if "raw" in root and "labels" in root:
100        return zarr_path
101
102    if not download:
103        raise RuntimeError(
104            f"No cached data found at '{zarr_path}'. Set download=True to stream it from GCS."
105        )
106
107    try:
108        import cloudvolume
109    except ImportError:
110        raise ImportError("The 'cloud-volume' package is required: pip install cloud-volume")
111
112    x_min, x_max, y_min, y_max, z_min, z_max = bounding_box
113    print(f"Streaming FAFB EM + FlyWire segmentation for bbox {bounding_box} ...")
114
115    em_vol = cloudvolume.CloudVolume(EM_URL, use_https=True, mip=EM_MIP, progress=True)
116    seg_vol = cloudvolume.CloudVolume(SEG_URL, use_https=True, mip=0, progress=True)
117
118    raw = np.array(em_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
119    labels = np.array(seg_vol[x_min:x_max, y_min:y_max, z_min:z_max])[..., 0].transpose(2, 1, 0)
120
121    # FlyWire IDs are large uint64 values - relabel to consecutive integers.
122    _, labels = np.unique(labels, return_inverse=True)
123    labels = labels.reshape(raw.shape).astype("uint64")
124
125    shape = tuple(min(r, l) for r, l in zip(raw.shape, labels.shape))
126    raw = raw[:shape[0], :shape[1], :shape[2]]
127    labels = labels[:shape[0], :shape[1], :shape[2]]
128
129    root.attrs["bounding_box"] = list(bounding_box)
130    root.attrs["resolution_nm"] = [16, 16, 40]
131
132    if "raw" not in root:
133        ds_raw = _create_array(root, "raw", shape, np.dtype("uint8"), is_label=False)
134        ds_raw[:] = raw
135    if "labels" not in root:
136        ds_lbl = _create_array(root, "labels", shape, np.dtype("uint64"), is_label=True)
137        ds_lbl[:] = labels
138
139    print(f"Cached to {zarr_path} (shape {shape})")
140    return zarr_path

Stream a subvolume from the FAFB dataset and cache it as a zarr v3 store.

Arguments:
  • path: Filepath to a folder where the cached zarr store will be saved.
  • bounding_box: The region to fetch as (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. Defaults to a 2048x2048x819 central brain crop.
  • download: Whether to stream and cache the data if it is not present.
Returns:

The filepath to the cached zarr store.

def get_fafb_paths( path: Union[os.PathLike, str], bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False) -> List[str]:
143def get_fafb_paths(
144    path: Union[os.PathLike, str],
145    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
146    download: bool = False,
147) -> List[str]:
148    """Get paths to FAFB zarr stores.
149
150    Args:
151        path: Filepath to a folder where the cached zarr stores will be saved.
152        bounding_boxes: List of regions to fetch, each as
153            (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates.
154            Defaults to DEFAULT_BOUNDING_BOXES (4 crops).
155        download: Whether to stream and cache the data if it is not present.
156
157    Returns:
158        List of filepaths to the cached zarr stores.
159    """
160    if bounding_boxes is None:
161        bounding_boxes = DEFAULT_BOUNDING_BOXES
162    return [get_fafb_data(path, bbox, download) for bbox in bounding_boxes]

Get paths to FAFB zarr stores.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • bounding_boxes: List of regions to fetch, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. Defaults to DEFAULT_BOUNDING_BOXES (4 crops).
  • download: Whether to stream and cache the data if it is not present.
Returns:

List of filepaths to the cached zarr stores.

def get_fafb_dataset( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False, offsets: Optional[List[List[int]]] = None, boundaries: bool = False, **kwargs) -> torch.utils.data.dataset.Dataset:
165def get_fafb_dataset(
166    path: Union[os.PathLike, str],
167    patch_shape: Tuple[int, int, int],
168    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
169    download: bool = False,
170    offsets: Optional[List[List[int]]] = None,
171    boundaries: bool = False,
172    **kwargs,
173) -> Dataset:
174    """Get the FAFB dataset for neuron instance segmentation.
175
176    Args:
177        path: Filepath to a folder where the cached zarr stores will be saved.
178        patch_shape: The patch shape (z, y, x) to use for training.
179        bounding_boxes: List of subvolumes to use, each as
180            (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates.
181            Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops.
182        download: Whether to stream and cache data if not already present.
183        offsets: Offset values for affinity computation used as target.
184        boundaries: Whether to compute boundaries as the target.
185        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`.
186
187    Returns:
188        The segmentation dataset.
189    """
190    assert len(patch_shape) == 3
191
192    paths = get_fafb_paths(path, bounding_boxes, download)
193
194    kwargs = util.update_kwargs(kwargs, "is_seg_dataset", True)
195    kwargs, _ = util.add_instance_label_transform(
196        kwargs, add_binary_target=False, boundaries=boundaries, offsets=offsets
197    )
198
199    return torch_em.default_segmentation_dataset(
200        raw_paths=paths,
201        raw_key="raw",
202        label_paths=paths,
203        label_key="labels",
204        patch_shape=patch_shape,
205        **kwargs,
206    )

Get the FAFB dataset for neuron instance segmentation.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops.
  • download: Whether to stream and cache data if not already present.
  • offsets: Offset values for affinity computation used as target.
  • boundaries: Whether to compute boundaries as the target.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset.
Returns:

The segmentation dataset.

def get_fafb_loader( path: Union[os.PathLike, str], patch_shape: Tuple[int, int, int], batch_size: int, bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None, download: bool = False, offsets: Optional[List[List[int]]] = None, boundaries: bool = False, **kwargs) -> torch.utils.data.dataloader.DataLoader:
209def get_fafb_loader(
210    path: Union[os.PathLike, str],
211    patch_shape: Tuple[int, int, int],
212    batch_size: int,
213    bounding_boxes: Optional[List[Tuple[int, int, int, int, int, int]]] = None,
214    download: bool = False,
215    offsets: Optional[List[List[int]]] = None,
216    boundaries: bool = False,
217    **kwargs,
218) -> DataLoader:
219    """Get the DataLoader for neuron instance segmentation in the FAFB dataset.
220
221    Args:
222        path: Filepath to a folder where the cached zarr stores will be saved.
223        patch_shape: The patch shape (z, y, x) to use for training.
224        batch_size: The batch size for training.
225        bounding_boxes: List of subvolumes to use, each as
226            (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates.
227            Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops.
228        download: Whether to stream and cache data if not already present.
229        offsets: Offset values for affinity computation used as target.
230        boundaries: Whether to compute boundaries as the target.
231        kwargs: Additional keyword arguments for `torch_em.default_segmentation_dataset`
232            or for the PyTorch DataLoader.
233
234    Returns:
235        The DataLoader.
236    """
237    ds_kwargs, loader_kwargs = util.split_kwargs(torch_em.default_segmentation_dataset, **kwargs)
238    dataset = get_fafb_dataset(
239        path, patch_shape, bounding_boxes=bounding_boxes,
240        download=download, offsets=offsets, boundaries=boundaries, **ds_kwargs
241    )
242    return torch_em.get_data_loader(dataset, batch_size, **loader_kwargs)

Get the DataLoader for neuron instance segmentation in the FAFB dataset.

Arguments:
  • path: Filepath to a folder where the cached zarr stores will be saved.
  • patch_shape: The patch shape (z, y, x) to use for training.
  • batch_size: The batch size for training.
  • bounding_boxes: List of subvolumes to use, each as (x_min, x_max, y_min, y_max, z_min, z_max) in 16 nm voxel coordinates. Defaults to DEFAULT_BOUNDING_BOXES - four 2048x2048x819 isotropic crops.
  • download: Whether to stream and cache data if not already present.
  • offsets: Offset values for affinity computation used as target.
  • boundaries: Whether to compute boundaries as the target.
  • kwargs: Additional keyword arguments for torch_em.default_segmentation_dataset or for the PyTorch DataLoader.
Returns:

The DataLoader.