Large-Scale Connectome Analytics in the Cloud

analysis
Author

Stephan Gerhard

Published

October 9, 2023

Goals

Learn how to access and analyze large-scale connectome datasets hosted on BrainCircuits.io remotely in Python using DuckDB.

Requirements

We need to first install DuckDB into our Python environment

pip install duckdb

Datasets

Every dataset on BrainCircuits.io has a unique string identifier. You can find the identifier in the Dataset Description overview or by inspecting the folders of the source data directly.

For the following tutorial we are going to use the public fruitfly FlyWire dataset with identifier fruitfly_fafb_flywire_public. Please make sure to cite the appropriate publications when using this data. You can find relevant links in the About widget of the dataset.

Data Model Overview

The static connectome dataset consists of a number of Parquet files. You can see all the files for the FlyWire dataset here: https://api.braincircuits.io/data/fruitfly_fafb_flywire_public/

In each dataset folder, you find a DATASET.txt file which shows for all files the contained columns and data type of the column together with the total number of records available.

The following table describes briefly the content of each type of file:

Filename Content
neurons.parquet Each neuron (or segment) in the dataset with an arbitrary set of columns with more information about individual neurons
segment_link.parquet An aggregate edge table that lists the synaptic counts between source and target segments. Additional columns with some statistics about the connection (e.g. avg_scores is the average score value of each individual synaptic link score)
segment_size.parquet Size information about a segment such as number of containing supervoxels or total number of voxels
segment_nucleus.parquet For each segment a count of associated number of nuclei. Should only be one in an ideal world with perfect segmentations.
segment_neurotransmitter.parquet Summary of most likely neurotransmitter per segment based on filtering of presynaptic locations and their individual neurontransmitter prediction
segment_link_pre.parquet Number of presynaptic locations and downstream partner segments for each segment
segment_link_post.parquet Number of postsynaptic locations and upstream partner segments for each segment
link.parquet/*.parquet Same information as in thet synapse_link.parquet file but split into individual parquet files for faster access
neurotransmitter.parquet/*.parquet The neurotransmitter predictions for each individual synaptic link
nucleus.parquet/*.parquet The location and size of each individual predicted nucleus
synapse_link.parquet Each individual predicted synaptic link between a presynaptic and a postsynaptic location with associated scores.
skeleton/l2skeletons/skeletons.parquet Summary statistics about each exported l2skeleton in the dataset.
skeleton/l2skeletons/skeleton_nodes.parquet Individual skeleton nodes for each segment.

Query the Data

Using DuckDB, those files can be used remotely as tables in an SQL query. If you want to go into more depth, you can read the DuckDB documentation. Here, we demonstrate a few common queries to get started. In order to improve readability of the query, we store part of the URL in a baseurl variable

import navis
import duckdb
baseurl = 'https://api.braincircuits.io/data/fruitfly_fafb_flywire_public'

First, we need a DuckDB connection object:

con = duckdb.connect()

We can query neurons by a search string easily. If we want to retrieve all neurons which contain BPN string(case-insensitive) in their label column, we use the following query:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/neurons.parquet' 
               WHERE label ILIKE '%BPN%'
               LIMIT 5""").df()
df
segment_id x y z soma_x soma_y soma_z nucleus_id sv_id flow ... morphology_group top_nt_conf cell_type label size_um3 nr_pre nr_downstream_partner nr_post nr_upstream_partner nucleus_nr
0 720575940624402173 101863 31209 4323 128144 23056 3021 4367706 77618374632589286 efferent ... 0.958113 DNpe053 DNp_R; Part of comprehensive neck connective t... 2581.600098 67526 29564 16139 2792 1
1 720575940628278654 122736 35863 4010 127752 24256 3265 4373018 79026093046706888 intrinsic ... SMPpm1_2 0.844157 orphan-BPN-output 195.360001 1812 958 544 205 1
2 720575940630006267 122528 36529 4022 122384 28984 3898 3574415 79026161766195332 intrinsic ... SMPpm1_4 0.870921 new_clone_4; right; new_clone_4; right; BPN-li... 298.239990 2913 1261 1272 385 1
3 720575940627098249 140000 23144 3126 140000 23144 3126 5577486 80221468276838352 intrinsic ... SMPp&v1_posterior_1 0.795091 SMPp&v1A_H01 orphan-BPN-output 1241.989990 10747 4063 6845 1085 1
4 720575940618558365 162592 77144 1981 162592 77144 1981 6912408 81773222580344887 intrinsic ... 0.845792 orphan-BPN-output 1819.329956 20429 7895 13329 2115 1

5 rows × 30 columns

Let’s retrieve the first 5 neurons that have more than 1000 downstream partners and sort by the number of downstream partners. We can use the following query and retrieve the results directly in a Pandas dataframe:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/neurons.parquet' 
               WHERE nr_downstream_partner > 1000 
               ORDER BY nr_downstream_partner DESC
               LIMIT 5""").df()
df
segment_id x y z soma_x soma_y soma_z nucleus_id sv_id flow ... morphology_group top_nt_conf cell_type label size_um3 nr_pre nr_downstream_partner nr_post nr_upstream_partner nucleus_nr
0 720575940621280688 156482 42275 3896 179416 51040 3014 7396453 81348673921237569 intrinsic ... 0.771267 APL; APL-LHS; APL-LHS; Mushroom Body 11388.240234 146255 51620 113071 7052 1
1 720575940615970783 158906 64213 4347 127904 49928 1289 4492836 81561292049294153 intrinsic ... 0.655521 CT1; CT1; gabaergic 16315.629883 247006 49134 144276 17505 1
2 720575940629174889 107560 34947 4823 83352 47240 3701 1813248 78040862042737270 intrinsic ... 0.795262 APL; Mushroom Body; APL-RHS; APL-RHS 11757.730469 163407 48184 125189 6953 1
3 720575940621675174 133999 56333 1847 134280 51872 1179 5058804 79801523353597754 intrinsic ... 0.621499 CT1; CT1 15729.919922 246711 47962 144967 18119 1
4 720575940626637002 121547 58007 2243 144464 50760 762 5741632 78957235929387973 intrinsic ... 0.859750 None 6673.060059 139542 42196 67281 10190 1

5 rows × 30 columns

We can inspect all the columns in the data frame with:

df.columns
Index(['segment_id', 'x', 'y', 'z', 'soma_x', 'soma_y', 'soma_z', 'nucleus_id',
       'sv_id', 'flow', 'side', 'super_class', 'cell_class', 'cell_sub_class',
       'ito_lee_hemilineage', 'hartenstein_hemilineage', 'nerve', 'fbbt_id',
       'top_nt', 'status', 'morphology_group', 'top_nt_conf', 'cell_type',
       'label', 'size_um3', 'nr_pre', 'nr_downstream_partner', 'nr_post',
       'nr_upstream_partner', 'nucleus_nr'],
      dtype='object')

Let’s display only the number of downstream partners and presynaptic locations:

df[['segment_id', 'nr_downstream_partner', 'nr_pre']]
segment_id nr_downstream_partner nr_pre
0 720575940621280688 51620 146255
1 720575940615970783 49134 247006
2 720575940629174889 48184 163407
3 720575940621675174 47962 246711
4 720575940626637002 42196 139542

We see that segment 720575940621280688 has 51620 downstream partners and 146255 presynaptic locations.

Let’s store this segment id as variable

segment_id = 720575940621280688

Let’s get all the downstream partners of this segment:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/segment_link.parquet' 
               WHERE src = {segment_id}
               ORDER BY dst DESC""").df()
df
src dst count min_scores max_scores avg_scores min_cleft_scores max_cleft_scores avg_cleft_scores
0 720575940621280688 720575940660053121 2 109.382713 190.875397 150 141 152 146
1 720575940621280688 720575940660047489 43 7.637001 742.377136 192 1 178 114
2 720575940621280688 720575940659940225 41 5.718618 443.282257 172 0 168 120
3 720575940621280688 720575940659850113 23 9.563767 700.225708 174 0 168 104
4 720575940621280688 720575940659765121 31 9.567139 1110.103516 161 9 173 129
... ... ... ... ... ... ... ... ... ...
51615 720575940621280688 720575940379705726 2 20.213730 29.862347 25 144 166 155
51616 720575940621280688 720575940379590257 1 100.210541 100.210541 100 28 28 28
51617 720575940621280688 720575940379312979 1 130.753403 130.753403 131 150 150 150
51618 720575940621280688 720575940379312467 1 337.480469 337.480469 337 144 144 144
51619 720575940621280688 0 5 23.942980 425.107513 148 0 168 92

51620 rows × 9 columns

And we get the list of all downstream segments from the dataframe:

downstream_segments = df['dst'].tolist()
print(downstream_segments[:5])
[720575940660053121, 720575940660047489, 720575940659940225, 720575940659850113, 720575940659765121]

Let’s get the L2 skeleton of this segment

df = con.query(f"""SELECT * 
               FROM '{baseurl}/skeleton/l2skeletons/skeleton_nodes.parquet' 
               WHERE segment_id = {segment_id}""").df()
df
segment_id x y z node_id parent
0 720575940621280688 506832 182256 49640 0 6
1 720575940621280688 507232 181232 52400 1 5
2 720575940621280688 511488 155344 87240 2 8
3 720575940621280688 510192 173536 60120 3 9
4 720575940621280688 508864 175056 59200 4 3
... ... ... ... ... ... ...
11343 720575940621280688 720592 199120 123600 11343 11342
11344 720575940621280688 721520 201552 120360 11344 11346
11345 720575940621280688 724288 202160 123520 11345 11344
11346 720575940621280688 724352 205840 120680 11346 11347
11347 720575940621280688 720944 205328 125480 11347 11340

11348 rows × 6 columns

We notice that this query is very slow. One option to speed things up is to download the entire skeleton_nodes.parquet file (180MB) and run the query locally:

!wget -O /tmp/skeleton_nodes.parquet \
    https://api.braincircuits.io/data/fruitfly_fafb_flywire_public/skeleton/l2skeletons/skeleton_nodes.parquet 

The same query now takes only a fraction of a second.

df = con.query(f"""SELECT * 
               FROM '/tmp/skeleton_nodes.parquet' 
               WHERE segment_id = {segment_id}""").df()

If we are interested in advanced neuroanatomical analyses of this neuron based on it’s skeletonized representation, we can use the NAVis package for neuron analysis and visualization.

In order to do that, we convert the DataFrame into a format that can be parsed by the SWC reader of NAVis.

df2 = df.rename(columns={'parent': 'parent_id'})
df2['label'] = 0
df2['radius'] = 0

Then we convert the DataFrame to a navis.TreeNeuron to get some basic statistics.

swc = navis.read_swc(df2)
print(swc)
type                      navis.TreeNeuron
name                                   SWC
n_nodes                              11348
n_connectors                          None
n_branches                            1158
n_leafs                               1761
cable_length                    42474824.0
soma                                  None
units                      1 dimensionless
created_at      2023-10-19 18:01:31.521072
origin                           DataFrame
dtype: object

See the Tutorials sections on the NAVis website for an overview for further processing of the skeleton.

Going back to synaptic data, we can now also retrieve presynaptic locations for this segment with the following query:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/synapse_link.parquet' 
               WHERE pre_segment_id = {segment_id}
               LIMIT 5""").df()
df
id pre_x pre_y pre_z post_x post_y post_z scores cleft_scores cleft_id pre_segment_id post_segment_id
0 21210 135698 41219 1850 135665 41212 1849 170.457611 141 9915144 720575940621280688 720575940507977460
1 21211 135709 41240 1849 135680 41243 1849 326.895081 147 9915144 720575940621280688 720575940507981812
2 21212 135719 41259 1848 135723 41279 1849 298.207092 100 0 720575940621280688 720575940507984372
3 21219 135709 41226 1852 135683 41234 1852 127.513809 68 0 720575940621280688 720575940605817522
4 22149 137649 41189 1863 137652 41207 1862 487.758301 148 10150849 720575940621280688 720575940472242925

Or the postsynaptic locations:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/synapse_link.parquet' 
               WHERE post_segment_id = {segment_id}
               LIMIT 5""").df()
df
id pre_x pre_y pre_z post_x post_y post_z scores cleft_scores cleft_id pre_segment_id post_segment_id
0 21253 135822 41325 1862 135843 41297 1862 10.482904 0 0 720575940507992820 720575940621280688
1 21277 135924 41320 1873 135938 41299 1873 910.730591 142 9915241 720575940636560730 720575940621280688
2 21284 135971 41583 1872 135968 41552 1872 286.582794 149 9915238 720575940649691513 720575940621280688
3 21297 135999 41502 1873 135986 41526 1873 117.003525 83 0 720575940636560730 720575940621280688
4 21344 136810 41444 1849 136790 41461 1849 45.242020 92 0 720575940629397079 720575940621280688

It is also possible to fetch synaptic locations across a set of segments. Here, we’re fetching more than 800’000 locations with a single query:

df = con.query(f"""SELECT count(*) 
               FROM '{baseurl}/synapse_link.parquet' 
               WHERE pre_segment_id in ('720575940621280688','720575940615970783','720575940629174889','720575940621675174')""").df()
df
count_star()
0 803379