import navis
import duckdb
= 'https://api.braincircuits.io/data/fruitfly_fafb_flywire_public' baseurl
Goals
Learn how to access and analyze large-scale connectome datasets hosted on BrainCircuits.io remotely in Python using DuckDB.
Requirements
We need to first install DuckDB into our Python environment
pip install duckdb
Datasets
Every dataset on BrainCircuits.io has a unique string identifier. You can find the identifier in the Dataset Description overview or by inspecting the folders of the source data directly.
For the following tutorial we are going to use the public fruitfly FlyWire dataset with identifier fruitfly_fafb_flywire_public
. Please make sure to cite the appropriate publications when using this data. You can find relevant links in the About
widget of the dataset.
Data Model Overview
The static connectome dataset consists of a number of Parquet files. You can see all the files for the FlyWire dataset here: https://api.braincircuits.io/data/fruitfly_fafb_flywire_public/
In each dataset folder, you find a DATASET.txt
file which shows for all files the contained columns and data type of the column together with the total number of records available.
The following table describes briefly the content of each type of file:
Filename | Content |
---|---|
neurons.parquet |
Each neuron (or segment) in the dataset with an arbitrary set of columns with more information about individual neurons |
segment_link.parquet |
An aggregate edge table that lists the synaptic counts between source and target segments. Additional columns with some statistics about the connection (e.g. avg_scores is the average score value of each individual synaptic link score) |
segment_size.parquet |
Size information about a segment such as number of containing supervoxels or total number of voxels |
segment_nucleus.parquet |
For each segment a count of associated number of nuclei. Should only be one in an ideal world with perfect segmentations. |
segment_neurotransmitter.parquet |
Summary of most likely neurotransmitter per segment based on filtering of presynaptic locations and their individual neurontransmitter prediction |
segment_link_pre.parquet |
Number of presynaptic locations and downstream partner segments for each segment |
segment_link_post.parquet |
Number of postsynaptic locations and upstream partner segments for each segment |
link.parquet/*.parquet |
Same information as in thet synapse_link.parquet file but split into individual parquet files for faster access |
neurotransmitter.parquet/*.parquet |
The neurotransmitter predictions for each individual synaptic link |
nucleus.parquet/*.parquet |
The location and size of each individual predicted nucleus |
synapse_link.parquet |
Each individual predicted synaptic link between a presynaptic and a postsynaptic location with associated scores. |
skeleton/l2skeletons/skeletons.parquet |
Summary statistics about each exported l2skeleton in the dataset. |
skeleton/l2skeletons/skeleton_nodes.parquet |
Individual skeleton nodes for each segment. |
Query the Data
Using DuckDB, those files can be used remotely as tables in an SQL query. If you want to go into more depth, you can read the DuckDB documentation. Here, we demonstrate a few common queries to get started. In order to improve readability of the query, we store part of the URL in a baseurl
variable
First, we need a DuckDB connection object:
= duckdb.connect() con
We can query neurons by a search string easily. If we want to retrieve all neurons which contain BPN
string(case-insensitive) in their label column, we use the following query:
= con.query(f"""SELECT *
df FROM '{baseurl}/neurons.parquet'
WHERE label ILIKE '%BPN%'
LIMIT 5""").df()
df
segment_id | x | y | z | soma_x | soma_y | soma_z | nucleus_id | sv_id | flow | ... | morphology_group | top_nt_conf | cell_type | label | size_um3 | nr_pre | nr_downstream_partner | nr_post | nr_upstream_partner | nucleus_nr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 720575940624402173 | 101863 | 31209 | 4323 | 128144 | 23056 | 3021 | 4367706 | 77618374632589286 | efferent | ... | 0.958113 | DNpe053 | DNp_R; Part of comprehensive neck connective t... | 2581.600098 | 67526 | 29564 | 16139 | 2792 | 1 | |
1 | 720575940628278654 | 122736 | 35863 | 4010 | 127752 | 24256 | 3265 | 4373018 | 79026093046706888 | intrinsic | ... | SMPpm1_2 | 0.844157 | orphan-BPN-output | 195.360001 | 1812 | 958 | 544 | 205 | 1 | |
2 | 720575940630006267 | 122528 | 36529 | 4022 | 122384 | 28984 | 3898 | 3574415 | 79026161766195332 | intrinsic | ... | SMPpm1_4 | 0.870921 | new_clone_4; right; new_clone_4; right; BPN-li... | 298.239990 | 2913 | 1261 | 1272 | 385 | 1 | |
3 | 720575940627098249 | 140000 | 23144 | 3126 | 140000 | 23144 | 3126 | 5577486 | 80221468276838352 | intrinsic | ... | SMPp&v1_posterior_1 | 0.795091 | SMPp&v1A_H01 | orphan-BPN-output | 1241.989990 | 10747 | 4063 | 6845 | 1085 | 1 |
4 | 720575940618558365 | 162592 | 77144 | 1981 | 162592 | 77144 | 1981 | 6912408 | 81773222580344887 | intrinsic | ... | 0.845792 | orphan-BPN-output | 1819.329956 | 20429 | 7895 | 13329 | 2115 | 1 |
5 rows × 30 columns
Let’s retrieve the first 5 neurons that have more than 1000 downstream partners and sort by the number of downstream partners. We can use the following query and retrieve the results directly in a Pandas dataframe:
= con.query(f"""SELECT *
df FROM '{baseurl}/neurons.parquet'
WHERE nr_downstream_partner > 1000
ORDER BY nr_downstream_partner DESC
LIMIT 5""").df()
df
segment_id | x | y | z | soma_x | soma_y | soma_z | nucleus_id | sv_id | flow | ... | morphology_group | top_nt_conf | cell_type | label | size_um3 | nr_pre | nr_downstream_partner | nr_post | nr_upstream_partner | nucleus_nr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 720575940621280688 | 156482 | 42275 | 3896 | 179416 | 51040 | 3014 | 7396453 | 81348673921237569 | intrinsic | ... | 0.771267 | APL; APL-LHS; APL-LHS; Mushroom Body | 11388.240234 | 146255 | 51620 | 113071 | 7052 | 1 | ||
1 | 720575940615970783 | 158906 | 64213 | 4347 | 127904 | 49928 | 1289 | 4492836 | 81561292049294153 | intrinsic | ... | 0.655521 | CT1; CT1; gabaergic | 16315.629883 | 247006 | 49134 | 144276 | 17505 | 1 | ||
2 | 720575940629174889 | 107560 | 34947 | 4823 | 83352 | 47240 | 3701 | 1813248 | 78040862042737270 | intrinsic | ... | 0.795262 | APL; Mushroom Body; APL-RHS; APL-RHS | 11757.730469 | 163407 | 48184 | 125189 | 6953 | 1 | ||
3 | 720575940621675174 | 133999 | 56333 | 1847 | 134280 | 51872 | 1179 | 5058804 | 79801523353597754 | intrinsic | ... | 0.621499 | CT1; CT1 | 15729.919922 | 246711 | 47962 | 144967 | 18119 | 1 | ||
4 | 720575940626637002 | 121547 | 58007 | 2243 | 144464 | 50760 | 762 | 5741632 | 78957235929387973 | intrinsic | ... | 0.859750 | None | 6673.060059 | 139542 | 42196 | 67281 | 10190 | 1 |
5 rows × 30 columns
We can inspect all the columns in the data frame with:
df.columns
Index(['segment_id', 'x', 'y', 'z', 'soma_x', 'soma_y', 'soma_z', 'nucleus_id',
'sv_id', 'flow', 'side', 'super_class', 'cell_class', 'cell_sub_class',
'ito_lee_hemilineage', 'hartenstein_hemilineage', 'nerve', 'fbbt_id',
'top_nt', 'status', 'morphology_group', 'top_nt_conf', 'cell_type',
'label', 'size_um3', 'nr_pre', 'nr_downstream_partner', 'nr_post',
'nr_upstream_partner', 'nucleus_nr'],
dtype='object')
Let’s display only the number of downstream partners and presynaptic locations:
'segment_id', 'nr_downstream_partner', 'nr_pre']] df[[
segment_id | nr_downstream_partner | nr_pre | |
---|---|---|---|
0 | 720575940621280688 | 51620 | 146255 |
1 | 720575940615970783 | 49134 | 247006 |
2 | 720575940629174889 | 48184 | 163407 |
3 | 720575940621675174 | 47962 | 246711 |
4 | 720575940626637002 | 42196 | 139542 |
We see that segment 720575940621280688
has 51620 downstream partners and 146255 presynaptic locations.
Let’s store this segment id as variable
= 720575940621280688 segment_id
Let’s get all the downstream partners of this segment:
= con.query(f"""SELECT *
df FROM '{baseurl}/segment_link.parquet'
WHERE src = {segment_id}
ORDER BY dst DESC""").df()
df
src | dst | count | min_scores | max_scores | avg_scores | min_cleft_scores | max_cleft_scores | avg_cleft_scores | |
---|---|---|---|---|---|---|---|---|---|
0 | 720575940621280688 | 720575940660053121 | 2 | 109.382713 | 190.875397 | 150 | 141 | 152 | 146 |
1 | 720575940621280688 | 720575940660047489 | 43 | 7.637001 | 742.377136 | 192 | 1 | 178 | 114 |
2 | 720575940621280688 | 720575940659940225 | 41 | 5.718618 | 443.282257 | 172 | 0 | 168 | 120 |
3 | 720575940621280688 | 720575940659850113 | 23 | 9.563767 | 700.225708 | 174 | 0 | 168 | 104 |
4 | 720575940621280688 | 720575940659765121 | 31 | 9.567139 | 1110.103516 | 161 | 9 | 173 | 129 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
51615 | 720575940621280688 | 720575940379705726 | 2 | 20.213730 | 29.862347 | 25 | 144 | 166 | 155 |
51616 | 720575940621280688 | 720575940379590257 | 1 | 100.210541 | 100.210541 | 100 | 28 | 28 | 28 |
51617 | 720575940621280688 | 720575940379312979 | 1 | 130.753403 | 130.753403 | 131 | 150 | 150 | 150 |
51618 | 720575940621280688 | 720575940379312467 | 1 | 337.480469 | 337.480469 | 337 | 144 | 144 | 144 |
51619 | 720575940621280688 | 0 | 5 | 23.942980 | 425.107513 | 148 | 0 | 168 | 92 |
51620 rows × 9 columns
And we get the list of all downstream segments from the dataframe:
= df['dst'].tolist()
downstream_segments print(downstream_segments[:5])
[720575940660053121, 720575940660047489, 720575940659940225, 720575940659850113, 720575940659765121]
Let’s get the L2 skeleton of this segment
= con.query(f"""SELECT *
df FROM '{baseurl}/skeleton/l2skeletons/skeleton_nodes.parquet'
WHERE segment_id = {segment_id}""").df()
df
segment_id | x | y | z | node_id | parent | |
---|---|---|---|---|---|---|
0 | 720575940621280688 | 506832 | 182256 | 49640 | 0 | 6 |
1 | 720575940621280688 | 507232 | 181232 | 52400 | 1 | 5 |
2 | 720575940621280688 | 511488 | 155344 | 87240 | 2 | 8 |
3 | 720575940621280688 | 510192 | 173536 | 60120 | 3 | 9 |
4 | 720575940621280688 | 508864 | 175056 | 59200 | 4 | 3 |
... | ... | ... | ... | ... | ... | ... |
11343 | 720575940621280688 | 720592 | 199120 | 123600 | 11343 | 11342 |
11344 | 720575940621280688 | 721520 | 201552 | 120360 | 11344 | 11346 |
11345 | 720575940621280688 | 724288 | 202160 | 123520 | 11345 | 11344 |
11346 | 720575940621280688 | 724352 | 205840 | 120680 | 11346 | 11347 |
11347 | 720575940621280688 | 720944 | 205328 | 125480 | 11347 | 11340 |
11348 rows × 6 columns
We notice that this query is very slow. One option to speed things up is to download the entire skeleton_nodes.parquet
file (180MB) and run the query locally:
!wget -O /tmp/skeleton_nodes.parquet \
//api.braincircuits.io/data/fruitfly_fafb_flywire_public/skeleton/l2skeletons/skeleton_nodes.parquet https:
The same query now takes only a fraction of a second.
= con.query(f"""SELECT *
df FROM '/tmp/skeleton_nodes.parquet'
WHERE segment_id = {segment_id}""").df()
If we are interested in advanced neuroanatomical analyses of this neuron based on it’s skeletonized representation, we can use the NAVis package for neuron analysis and visualization.
In order to do that, we convert the DataFrame into a format that can be parsed by the SWC reader of NAVis.
= df.rename(columns={'parent': 'parent_id'})
df2 'label'] = 0
df2['radius'] = 0 df2[
Then we convert the DataFrame to a navis.TreeNeuron
to get some basic statistics.
= navis.read_swc(df2)
swc print(swc)
type navis.TreeNeuron
name SWC
n_nodes 11348
n_connectors None
n_branches 1158
n_leafs 1761
cable_length 42474824.0
soma None
units 1 dimensionless
created_at 2023-10-19 18:01:31.521072
origin DataFrame
dtype: object
See the Tutorials sections on the NAVis website for an overview for further processing of the skeleton.
Going back to synaptic data, we can now also retrieve presynaptic locations for this segment with the following query:
= con.query(f"""SELECT *
df FROM '{baseurl}/synapse_link.parquet'
WHERE pre_segment_id = {segment_id}
LIMIT 5""").df()
df
id | pre_x | pre_y | pre_z | post_x | post_y | post_z | scores | cleft_scores | cleft_id | pre_segment_id | post_segment_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21210 | 135698 | 41219 | 1850 | 135665 | 41212 | 1849 | 170.457611 | 141 | 9915144 | 720575940621280688 | 720575940507977460 |
1 | 21211 | 135709 | 41240 | 1849 | 135680 | 41243 | 1849 | 326.895081 | 147 | 9915144 | 720575940621280688 | 720575940507981812 |
2 | 21212 | 135719 | 41259 | 1848 | 135723 | 41279 | 1849 | 298.207092 | 100 | 0 | 720575940621280688 | 720575940507984372 |
3 | 21219 | 135709 | 41226 | 1852 | 135683 | 41234 | 1852 | 127.513809 | 68 | 0 | 720575940621280688 | 720575940605817522 |
4 | 22149 | 137649 | 41189 | 1863 | 137652 | 41207 | 1862 | 487.758301 | 148 | 10150849 | 720575940621280688 | 720575940472242925 |
Or the postsynaptic locations:
= con.query(f"""SELECT *
df FROM '{baseurl}/synapse_link.parquet'
WHERE post_segment_id = {segment_id}
LIMIT 5""").df()
df
id | pre_x | pre_y | pre_z | post_x | post_y | post_z | scores | cleft_scores | cleft_id | pre_segment_id | post_segment_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21253 | 135822 | 41325 | 1862 | 135843 | 41297 | 1862 | 10.482904 | 0 | 0 | 720575940507992820 | 720575940621280688 |
1 | 21277 | 135924 | 41320 | 1873 | 135938 | 41299 | 1873 | 910.730591 | 142 | 9915241 | 720575940636560730 | 720575940621280688 |
2 | 21284 | 135971 | 41583 | 1872 | 135968 | 41552 | 1872 | 286.582794 | 149 | 9915238 | 720575940649691513 | 720575940621280688 |
3 | 21297 | 135999 | 41502 | 1873 | 135986 | 41526 | 1873 | 117.003525 | 83 | 0 | 720575940636560730 | 720575940621280688 |
4 | 21344 | 136810 | 41444 | 1849 | 136790 | 41461 | 1849 | 45.242020 | 92 | 0 | 720575940629397079 | 720575940621280688 |
It is also possible to fetch synaptic locations across a set of segments. Here, we’re fetching more than 800’000 locations with a single query:
= con.query(f"""SELECT count(*)
df FROM '{baseurl}/synapse_link.parquet'
WHERE pre_segment_id in ('720575940621280688','720575940615970783','720575940629174889','720575940621675174')""").df()
df
count_star() | |
---|---|
0 | 803379 |