Large-Scale Connectome Analytics in the Cloud

Goals

Learn how to access and analyze large-scale connectome datasets hosted on BrainCircuits.io remotely in Python using DuckDB.

Requirements

We need to first install DuckDB into our Python environment

pip install duckdb

Datasets

Every dataset on BrainCircuits.io has a unique string identifier. You can find the identifier in the Dataset Description overview or by inspecting the folders of the source data directly.

For the following tutorial we are going to use the public fruitfly FlyWire dataset with identifier fruitfly_fafb_flywire_public. Please make sure to cite the appropriate publications when using this data. You can find relevant links in the About widget of the dataset.

Data Model Overview

The static connectome dataset consists of a number of Parquet files. You can see all the files for the FlyWire dataset here: https://api.braincircuits.io/data/fruitfly_fafb_flywire_public/

In each dataset folder, you find a DATASET.txt file which shows for all files the contained columns and data type of the column together with the total number of records available.

The following table describes briefly the content of each type of file:

Filename	Content
`neurons.parquet`	Each neuron (or segment) in the dataset with an arbitrary set of columns with more information about individual neurons
`segment_link.parquet`	An aggregate edge table that lists the synaptic counts between source and target segments. Additional columns with some statistics about the connection (e.g. `avg_scores` is the average score value of each individual synaptic link score)
`segment_size.parquet`	Size information about a segment such as number of containing supervoxels or total number of voxels
`segment_nucleus.parquet`	For each segment a count of associated number of nuclei. Should only be one in an ideal world with perfect segmentations.
`segment_neurotransmitter.parquet`	Summary of most likely neurotransmitter per segment based on filtering of presynaptic locations and their individual neurontransmitter prediction
`segment_link_pre.parquet`	Number of presynaptic locations and downstream partner segments for each segment
`segment_link_post.parquet`	Number of postsynaptic locations and upstream partner segments for each segment
`link.parquet/*.parquet`	Same information as in thet `synapse_link.parquet` file but split into individual parquet files for faster access
`neurotransmitter.parquet/*.parquet`	The neurotransmitter predictions for each individual synaptic link
`nucleus.parquet/*.parquet`	The location and size of each individual predicted nucleus
`synapse_link.parquet`	Each individual predicted synaptic link between a presynaptic and a postsynaptic location with associated scores.
`skeleton/l2skeletons/skeletons.parquet`	Summary statistics about each exported `l2skeleton` in the dataset.
`skeleton/l2skeletons/skeleton_nodes.parquet`	Individual skeleton nodes for each segment.

Query the Data

Using DuckDB, those files can be used remotely as tables in an SQL query. If you want to go into more depth, you can read the DuckDB documentation. Here, we demonstrate a few common queries to get started. In order to improve readability of the query, we store part of the URL in a baseurl variable

import navis
import duckdb
baseurl = 'https://api.braincircuits.io/data/fruitfly_fafb_flywire_public'

First, we need a DuckDB connection object:

con = duckdb.connect()

We can query neurons by a search string easily. If we want to retrieve all neurons which contain BPN string(case-insensitive) in their label column, we use the following query:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/neurons.parquet' 
               WHERE label ILIKE '%BPN%'
               LIMIT 5""").df()
df

	segment_id	x	y	z	soma_x	soma_y	soma_z	nucleus_id	sv_id	flow	...	morphology_group	top_nt_conf	cell_type	label	size_um3	nr_pre	nr_downstream_partner	nr_post	nr_upstream_partner	nucleus_nr
0	720575940624402173	101863	31209	4323	128144	23056	3021	4367706	77618374632589286	efferent	...		0.958113	DNpe053	DNp_R; Part of comprehensive neck connective t...	2581.600098	67526	29564	16139	2792	1
1	720575940628278654	122736	35863	4010	127752	24256	3265	4373018	79026093046706888	intrinsic	...	SMPpm1_2	0.844157		orphan-BPN-output	195.360001	1812	958	544	205	1
2	720575940630006267	122528	36529	4022	122384	28984	3898	3574415	79026161766195332	intrinsic	...	SMPpm1_4	0.870921		new_clone_4; right; new_clone_4; right; BPN-li...	298.239990	2913	1261	1272	385	1
3	720575940627098249	140000	23144	3126	140000	23144	3126	5577486	80221468276838352	intrinsic	...	SMPp&v1_posterior_1	0.795091	SMPp&v1A_H01	orphan-BPN-output	1241.989990	10747	4063	6845	1085	1
4	720575940618558365	162592	77144	1981	162592	77144	1981	6912408	81773222580344887	intrinsic	...		0.845792		orphan-BPN-output	1819.329956	20429	7895	13329	2115	1

5 rows × 30 columns

Let’s retrieve the first 5 neurons that have more than 1000 downstream partners and sort by the number of downstream partners. We can use the following query and retrieve the results directly in a Pandas dataframe:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/neurons.parquet' 
               WHERE nr_downstream_partner > 1000 
               ORDER BY nr_downstream_partner DESC
               LIMIT 5""").df()
df

	segment_id	x	y	z	soma_x	soma_y	soma_z	nucleus_id	sv_id	flow	...	top_nt_conf	label	size_um3	nr_pre	nr_downstream_partner	nr_post	nr_upstream_partner	nucleus_nr
0	720575940621280688	156482	42275	3896	179416	51040	3014	7396453	81348673921237569	intrinsic	...	0.771267	APL; APL-LHS; APL-LHS; Mushroom Body	11388.240234	146255	51620	113071	7052	1
1	720575940615970783	158906	64213	4347	127904	49928	1289	4492836	81561292049294153	intrinsic	...	0.655521	CT1; CT1; gabaergic	16315.629883	247006	49134	144276	17505	1
2	720575940629174889	107560	34947	4823	83352	47240	3701	1813248	78040862042737270	intrinsic	...	0.795262	APL; Mushroom Body; APL-RHS; APL-RHS	11757.730469	163407	48184	125189	6953	1
3	720575940621675174	133999	56333	1847	134280	51872	1179	5058804	79801523353597754	intrinsic	...	0.621499	CT1; CT1	15729.919922	246711	47962	144967	18119	1
4	720575940626637002	121547	58007	2243	144464	50760	762	5741632	78957235929387973	intrinsic	...	0.859750	None	6673.060059	139542	42196	67281	10190	1

5 rows × 30 columns

We can inspect all the columns in the data frame with:

df.columns

Index(['segment_id', 'x', 'y', 'z', 'soma_x', 'soma_y', 'soma_z', 'nucleus_id',
       'sv_id', 'flow', 'side', 'super_class', 'cell_class', 'cell_sub_class',
       'ito_lee_hemilineage', 'hartenstein_hemilineage', 'nerve', 'fbbt_id',
       'top_nt', 'status', 'morphology_group', 'top_nt_conf', 'cell_type',
       'label', 'size_um3', 'nr_pre', 'nr_downstream_partner', 'nr_post',
       'nr_upstream_partner', 'nucleus_nr'],
      dtype='object')

Let’s display only the number of downstream partners and presynaptic locations:

df[['segment_id', 'nr_downstream_partner', 'nr_pre']]

	segment_id	nr_downstream_partner	nr_pre
0	720575940621280688	51620	146255
1	720575940615970783	49134	247006
2	720575940629174889	48184	163407
3	720575940621675174	47962	246711
4	720575940626637002	42196	139542

We see that segment 720575940621280688 has 51620 downstream partners and 146255 presynaptic locations.

Let’s store this segment id as variable

segment_id = 720575940621280688

Let’s get all the downstream partners of this segment:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/segment_link.parquet' 
               WHERE src = {segment_id}
               ORDER BY dst DESC""").df()
df

	src	dst	count	min_scores	max_scores	avg_scores	min_cleft_scores	max_cleft_scores	avg_cleft_scores
0	720575940621280688	720575940660053121	2	109.382713	190.875397	150	141	152	146
1	720575940621280688	720575940660047489	43	7.637001	742.377136	192	1	178	114
2	720575940621280688	720575940659940225	41	5.718618	443.282257	172	0	168	120
3	720575940621280688	720575940659850113	23	9.563767	700.225708	174	0	168	104
4	720575940621280688	720575940659765121	31	9.567139	1110.103516	161	9	173	129
...	...	...	...	...	...	...	...	...	...
51615	720575940621280688	720575940379705726	2	20.213730	29.862347	25	144	166	155
51616	720575940621280688	720575940379590257	1	100.210541	100.210541	100	28	28	28
51617	720575940621280688	720575940379312979	1	130.753403	130.753403	131	150	150	150
51618	720575940621280688	720575940379312467	1	337.480469	337.480469	337	144	144	144
51619	720575940621280688	0	5	23.942980	425.107513	148	0	168	92

51620 rows × 9 columns

And we get the list of all downstream segments from the dataframe:

downstream_segments = df['dst'].tolist()
print(downstream_segments[:5])

[720575940660053121, 720575940660047489, 720575940659940225, 720575940659850113, 720575940659765121]

Let’s get the L2 skeleton of this segment

df = con.query(f"""SELECT * 
               FROM '{baseurl}/skeleton/l2skeletons/skeleton_nodes.parquet' 
               WHERE segment_id = {segment_id}""").df()
df

	segment_id	x	y	z	node_id	parent
0	720575940621280688	506832	182256	49640	0	6
1	720575940621280688	507232	181232	52400	1	5
2	720575940621280688	511488	155344	87240	2	8
3	720575940621280688	510192	173536	60120	3	9
4	720575940621280688	508864	175056	59200	4	3
...	...	...	...	...	...	...
11343	720575940621280688	720592	199120	123600	11343	11342
11344	720575940621280688	721520	201552	120360	11344	11346
11345	720575940621280688	724288	202160	123520	11345	11344
11346	720575940621280688	724352	205840	120680	11346	11347
11347	720575940621280688	720944	205328	125480	11347	11340

11348 rows × 6 columns

We notice that this query is very slow. One option to speed things up is to download the entire skeleton_nodes.parquet file (180MB) and run the query locally:

!wget -O /tmp/skeleton_nodes.parquet \
    https://api.braincircuits.io/data/fruitfly_fafb_flywire_public/skeleton/l2skeletons/skeleton_nodes.parquet

The same query now takes only a fraction of a second.

df = con.query(f"""SELECT * 
               FROM '/tmp/skeleton_nodes.parquet' 
               WHERE segment_id = {segment_id}""").df()

If we are interested in advanced neuroanatomical analyses of this neuron based on it’s skeletonized representation, we can use the NAVis package for neuron analysis and visualization.

In order to do that, we convert the DataFrame into a format that can be parsed by the SWC reader of NAVis.

df2 = df.rename(columns={'parent': 'parent_id'})
df2['label'] = 0
df2['radius'] = 0

Then we convert the DataFrame to a navis.TreeNeuron to get some basic statistics.

swc = navis.read_swc(df2)
print(swc)

type                      navis.TreeNeuron
name                                   SWC
n_nodes                              11348
n_connectors                          None
n_branches                            1158
n_leafs                               1761
cable_length                    42474824.0
soma                                  None
units                      1 dimensionless
created_at      2023-10-19 18:01:31.521072
origin                           DataFrame
dtype: object

See the Tutorials sections on the NAVis website for an overview for further processing of the skeleton.

Going back to synaptic data, we can now also retrieve presynaptic locations for this segment with the following query:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/synapse_link.parquet' 
               WHERE pre_segment_id = {segment_id}
               LIMIT 5""").df()
df

	id	pre_x	pre_y	pre_z	post_x	post_y	post_z	scores	cleft_scores	cleft_id	pre_segment_id	post_segment_id
0	21210	135698	41219	1850	135665	41212	1849	170.457611	141	9915144	720575940621280688	720575940507977460
1	21211	135709	41240	1849	135680	41243	1849	326.895081	147	9915144	720575940621280688	720575940507981812
2	21212	135719	41259	1848	135723	41279	1849	298.207092	100	0	720575940621280688	720575940507984372
3	21219	135709	41226	1852	135683	41234	1852	127.513809	68	0	720575940621280688	720575940605817522
4	22149	137649	41189	1863	137652	41207	1862	487.758301	148	10150849	720575940621280688	720575940472242925

Or the postsynaptic locations:

df = con.query(f"""SELECT * 
               FROM '{baseurl}/synapse_link.parquet' 
               WHERE post_segment_id = {segment_id}
               LIMIT 5""").df()
df

	id	pre_x	pre_y	pre_z	post_x	post_y	post_z	scores	cleft_scores	cleft_id	pre_segment_id	post_segment_id
0	21253	135822	41325	1862	135843	41297	1862	10.482904	0	0	720575940507992820	720575940621280688
1	21277	135924	41320	1873	135938	41299	1873	910.730591	142	9915241	720575940636560730	720575940621280688
2	21284	135971	41583	1872	135968	41552	1872	286.582794	149	9915238	720575940649691513	720575940621280688
3	21297	135999	41502	1873	135986	41526	1873	117.003525	83	0	720575940636560730	720575940621280688
4	21344	136810	41444	1849	136790	41461	1849	45.242020	92	0	720575940629397079	720575940621280688

It is also possible to fetch synaptic locations across a set of segments. Here, we’re fetching more than 800’000 locations with a single query:

df = con.query(f"""SELECT count(*) 
               FROM '{baseurl}/synapse_link.parquet' 
               WHERE pre_segment_id in ('720575940621280688','720575940615970783','720575940629174889','720575940621675174')""").df()
df

	count_star()
0	803379