Data Handling

This tutorial describes how to start working with the Proximity Network Assay (PNA) output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.

After completing this tutorial, you will be able to:

Load PXL files in Python and access the multi-modal data contents including protein counts, metadata and spatial scores.
Understand and work with spatial scores, i.e. protein clustering and colocalization (co-clustering) patterns.
Directly access the spatial graph structure through the edge list.
Aggregate multiple PXL files into an integrated data object with sample identities.

Setup

To start with, we need to load some packages and functions that we will need.

from pixelator import read_pna as read

DATA_DIR = Path("<path to the directory to save datasets to>")

Loading data

We begin by locating the output from the pixelator pipeline. The input for downstream analysis is the PXL file contained in the layout directory.

baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/pna-datasets/v1"

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"

from pathlib import Path

pg_data = read(DATA_DIR / "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl")

Component meta data

The object contains antibody count and meta data for each connected component in the graph. Each graph component corresponds to a cell. This table contains information that can be useful in quality control, such as how many protein molecules were detected (n_umi), the sequencing depth (reads_in_component), and the graph connectivity of each component (average_k_core).

pg_data.adata().obs

	n_umi1	n_umi2	n_edges	reads_in_component	n_antibodies	n_umi	isotype_fraction	intracellular_fraction	tau_type	tau	...	k_core_1	k_core_2	k_core_3	k_core_4	k_core_5	k_core_6	k_core_7	svd_var_expl_s1	svd_var_expl_s2	svd_var_expl_s3
component
f1b52a4758932fc7	13355	18311	93749	246004	133	31666	0.000316	0.0	normal	0.974145	...	5773.0	3009.0	5429.0	8402.0	9053.0	0.0	0.0	0.284417	0.156809	0.089341
eca4983191f3647f	17311	23376	105495	273174	139	40687	0.000221	0.0	normal	0.975355	...	7919.0	5180.0	9614.0	17974.0	0.0	0.0	0.0	0.253557	0.192981	0.146433
f98240c66b2e49fc	11827	14934	78611	225033	158	26761	0.001794	0.0	normal	0.946812	...	4706.0	3152.0	3553.0	8152.0	7198.0	0.0	0.0	0.239587	0.187886	0.120709
b5f8b192a0cc2603	21450	28111	97857	296160	158	49561	0.000404	0.0	normal	0.949583	...	12780.0	12920.0	19498.0	4363.0	0.0	0.0	0.0	0.349326	0.237203	0.116332
7994e60e303c6dbb	15767	20905	126994	357256	144	36672	0.000218	0.0	normal	0.952086	...	5937.0	2389.0	3245.0	7657.0	14342.0	3102.0	0.0	0.292264	0.173078	0.140349
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
d3fd6df73f0c660a	8178	10070	48412	132220	126	18248	0.000219	0.0	normal	0.966714	...	3556.0	2407.0	3715.0	7959.0	611.0	0.0	0.0	0.265177	0.210068	0.106871
38663b1fa7dd363a	6331	7801	34239	95228	100	14132	0.000212	0.0	normal	0.987860	...	2910.0	2406.0	3488.0	5328.0	0.0	0.0	0.0	0.302713	0.264891	0.151157
9791359e16ac85ae	4146	4909	24284	65303	85	9055	0.000331	0.0	normal	0.991820	...	1751.0	1253.0	1412.0	4349.0	290.0	0.0	0.0	0.287943	0.201834	0.128967
072963fbd47582c3	12823	16209	71683	204981	137	29032	0.000103	0.0	normal	0.963505	...	5543.0	4191.0	7566.0	11732.0	0.0	0.0	0.0	0.334528	0.212675	0.144832
b5e265cb02443e7b	4645	5854	39094	97485	89	10499	0.000095	0.0	normal	0.992041	...	1603.0	690.0	879.0	1371.0	2480.0	3476.0	0.0	0.326871	0.115863	0.095980

1083 rows × 25 columns

The antibody count table is also present in the data object.

pg_data.adata().to_df()

	HLA-ABC	B2M	CD11b	CD11c	CD18	CD82	CD8	TCRab	HLA-DR	CD45	...	CX3CR1	CD326	CD209	CD34	CD369	CD54	CD71	CD47	CD117	CD314
component
f1b52a4758932fc7	1643	3402	1	3	118	376	8	161	1	2141	...	0	0	0	0	3	16	2	164	4	1
eca4983191f3647f	1719	3587	0	1	281	781	4	322	2	3939	...	4	0	0	0	5	25	3	221	3	4
f98240c66b2e49fc	1029	1922	13	7	124	45	2862	91	10	1575	...	3	15	7	10	9	8	7	130	13	34
b5f8b192a0cc2603	1984	4153	74	287	547	307	3564	236	70	2454	...	27	6	3	12	29	245	14	133	61	92
7994e60e303c6dbb	2065	4303	2	6	97	40	4240	163	13	2096	...	0	0	1	2	4	6	2	119	4	41
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
d3fd6df73f0c660a	1526	2931	2	27	145	1	141	1	7	1094	...	3	0	0	0	3	74	4	111	4	44
38663b1fa7dd363a	1260	2054	4	3	2	349	0	0	1	26	...	0	0	0	6	0	3	3	48	1	0
9791359e16ac85ae	798	957	1	0	3	82	2	0	1	14	...	1	1	1	6	0	0	0	75	1	0
072963fbd47582c3	1861	4314	60	31	313	18	345	5	3	1641	...	8	0	0	0	1	99	1	122	2	31
b5e265cb02443e7b	385	460	0	0	2	120	1	2	3	6	...	1	0	0	14	1	1	1	70	0	0

1083 rows × 158 columns

Spatial data

In addition to count data corresponding to protein abundance, proximity network assay also generates spatial data which is used to find spatial structures of proteins across cells and to visualize individual cells. The PXL file comes with proximity scores, describing the degree of spatial clustering of each protein, and the colocalization (co-clustering) of each protein-pair on each cell. If you want to read more about spatial metrics in proximity network assay data, LINK_TO_PNA_PROXIMITY_EXPLANATION.

The proximity scores can be retrieved through proximity() function.

pg_data.proximity().to_df()

	marker_1	marker_2	join_count	join_count_expected_mean	join_count_expected_sd	join_count_z	join_count_p	component	sample	marker_1_count	marker_1_freq	marker_2_count	marker_2_freq	min_count	log2_ratio
0	CD33	CD33	0	0.25	0.519810	-0.250000	0.401294	91984005700471b2	PNA062_unstim_PBMCs_1000cells_S02_S2	34	0.003074	34	0.003074	34	0.000000
1	CD33	CD59	5	3.02	1.959076	1.010680	0.156085	91984005700471b2	PNA062_unstim_PBMCs_1000cells_S02_S2	34	0.003074	252	0.022787	34	0.727380
2	CD33	CD45RB	0	0.08	0.272660	-0.080000	0.468119	91984005700471b2	PNA062_unstim_PBMCs_1000cells_S02_S2	34	0.003074	10	0.000904	10	0.000000
3	CD33	KLRG1	1	0.14	0.348735	0.860000	0.194895	91984005700471b2	PNA062_unstim_PBMCs_1000cells_S02_S2	34	0.003074	12	0.001085	12	0.000000
4	CD33	VISTA	0	0.20	0.512471	-0.200000	0.420740	91984005700471b2	PNA062_unstim_PBMCs_1000cells_S02_S2	34	0.003074	18	0.001628	18	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2679835	CD55	CD81	5	0.84	0.961060	4.160000	0.000016	0dd3799c51773670	PNA062_unstim_PBMCs_1000cells_S02_S2	487	0.006413	32	0.000421	32	2.321928
2679836	CD55	IgM	10	4.23	1.916726	3.010342	0.001305	0dd3799c51773670	PNA062_unstim_PBMCs_1000cells_S02_S2	487	0.006413	116	0.001528	116	1.241270
2679837	CD55	VISTA	0	0.41	0.637150	-0.410000	0.340903	0dd3799c51773670	PNA062_unstim_PBMCs_1000cells_S02_S2	487	0.006413	11	0.000145	11	0.000000
2679838	CD55	HLA-DR-DP-DQ	0	1.67	1.385750	-1.205124	0.114078	0dd3799c51773670	PNA062_unstim_PBMCs_1000cells_S02_S2	487	0.006413	46	0.000606	46	-0.739848
2679839	CD55	CD55	17	7.28	2.278933	4.265154	0.000010	0dd3799c51773670	PNA062_unstim_PBMCs_1000cells_S02_S2	487	0.006413	487	0.006413	487	1.223524

2679840 rows × 15 columns

Edgelist

The edge list is a list of detected links between two uniquely identified antibodies. The edge list constitutes the graph which makes up a Proximity Network Assay experiment. Each row contains an edge which is identified by a UMI1 and a UMI2 that are unique identifiers of the antibodies and marker_1 and marker_2 that are the proteins that correspond to the respective antibodies. The edge list is used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.

pg_data.edgelist().to_polars()

shape: (99_143_483, 9)

marker_1	marker_2	umi1	umi2	read_count	uei_count	corrected_read_count	component	sample
str	str	u64	u64	u32	u16	u32	str	str
"GPR56"	"GPR56"	224470133783022	25464258655860574	3	3	0	"f4221aafaa616afe"	"PNA062_unstim_PBMCs_1000cells_…
"CD38"	"CD158b"	1185904180182398	36214581647272587	7	2	1	"f4221aafaa616afe"	"PNA062_unstim_PBMCs_1000cells_…
"CD38"	"CD45"	1788860347977752	53246002513730209	3	1	0	"f4221aafaa616afe"	"PNA062_unstim_PBMCs_1000cells_…
"CD102"	"CD352"	2405410456204710	69671697962098444	1	1	0	"f4221aafaa616afe"	"PNA062_unstim_PBMCs_1000cells_…
"CD337"	"CD156c"	2946334228576720	24455141328440508	1	1	0	"f4221aafaa616afe"	"PNA062_unstim_PBMCs_1000cells_…
…	…	…	…	…	…	…	…	…
"B2M"	"CD44"	71379091948550637	33198496021597180	1	1	0	"b2fa2041c541acb5"	"PNA062_unstim_PBMCs_1000cells_…
"CD27"	"CD44"	71691683015659365	18683185979839095	1	1	0	"b2fa2041c541acb5"	"PNA062_unstim_PBMCs_1000cells_…
"HLA-ABC"	"CD2"	71798392508456934	31329073087501939	1	1	0	"b2fa2041c541acb5"	"PNA062_unstim_PBMCs_1000cells_…
"CD44"	"CD52"	71810819798315409	30072559572985173	5	4	0	"b2fa2041c541acb5"	"PNA062_unstim_PBMCs_1000cells_…
"CD3e"	"CD3e"	72013234700269253	22523768290847513	4	1	0	"b2fa2041c541acb5"	"PNA062_unstim_PBMCs_1000cells_…

Aggregating data

For cases when we want to process and analyze samples in parallel, it is convenient to merge them into a single object. Here, we read multiple files and aggregate them into a single object. To merge PXL files, we simply pass a list of file paths to the read() function.

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl"

pg_data_combined = read([
  DATA_DIR / "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl",
  DATA_DIR / "PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl",
])

Now calling any of the functions from the object will include the aggregated data from both samples:

pg_data_combined.adata().obs

	n_umi1	n_umi2	n_edges	reads_in_component	n_antibodies	n_umi	isotype_fraction	intracellular_fraction	tau_type	tau	...	sample	antibodies	average_k_core	k_core_1	k_core_2	k_core_3	k_core_4	svd_var_expl_s1	svd_var_expl_s2	svd_var_expl_s3
component
f1b52a4758932fc7	13355	18311	93749	246004	133	31666	0.000316	0.0	normal	0.974145	...	PNA062_unstim_PBMCs_1000cells_S02_S2	133	3.377471	5773.0	3009.0	5429.0	8402.0	0.284417	0.156809	0.089341
eca4983191f3647f	17311	23376	105495	273174	139	40687	0.000221	0.0	normal	0.975355	...	PNA062_unstim_PBMCs_1000cells_S02_S2	139	2.925185	7919.0	5180.0	9614.0	17974.0	0.253557	0.192981	0.146433
f98240c66b2e49fc	11827	14934	78611	225033	158	26761	0.001794	0.0	normal	0.946812	...	PNA062_unstim_PBMCs_1000cells_S02_S2	158	3.373080	4706.0	3152.0	3553.0	8152.0	0.239587	0.187886	0.120709
b5f8b192a0cc2603	21450	28111	97857	296160	158	49561	0.000404	0.0	normal	0.949583	...	PNA062_unstim_PBMCs_1000cells_S02_S2	158	2.311616	12780.0	12920.0	19498.0	4363.0	0.349326	0.237203	0.116332
7994e60e303c6dbb	15767	20905	126994	357256	144	36672	0.000218	0.0	normal	0.952086	...	PNA062_unstim_PBMCs_1000cells_S02_S2	144	3.855803	5937.0	2389.0	3245.0	7657.0	0.292264	0.173078	0.140349
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
fe9e06d663b632d1	20195	28358	82112	178421	156	48553	0.000700	0.0	normal	0.963772	...	PNA062_PHA_PBMCs_1000cells_S04_S4	156	2.021296	13791.0	19937.0	14825.0	0.0	0.337305	0.211262	0.172595
b526da61c6941223	30537	39637	136362	310787	153	70174	0.000214	0.0	normal	0.970609	...	PNA062_PHA_PBMCs_1000cells_S04_S4	153	2.261065	16654.0	18583.0	34900.0	37.0	0.376707	0.282748	0.156067
8c26b116f942b613	30450	42223	140216	307077	150	72673	0.000138	0.0	normal	0.942681	...	PNA062_PHA_PBMCs_1000cells_S04_S4	150	2.256175	17579.0	18898.0	36196.0	0.0	0.405094	0.207156	0.142091
9567c4b0b99c07ab	10783	14004	44906	97541	141	24787	0.000121	0.0	normal	0.972555	...	PNA062_PHA_PBMCs_1000cells_S04_S4	141	2.091782	7136.0	8240.0	9411.0	0.0	0.478056	0.163209	0.079409
1a14cfd48348799d	79266	107755	363022	789026	158	187021	0.000283	0.0	normal	0.975541	...	PNA062_PHA_PBMCs_1000cells_S04_S4	158	2.275621	42751.0	49972.0	94298.0	0.0	0.278279	0.191667	0.181490

2137 rows × 22 columns

We have now seen how to load proximity network assay data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.

Data Handling

Setup​

Loading data​

Component meta data​

Spatial data​

Edgelist​

Aggregating data​