Skip to main content

Data Handling

This tutorial describes how to start working with the Proximity Network Assay (PNA) output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.

After completing this tutorial, you will be able to:

  • Load PXL files in Python and access the multi-modal data contents including protein counts, metadata and spatial scores.
  • Understand and work with spatial scores, i.e. protein clustering and colocalization (co-clustering) patterns.
  • Directly access the spatial graph structure through the edge list.
  • Aggregate multiple PXL files into an integrated data object with sample identities.

Setup

To start with, we need to load some packages and functions that we will need.

from pixelator import read_pna as read
DATA_DIR = Path("<path to the directory to save datasets to>")

Loading data

We begin by locating the output from the pixelator pipeline. The input for downstream analysis is the PXL file contained in the layout directory.

baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/pna-datasets/v1"

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
from pathlib import Path

pg_data = read(DATA_DIR / "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl")

Component meta data

The object contains antibody count and meta data for each connected component in the graph. Each graph component corresponds to a cell. This table contains information that can be useful in quality control, such as how many protein molecules were detected (n_umi), the sequencing depth (reads_in_component), and the graph connectivity of each component (average_k_core).

pg_data.adata().obs
n_umi1 n_umi2 n_edges reads_in_component n_antibodies n_umi isotype_fraction intracellular_fraction tau_type tau ... k_core_1 k_core_2 k_core_3 k_core_4 k_core_5 k_core_6 k_core_7 svd_var_expl_s1 svd_var_expl_s2 svd_var_expl_s3
component
f1b52a4758932fc7 13355 18311 93749 246004 133 31666 0.000316 0.0 normal 0.974145 ... 5773.0 3009.0 5429.0 8402.0 9053.0 0.0 0.0 0.284417 0.156809 0.089341
eca4983191f3647f 17311 23376 105495 273174 139 40687 0.000221 0.0 normal 0.975355 ... 7919.0 5180.0 9614.0 17974.0 0.0 0.0 0.0 0.253557 0.192981 0.146433
f98240c66b2e49fc 11827 14934 78611 225033 158 26761 0.001794 0.0 normal 0.946812 ... 4706.0 3152.0 3553.0 8152.0 7198.0 0.0 0.0 0.239587 0.187886 0.120709
b5f8b192a0cc2603 21450 28111 97857 296160 158 49561 0.000404 0.0 normal 0.949583 ... 12780.0 12920.0 19498.0 4363.0 0.0 0.0 0.0 0.349326 0.237203 0.116332
7994e60e303c6dbb 15767 20905 126994 357256 144 36672 0.000218 0.0 normal 0.952086 ... 5937.0 2389.0 3245.0 7657.0 14342.0 3102.0 0.0 0.292264 0.173078 0.140349
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
d3fd6df73f0c660a 8178 10070 48412 132220 126 18248 0.000219 0.0 normal 0.966714 ... 3556.0 2407.0 3715.0 7959.0 611.0 0.0 0.0 0.265177 0.210068 0.106871
38663b1fa7dd363a 6331 7801 34239 95228 100 14132 0.000212 0.0 normal 0.987860 ... 2910.0 2406.0 3488.0 5328.0 0.0 0.0 0.0 0.302713 0.264891 0.151157
9791359e16ac85ae 4146 4909 24284 65303 85 9055 0.000331 0.0 normal 0.991820 ... 1751.0 1253.0 1412.0 4349.0 290.0 0.0 0.0 0.287943 0.201834 0.128967
072963fbd47582c3 12823 16209 71683 204981 137 29032 0.000103 0.0 normal 0.963505 ... 5543.0 4191.0 7566.0 11732.0 0.0 0.0 0.0 0.334528 0.212675 0.144832
b5e265cb02443e7b 4645 5854 39094 97485 89 10499 0.000095 0.0 normal 0.992041 ... 1603.0 690.0 879.0 1371.0 2480.0 3476.0 0.0 0.326871 0.115863 0.095980

1083 rows × 25 columns

The antibody count table is also present in the data object.

pg_data.adata().to_df()
HLA-ABC B2M CD11b CD11c CD18 CD82 CD8 TCRab HLA-DR CD45 ... CX3CR1 CD326 CD209 CD34 CD369 CD54 CD71 CD47 CD117 CD314
component
f1b52a4758932fc7 1643 3402 1 3 118 376 8 161 1 2141 ... 0 0 0 0 3 16 2 164 4 1
eca4983191f3647f 1719 3587 0 1 281 781 4 322 2 3939 ... 4 0 0 0 5 25 3 221 3 4
f98240c66b2e49fc 1029 1922 13 7 124 45 2862 91 10 1575 ... 3 15 7 10 9 8 7 130 13 34
b5f8b192a0cc2603 1984 4153 74 287 547 307 3564 236 70 2454 ... 27 6 3 12 29 245 14 133 61 92
7994e60e303c6dbb 2065 4303 2 6 97 40 4240 163 13 2096 ... 0 0 1 2 4 6 2 119 4 41
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
d3fd6df73f0c660a 1526 2931 2 27 145 1 141 1 7 1094 ... 3 0 0 0 3 74 4 111 4 44
38663b1fa7dd363a 1260 2054 4 3 2 349 0 0 1 26 ... 0 0 0 6 0 3 3 48 1 0
9791359e16ac85ae 798 957 1 0 3 82 2 0 1 14 ... 1 1 1 6 0 0 0 75 1 0
072963fbd47582c3 1861 4314 60 31 313 18 345 5 3 1641 ... 8 0 0 0 1 99 1 122 2 31
b5e265cb02443e7b 385 460 0 0 2 120 1 2 3 6 ... 1 0 0 14 1 1 1 70 0 0

1083 rows × 158 columns

Spatial data

In addition to count data corresponding to protein abundance, proximity network assay also generates spatial data which is used to find spatial structures of proteins across cells and to visualize individual cells. The PXL file comes with proximity scores, describing the degree of spatial clustering of each protein, and the colocalization (co-clustering) of each protein-pair on each cell. If you want to read more about spatial metrics in proximity network assay data, LINK_TO_PNA_PROXIMITY_EXPLANATION.

The proximity scores can be retrieved through proximity() function.

pg_data.proximity().to_df()
marker_1 marker_2 join_count join_count_expected_mean join_count_expected_sd join_count_z join_count_p component sample marker_1_count marker_1_freq marker_2_count marker_2_freq min_count log2_ratio
0 CD33 CD33 0 0.25 0.519810 -0.250000 0.401294 91984005700471b2 PNA062_unstim_PBMCs_1000cells_S02_S2 34 0.003074 34 0.003074 34 0.000000
1 CD33 CD59 5 3.02 1.959076 1.010680 0.156085 91984005700471b2 PNA062_unstim_PBMCs_1000cells_S02_S2 34 0.003074 252 0.022787 34 0.727380
2 CD33 CD45RB 0 0.08 0.272660 -0.080000 0.468119 91984005700471b2 PNA062_unstim_PBMCs_1000cells_S02_S2 34 0.003074 10 0.000904 10 0.000000
3 CD33 KLRG1 1 0.14 0.348735 0.860000 0.194895 91984005700471b2 PNA062_unstim_PBMCs_1000cells_S02_S2 34 0.003074 12 0.001085 12 0.000000
4 CD33 VISTA 0 0.20 0.512471 -0.200000 0.420740 91984005700471b2 PNA062_unstim_PBMCs_1000cells_S02_S2 34 0.003074 18 0.001628 18 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2679835 CD55 CD81 5 0.84 0.961060 4.160000 0.000016 0dd3799c51773670 PNA062_unstim_PBMCs_1000cells_S02_S2 487 0.006413 32 0.000421 32 2.321928
2679836 CD55 IgM 10 4.23 1.916726 3.010342 0.001305 0dd3799c51773670 PNA062_unstim_PBMCs_1000cells_S02_S2 487 0.006413 116 0.001528 116 1.241270
2679837 CD55 VISTA 0 0.41 0.637150 -0.410000 0.340903 0dd3799c51773670 PNA062_unstim_PBMCs_1000cells_S02_S2 487 0.006413 11 0.000145 11 0.000000
2679838 CD55 HLA-DR-DP-DQ 0 1.67 1.385750 -1.205124 0.114078 0dd3799c51773670 PNA062_unstim_PBMCs_1000cells_S02_S2 487 0.006413 46 0.000606 46 -0.739848
2679839 CD55 CD55 17 7.28 2.278933 4.265154 0.000010 0dd3799c51773670 PNA062_unstim_PBMCs_1000cells_S02_S2 487 0.006413 487 0.006413 487 1.223524

2679840 rows × 15 columns

Edgelist

The edge list is a list of detected links between two uniquely identified antibodies. The edge list constitutes the graph which makes up a Proximity Network Assay experiment. Each row contains an edge which is identified by a UMI1 and a UMI2 that are unique identifiers of the antibodies and marker_1 and marker_2 that are the proteins that correspond to the respective antibodies. The edge list is used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.

pg_data.edgelist().to_polars()
shape: (99_143_483, 9)
marker_1 marker_2 umi1 umi2 read_count uei_count corrected_read_count component sample
str str u64 u64 u32 u16 u32 str str
"GPR56" "GPR56" 224470133783022 25464258655860574 3 3 0 "f4221aafaa616afe" "PNA062_unstim_PBMCs_1000cells_…
"CD38" "CD158b" 1185904180182398 36214581647272587 7 2 1 "f4221aafaa616afe" "PNA062_unstim_PBMCs_1000cells_…
"CD38" "CD45" 1788860347977752 53246002513730209 3 1 0 "f4221aafaa616afe" "PNA062_unstim_PBMCs_1000cells_…
"CD102" "CD352" 2405410456204710 69671697962098444 1 1 0 "f4221aafaa616afe" "PNA062_unstim_PBMCs_1000cells_…
"CD337" "CD156c" 2946334228576720 24455141328440508 1 1 0 "f4221aafaa616afe" "PNA062_unstim_PBMCs_1000cells_…
"B2M" "CD44" 71379091948550637 33198496021597180 1 1 0 "b2fa2041c541acb5" "PNA062_unstim_PBMCs_1000cells_…
"CD27" "CD44" 71691683015659365 18683185979839095 1 1 0 "b2fa2041c541acb5" "PNA062_unstim_PBMCs_1000cells_…
"HLA-ABC" "CD2" 71798392508456934 31329073087501939 1 1 0 "b2fa2041c541acb5" "PNA062_unstim_PBMCs_1000cells_…
"CD44" "CD52" 71810819798315409 30072559572985173 5 4 0 "b2fa2041c541acb5" "PNA062_unstim_PBMCs_1000cells_…
"CD3e" "CD3e" 72013234700269253 22523768290847513 4 1 0 "b2fa2041c541acb5" "PNA062_unstim_PBMCs_1000cells_…

Aggregating data

For cases when we want to process and analyze samples in parallel, it is convenient to merge them into a single object. Here, we read multiple files and aggregate them into a single object. To merge PXL files, we simply pass a list of file paths to the read() function.

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl"

pg_data_combined = read([
DATA_DIR / "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl",
DATA_DIR / "PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl",
])

Now calling any of the functions from the object will include the aggregated data from both samples:

pg_data_combined.adata().obs
n_umi1 n_umi2 n_edges reads_in_component n_antibodies n_umi isotype_fraction intracellular_fraction tau_type tau ... sample antibodies average_k_core k_core_1 k_core_2 k_core_3 k_core_4 svd_var_expl_s1 svd_var_expl_s2 svd_var_expl_s3
component
f1b52a4758932fc7 13355 18311 93749 246004 133 31666 0.000316 0.0 normal 0.974145 ... PNA062_unstim_PBMCs_1000cells_S02_S2 133 3.377471 5773.0 3009.0 5429.0 8402.0 0.284417 0.156809 0.089341
eca4983191f3647f 17311 23376 105495 273174 139 40687 0.000221 0.0 normal 0.975355 ... PNA062_unstim_PBMCs_1000cells_S02_S2 139 2.925185 7919.0 5180.0 9614.0 17974.0 0.253557 0.192981 0.146433
f98240c66b2e49fc 11827 14934 78611 225033 158 26761 0.001794 0.0 normal 0.946812 ... PNA062_unstim_PBMCs_1000cells_S02_S2 158 3.373080 4706.0 3152.0 3553.0 8152.0 0.239587 0.187886 0.120709
b5f8b192a0cc2603 21450 28111 97857 296160 158 49561 0.000404 0.0 normal 0.949583 ... PNA062_unstim_PBMCs_1000cells_S02_S2 158 2.311616 12780.0 12920.0 19498.0 4363.0 0.349326 0.237203 0.116332
7994e60e303c6dbb 15767 20905 126994 357256 144 36672 0.000218 0.0 normal 0.952086 ... PNA062_unstim_PBMCs_1000cells_S02_S2 144 3.855803 5937.0 2389.0 3245.0 7657.0 0.292264 0.173078 0.140349
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
fe9e06d663b632d1 20195 28358 82112 178421 156 48553 0.000700 0.0 normal 0.963772 ... PNA062_PHA_PBMCs_1000cells_S04_S4 156 2.021296 13791.0 19937.0 14825.0 0.0 0.337305 0.211262 0.172595
b526da61c6941223 30537 39637 136362 310787 153 70174 0.000214 0.0 normal 0.970609 ... PNA062_PHA_PBMCs_1000cells_S04_S4 153 2.261065 16654.0 18583.0 34900.0 37.0 0.376707 0.282748 0.156067
8c26b116f942b613 30450 42223 140216 307077 150 72673 0.000138 0.0 normal 0.942681 ... PNA062_PHA_PBMCs_1000cells_S04_S4 150 2.256175 17579.0 18898.0 36196.0 0.0 0.405094 0.207156 0.142091
9567c4b0b99c07ab 10783 14004 44906 97541 141 24787 0.000121 0.0 normal 0.972555 ... PNA062_PHA_PBMCs_1000cells_S04_S4 141 2.091782 7136.0 8240.0 9411.0 0.0 0.478056 0.163209 0.079409
1a14cfd48348799d 79266 107755 363022 789026 158 187021 0.000283 0.0 normal 0.975541 ... PNA062_PHA_PBMCs_1000cells_S04_S4 158 2.275621 42751.0 49972.0 94298.0 0.0 0.278279 0.191667 0.181490

2137 rows × 22 columns

We have now seen how to load proximity network assay data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.