Skip to main content

Data Handling

This tutorial describes how to start working with the Proximity Network Assay (PNA) output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.

After completing this tutorial, you will be able to:

  • Load PXL files in Python and access the multi-modal data contents including protein counts, metadata and spatial scores.
  • Understand and work with spatial scores, i.e. protein clustering and colocalization (co-clustering) patterns.
  • Directly access the spatial graph structure through the edge list.
  • Aggregate multiple PXL files into an integrated data object with sample identities.

Setup

To start with, we need to load some packages and functions that we will need.

from pixelator import read_pna as read
DATA_DIR = Path("<path to the directory to save datasets to>")

Loading data

We begin by locating the output from the pixelator pipeline. The input for downstream analysis is the PXL file contained in the layout directory.

baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/pna-datasets/v1"

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
from pathlib import Path

pg_data = read(DATA_DIR / "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl")

Component meta data

The object contains antibody count and meta data for each connected component in the graph. Each graph component corresponds to a cell. This table contains information that can be useful in quality control, such as how many protein molecules were detected (n_umi), the sequencing depth (reads_in_component), and the graph connectivity of each component (average_k_core).

pg_data.adata().obs
n_umi1 n_umi2 n_edges n_antibodies reads_in_component n_umi isotype_fraction intracellular_fraction tau_type tau ... k_core_3 k_core_4 k_core_5 k_core_6 k_core_7 svd_var_expl_s1 svd_var_expl_s2 svd_var_expl_s3 B_nodes_mean_degree A_nodes_mean_degree
component
00259f58c70ad3a1 12682 16831 76755 143 207161 29513 0.000474 0.0 normal 0.970740 ... 7184.0 12699.0 0.0 0.0 0.0 0.305689 0.144510 0.122012 4.560335 6.052279
002d152256da292c 18491 24404 109344 158 293426 42895 0.002121 0.0 normal 0.971252 ... 7079.0 18045.0 840.0 0.0 0.0 0.258788 0.184087 0.164557 4.480577 5.913363
016247a40001f2d2 3635 4557 27302 115 66749 8192 0.000122 0.0 normal 0.975658 ... 965.0 1448.0 1572.0 1739.0 0.0 0.214890 0.137352 0.088650 5.991222 7.510867
0189d9c1054ffa9d 5915 6029 32331 130 85856 11944 0.000251 0.0 normal 0.968044 ... 2234.0 4133.0 1684.0 0.0 0.0 0.218786 0.154800 0.093147 5.362581 5.465934
019f4673ca347f70 15761 18184 93848 141 259873 33945 0.000206 0.0 normal 0.971870 ... 5429.0 16501.0 2024.0 0.0 0.0 0.222410 0.196024 0.148939 5.161021 5.954445
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
fef58843c24dfc71 10323 13974 58222 147 153559 24297 0.000412 0.0 normal 0.977312 ... 5858.0 8580.0 0.0 0.0 0.0 0.281905 0.207809 0.146402 4.166452 5.640027
ff34828a9391db95 16742 23875 102427 155 277026 40617 0.000271 0.0 normal 0.984399 ... 8922.0 17030.0 0.0 0.0 0.0 0.248770 0.219871 0.103427 4.290136 6.117967
ff9a29e516f0349a 9911 12773 67373 131 185195 22684 0.000132 0.0 normal 0.989573 ... 2507.0 7681.0 5995.0 0.0 0.0 0.307491 0.154692 0.126961 5.274642 6.797800
ffa01c7eebadf5f9 5708 7090 40348 108 113742 12798 0.000234 0.0 normal 0.986570 ... 1101.0 3061.0 4766.0 366.0 0.0 0.332866 0.168603 0.136671 5.690832 7.068676
fff8ca8e02383784 14157 18161 100723 149 261196 32318 0.000186 0.0 normal 0.962244 ... 4119.0 4856.0 14180.0 0.0 0.0 0.317734 0.123158 0.091305 5.546115 7.114714

1084 rows × 25 columns

The antibody count table is also present in the data object.

pg_data.adata().to_df()
HLA-ABC B2M CD11b CD11c CD18 CD82 CD8 TCRab HLA-DR CD45 ... CX3CR1 CD326 CD209 CD34 CD369 CD54 CD71 CD47 CD117 CD314
component
00259f58c70ad3a1 1118 2438 1 2 100 91 39 83 12 2319 ... 5 0 1 2 4 17 0 134 4 2
002d152256da292c 1391 3411 22 37 297 621 95 563 52 2908 ... 13 23 28 9 17 54 17 245 37 37
016247a40001f2d2 344 787 40 92 98 104 7 2 40 271 ... 2 1 0 1 15 44 2 23 0 1
0189d9c1054ffa9d 832 1985 1 7 27 53 15 3 299 757 ... 1 0 1 2 4 20 0 38 2 2
019f4673ca347f70 4467 6267 2 222 151 712 52 9 1236 704 ... 0 3 0 0 4 20 11 116 4 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
fef58843c24dfc71 1411 3024 0 11 120 66 36 130 16 1134 ... 4 2 1 3 3 12 0 107 4 4
ff34828a9391db95 1873 3217 7 588 579 74 46 11 259 2253 ... 25 2 4 2 75 168 9 152 9 2
ff9a29e516f0349a 1743 2371 2 6 12 295 29 5 6 86 ... 0 0 1 37 2 4 19 114 3 4
ffa01c7eebadf5f9 2281 3208 0 3 7 59 19 5 2 44 ... 0 0 0 2 2 2 8 31 4 0
fff8ca8e02383784 2778 4665 14 20 297 44 42 5 13 2038 ... 6 0 1 5 2 25 4 171 0 140

1084 rows × 158 columns

Spatial data

In addition to count data corresponding to protein abundance, proximity network assay also generates spatial data which is used to find spatial structures of proteins across cells and to visualize individual cells. The PXL file comes with proximity scores, describing the degree of spatial clustering of each protein, and the colocalization (co-clustering) of each protein-pair on each cell. If you want to read more about spatial metrics in proximity network assay data, LINK_TO_PNA_PROXIMITY_EXPLANATION.

The proximity scores can be retrieved through proximity() function.

pg_data.proximity().to_df()
marker_1 marker_2 join_count join_count_expected_mean join_count_expected_sd join_count_z join_count_p component sample marker_1_count marker_2_count min_count log2_ratio
0 HLA-DR-DP-DQ HLA-DR-DP-DQ 1 0.08 0.272660 0.920000 0.178786 7f0909a1212cd653 PNA062_unstim_PBMCs_1000cells_S02_S2 17 17 17 0.000000
1 HLA-DR-DP-DQ IgE 0 0.00 0.000000 0.000000 0.500000 7f0909a1212cd653 PNA062_unstim_PBMCs_1000cells_S02_S2 17 1 1 0.000000
2 HLA-DR-DP-DQ TIGIT 0 0.00 0.000000 0.000000 0.500000 7f0909a1212cd653 PNA062_unstim_PBMCs_1000cells_S02_S2 17 1 1 0.000000
3 HLA-DR-DP-DQ KLRG1 0 0.02 0.140705 -0.020000 0.492022 7f0909a1212cd653 PNA062_unstim_PBMCs_1000cells_S02_S2 17 3 3 0.000000
4 HLA-DR-DP-DQ Siglec-9 0 0.03 0.171447 -0.030000 0.488034 7f0909a1212cd653 PNA062_unstim_PBMCs_1000cells_S02_S2 17 4 4 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ...
11542966 CD49e CD7 0 0.28 0.494004 -0.280000 0.389739 459051ae274b9701 PNA062_unstim_PBMCs_1000cells_S02_S2 68 21 21 0.000000
11542967 CD49e KLRG1 0 0.25 0.519810 -0.250000 0.401294 459051ae274b9701 PNA062_unstim_PBMCs_1000cells_S02_S2 68 19 19 0.000000
11542968 CD49e CD82 6 6.55 2.893305 -0.190094 0.424618 459051ae274b9701 PNA062_unstim_PBMCs_1000cells_S02_S2 68 588 68 -0.126532
11542969 CD49e CD5 0 0.50 0.674200 -0.500000 0.308538 459051ae274b9701 PNA062_unstim_PBMCs_1000cells_S02_S2 68 45 45 0.000000
11542970 CD49e CD49e 0 0.42 0.606030 -0.420000 0.337243 459051ae274b9701 PNA062_unstim_PBMCs_1000cells_S02_S2 68 68 68 0.000000

11542971 rows × 13 columns

Edgelist

The edge list is a list of detected links between two uniquely identified antibodies. The edge list constitutes the graph which makes up a Proximity Network Assay experiment. Each row contains an edge which is identified by a UMI1 and a UMI2 that are unique identifiers of the antibodies and marker_1 and marker_2 that are the proteins that correspond to the respective antibodies. The edge list is used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.

pg_data.edgelist().to_polars()
shape: (101_114_218, 9)
marker_1 marker_2 umi1 umi2 read_count uei_count corrected_read_count component sample
str str u64 u64 u32 u16 u32 str str
"CD41" "CD36" 1536508673753445 17160439996589472 1 1 0 "d79d0373b9011594" "PNA062_unstim_PBMCs_1000cells_…
"CD41" "CD84" 3756592463452849 21225509255780229 2 2 0 "d79d0373b9011594" "PNA062_unstim_PBMCs_1000cells_…
"CD41" "CD41" 6212127383916900 17447157367127077 2 2 0 "d79d0373b9011594" "PNA062_unstim_PBMCs_1000cells_…
"B2M" "CD36" 6602866048587088 21565231029558269 3 2 0 "d79d0373b9011594" "PNA062_unstim_PBMCs_1000cells_…
"CD36" "CD36" 7466584021219948 11613682042882452 1 1 0 "d79d0373b9011594" "PNA062_unstim_PBMCs_1000cells_…
"HLA-ABC" "CD102" 71441025645011987 68669839214899147 4 2 1 "9a80d6923ba5691d" "PNA062_unstim_PBMCs_1000cells_…
"CD43" "CD43" 71590642841442142 41814121581079748 5 3 0 "9a80d6923ba5691d" "PNA062_unstim_PBMCs_1000cells_…
"CD53" "HLA-ABC" 71663951675489002 42621753378576454 1 1 0 "9a80d6923ba5691d" "PNA062_unstim_PBMCs_1000cells_…
"CD352" "CD328" 71863271622104604 19673970718705989 5 2 0 "9a80d6923ba5691d" "PNA062_unstim_PBMCs_1000cells_…
"B2M" "CD43" 72032326171452283 64807303907228061 5 1 0 "9a80d6923ba5691d" "PNA062_unstim_PBMCs_1000cells_…

Aggregating data

For cases when we want to process and analyze samples in parallel, it is convenient to merge them into a single object. Here, we read multiple files and aggregate them into a single object. To merge PXL files, we simply pass a list of file paths to the read() function.

!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl"

pg_data_combined = read([
DATA_DIR / "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl",
DATA_DIR / "PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl",
])

Now calling any of the functions from the object will include the aggregated data from both samples:

pg_data_combined.adata().obs
n_umi1 n_umi2 n_edges n_antibodies reads_in_component n_umi isotype_fraction intracellular_fraction tau_type tau ... average_k_core k_core_1 k_core_2 k_core_3 k_core_4 svd_var_expl_s1 svd_var_expl_s2 svd_var_expl_s3 B_nodes_mean_degree A_nodes_mean_degree
component
00259f58c70ad3a1 12682 16831 76755 143 207161 29513 0.000474 0.0 normal 0.970740 ... 2.886830 6409.0 3221.0 7184.0 12699.0 0.305689 0.144510 0.122012 4.560335 6.052279
002d152256da292c 18491 24404 109344 158 293426 42895 0.002121 0.0 normal 0.971252 ... 2.817065 10641.0 6290.0 7079.0 18045.0 0.258788 0.184087 0.164557 4.480577 5.913363
016247a40001f2d2 3635 4557 27302 115 66749 8192 0.000122 0.0 normal 0.975658 ... 3.687500 1709.0 759.0 965.0 1448.0 0.214890 0.137352 0.088650 5.991222 7.510867
0189d9c1054ffa9d 5915 6029 32331 130 85856 11944 0.000251 0.0 normal 0.968044 ... 3.083724 2608.0 1285.0 2234.0 4133.0 0.218786 0.154800 0.093147 5.362581 5.465934
019f4673ca347f70 15761 18184 93848 141 259873 33945 0.000206 0.0 normal 0.971870 ... 3.109029 6857.0 3134.0 5429.0 16501.0 0.222410 0.196024 0.148939 5.161021 5.954445
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ff3a0ce7e072da41 15618 22415 71715 158 163157 38033 0.000868 0.0 normal 0.970760 ... 2.184866 11059.0 9105.0 17648.0 221.0 0.268962 0.215883 0.156088 3.199420 4.591817
ff8043e7db6cec46 9961 13254 38293 158 87143 23215 0.001292 0.0 normal 0.938416 ... 1.989231 7296.0 8873.0 7046.0 0.0 0.277913 0.186823 0.164854 2.889166 3.844293
ff863128cf2cd4ad 9515 12783 36104 155 78148 22298 0.000852 0.0 normal 0.954077 ... 1.913849 7652.0 8915.0 5731.0 0.0 0.363647 0.193408 0.081201 2.824376 3.794430
ffbf15dbca4b1580 18862 25426 73312 158 163523 44288 0.001490 0.0 normal 0.952298 ... 1.976269 14435.0 16469.0 13384.0 0.0 0.406313 0.266356 0.155110 2.883348 3.886756
ffd0db962011b500 27491 37335 119640 158 254588 64826 0.000432 0.0 normal 0.965988 ... 2.136134 18735.0 18531.0 27560.0 0.0 0.283183 0.207696 0.158911 3.204500 4.351970

2137 rows × 22 columns

We have now seen how to load proximity network assay data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.