Data Handling
This tutorial describes how to start working with the Proximity Network Assay (PNA) output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.
After completing this tutorial, you will be able to:
- Load PXL files in Python and access the multi-modal data contents including protein counts, metadata and spatial scores.
- Understand and work with spatial scores, i.e. protein clustering and colocalization (co-clustering) patterns.
- Directly access the spatial graph structure through the edge list.
- Aggregate multiple PXL files into an integrated data object with sample identities.
Setup
To start with, we need to load some packages and functions that we will need.
from pixelator import read_pna as read
DATA_DIR = Path("<path to the directory to save datasets to>")
Loading data
We begin by locating the output from the pixelator
pipeline. The input for
downstream analysis is the PXL file contained in the layout
directory.
baseurl = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/pna-datasets/v1"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
from pathlib import Path
pg_data = read(DATA_DIR / "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl")
Component meta data
The object contains antibody count and meta data for each connected
component in the graph. Each graph component corresponds to a cell. This
table contains information that can be useful in quality control, such
as how many protein molecules were detected (n_umi
), the sequencing
depth (reads_in_component
), and the graph connectivity of each
component (average_k_core
).
pg_data.adata().obs
n_umi1 | n_umi2 | n_edges | n_antibodies | reads_in_component | n_umi | isotype_fraction | intracellular_fraction | tau_type | tau | ... | k_core_3 | k_core_4 | k_core_5 | k_core_6 | k_core_7 | svd_var_expl_s1 | svd_var_expl_s2 | svd_var_expl_s3 | B_nodes_mean_degree | A_nodes_mean_degree | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
component | |||||||||||||||||||||
00259f58c70ad3a1 | 12682 | 16831 | 76755 | 143 | 207161 | 29513 | 0.000474 | 0.0 | normal | 0.970740 | ... | 7184.0 | 12699.0 | 0.0 | 0.0 | 0.0 | 0.305689 | 0.144510 | 0.122012 | 4.560335 | 6.052279 |
002d152256da292c | 18491 | 24404 | 109344 | 158 | 293426 | 42895 | 0.002121 | 0.0 | normal | 0.971252 | ... | 7079.0 | 18045.0 | 840.0 | 0.0 | 0.0 | 0.258788 | 0.184087 | 0.164557 | 4.480577 | 5.913363 |
016247a40001f2d2 | 3635 | 4557 | 27302 | 115 | 66749 | 8192 | 0.000122 | 0.0 | normal | 0.975658 | ... | 965.0 | 1448.0 | 1572.0 | 1739.0 | 0.0 | 0.214890 | 0.137352 | 0.088650 | 5.991222 | 7.510867 |
0189d9c1054ffa9d | 5915 | 6029 | 32331 | 130 | 85856 | 11944 | 0.000251 | 0.0 | normal | 0.968044 | ... | 2234.0 | 4133.0 | 1684.0 | 0.0 | 0.0 | 0.218786 | 0.154800 | 0.093147 | 5.362581 | 5.465934 |
019f4673ca347f70 | 15761 | 18184 | 93848 | 141 | 259873 | 33945 | 0.000206 | 0.0 | normal | 0.971870 | ... | 5429.0 | 16501.0 | 2024.0 | 0.0 | 0.0 | 0.222410 | 0.196024 | 0.148939 | 5.161021 | 5.954445 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
fef58843c24dfc71 | 10323 | 13974 | 58222 | 147 | 153559 | 24297 | 0.000412 | 0.0 | normal | 0.977312 | ... | 5858.0 | 8580.0 | 0.0 | 0.0 | 0.0 | 0.281905 | 0.207809 | 0.146402 | 4.166452 | 5.640027 |
ff34828a9391db95 | 16742 | 23875 | 102427 | 155 | 277026 | 40617 | 0.000271 | 0.0 | normal | 0.984399 | ... | 8922.0 | 17030.0 | 0.0 | 0.0 | 0.0 | 0.248770 | 0.219871 | 0.103427 | 4.290136 | 6.117967 |
ff9a29e516f0349a | 9911 | 12773 | 67373 | 131 | 185195 | 22684 | 0.000132 | 0.0 | normal | 0.989573 | ... | 2507.0 | 7681.0 | 5995.0 | 0.0 | 0.0 | 0.307491 | 0.154692 | 0.126961 | 5.274642 | 6.797800 |
ffa01c7eebadf5f9 | 5708 | 7090 | 40348 | 108 | 113742 | 12798 | 0.000234 | 0.0 | normal | 0.986570 | ... | 1101.0 | 3061.0 | 4766.0 | 366.0 | 0.0 | 0.332866 | 0.168603 | 0.136671 | 5.690832 | 7.068676 |
fff8ca8e02383784 | 14157 | 18161 | 100723 | 149 | 261196 | 32318 | 0.000186 | 0.0 | normal | 0.962244 | ... | 4119.0 | 4856.0 | 14180.0 | 0.0 | 0.0 | 0.317734 | 0.123158 | 0.091305 | 5.546115 | 7.114714 |
1084 rows × 25 columns
The antibody count table is also present in the data object.
pg_data.adata().to_df()
HLA-ABC | B2M | CD11b | CD11c | CD18 | CD82 | CD8 | TCRab | HLA-DR | CD45 | ... | CX3CR1 | CD326 | CD209 | CD34 | CD369 | CD54 | CD71 | CD47 | CD117 | CD314 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
component | |||||||||||||||||||||
00259f58c70ad3a1 | 1118 | 2438 | 1 | 2 | 100 | 91 | 39 | 83 | 12 | 2319 | ... | 5 | 0 | 1 | 2 | 4 | 17 | 0 | 134 | 4 | 2 |
002d152256da292c | 1391 | 3411 | 22 | 37 | 297 | 621 | 95 | 563 | 52 | 2908 | ... | 13 | 23 | 28 | 9 | 17 | 54 | 17 | 245 | 37 | 37 |
016247a40001f2d2 | 344 | 787 | 40 | 92 | 98 | 104 | 7 | 2 | 40 | 271 | ... | 2 | 1 | 0 | 1 | 15 | 44 | 2 | 23 | 0 | 1 |
0189d9c1054ffa9d | 832 | 1985 | 1 | 7 | 27 | 53 | 15 | 3 | 299 | 757 | ... | 1 | 0 | 1 | 2 | 4 | 20 | 0 | 38 | 2 | 2 |
019f4673ca347f70 | 4467 | 6267 | 2 | 222 | 151 | 712 | 52 | 9 | 1236 | 704 | ... | 0 | 3 | 0 | 0 | 4 | 20 | 11 | 116 | 4 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
fef58843c24dfc71 | 1411 | 3024 | 0 | 11 | 120 | 66 | 36 | 130 | 16 | 1134 | ... | 4 | 2 | 1 | 3 | 3 | 12 | 0 | 107 | 4 | 4 |
ff34828a9391db95 | 1873 | 3217 | 7 | 588 | 579 | 74 | 46 | 11 | 259 | 2253 | ... | 25 | 2 | 4 | 2 | 75 | 168 | 9 | 152 | 9 | 2 |
ff9a29e516f0349a | 1743 | 2371 | 2 | 6 | 12 | 295 | 29 | 5 | 6 | 86 | ... | 0 | 0 | 1 | 37 | 2 | 4 | 19 | 114 | 3 | 4 |
ffa01c7eebadf5f9 | 2281 | 3208 | 0 | 3 | 7 | 59 | 19 | 5 | 2 | 44 | ... | 0 | 0 | 0 | 2 | 2 | 2 | 8 | 31 | 4 | 0 |
fff8ca8e02383784 | 2778 | 4665 | 14 | 20 | 297 | 44 | 42 | 5 | 13 | 2038 | ... | 6 | 0 | 1 | 5 | 2 | 25 | 4 | 171 | 0 | 140 |
1084 rows × 158 columns
Spatial data
In addition to count data corresponding to protein abundance, proximity network assay also generates spatial data which is used to find spatial structures of proteins across cells and to visualize individual cells. The PXL file comes with proximity scores, describing the degree of spatial clustering of each protein, and the colocalization (co-clustering) of each protein-pair on each cell. If you want to read more about spatial metrics in proximity network assay data, LINK_TO_PNA_PROXIMITY_EXPLANATION.
The proximity scores can be retrieved through proximity()
function.
pg_data.proximity().to_df()
marker_1 | marker_2 | join_count | join_count_expected_mean | join_count_expected_sd | join_count_z | join_count_p | component | sample | marker_1_count | marker_2_count | min_count | log2_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HLA-DR-DP-DQ | HLA-DR-DP-DQ | 1 | 0.08 | 0.272660 | 0.920000 | 0.178786 | 7f0909a1212cd653 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 17 | 17 | 17 | 0.000000 |
1 | HLA-DR-DP-DQ | IgE | 0 | 0.00 | 0.000000 | 0.000000 | 0.500000 | 7f0909a1212cd653 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 17 | 1 | 1 | 0.000000 |
2 | HLA-DR-DP-DQ | TIGIT | 0 | 0.00 | 0.000000 | 0.000000 | 0.500000 | 7f0909a1212cd653 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 17 | 1 | 1 | 0.000000 |
3 | HLA-DR-DP-DQ | KLRG1 | 0 | 0.02 | 0.140705 | -0.020000 | 0.492022 | 7f0909a1212cd653 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 17 | 3 | 3 | 0.000000 |
4 | HLA-DR-DP-DQ | Siglec-9 | 0 | 0.03 | 0.171447 | -0.030000 | 0.488034 | 7f0909a1212cd653 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 17 | 4 | 4 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11542966 | CD49e | CD7 | 0 | 0.28 | 0.494004 | -0.280000 | 0.389739 | 459051ae274b9701 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 68 | 21 | 21 | 0.000000 |
11542967 | CD49e | KLRG1 | 0 | 0.25 | 0.519810 | -0.250000 | 0.401294 | 459051ae274b9701 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 68 | 19 | 19 | 0.000000 |
11542968 | CD49e | CD82 | 6 | 6.55 | 2.893305 | -0.190094 | 0.424618 | 459051ae274b9701 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 68 | 588 | 68 | -0.126532 |
11542969 | CD49e | CD5 | 0 | 0.50 | 0.674200 | -0.500000 | 0.308538 | 459051ae274b9701 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 68 | 45 | 45 | 0.000000 |
11542970 | CD49e | CD49e | 0 | 0.42 | 0.606030 | -0.420000 | 0.337243 | 459051ae274b9701 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 68 | 68 | 68 | 0.000000 |
11542971 rows × 13 columns
Edgelist
The edge list is a list of detected links between two uniquely identified antibodies. The edge list constitutes the graph which makes up a Proximity Network Assay experiment. Each row contains an edge which is identified by a UMI1 and a UMI2 that are unique identifiers of the antibodies and marker_1 and marker_2 that are the proteins that correspond to the respective antibodies. The edge list is used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.
pg_data.edgelist().to_polars()
marker_1 | marker_2 | umi1 | umi2 | read_count | uei_count | corrected_read_count | component | sample |
---|---|---|---|---|---|---|---|---|
str | str | u64 | u64 | u32 | u16 | u32 | str | str |
"CD41" | "CD36" | 1536508673753445 | 17160439996589472 | 1 | 1 | 0 | "d79d0373b9011594" | "PNA062_unstim_PBMCs_1000cells_… |
"CD41" | "CD84" | 3756592463452849 | 21225509255780229 | 2 | 2 | 0 | "d79d0373b9011594" | "PNA062_unstim_PBMCs_1000cells_… |
"CD41" | "CD41" | 6212127383916900 | 17447157367127077 | 2 | 2 | 0 | "d79d0373b9011594" | "PNA062_unstim_PBMCs_1000cells_… |
"B2M" | "CD36" | 6602866048587088 | 21565231029558269 | 3 | 2 | 0 | "d79d0373b9011594" | "PNA062_unstim_PBMCs_1000cells_… |
"CD36" | "CD36" | 7466584021219948 | 11613682042882452 | 1 | 1 | 0 | "d79d0373b9011594" | "PNA062_unstim_PBMCs_1000cells_… |
… | … | … | … | … | … | … | … | … |
"HLA-ABC" | "CD102" | 71441025645011987 | 68669839214899147 | 4 | 2 | 1 | "9a80d6923ba5691d" | "PNA062_unstim_PBMCs_1000cells_… |
"CD43" | "CD43" | 71590642841442142 | 41814121581079748 | 5 | 3 | 0 | "9a80d6923ba5691d" | "PNA062_unstim_PBMCs_1000cells_… |
"CD53" | "HLA-ABC" | 71663951675489002 | 42621753378576454 | 1 | 1 | 0 | "9a80d6923ba5691d" | "PNA062_unstim_PBMCs_1000cells_… |
"CD352" | "CD328" | 71863271622104604 | 19673970718705989 | 5 | 2 | 0 | "9a80d6923ba5691d" | "PNA062_unstim_PBMCs_1000cells_… |
"B2M" | "CD43" | 72032326171452283 | 64807303907228061 | 5 | 1 | 0 | "9a80d6923ba5691d" | "PNA062_unstim_PBMCs_1000cells_… |
Aggregating data
For cases when we want to process and analyze samples in parallel, it is
convenient to merge them into a single object. Here, we read multiple
files and aggregate them into a single object. To merge PXL files, we
simply pass a list of file paths to the read()
function.
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
!curl -L -O -C - --create-dirs --output-dir {DATA_DIR} "{baseurl}/PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl"
pg_data_combined = read([
DATA_DIR / "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl",
DATA_DIR / "PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl",
])
Now calling any of the functions from the object will include the aggregated data from both samples:
pg_data_combined.adata().obs
n_umi1 | n_umi2 | n_edges | n_antibodies | reads_in_component | n_umi | isotype_fraction | intracellular_fraction | tau_type | tau | ... | average_k_core | k_core_1 | k_core_2 | k_core_3 | k_core_4 | svd_var_expl_s1 | svd_var_expl_s2 | svd_var_expl_s3 | B_nodes_mean_degree | A_nodes_mean_degree | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
component | |||||||||||||||||||||
00259f58c70ad3a1 | 12682 | 16831 | 76755 | 143 | 207161 | 29513 | 0.000474 | 0.0 | normal | 0.970740 | ... | 2.886830 | 6409.0 | 3221.0 | 7184.0 | 12699.0 | 0.305689 | 0.144510 | 0.122012 | 4.560335 | 6.052279 |
002d152256da292c | 18491 | 24404 | 109344 | 158 | 293426 | 42895 | 0.002121 | 0.0 | normal | 0.971252 | ... | 2.817065 | 10641.0 | 6290.0 | 7079.0 | 18045.0 | 0.258788 | 0.184087 | 0.164557 | 4.480577 | 5.913363 |
016247a40001f2d2 | 3635 | 4557 | 27302 | 115 | 66749 | 8192 | 0.000122 | 0.0 | normal | 0.975658 | ... | 3.687500 | 1709.0 | 759.0 | 965.0 | 1448.0 | 0.214890 | 0.137352 | 0.088650 | 5.991222 | 7.510867 |
0189d9c1054ffa9d | 5915 | 6029 | 32331 | 130 | 85856 | 11944 | 0.000251 | 0.0 | normal | 0.968044 | ... | 3.083724 | 2608.0 | 1285.0 | 2234.0 | 4133.0 | 0.218786 | 0.154800 | 0.093147 | 5.362581 | 5.465934 |
019f4673ca347f70 | 15761 | 18184 | 93848 | 141 | 259873 | 33945 | 0.000206 | 0.0 | normal | 0.971870 | ... | 3.109029 | 6857.0 | 3134.0 | 5429.0 | 16501.0 | 0.222410 | 0.196024 | 0.148939 | 5.161021 | 5.954445 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ff3a0ce7e072da41 | 15618 | 22415 | 71715 | 158 | 163157 | 38033 | 0.000868 | 0.0 | normal | 0.970760 | ... | 2.184866 | 11059.0 | 9105.0 | 17648.0 | 221.0 | 0.268962 | 0.215883 | 0.156088 | 3.199420 | 4.591817 |
ff8043e7db6cec46 | 9961 | 13254 | 38293 | 158 | 87143 | 23215 | 0.001292 | 0.0 | normal | 0.938416 | ... | 1.989231 | 7296.0 | 8873.0 | 7046.0 | 0.0 | 0.277913 | 0.186823 | 0.164854 | 2.889166 | 3.844293 |
ff863128cf2cd4ad | 9515 | 12783 | 36104 | 155 | 78148 | 22298 | 0.000852 | 0.0 | normal | 0.954077 | ... | 1.913849 | 7652.0 | 8915.0 | 5731.0 | 0.0 | 0.363647 | 0.193408 | 0.081201 | 2.824376 | 3.794430 |
ffbf15dbca4b1580 | 18862 | 25426 | 73312 | 158 | 163523 | 44288 | 0.001490 | 0.0 | normal | 0.952298 | ... | 1.976269 | 14435.0 | 16469.0 | 13384.0 | 0.0 | 0.406313 | 0.266356 | 0.155110 | 2.883348 | 3.886756 |
ffd0db962011b500 | 27491 | 37335 | 119640 | 158 | 254588 | 64826 | 0.000432 | 0.0 | normal | 0.965988 | ... | 2.136134 | 18735.0 | 18531.0 | 27560.0 | 0.0 | 0.283183 | 0.207696 | 0.158911 | 3.204500 | 4.351970 |
2137 rows × 22 columns
We have now seen how to load proximity network assay data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.