Data Handling
This tutorial describes how to start working with the Proximity Network Assay (PNA) output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.
After completing this tutorial, you will be able to:
- Build Seurat objects from PXL files and access the multi-modal PNA data including protein abundance data, spatial data and additional metadata.
- Access spatial scores which quantify protein clustering and colocalization (co-clustering) patterns.
- Access the edge list representing component (cell) graphs.
- Aggregate multiple PNA datasets into a single Seurat object.
Setup
First, we will load some packages needed for the tutorial.
library(pixelatorR)
library(SeuratObject)
library(dplyr)
library(stringr)
library(here)
library(tibble)
Loading data
We begin by locating the output from the pixelator
pipeline. The input for
downstream analysis is the PXL file contained in the layout
directory.
DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)
Run the following code in the terminal to download the dataset to your local machine.
And then proceed in R to load the PNA data into Seurat object.
# Collect the path to the PXL file
filename <- file.path(DATA_DIR, "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl")
# Load the PXL file using the read function in the pixelatorR package
pg_data <- ReadPNA_Seurat(filename)
If you like to know more about the functions provided in the pixelatorR
package, you can access the function documentation by running for
example ?ReadPNA_Seurat
in the R console.
Component meta data
In PNA data, the term “component” generally corresponds to an individual cell. However, it’s important to note that in some instances, a component might represent other entities such as cell doublets, fragments of cells, or even antibody aggregates. For the sake of simplicity in these tutorials, we will primarily use “component” and “cell” as synonymous terms.
In addition to counts and spatial information, the Seurat object includes a component-level metadata table. This table contains key quality control metrics such as antibody counts (n_umi), sequencing depth (reads_in_component), and the number of detected proteins (n_antibodies).
as_tibble(pg_data[[]], rownames = "component") %>%
head() %>%
knitr::kable()
component | n_umi1 | n_umi2 | n_edges | n_antibodies | reads_in_component | n_umi | isotype_fraction | intracellular_fraction | tau_type | tau | sample | antibodies | average_k_core | k_core_1 | k_core_2 | k_core_3 | k_core_4 | k_core_5 | k_core_6 | k_core_7 | svd_var_expl_s1 | svd_var_expl_s2 | svd_var_expl_s3 | B_nodes_mean_degree | A_nodes_mean_degree |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
00259f58c70ad3a1 | 12682 | 16831 | 76755 | 143 | 207161 | 29513 | 0.0004744 | 0 | normal | 0.9707400 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 143 | 2.886830 | 6409 | 3221 | 7184 | 12699 | 0 | 0 | 0 | 0.3056892 | 0.1445095 | 0.1220118 | 4.560335 | 6.052279 |
002d152256da292c | 18491 | 24404 | 109344 | 158 | 293426 | 42895 | 0.0021215 | 0 | normal | 0.9712516 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 158 | 2.817065 | 10641 | 6290 | 7079 | 18045 | 840 | 0 | 0 | 0.2587881 | 0.1840865 | 0.1645575 | 4.480577 | 5.913363 |
016247a40001f2d2 | 3635 | 4557 | 27302 | 115 | 66749 | 8192 | 0.0001221 | 0 | normal | 0.9756582 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 115 | 3.687500 | 1709 | 759 | 965 | 1448 | 1572 | 1739 | 0 | 0.2148896 | 0.1373523 | 0.0886505 | 5.991222 | 7.510867 |
0189d9c1054ffa9d | 5915 | 6029 | 32331 | 130 | 85856 | 11944 | 0.0002512 | 0 | normal | 0.9680438 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 130 | 3.083724 | 2608 | 1285 | 2234 | 4133 | 1684 | 0 | 0 | 0.2187860 | 0.1547995 | 0.0931473 | 5.362581 | 5.465934 |
019f4673ca347f70 | 15761 | 18184 | 93848 | 141 | 259873 | 33945 | 0.0002062 | 0 | normal | 0.9718696 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 141 | 3.109029 | 6857 | 3134 | 5429 | 16501 | 2024 | 0 | 0 | 0.2224099 | 0.1960237 | 0.1489391 | 5.161021 | 5.954445 |
01cf357f3786c50f | 15818 | 21149 | 91233 | 149 | 246791 | 36967 | 0.0001353 | 0 | normal | 0.9719708 | PNA062_unstim_PBMCs_1000cells_S02_S2 | 149 | 2.845078 | 8440 | 4378 | 8618 | 15531 | 0 | 0 | 0 | 0.3095300 | 0.1964658 | 0.1241150 | 4.313821 | 5.767670 |
Count data
The antibody count matrix can be accessed from the Seurat object using
the LayerData
method.
# Display the 10 first rows and 10 first columns of the antibody counts
LayerData(pg_data, layer = "counts")[1:10, 1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
HLA-ABC 1118 1391 344 832 4467 2199 1511 2683 7494 3170
B2M 2438 3411 787 1985 6267 4175 2520 4714 10978 6034
CD11b 1 22 40 1 2 1 48 4 6 56
CD11c 2 37 92 7 222 4 208 8 24 537
CD18 100 297 98 27 151 159 513 407 74 403
CD82 91 621 104 53 712 578 363 166 924 403
CD8 39 95 7 15 52 60 40 1584 131 104
TCRab 83 563 2 3 9 485 7 109 26 12
HLA-DR 12 52 40 299 1236 31 149 21 45 2129
CD45 2319 2908 271 757 704 2145 892 4442 338 1241
Spatial data
Beyond protein abundance, the Proximity Network Assay (PNA) generates spatial data that enables the analysis of protein proximity patterns both within and between cells. Moreover, by leveraging graph drawing methods, it is even possible to create 3D layouts of the cells for visualization.
The PXL file includes a table of proximity scores, quantifying the spatial clustering of individual proteins and the co-localization of protein pairs on each cell.
pixelatorR Assay classes
PixelatorR introduces two important Seurat extensions to store PNA data:
PNAAssay
(based on Assay
from Seurat v3) and PNAAssay5
(based on
Assay5
from Seurat v5). These two classes are used to store abundance
data, just like an Assay
or Assay5
, but also spatial metrics and
CellGraph
objects. When a Seurat object is created from a PXL file
with ReadPNA_Seurat
, the data is automatically converted into a
PNAAssay5
object named “PNA” which can be accessed as follows:
pg_data[["PNA"]]
PNAAssay (v5) data with 158 features for 1084 cells
Top 10 variable features:
HLA-ABC, B2M, CD11b, CD11c, CD18, CD82, CD8, TCRab, HLA-DR, CD45
Layers:
counts
Loaded CellGraph objects:
0
In the output above, we see that there are 158 proteins and 1084 cells
available and that the object currently contains 0 loaded CellGraph
objects. More about that later.
PNA Proximity scores
The spatial metrics can be accessed from the PNAAssay5
object with:
ProximityScores(pg_data) %>%
head() %>%
knitr::kable()
marker_1 | marker_2 | join_count | join_count_expected_mean | join_count_expected_sd | join_count_z | join_count_p | component | log2_ratio |
---|---|---|---|---|---|---|---|---|
HLA-DR-DP-DQ | HLA-DR-DP-DQ | 1 | 0.08 | 0.2726599 | 0.92 | 0.1787864 | 7f0909a1212cd653 | 0 |
HLA-DR-DP-DQ | IgE | 0 | 0.00 | 0.0000000 | 0.00 | 0.5000000 | 7f0909a1212cd653 | 0 |
HLA-DR-DP-DQ | TIGIT | 0 | 0.00 | 0.0000000 | 0.00 | 0.5000000 | 7f0909a1212cd653 | 0 |
HLA-DR-DP-DQ | KLRG1 | 0 | 0.02 | 0.1407053 | -0.02 | 0.4920217 | 7f0909a1212cd653 | 0 |
HLA-DR-DP-DQ | Siglec-9 | 0 | 0.03 | 0.1714466 | -0.03 | 0.4880335 | 7f0909a1212cd653 | 0 |
HLA-DR-DP-DQ | TCRVd2 | 0 | 0.04 | 0.1969464 | -0.04 | 0.4840466 | 7f0909a1212cd653 | 0 |
Edgelist
The edgelist table lists linked protein pairs. Each row represents an edge, identified by two integer-encoded UMI barcodes (UMI1 and UMI2). Each UMI uniquely maps to a single protein (UMI1 to marker_1, UMI2 to marker_2).
This edgelist is a form of graph representation where UMI1 and UMI2 represent distinct sets of nodes (vertices), and a pair of UMI1/UMI2 denote edges (links) between them. This bipartite structure, inherent to how relationships are formed in the Proximity Network Assay, means a UMI1 barcode can only connect to a UMI2 barcode, and vice versa. In more technical terms, a PNA component graph is a simple, undirected bipartite graph. Given that edges typically form between spatially proximal proteins, the PNA graph is also a spatial graph.
Importantly, the edgelist is organized by component (individual cell), with no edges spanning across components. While a PXL file’s edgelist can be large (e.g., over 100 million rows in our sample), making direct manipulation cumbersome, pixelatorR offers tools to convert these component edgelists into more manageable graph representations for analysis.
Using the ReadPNA_edgelist
function, we can for example lazy load the
edgelist from the PXL file without loading it into memory. The resulting
object is a tbl_lazy
object, which can be manipulated with most
dplyr
verbs (thanks to the dbplyr
R package) and materialized in
memory when needed with collect
.
edge_list <- ReadPNA_edgelist(filename)
edge_list %>%
head() %>%
collect() %>%
knitr::kable()
marker_1 | marker_2 | umi1 | umi2 | read_count | uei_count | corrected_read_count | component |
---|---|---|---|---|---|---|---|
CD52 | B2M | 23408400476442 | 28306897172026797 | 1 | 1 | 0 | 00259f58c70ad3a1 |
CD52 | CD52 | 443712860184452 | 55804500072166761 | 1 | 1 | 0 | 00259f58c70ad3a1 |
CD44 | CD44 | 722767188280986 | 44328367009958117 | 1 | 1 | 0 | 00259f58c70ad3a1 |
CD45 | CD59 | 1449711542454126 | 1569654851068915 | 2 | 2 | 0 | 00259f58c70ad3a1 |
CD3e | CD48 | 1538057282869715 | 44931945321675157 | 14 | 10 | 1 | 00259f58c70ad3a1 |
CD52 | CD52 | 1632028877538373 | 22620331969975541 | 1 | 1 | 0 | 00259f58c70ad3a1 |
Seurat objects and PXL files
For some analyses, we will need to access information from the edgelist.
For instance, if we want to load component graphs for visualization, we
fetch the relevant component edges. It is important to note that the
edgelist is not stored in the Seurat object, it’s only available in the
PXL file. The PNAAssay5
, mentioned earlier in this tutorial, stores
the path(s) to the PXL file(s) from which the Seurat object was built
from so that we can access the edgelist when needed. This has the
advantage that we can keep the size of the Seurat object relatively
small until we decide to load the graphs. On the down side, it also
means that the path(s) to the PXL file(s) need to remain unchanged. If
the PXL file(s) are moved or renamed, we can no longer access them.
This has an important implication when saving and sharing Seurat objects
containing PNA data. If you want to export a Seurat object as an RDS
file and share it with someone else, you should also share associated
the PXL file(s). In addition, the path(s) to the PXL file(s) has to be
updated in the Seurat object. If you want to see the PXL file path(s) or
update them, you can use the FSMap
method (see ?FSMap
for more
details):
# Get the path to the PXL file
FSMap(pg_data)
# A tibble: 1 × 3
id_map sample pxl_file
<list> <int> <chr>
1 <tibble [1,084 × 2]> 1 data/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.…
If this is unclear at the moment, don’t worry. We will come back to this topic in the tutorial about cell visualization.
Aggregating data
For cases when we want to process and analyze multiple PNA datasets, we
can load the corresponding PXL files as Seurat objects and merge
them.
The cell IDs should be unique across different samples, but you can
provide a unique prefix to each sample by passing a character vector to
the add.cell.ids
parameter in merge
.
In the code chunk below, we will create a merged object from two PXL files. These PXL files represent a resting and PHA stimulated PBMC sample. We will create a merged object Seurat and update the object’s metadata to keep information abiut the condition. The resulting object will be explored further in the next tutorial on Quality Control.
BASEURL="https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/pna-datasets/v1"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl"
# The folders the data is located in:
data_files <- c(
# Resting
resting = file.path(DATA_DIR, "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"),
# PHA
stimulated = file.path(DATA_DIR, "PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl")
)
# Read PXL files as a list of Seurat objects
obj_list <- lapply(data_files, ReadPNA_Seurat)
# Combine files into a single Seurat object
pg_data_combined <- merge(obj_list[[1]], y = obj_list[-1], add.cell.ids = names(obj_list))
# Add meta data describing the sample and donor origin
pg_data_combined <- AddMetaData(
pg_data_combined,
metadata = data.frame(id = colnames(pg_data_combined)) %>%
mutate(condition = str_remove(id, "_.*")) %>%
column_to_rownames("id")
)
pg_data_combined
An object of class Seurat
158 features across 2137 samples within 1 assay
Active assay: PNA (158 features, 158 variable features)
2 layers present: counts.1, counts.2
As you can see in the object description above, the count data in the
merged Seurat object is separated in multiple matrices (“counts.1”,
“counts.2”, …). Our PNA data is stored in a PNAAssay5
object which
extends the Assay5
object introduced in Seurat v5 which keeps the
sample count matrices separated after merging. For many downstream
applications, one needs a single merged count matrix.
One option is to set options(Seurat.object.assay.version = "v3")
and
rerun this tutorial in which case the PNA data will be stored in a
PNAAssay
(Seurat v3) instead. Alternatively, you can join the data
using the JoinLayers
method. After joining, you should see that
there’s now only 1 layer present.
pg_data_combined <- JoinLayers(pg_data_combined)
pg_data_combined
An object of class Seurat
158 features across 2137 samples within 1 assay
Active assay: PNA (158 features, 158 variable features)
1 layer present: counts
Save data
After combining the samples, we can save the merged data to an RDS file to use for later. Remember that a Seurat object containing PNA data is associated with one or several a PXL file(s), so you need to share the PXL file(s) along with the RDS file to ensure that all pixelatorR features can be used by the recipient. If the component graphs are not needed for subsequent analysis, the PXL file(s) are not needed.
# This is an optional pause step; if you prefer to continue to the next tutorial without saving the object, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data.rds"))
We have now learned how to load multiple PNA datasets, inspect its key properties and prepare the data for downstream analysis. In the next tutorial we will apply these skills and demonstrate how to quality control PNA data.