Skip to main content

Data Handling

This tutorial describes how to start working with the Proximity Network Assay (PNA) output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.

After completing this tutorial, you will be able to:

  • Build Seurat objects from PXL files and access the multi-modal PNA data including protein abundance data, spatial data and additional metadata.
  • Access spatial scores which quantify protein clustering and colocalization (co-clustering) patterns.
  • Access the edge list representing component (cell) graphs.
  • Aggregate multiple PNA datasets into a single Seurat object.

Setup

First, we will load some packages needed for the tutorial.

library(pixelatorR)
library(SeuratObject)
library(dplyr)
library(stringr)
library(here)
library(tibble)

Loading data

We begin by locating the output from the pixelator pipeline. The input for downstream analysis is the PXL file contained in the layout directory.

DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)

Run the following code in the terminal to download the dataset to your local machine.

And then proceed in R to load the PNA data into Seurat object.

# Collect the path to the PXL file
filename <- file.path(DATA_DIR, "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl")

# Load the PXL file using the read function in the pixelatorR package
pg_data <- ReadPNA_Seurat(filename)
Accessing the function documentation

If you like to know more about the functions provided in the pixelatorR package, you can access the function documentation by running for example ?ReadPNA_Seurat in the R console.

Component meta data

In PNA data, the term “component” generally corresponds to an individual cell. However, it’s important to note that in some instances, a component might represent other entities such as cell doublets, fragments of cells, or even antibody aggregates. For the sake of simplicity in these tutorials, we will primarily use “component” and “cell” as synonymous terms.

In addition to counts and spatial information, the Seurat object includes a component-level metadata table. This table contains key quality control metrics such as antibody counts (n_umi), sequencing depth (reads_in_component), and the number of detected proteins (n_antibodies).

as_tibble(pg_data[[]], rownames = "component") %>% 
head() %>%
knitr::kable()
component n_umi1 n_umi2 n_edges n_antibodies reads_in_component n_umi isotype_fraction intracellular_fraction tau_type tau sample antibodies average_k_core k_core_1 k_core_2 k_core_3 k_core_4 k_core_5 k_core_6 k_core_7 svd_var_expl_s1 svd_var_expl_s2 svd_var_expl_s3 B_nodes_mean_degree A_nodes_mean_degree
00259f58c70ad3a1 12682 16831 76755 143 207161 29513 0.0004744 0 normal 0.9707400 PNA062_unstim_PBMCs_1000cells_S02_S2 143 2.886830 6409 3221 7184 12699 0 0 0 0.3056892 0.1445095 0.1220118 4.560335 6.052279
002d152256da292c 18491 24404 109344 158 293426 42895 0.0021215 0 normal 0.9712516 PNA062_unstim_PBMCs_1000cells_S02_S2 158 2.817065 10641 6290 7079 18045 840 0 0 0.2587881 0.1840865 0.1645575 4.480577 5.913363
016247a40001f2d2 3635 4557 27302 115 66749 8192 0.0001221 0 normal 0.9756582 PNA062_unstim_PBMCs_1000cells_S02_S2 115 3.687500 1709 759 965 1448 1572 1739 0 0.2148896 0.1373523 0.0886505 5.991222 7.510867
0189d9c1054ffa9d 5915 6029 32331 130 85856 11944 0.0002512 0 normal 0.9680438 PNA062_unstim_PBMCs_1000cells_S02_S2 130 3.083724 2608 1285 2234 4133 1684 0 0 0.2187860 0.1547995 0.0931473 5.362581 5.465934
019f4673ca347f70 15761 18184 93848 141 259873 33945 0.0002062 0 normal 0.9718696 PNA062_unstim_PBMCs_1000cells_S02_S2 141 3.109029 6857 3134 5429 16501 2024 0 0 0.2224099 0.1960237 0.1489391 5.161021 5.954445
01cf357f3786c50f 15818 21149 91233 149 246791 36967 0.0001353 0 normal 0.9719708 PNA062_unstim_PBMCs_1000cells_S02_S2 149 2.845078 8440 4378 8618 15531 0 0 0 0.3095300 0.1964658 0.1241150 4.313821 5.767670

Count data

The antibody count matrix can be accessed from the Seurat object using the LayerData method.

# Display the 10 first rows and 10 first columns of the antibody counts
LayerData(pg_data, layer = "counts")[1:10, 1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
                                                         
HLA-ABC 1118 1391 344 832 4467 2199 1511 2683 7494 3170
B2M 2438 3411 787 1985 6267 4175 2520 4714 10978 6034
CD11b 1 22 40 1 2 1 48 4 6 56
CD11c 2 37 92 7 222 4 208 8 24 537
CD18 100 297 98 27 151 159 513 407 74 403
CD82 91 621 104 53 712 578 363 166 924 403
CD8 39 95 7 15 52 60 40 1584 131 104
TCRab 83 563 2 3 9 485 7 109 26 12
HLA-DR 12 52 40 299 1236 31 149 21 45 2129
CD45 2319 2908 271 757 704 2145 892 4442 338 1241

Spatial data

Beyond protein abundance, the Proximity Network Assay (PNA) generates spatial data that enables the analysis of protein proximity patterns both within and between cells. Moreover, by leveraging graph drawing methods, it is even possible to create 3D layouts of the cells for visualization.

The PXL file includes a table of proximity scores, quantifying the spatial clustering of individual proteins and the co-localization of protein pairs on each cell.

pixelatorR Assay classes

PixelatorR introduces two important Seurat extensions to store PNA data: PNAAssay (based on Assay from Seurat v3) and PNAAssay5 (based on Assay5 from Seurat v5). These two classes are used to store abundance data, just like an Assay or Assay5, but also spatial metrics and CellGraph objects. When a Seurat object is created from a PXL file with ReadPNA_Seurat, the data is automatically converted into a PNAAssay5 object named “PNA” which can be accessed as follows:

pg_data[["PNA"]]
PNAAssay (v5) data with 158 features for 1084 cells
Top 10 variable features:
HLA-ABC, B2M, CD11b, CD11c, CD18, CD82, CD8, TCRab, HLA-DR, CD45
Layers:
counts
Loaded CellGraph objects:
0

In the output above, we see that there are 158 proteins and 1084 cells available and that the object currently contains 0 loaded CellGraph objects. More about that later.

PNA Proximity scores

The spatial metrics can be accessed from the PNAAssay5 object with:

ProximityScores(pg_data) %>% 
head() %>%
knitr::kable()
marker_1 marker_2 join_count join_count_expected_mean join_count_expected_sd join_count_z join_count_p component log2_ratio
HLA-DR-DP-DQ HLA-DR-DP-DQ 1 0.08 0.2726599 0.92 0.1787864 7f0909a1212cd653 0
HLA-DR-DP-DQ IgE 0 0.00 0.0000000 0.00 0.5000000 7f0909a1212cd653 0
HLA-DR-DP-DQ TIGIT 0 0.00 0.0000000 0.00 0.5000000 7f0909a1212cd653 0
HLA-DR-DP-DQ KLRG1 0 0.02 0.1407053 -0.02 0.4920217 7f0909a1212cd653 0
HLA-DR-DP-DQ Siglec-9 0 0.03 0.1714466 -0.03 0.4880335 7f0909a1212cd653 0
HLA-DR-DP-DQ TCRVd2 0 0.04 0.1969464 -0.04 0.4840466 7f0909a1212cd653 0

Edgelist

The edgelist table lists linked protein pairs. Each row represents an edge, identified by two integer-encoded UMI barcodes (UMI1 and UMI2). Each UMI uniquely maps to a single protein (UMI1 to marker_1, UMI2 to marker_2).

This edgelist is a form of graph representation where UMI1 and UMI2 represent distinct sets of nodes (vertices), and a pair of UMI1/UMI2 denote edges (links) between them. This bipartite structure, inherent to how relationships are formed in the Proximity Network Assay, means a UMI1 barcode can only connect to a UMI2 barcode, and vice versa. In more technical terms, a PNA component graph is a simple, undirected bipartite graph. Given that edges typically form between spatially proximal proteins, the PNA graph is also a spatial graph.

Importantly, the edgelist is organized by component (individual cell), with no edges spanning across components. While a PXL file’s edgelist can be large (e.g., over 100 million rows in our sample), making direct manipulation cumbersome, pixelatorR offers tools to convert these component edgelists into more manageable graph representations for analysis.

Using the ReadPNA_edgelist function, we can for example lazy load the edgelist from the PXL file without loading it into memory. The resulting object is a tbl_lazy object, which can be manipulated with most dplyr verbs (thanks to the dbplyr R package) and materialized in memory when needed with collect.

edge_list <- ReadPNA_edgelist(filename)
edge_list %>%
head() %>%
collect() %>%
knitr::kable()
marker_1 marker_2 umi1 umi2 read_count uei_count corrected_read_count component
CD52 B2M 23408400476442 28306897172026797 1 1 0 00259f58c70ad3a1
CD52 CD52 443712860184452 55804500072166761 1 1 0 00259f58c70ad3a1
CD44 CD44 722767188280986 44328367009958117 1 1 0 00259f58c70ad3a1
CD45 CD59 1449711542454126 1569654851068915 2 2 0 00259f58c70ad3a1
CD3e CD48 1538057282869715 44931945321675157 14 10 1 00259f58c70ad3a1
CD52 CD52 1632028877538373 22620331969975541 1 1 0 00259f58c70ad3a1

Seurat objects and PXL files

For some analyses, we will need to access information from the edgelist. For instance, if we want to load component graphs for visualization, we fetch the relevant component edges. It is important to note that the edgelist is not stored in the Seurat object, it’s only available in the PXL file. The PNAAssay5, mentioned earlier in this tutorial, stores the path(s) to the PXL file(s) from which the Seurat object was built from so that we can access the edgelist when needed. This has the advantage that we can keep the size of the Seurat object relatively small until we decide to load the graphs. On the down side, it also means that the path(s) to the PXL file(s) need to remain unchanged. If the PXL file(s) are moved or renamed, we can no longer access them.

This has an important implication when saving and sharing Seurat objects containing PNA data. If you want to export a Seurat object as an RDS file and share it with someone else, you should also share associated the PXL file(s). In addition, the path(s) to the PXL file(s) has to be updated in the Seurat object. If you want to see the PXL file path(s) or update them, you can use the FSMap method (see ?FSMap for more details):

# Get the path to the PXL file
FSMap(pg_data)
# A tibble: 1 × 3
id_map sample pxl_file
<list> <int> <chr>
1 <tibble [1,084 × 2]> 1 data/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.…

If this is unclear at the moment, don’t worry. We will come back to this topic in the tutorial about cell visualization.

Aggregating data

For cases when we want to process and analyze multiple PNA datasets, we can load the corresponding PXL files as Seurat objects and merge them. The cell IDs should be unique across different samples, but you can provide a unique prefix to each sample by passing a character vector to the add.cell.ids parameter in merge.

In the code chunk below, we will create a merged object from two PXL files. These PXL files represent a resting and PHA stimulated PBMC sample. We will create a merged object Seurat and update the object’s metadata to keep information abiut the condition. The resulting object will be explored further in the next tutorial on Quality Control.

BASEURL="https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/pna-datasets/v1"

curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl"
# The folders the data is located in:
data_files <- c(
# Resting
resting = file.path(DATA_DIR, "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl"),
# PHA
stimulated = file.path(DATA_DIR, "PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl")
)

# Read PXL files as a list of Seurat objects
obj_list <- lapply(data_files, ReadPNA_Seurat)

# Combine files into a single Seurat object
pg_data_combined <- merge(obj_list[[1]], y = obj_list[-1], add.cell.ids = names(obj_list))

# Add meta data describing the sample and donor origin
pg_data_combined <- AddMetaData(
pg_data_combined,
metadata = data.frame(id = colnames(pg_data_combined)) %>%
mutate(condition = str_remove(id, "_.*")) %>%
column_to_rownames("id")
)
pg_data_combined
An object of class Seurat 
158 features across 2137 samples within 1 assay
Active assay: PNA (158 features, 158 variable features)
2 layers present: counts.1, counts.2

As you can see in the object description above, the count data in the merged Seurat object is separated in multiple matrices (“counts.1”, “counts.2”, …). Our PNA data is stored in a PNAAssay5 object which extends the Assay5 object introduced in Seurat v5 which keeps the sample count matrices separated after merging. For many downstream applications, one needs a single merged count matrix.

One option is to set options(Seurat.object.assay.version = "v3") and rerun this tutorial in which case the PNA data will be stored in a PNAAssay (Seurat v3) instead. Alternatively, you can join the data using the JoinLayers method. After joining, you should see that there’s now only 1 layer present.

pg_data_combined <- JoinLayers(pg_data_combined)
pg_data_combined
An object of class Seurat 
158 features across 2137 samples within 1 assay
Active assay: PNA (158 features, 158 variable features)
1 layer present: counts

Save data

After combining the samples, we can save the merged data to an RDS file to use for later. Remember that a Seurat object containing PNA data is associated with one or several a PXL file(s), so you need to share the PXL file(s) along with the RDS file to ensure that all pixelatorR features can be used by the recipient. If the component graphs are not needed for subsequent analysis, the PXL file(s) are not needed.

# This is an optional pause step; if you prefer to continue to the next tutorial without saving the object, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data.rds"))

We have now learned how to load multiple PNA datasets, inspect its key properties and prepare the data for downstream analysis. In the next tutorial we will apply these skills and demonstrate how to quality control PNA data.