Skip to main content
Version: 0.18.x

Data Handling

This tutorial describes how to start working with the MPX output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.

After completing this tutorial, you will be able to:

  • Load PXL files in R and access the multi-modal data contents including protein counts, metadata and spatial scores.
  • Understand and work with spatial polarity (protein clustering patterns).
  • Interpret and work with colocalization scores (protein co-clustering).
  • Directly access the spatial graph structure through the edge list.
  • Aggregate multiple PXL files into an integrated data object with sample identities.
  • Save aggregated data objects for later reuse in analyses.

Setup

To start with, we need to load some packages and functions that we will need.

library(pixelatorR)
library(SeuratObject)
library(dplyr)
library(stringr)
library(here)
library(tibble)

Loading data

We begin by locating the output from the nf-core-pixelator pipeline. The input for downstream analysis is the PXL file contained in the layout directory.

DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)

Run the following code in the terminal to download the dataset.

BASEURL="https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.18.x/technote-v1-vs-v2-immunology-II"

curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample05_V2_PBMC_r1.layout.dataset.pxl"

And than proceed in R.

# The folders the data is located in:
filename <- file.path(DATA_DIR, "Sample05_V2_PBMC_r1.layout.dataset.pxl")
pg_data <- ReadMPX_Seurat(filename)

Component meta data

In addition to the counts data, the object contains meta data of each MPX graph component, each corresponding to a cell. This table contains information that can be useful in quality control, with information such as how many protein molecules were detected (molecules), the sequencing depth (reads), and the graph connectivity of each component (mean_b_pixels_per_a_pixel).

as_tibble(pg_data[[]], rownames = "component")
# A tibble: 1,125 × 19
component a_pixel_b_pixel_ratio a_pixels antibodies b_pixels leiden
<chr> <dbl> <int> <int> <int> <fct>
1 RCVCMP0000000 1.80 3749 79 2084 0
2 RCVCMP0000001 2.33 3338 79 1431 1
3 RCVCMP0000002 1.81 2724 78 1503 0
4 RCVCMP0000004 1.92 2159 79 1126 2
5 RCVCMP0000005 1.81 6413 79 3545 3
6 RCVCMP0000007 2.34 2953 78 1264 4
7 RCVCMP0000009 3.42 1382 76 404 9
8 RCVCMP0000011 2.65 6791 79 2564 3
9 RCVCMP0000014 2.11 5128 79 2436 6
10 RCVCMP0000016 2.09 2432 77 1161 13
# ℹ 1,115 more rows
# ℹ 13 more variables: mean_a_pixels_per_b_pixel <dbl>,
# mean_b_pixels_per_a_pixel <dbl>, mean_molecules_per_a_pixel <dbl>,
# mean_reads_per_molecule <dbl>, median_a_pixels_per_b_pixel <dbl>,
# median_b_pixels_per_a_pixel <dbl>, median_molecules_per_a_pixel <dbl>,
# median_reads_per_molecule <dbl>, molecules <int>, pixels <int>,
# reads <int>, tau <dbl>, tau_type <chr>

The antibody count matrix can be accessed from the Seurat object using the LayerData method.

# Display the 10 first rows and 10 first columns of the antibody counts
LayerData(pg_data, layer = "counts")[1:10, 1:10]
10 x 10 sparse Matrix of class "dgCMatrix"

B2M 4690 3379 3070 2014 6731 5090 106 3926 5993 3225
CD102 349 94 39 26 227 155 32 110 154 58
CD11a 509 262 202 63 240 366 65 262 495 140
CD11b 81 156 58 92 185 102 88 416 314 20
CD11c 679 50 35 18 112 28 51 168 757 30
CD123 2 15 5 1 5 4 . 15 19 4
CD14 50 78 42 19 79 26 40 178 613 16
CD150 23 105 30 27 134 14 53 155 22 48
CD152 13 21 17 14 66 12 35 108 25 11
CD154 31 34 36 16 60 22 59 137 34 23

Spatial data

In addition to count data corresponding to protein abundance, MPX also generates spatial data which be used to visualize individual cells, and to find spatial structures of proteins across cells. The PXL file comes with MPX polarity scores, describing the degree of spatial clustering of each protein on each cell, and MPX colocalization scores, which describe the degree of spatial colocalization of pairs of proteins on each cell. If you want to read more about spatial metrics in MPX data, click here.

pixelatorR Assay classes

SeuratObject provides the Assay class and more recently the Assay5 class to store single-cell abundance data in. PixelatorR provides additional classes CellGraphAssay and CellGraphAssay5 to store both abundance data and spatial metrics in. These classes inherit the corresponding SeuratObject classes but also contain additional slots for the spatial metrics. Here for instance, pg_data contains a CellGraphAssay5 object called “mpxCells” which can be accessed as follows:

pg_data[["mpxCells"]]
CellGraphAssay (v5) data with 84 features for 1125 cells
Top 10 variable features:
B2M, CD102, CD11a, CD11b, CD11c, CD123, CD14, CD150, CD152, CD154
Layers:
counts
Loaded CellGraph objects:
0

In the output above, we can confirm that the object is of class CellGraphAssay5 and that it currently contains 0 loaded CellGraph objects. More about that later.

MPX Polarity scores

The spatial metrics can be accessed from the CellGraphAssay5 object with:

PolarizationScores(pg_data)
# A tibble: 84,022 × 6
marker morans_i morans_z morans_p_value morans_p_adjusted component
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 ACTB -0.00827 -1.42 0.0774 0.300 RCVCMP0000000
2 B2M -0.00589 -1.77 0.0380 0.206 RCVCMP0000000
3 CD102 -0.00151 -0.0780 0.469 0.494 RCVCMP0000000
4 CD11a 0.00856 1.46 0.0721 0.290 RCVCMP0000000
5 CD11b 0.00410 1.01 0.156 0.393 RCVCMP0000000
6 CD11c 0.00319 0.851 0.197 0.417 RCVCMP0000000
7 CD14 0.0131 2.89 0.00190 0.0221 RCVCMP0000000
8 CD150 -0.00459 -0.712 0.238 0.435 RCVCMP0000000
9 CD152 0.0100 0.998 0.159 0.395 RCVCMP0000000
10 CD154 0.000512 0.328 0.372 0.473 RCVCMP0000000
# ℹ 84,012 more rows

MPX colocalization scores

Similarly, pixelatorR provides a method to access the colocalization scores:

ColocalizationScores(pg_data)
# A tibble: 3,111,941 × 15
marker_1 marker_2 pearson pearson_mean pearson_stdev pearson_z
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ACTB B2M -0.00456 -0.0127 0.0259 0.316
2 ACTB CD102 -0.139 -0.00133 0.0222 -6.21
3 B2M CD102 -0.207 -0.0403 0.0273 -6.11
4 ACTB CD11a -0.0803 -0.00523 0.0247 -3.04
5 B2M CD11a 0.0335 -0.0453 0.0194 4.06
6 CD102 CD11a -0.147 -0.0179 0.0230 -5.64
7 ACTB CD11b -0.00205 -0.000778 0.0240 -0.0528
8 B2M CD11b -0.0842 -0.0187 0.0217 -3.03
9 CD102 CD11b -0.00990 -0.00598 0.0266 -0.147
10 CD11a CD11b 0.0690 -0.00677 0.0209 3.63
# ℹ 3,111,931 more rows
# ℹ 9 more variables: pearson_p_value <dbl>, pearson_p_value_adjusted <dbl>,
# jaccard <dbl>, jaccard_mean <dbl>, jaccard_stdev <dbl>, jaccard_z <dbl>,
# jaccard_p_value <dbl>, jaccard_p_value_adjusted <dbl>, component <chr>

Edgelist

The edgelist is a list of uniquely identified antibodies per each cell, and simultaneously constitutes the bipartite graph which makes up an MPX experiment. Each row contains an edge (a uniquely identified antibody) which is identified by a UMI, UPIA, and a UPIB. The UPIA is the unique identifier for the specific A pixel, while the UPIB is the unique identifier for the B pixel. Both A pixels and B pixels may contain multiple edges, resulting in a spatial graph which can be used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.

edge_list <- 
ReadMPX_edgelist(filename)
edge_list
# A tibble: 26,003,773 × 8
upia upib umi marker sequence count unique_molecules_count component
<fct> <fct> <fct> <fct> <fct> <int> <int> <fct>
1 CGGGTTAGT… ATTG… CGTC… CD19 ACCAACTT 15 1 RCVCMP00…
2 CGTGATCAG… GCAG… GAAG… CD19 ACCAACTT 15 1 RCVCMP00…
3 CTTTTAGTG… CGTT… TGCG… CD19 ACCAACTT 15 1 RCVCMP00…
4 ACCCAGTTC… GATG… GAGT… CD19 ACCAACTT 14 1 RCVCMP00…
5 TTGGTGTAC… AGCG… TAGT… CD19 ACCAACTT 14 1 RCVCMP00…
6 GTGAAAAAT… ATTA… GTGC… CD19 ACCAACTT 13 1 RCVCMP00…
7 TATCCCAGC… AGTC… TGTC… CD19 ACCAACTT 13 1 RCVCMP00…
8 TTGATTGTT… CTAG… CCCG… CD19 ACCAACTT 13 1 RCVCMP00…
9 AATCGATAG… TGGC… ATGG… CD19 ACCAACTT 12 1 RCVCMP00…
10 TTACTTGTT… CATA… CATC… CD19 ACCAACTT 12 1 RCVCMP00…
# ℹ 26,003,763 more rows

Seurat objects and PXL files

For some analytical tasks, we will need access to the edgelist. For instance, if we want to load component graphs for visualization, we need access to the edgelist to build these graphs. However, the edgelist is not stored in the Seurat object, it’s only available in the PXL file. The CellGraphAssay5 that was mentioned earlier in this tutorial keeps the path to the PXL file so that we can access it when needed. This has the advantage that we can keep the size of the Seurat object relatively small until we decide to load the graphs. On the down side, it also means that the path to the PXL file need to remain unchanged. If the PXL file is moved or renamed, we can no longer access it.

This has an important implication when saving and sharing Seurat objects containing MPX data. If you want to export a Seurat object as an RDS file and share it with someone else, you should also share associated the PXL file. In addition, the path to the PXL file has to be updated in the Seurat object.

If this is unclear at the moment, don’t worry. We will come back to this topic in the tutorials about 2D and 3D cell visualization.

Aggregating data

For cases when we want to process and analyze samples in parallel, it is convenient to merge them into a single object. Here, we read multiple files and aggregate them into a single object. When doing so, the cell ID and sample ID are merged to form the new cell ID in the merged object (e.g. “S1_RCVCMP0000830”). When merging Seurat objects, you can provide add.cell.ids to control the sample prefixes.

In the code chunk below, we will create a merged object from four PXL files. These PXL files represent a resting and PHA stimulated PBMC sample, both in duplicate. We will create a merged object and update the object’s metadata to indicate the condition and replicate number. The resulting object will be explored further in the next tutorial on Quality Control.

BASEURL="https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.18.x/technote-v1-vs-v2-immunology-II"

curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample06_V2_PBMC_r2.layout.dataset.pxl"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl"
# The folders the data is located in:
data_files <-
c(
# Resting
resting_r1 = file.path(DATA_DIR, "Sample05_V2_PBMC_r1.layout.dataset.pxl"),
resting_r2 = file.path(DATA_DIR, "Sample06_V2_PBMC_r2.layout.dataset.pxl"),

# PHA
stimulated_r1 = file.path(DATA_DIR, "Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"),
stimulated_r2 = file.path(DATA_DIR, "Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl")
)

# Read PXL files as a list of Seurat objects
obj_list <- lapply(data_files, ReadMPX_Seurat)

# Combine files into a single Seurat object
pg_data_combined <-
merge(obj_list[[1]],
y = obj_list[-1],
add.cell.ids = names(obj_list))

# Add meta data describing the sample and donor origin
pg_data_combined <-
AddMetaData(pg_data_combined,
metadata = data.frame(id = colnames(pg_data_combined)) %>%
mutate(
condition = str_remove(id, "_.*"),
sample = str_remove(id, "_RCVC.*"),
replicate = str_extract(id, "_r[0-9]_") %>% str_remove_all("_")
) %>%
column_to_rownames("id"))
pg_data_combined
An object of class Seurat 
84 features across 4454 samples within 1 assay
Active assay: mpxCells (84 features, 84 variable features)
4 layers present: counts.1, counts.2, counts.3, counts.4

As you can see in the object description above, the count layers of a merged object will be kept separate as individual layers (“counts.1”, “counts.2”, …). For many downstream applications, one needs a singled merged counts matrix. This layer can be created with the following command.

pg_data_combined <- JoinLayers(pg_data_combined)
pg_data_combined
An object of class Seurat 
84 features across 4454 samples within 1 assay
Active assay: mpxCells (84 features, 84 variable features)
1 layer present: counts

Save data

After combining the samples, we can save the merged data to an RDS file to use later. Remember that a Seurat object containing MPX data is “connected” to a PXL file, so you need to share the PXL file along with the RDS file to ensure that component graphs can be loaded. If component graphs are not needed for subsequent analysis, the PXL file is no longer needed.

# This is an optional pause step; if you prefer to continue to the next tutorial without saving the object, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data.rds"))

We have now seen how to load MPX data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.