Data Handling
This tutorial describes how to start working with the MPX output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.
After completing this tutorial, you will be able to:
- Load PXL files in R and access the multi-modal data contents including protein counts, metadata and spatial scores.
- Understand and work with spatial polarity (protein clustering patterns).
- Interpret and work with colocalization scores (protein co-clustering).
- Directly access the spatial graph structure through the edge list.
- Aggregate multiple PXL files into an integrated data object with sample identities.
- Save aggregated data objects for later reuse in analyses.
Setup
To start with, we need to load some packages and functions that we will need.
library(pixelatorR)
library(SeuratObject)
library(dplyr)
library(stringr)
library(here)
library(tibble)
Loading data
We begin by locating the output from the nf-core-pixelator
pipeline. The input
for downstream analysis is the PXL file contained in the layout
directory.
DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)
Run the following code in the terminal to download the dataset.
BASEURL="https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.19.x/technote-v1-vs-v2-immunology-II"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
And than proceed in R.
# The folders the data is located in:
filename <- file.path(DATA_DIR, "Sample05_V2_PBMC_r1.layout.dataset.pxl")
pg_data <- ReadMPX_Seurat(filename)
Component meta data
In addition to the counts data, the object contains meta data of each
MPX graph component, each corresponding to a cell. This table contains
information that can be useful in quality control, with information such
as how many protein molecules were detected (molecules
), the
sequencing depth (reads
), and the graph connectivity of each component
(mean_b_pixels_per_a_pixel
).
as_tibble(pg_data[[]], rownames = "component")
# A tibble: 1,101 × 21
component a_pixel_b_pixel_ratio a_pixels antibodies b_pixels
<chr> <dbl> <int> <int> <int>
1 002f13bdbf78b499 1.87 1931 77 1033
2 00387aa4175efe8d 2.29 3436 78 1500
3 005dd44dcce1635a 1.83 3326 77 1819
4 00ccd2f98f49bd73 2.07 1905 77 921
5 0105b317c16664a6 1.81 1481 77 817
6 0124a5d05c741552 1.82 2340 78 1288
7 012534131719430d 1.96 2208 78 1125
8 0198ad2f43621f7c 1.81 1777 78 981
9 019a35591a30986d 2.19 3569 77 1629
10 01b69faf47d17e73 1.81 2760 77 1528
# ℹ 1,091 more rows
# ℹ 16 more variables: is_potential_doublet <lgl>, leiden <fct>,
# mean_a_pixels_per_b_pixel <dbl>, mean_b_pixels_per_a_pixel <dbl>,
# mean_molecules_per_a_pixel <dbl>, mean_reads_per_molecule <dbl>,
# median_a_pixels_per_b_pixel <dbl>, median_b_pixels_per_a_pixel <dbl>,
# median_molecules_per_a_pixel <dbl>, median_reads_per_molecule <dbl>,
# molecules <int>, n_edges_to_split_doublet <dbl>, pixels <int>, …
The antibody count matrix can be accessed from the Seurat object using
the LayerData
method.
# Display the 10 first rows and 10 first columns of the antibody counts
LayerData(pg_data, layer = "counts")[1:10, 1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
B2M 1733 5128 1970 5071 1587 4129 2311 1816 5942 1270
CD102 30 77 113 39 13 38 82 34 227 13
CD11a 55 402 358 116 68 229 94 76 275 129
CD11b 33 194 270 47 16 66 28 32 107 230
CD11c 15 159 527 21 15 42 19 13 421 128
CD123 5 4 17 2 2 6 5 7 10 10
CD14 13 29 418 15 20 27 22 14 26 322
CD150 36 25 27 75 27 114 34 31 11 25
CD152 16 25 24 29 5 16 10 14 15 26
CD154 12 35 28 15 15 21 21 19 25 15
Spatial data
In addition to count data corresponding to protein abundance, MPX also generates spatial data which be used to visualize individual cells, and to find spatial structures of proteins across cells. The PXL file comes with MPX polarity scores, describing the degree of spatial clustering of each protein on each cell, and MPX colocalization scores, which describe the degree of spatial colocalization of pairs of proteins on each cell. If you want to read more about spatial metrics in MPX data, click here.
pixelatorR Assay classes
SeuratObject provides the Assay
class and more recently the Assay5
class to store single-cell abundance data in. PixelatorR provides
additional classes CellGraphAssay
and CellGraphAssay5
to store both
abundance data and spatial metrics in. These classes inherit the
corresponding SeuratObject classes but also contain additional slots for
the spatial metrics. Here for instance, pg_data
contains a
CellGraphAssay5
object called “mpxCells” which can be accessed as
follows:
pg_data[["mpxCells"]]
CellGraphAssay (v5) data with 84 features for 1101 cells
Top 10 variable features:
B2M, CD102, CD11a, CD11b, CD11c, CD123, CD14, CD150, CD152, CD154
Layers:
counts
Loaded CellGraph objects:
0
In the output above, we can confirm that the object is of class
CellGraphAssay5
and that it currently contains 0 loaded CellGraph
objects. More about that later.
MPX Polarity scores
The spatial metrics can be accessed from the CellGraphAssay5
object
with:
PolarizationScores(pg_data)
# A tibble: 82,415 × 6
marker morans_i morans_z morans_p_value morans_p_adjusted component
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 ACTB -0.00827 -2.03 0.0211 0.152 6b437c2d776b156c
2 B2M -0.00575 -1.31 0.0954 0.342 6b437c2d776b156c
3 CD102 -0.00148 -0.163 0.435 0.488 6b437c2d776b156c
4 CD11a 0.00861 1.57 0.0586 0.276 6b437c2d776b156c
5 CD11b 0.00401 0.987 0.162 0.406 6b437c2d776b156c
6 CD11c 0.00323 0.772 0.220 0.434 6b437c2d776b156c
7 CD14 0.0131 2.33 0.00985 0.0895 6b437c2d776b156c
8 CD150 -0.00459 -0.511 0.305 0.460 6b437c2d776b156c
9 CD152 0.0100 2.50 0.00623 0.0635 6b437c2d776b156c
10 CD154 0.000511 0.218 0.414 0.484 6b437c2d776b156c
# ℹ 82,405 more rows
MPX colocalization scores
Similarly, pixelatorR provides a method to access the colocalization scores:
ColocalizationScores(pg_data)
# A tibble: 3,054,479 × 15
marker_1 marker_2 pearson pearson_mean pearson_stdev pearson_z
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ACTB B2M -0.00456 -0.0158 0.0212 0.529
2 ACTB CD102 -0.139 -0.00130 0.0226 -6.10
3 B2M CD102 -0.208 -0.0390 0.0240 -7.02
4 ACTB CD11a -0.0803 -0.00986 0.0263 -2.68
5 B2M CD11a 0.0336 -0.0449 0.0230 3.42
6 CD102 CD11a -0.147 -0.00994 0.0249 -5.51
7 ACTB CD11b -0.00209 -0.00675 0.0261 0.179
8 B2M CD11b -0.0842 -0.0204 0.0260 -2.45
9 CD102 CD11b -0.00979 -0.00230 0.0261 -0.287
10 CD11a CD11b 0.0689 -0.0138 0.0253 3.28
# ℹ 3,054,469 more rows
# ℹ 9 more variables: pearson_p_value <dbl>, pearson_p_value_adjusted <dbl>,
# jaccard <dbl>, jaccard_mean <dbl>, jaccard_stdev <dbl>, jaccard_z <dbl>,
# jaccard_p_value <dbl>, jaccard_p_value_adjusted <dbl>, component <chr>
Edgelist
The edgelist is a list of uniquely identified antibodies per each cell, and simultaneously constitutes the bipartite graph which makes up an MPX experiment. Each row contains an edge (a uniquely identified antibody) which is identified by a UMI, UPIA, and a UPIB. The UPIA is the unique identifier for the specific A pixel, while the UPIB is the unique identifier for the B pixel. Both A pixels and B pixels may contain multiple edges, resulting in a spatial graph which can be used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.
edge_list <-
ReadMPX_edgelist(filename)
edge_list
# A tibble: 25,842,392 × 8
upia upib umi marker sequence count unique_molecules_count component
<fct> <fct> <fct> <fct> <fct> <int> <int> <fct>
1 CGGGTTAGT… ATTG… CGTC… CD19 ACCAACTT 15 1 6b437c2d…
2 CGTGATCAG… GCAG… GAAG… CD19 ACCAACTT 15 1 8914534a…
3 CTTTTAGTG… CGTT… TGCG… CD19 ACCAACTT 15 1 b033c579…
4 TTGGTGTAC… AGCG… TAGT… CD19 ACCAACTT 14 1 d45e83b2…
5 GTGAAAAAT… ATTA… GTGC… CD19 ACCAACTT 13 1 59722510…
6 TATCCCAGC… AGTC… TGTC… CD19 ACCAACTT 13 1 66b140d8…
7 TTGATTGTT… CTAG… CCCG… CD19 ACCAACTT 13 1 9e0e8227…
8 AATCGATAG… TGGC… ATGG… CD19 ACCAACTT 12 1 2d8042f0…
9 TTACTTGTT… CATA… CATC… CD19 ACCAACTT 12 1 66b140d8…
10 AAGAGGGGT… GGAA… GACG… CD19 ACCAACTT 12 1 4f3ab404…
# ℹ 25,842,382 more rows
Seurat objects and PXL files
For some analytical tasks, we will need access to the edgelist. For
instance, if we want to load component graphs for visualization, we need
access to the edgelist to build these graphs. However, the edgelist is
not stored in the Seurat object, it’s only available in the PXL file.
The CellGraphAssay5
that was mentioned earlier in this tutorial keeps
the path to the PXL file so that we can access it when needed. This has
the advantage that we can keep the size of the Seurat object relatively
small until we decide to load the graphs. On the down side, it also
means that the path to the PXL file need to remain unchanged. If the PXL
file is moved or renamed, we can no longer access it.
This has an important implication when saving and sharing Seurat objects containing MPX data. If you want to export a Seurat object as an RDS file and share it with someone else, you should also share associated the PXL file. In addition, the path to the PXL file has to be updated in the Seurat object.
If this is unclear at the moment, don’t worry. We will come back to this topic in the tutorials about 2D and 3D cell visualization.
Aggregating data
For cases when we want to process and analyze samples in parallel, it is
convenient to merge them into a single object. Here, we read multiple
files and aggregate them into a single object. When doing so, the cell
ID and sample ID are merged to form the new cell ID in the merged object
(e.g. “S1_RCVCMP0000830”). When merging Seurat objects, you can provide
add.cell.ids
to control the sample prefixes.
In the code chunk below, we will create a merged object from four PXL files. These PXL files represent a resting and PHA stimulated PBMC sample, both in duplicate. We will create a merged object and update the object’s metadata to indicate the condition and replicate number. The resulting object will be explored further in the next tutorial on Quality Control.
BASEURL="https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/0.19.x/technote-v1-vs-v2-immunology-II"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample06_V2_PBMC_r2.layout.dataset.pxl"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"
curl -L -O -C - --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl"
# The folders the data is located in:
data_files <-
c(
# Resting
resting_r1 = file.path(DATA_DIR, "Sample05_V2_PBMC_r1.layout.dataset.pxl"),
resting_r2 = file.path(DATA_DIR, "Sample06_V2_PBMC_r2.layout.dataset.pxl"),
# PHA
stimulated_r1 = file.path(DATA_DIR, "Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"),
stimulated_r2 = file.path(DATA_DIR, "Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl")
)
# Read PXL files as a list of Seurat objects
obj_list <- lapply(data_files, ReadMPX_Seurat)
# Combine files into a single Seurat object
pg_data_combined <-
merge(obj_list[[1]],
y = obj_list[-1],
add.cell.ids = names(obj_list))
# Add meta data describing the sample and donor origin
pg_data_combined <-
AddMetaData(pg_data_combined,
metadata = data.frame(id = colnames(pg_data_combined)) %>%
mutate(
condition = str_remove(id, "_.*"),
sample = str_remove(id, ".{17}$"),
replicate = str_extract(id, "_r[0-9]_") %>% str_remove_all("_")
) %>%
column_to_rownames("id"))
pg_data_combined
An object of class Seurat
84 features across 4273 samples within 1 assay
Active assay: mpxCells (84 features, 84 variable features)
4 layers present: counts.1, counts.2, counts.3, counts.4
As you can see in the object description above, the count layers of a merged object will be kept separate as individual layers (“counts.1”, “counts.2”, …). For many downstream applications, one needs a singled merged counts matrix. This layer can be created with the following command.
pg_data_combined <- JoinLayers(pg_data_combined)
pg_data_combined
An object of class Seurat
84 features across 4273 samples within 1 assay
Active assay: mpxCells (84 features, 84 variable features)
1 layer present: counts
Save data
After combining the samples, we can save the merged data to an RDS file to use later. Remember that a Seurat object containing MPX data is “connected” to a PXL file, so you need to share the PXL file along with the RDS file to ensure that component graphs can be loaded. If component graphs are not needed for subsequent analysis, the PXL file is no longer needed.
# This is an optional pause step; if you prefer to continue to the next tutorial without saving the object, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data.rds"))
We have now seen how to load MPX data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.