Cell Annotation
This tutorial demonstrates the annotation of cells from MPX data. We will normalize and scale the count data, perform dimensionality reduction, clustering, and perform a differential abundance analysis to find the cell identities of each cluster.
After completing this tutorial, you should be able to:
- Preprocess MPX data by removing control markers, normalizing with CLR, finding variable features, and scaling.
- Perform dimensionality reduction with PCA and UMAP and visualize protein markers projected onto the reduced space.
- Cluster cells with a graph-based Louvain algorithm to identify groups of cells by state.
- Identify cluster marker proteins through differential abundance analysis.
- Visually explore marker differences across clusters using volcano plots and heatmaps.
- Manually annotate cell type identities based on known marker expression patterns.
- Summarize cell type frequencies across experimental groups.
Setup
library(pixelatorR)
library(dplyr)
library(tibble)
library(stringr)
library(tidyr)
library(SeuratObject)
library(Seurat)
library(ggplot2)
library(here)
library(pheatmap)
Load data
Here, we load the same PBMC dataset used in previous two tutorials Data handling and Quality Control. We also apply the filtering steps we performed in Quality Control.
# The folders the data is located in:
data_files <-
c(S1 = here("data/Sample01_human_pbmcs_unstimulated.dataset.pxl"),
S2 = here("data/Sample02_human_pbmcs_unstimulated.dataset.pxl"))
# Read .pxl files as a list of Seurat objects
pg_data <- lapply(data_files, ReadMPX_Seurat)
# Combine data to Seurat object
pg_data_combined <-
merge(pg_data[[1]],
y = pg_data[-1],
add.cell.ids = names(pg_data))
# Add sample column to meta data
pg_data_combined <-
AddMetaData(pg_data_combined,
metadata = str_remove(colnames(pg_data_combined), "_.*"),
col.name = "sample")
# Filter cells to have at least 5000 edges
pg_data_combined <-
pg_data_combined %>%
subset(edges >= 5000 & tau_type == "normal")
Data Processing
To begin with, we will remove control antibodies from our dataset, as we don’t want these to inform downstream steps such as dimensionality reduction and clustering.
pg_data_combined_processed <-
pg_data_combined %>%
# Remove control antibodies
subset(features = setdiff(rownames(pg_data_combined), c("ACTB", "mIgG1", "mIgG2a", "mIgG2b")))
Next, we normalize our count data using the centered log ratio (CLR) transformation per each cell. We can visualize this transformation by examining a histogram showing the distribution of antibody counts and CLR transformed counts for CD3E.
pg_data_combined_processed <-
pg_data_combined_processed %>%
JoinLayers() %>%
# clr transform data per cell
NormalizeData(normalization.method = "CLR",
margin = 2)
LayerData(pg_data_combined_processed, layer = "counts")["CD3E",] %>%
enframe() %>%
ggplot(aes(value)) +
geom_histogram(bins = 40) +
theme_minimal() +
labs(x = "Counts")
LayerData(pg_data_combined_processed, layer = "data")["CD3E",] %>%
enframe() %>%
ggplot(aes(value)) +
geom_histogram(bins = 40) +
theme_minimal() +
labs(x = "CLR transformed counts")
Next, we identify the most variable markers in the dataset. This is done to select a set of markers that are most informative to distinguishing cell identities in our dataset. This marker set is used for downstream analysis such as dimensionality reduction and clustering. We then visualize the result in a variable feature plot, which the displays the variance of each marker in relation to its average abundance.
pg_data_combined_processed <-
pg_data_combined_processed %>%
# Find variable features
FindVariableFeatures(nfeatures = 50)
VariableFeaturePlot(pg_data_combined_processed, selection.method = "vst") %>%
LabelPoints(points = VariableFeatures(object = pg_data_combined_processed),
repel = F, size = 2, xnudge = 0, ynudge = 0.25)
Finally, we scale and center the data.
pg_data_combined_processed <-
pg_data_combined_processed %>%
# Scale and center data
ScaleData(do.scale = TRUE,
do.center = TRUE,
verbose = FALSE)
Dimensionality reduction
Now when we have a dataset that is normalized and scaled, we can perform dimensionality reduction. Here we perform both a Principal Component Analysis (PCA) and a Uniform Manifold Approximation and Projection (UMAP).
pg_data_combined_processed <-
pg_data_combined_processed %>%
RunPCA(npcs = 30) %>%
RunUMAP(dims = 1:10)
DimPlot(pg_data_combined_processed, reduction = "pca")
DimPlot(pg_data_combined_processed, reduction = "umap")
We can now use these reductions and project abundance of markers or meta data upon the cells.
FeaturePlot(pg_data_combined_processed,
features = c("CD3E", "CD4", "CD8",
"CD19", "CD20",
"CD11c"),
reduction = "pca", ncol = 3,
coord.fixed = TRUE)
FeaturePlot(pg_data_combined_processed,
features = c("CD3E", "CD4", "CD8",
"CD19", "CD20",
"CD11c"),
reduction = "umap", ncol = 3,
coord.fixed = TRUE)
FeaturePlot(pg_data_combined_processed,
features = c("edges", "reads", "mean_umi_per_upia"),
reduction = "umap",
coord.fixed = TRUE)
Clustering
To distinguish cell identities we perform a modularity based clustering using the Louvain algorithm. First, we create a Shared Nearest Neighbor graph, which we pass to Louvain.
pg_data_combined_processed <-
pg_data_combined_processed %>%
# Create SNN graph
FindNeighbors(dims = 1:10) %>%
# Find communities (clusters) of cells in SNN graph
FindClusters(random.seed = 1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 863
Number of edges: 28094
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8125
Number of communities: 9
Elapsed time: 0 seconds
DimPlot(pg_data_combined_processed, label = T)
Differential protein abundance
To annotate the resulting clusters, we use a Wilcoxon Rank sum test to gauge which markers are differentially abundant between the clusters. This function calculates the average log2 fold change of each cluster in comparison with all other cells.
DE_cluster <-
pg_data_combined_processed %>%
FindAllMarkers(test.use = "wilcox",
logfc.threshold = 0,
return.thresh = 1) %>%
as_tibble()
DE_cluster
# A tibble: 684 × 7
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 5.26e-79 0.873 1 0.982 4.00e-77 0 CD11a
2 1.17e-71 0.751 1 1 8.86e-70 0 CD27
3 1.22e-60 0.926 1 0.93 9.29e-59 0 CD127
4 8.59e-60 1.09 1 0.994 6.53e-58 0 CD4
5 2.79e-53 -0.758 1 1 2.12e-51 0 HLA-DR
6 7.25e-51 0.655 1 0.997 5.51e-49 0 CD7
7 5.35e-49 0.771 1 0.987 4.07e-47 0 CD3E
8 9.30e-44 0.302 1 1 7.06e-42 0 CD197
9 1.99e-43 -0.228 1 1 1.51e-41 0 CD54
10 7.36e-43 0.528 1 0.997 5.60e-41 0 CD52
# ℹ 674 more rows
The results can be visualized using a volcano plot to get an overview of the test results.
DE_cluster %>%
ggplot(aes(avg_log2FC, -log10(p_val_adj),
color = p_val_adj < 0.01)) +
geom_point() +
facet_wrap(~cluster) +
theme_bw()
To help us compare the abundance levels of markers across clusters, we will visualize the average fold change of each cluster in a heatmap.
DE_cluster %>%
group_by(gene) %>%
# Filter proteins that are statistically significant for at least one cluster
filter(any(p_val_adj < 0.01 &
abs(avg_log2FC) > 2)) %>%
select(avg_log2FC, cluster, gene) %>%
pivot_wider(names_from = "cluster", values_from = "avg_log2FC") %>%
column_to_rownames("gene") %>%
pheatmap(cellwidth = 10,
cellheight = 10,
angle_col = 0)
Now we can use the relative abundance levels of cells to manually assign cell identities to clusters.
# Cluster annotation as a named vector
cell_annotation <-
c("0" = "CD4 T cells",
"1" = "CD4 T cells",
"2" = "B cells",
"3" = "CD8 T cells",
"4" = "Monocytes",
"5" = "CD8 T cells",
"6" = "NK cells",
"7" = "MAIT cells",
"8" = "Platelets")
# Add the cluster annotation to the data
pg_data_combined_processed <-
pg_data_combined_processed %>%
AddMetaData(setNames(cell_annotation[pg_data_combined_processed$seurat_clusters],
nm = colnames(pg_data_combined_processed)), "cell_type")
# Plot UMAP colored by cell annotation
pg_data_combined_processed %>%
DimPlot(group.by = "cell_type")
Now, let’s take a look at the frequency of each of the cell identities within our data.
CellCountPlot(pg_data_combined_processed, group_by = "cell_type", color_by = "sample")
In this tutorial we have leraned how to normalize MPX data, annotate different cell populations and use protein abundance levels to compare across populations and between experimental samples. However, abundance is only one facet of the rich multidimensional data generated by MPX experiments. We are now ready to transition into the spatial realm, exploring the arrangement and interactions of proteins on the cell surface. In the upcoming tutorials, we will leverage the graph nature and spatial metrics within MPX data to unravel the spatial biology of cells, surveying protein polarization and colocalization patterns and how these contribute to cell identity, signaling, and function. Abundance has illuminated cell types; spatiality will illuminate the molecular geography that defines them.