Cell Annotation
This tutorial demonstrates the annotation of cells from MPX data. We will normalize and scale the count data, perform dimensionality reduction, clustering, and perform a differential abundance analysis to find the cell identities of each cluster.
After completing this tutorial, you should be able to:
- Cluster cells with a graph-based Louvain algorithm to identify groups of cells by state.
- Identify cluster marker proteins through differential abundance analysis.
- Visually explore marker differences across clusters using volcano plots and heatmaps.
- Manually annotate cell type identities based on known marker expression patterns.
- Summarize cell type frequencies across experimental groups.
Setup
library(pixelatorR)
library(dplyr)
library(tibble)
library(stringr)
library(tidyr)
library(SeuratObject)
library(Seurat)
library(ggplot2)
library(here)
library(pheatmap)
Load data
To illustrate the annotation workflow we continue with the integrated object we created in the previous tutorial. As described before, this merged and harmonized object contains four samples: a resting and PHA stimulated PBMC sample, both in duplicate.
DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)
# Load the object that you saved in the previous tutorial. This is not needed if it is still in your workspace.
pg_data_combined <- readRDS(file.path(DATA_DIR, "combined_data_integrated.rds"))
pg_data_combined
An object of class Seurat
168 features across 3844 samples within 2 assays
Active assay: mpxCells (84 features, 84 variable features)
3 layers present: counts, data, scale.data
1 other assay present: log_assay
4 dimensional reductions calculated: pca, umap, harmony, harmony_umap
Clustering
To distinguish cell identities we perform a modularity based clustering using the Louvain algorithm on the harmonized dimensional reduction. First, we create a Shared Nearest Neighbor graph, which we pass to Louvain.
pg_data_combined <-
pg_data_combined %>%
# Create SNN graph
FindNeighbors(dims = 1:10, reduction = "harmony") %>%
# Find communities (clusters) of cells in SNN graph
FindClusters(random.seed = 1,
resolution = 0.1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 3844
Number of edges: 139715
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9676
Number of communities: 7
Elapsed time: 0 seconds
# Visualize the clusters
DimPlot(pg_data_combined, reduction = "harmony_umap", label = TRUE)
Cell type marker detection
To annotate the resulting clusters, we use a Wilcoxon Rank sum test to gauge which markers are specific for each cluster. This function calculates the difference in dsb value of each cluster with all other cells.
DE_cluster <-
pg_data_combined %>%
FindAllMarkers(test.use = "wilcox",
assay = "mpxCells",
slot = "data",
logfc.threshold = 0,
return.thresh = 1,
mean.fxn = rowMeans,
fc.name = "difference") %>%
as_tibble()
DE_cluster
# A tibble: 501 × 7
p_val difference pct.1 pct.2 p_val_adj cluster gene
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 0 2.40 0.929 0.136 0 0 CD4
2 0 2.02 0.996 0.577 0 0 CD5
3 2.13e-214 1.70 0.999 0.725 1.79e-212 0 CD3E
4 2.16e-176 1.01 0.781 0.293 1.81e-174 0 CD26
5 6.89e-160 -1.22 0.111 0.577 5.79e-158 0 CD32
6 5.76e-145 1.14 0.934 0.438 4.84e-143 0 CD27
7 4.77e-134 -1.32 0.38 0.671 4.01e-132 0 HLA-DR
8 1.06e-132 0.694 0.99 0.815 8.88e-131 0 CD59
9 1.61e-115 0.796 0.941 0.678 1.35e-113 0 CD7
10 3.50e-107 -0.359 1 0.994 2.94e-105 0 CD18
# ℹ 491 more rows
The results can be visualized using a volcano plot to get an overview of the test results.
DE_cluster %>%
ggplot(aes(difference, -log10(p_val_adj),
color = p_val_adj < 0.01)) +
geom_point() +
facet_wrap(~cluster) +
theme_bw()
To help us compare the abundance levels of markers across clusters, we will visualize the average fold change of each cluster in a heatmap.
DE_cluster %>%
group_by(gene) %>%
# Filter proteins that are statistically significant for at least one cluster
filter(any(p_val_adj < 0.001 &
abs(difference) > 1.2)) %>%
select(difference, cluster, gene) %>%
pivot_wider(names_from = "cluster", values_from = "difference") %>%
column_to_rownames("gene") %>%
pheatmap(cellwidth = 10,
cellheight = 10,
angle_col = 0)
Now we can use the relative abundance levels of cells to manually assign cell identities to clusters.
# Cluster annotation as a named vector
cell_annotation <-
c("0" = "CD4 T cells",
"1" = "CD8 T cells",
"2" = "CD8 T cells",
"3" = "Monocytes",
"4" = "Other",
"5" = "B cells",
"6" = "Other")
# Add the cluster annotation to the data
pg_data_combined <-
pg_data_combined %>%
AddMetaData(setNames(cell_annotation[pg_data_combined$seurat_clusters],
nm = colnames(pg_data_combined)), "cell_type")
# Plot UMAP colored by cell annotation
pg_data_combined %>%
DimPlot(group.by = "cell_type", reduction = "harmony_umap")
Now, let’s take a look at the frequency of each of the cell identities within each condition.
CellCountPlot(pg_data_combined, group_by = "cell_type", color_by = "condition")
Compact annotation alternative - Marker gating
Besides differential abundance approaches as shown above, we can use the dsb normalized abundance with fixed thresholds for gating based annotation. In this tutorial we choose a dsb value of 1.5 as gate threshold since the non-expressing population are centered at 0 with a standard deviation close to 1 with dsb normalization. A cell can be annotated as a cell type if passes through the corresponding positive and negative marker gates. In the code below, we annotate a cell to the cell type with the highest ratio of passing gates. If for none of the cell types, the cell passes more than 75% of the gates, it will be annotated as “other”.
# Define a list of cell-type specific markers, a minus sign indicates the marker should not be expressed in this cell-type
celltype_marker_list <- list(
"CD8T" = c("CD3E", "CD8", "-CD4", "-CD26", "-CD328"),
"CD4T" = c("CD3E", "CD4", "-CD8", "-CD26", "-CD328"),
"B" = c("CD19", "CD20", "CD22", "-CD3E"),
"Monocyte" = c("CD1d", "CD36", "-CD3E", "-CD20")
)
# Define a function to calculate per cell the ratio of markers passing the gating strategy
calc_gates_passed <- function(expression_data, marker_sets, threshold){
expression_data <- t(expression_data)
gate_threshold <- threshold
celltypes <- names(marker_sets)
celltype_score <- matrix(0, nrow = nrow(expression_data), ncol = length(marker_sets))
rownames(celltype_score) <- rownames(expression_data)
colnames(celltype_score) <- celltypes
for(ct in celltypes){
for(marker in marker_sets[[ct]]){
if(str_sub(marker, 1, 1) == "-"){
celltype_score[,ct] <- celltype_score[,ct] + (expression_data[,str_sub(marker, 2)] < gate_threshold)
} else {
celltype_score[,ct] <- celltype_score[,ct] + (expression_data[,marker] > gate_threshold)
}
}
celltype_score[,ct] <- celltype_score[,ct]/length(marker_sets[[ct]])
}
return(celltype_score)
}
cell_type_gate <- calc_gates_passed(GetAssayData(pg_data_combined), celltype_marker_list, threshold = 1.5)
# keep maximum celltype_score and set cell type to "other" if this value is < than the required_gate_pass_ratio
required_gate_pass_ratio <- 0.75
gated_cell_type <- cell_type_gate %>%
as.data.frame() %>%
rownames_to_column("component") %>%
pivot_longer(-component, names_to = "gate_celltype", values_to = "ratio") %>%
group_by(component) %>%
arrange(-ratio) %>%
slice(1) %>%
mutate(gate_celltype = ifelse(ratio >= required_gate_pass_ratio, gate_celltype, "other"))
# Add the resulting
pg_data_combined <- AddMetaData(pg_data_combined, gated_cell_type, col.name = "gate_celltype")
Annotation sanity check
We now plot the abundance data for a few markers to check if the annotations conform with expected patterns. The density_scatter_plot function enables us to see marker distributions across single cells. Let us first look at an example of marker distribution against its isotype control, CD3E vs mIgG1.
DensityScatterPlot(pg_data_combined, layer = "data", marker1="mIgG1", marker2="CD3E", margin_density=TRUE)
As expected, we see that the isotype control appears to form a single distribution centered around 0 corresponding to negative expression for all cells. CD3E however is divided into two distributions where one corresponds to the negative population centered around 0, i.e. PBMCs other than T cells that do not express CD3E and another distribution corresponding to T cells with CD3E values centered around 2.5.
Now we can check whether each cell type expresses the expected markers. For example looking at CD4 and CD8 proteins across different cell types:
DensityScatterPlot(pg_data_combined, layer = "data", marker1="CD4", marker2="CD8", facet_vars="gate_celltype")
As expected, CD4 and CD8 are expressed in CD4 T cells and CD8 T cells respectively, while some NK cells also express CD8 as expected. Similarly, let’s look at a few more markers, namely, CD14, CD19, CD161 and CD337 that are expected to be expressed in Monocytes, B cells, Innate T cells and NK cells respectively.
p1 <- DensityScatterPlot(pg_data_combined, layer = "data", marker1="CD14", marker2="CD19", facet_vars="gate_celltype")
p2 <- DensityScatterPlot(pg_data_combined, layer = "data", marker1="CD161", marker2="CD337", facet_vars="gate_celltype")
p1 | p2
Compare annotations
Let us now check to what extent the clustering based annotation and the gating based annotation agree:
table(pg_data_combined[[]]$cell_type, pg_data_combined[[]]$gate_celltype) %>%
pheatmap(.,
display_numbers = .,
cluster_rows = TRUE,
cluster_columns = TRUE,
treeheight_row = 0,
treeheight_col = 0,
color = colorRampPalette(c("white", "navy"))(50), )
We observe that other than a majority of the NK cells that are classified as Monocytes or T cells, the annotations mostly agree. Note that on its own, annotation based on gating can be quite prone to errors as it only considers a few markers. However, it can serve as a powerful complement to clustering based annotation by cleaning up wrong annotations. For example, B cells should not contain any CD3+/CD14+, T cells should not contain any CD19+/CD14+ etc.
In this tutorial we have learned how to annotate different cell populations and use protein abundance levels to compare across populations and between experimental samples.
We will now save this annotated object to proceed to the next tutorial where we will show how to perform a differential abundance test between experimental conditions within a specific cell type.
# Save filtered dataset for next step. This is an optional pause step; if you prefer to continue to the next tutorial without saving the object explicitly, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data_annotated.rds"))