Cell Annotation

This tutorial demonstrates key steps in analyzing cell populations within PNA data. We will cover normalization, scaling, dimensionality reduction, clustering, and differential abundance testing to identify cell annotations for each cluster.

After completing this tutorial, you should be able to:

Cluster cells with a graph-based community-detection algorithm (Louvain) to identify groups of cells with similar abundance profiles.
Identify cluster marker proteins through differential abundance analysis.
Visually explore marker differences across clusters using heatmaps.
Manually annotate cell type identities based on known marker expression patterns.
Summarize cell type frequencies across experimental groups.

Setup

library(pixelatorR)
library(dplyr)
library(tibble)
library(stringr)
library(tidyr)
library(SeuratObject)
library(Seurat)
library(ggplot2)
library(here)
library(pheatmap)

Load data

To illustrate the annotation workflow we continue with the integrated object we created in the previous tutorial. As described before, this merged and harmonized object contains four samples: a resting and PHA stimulated PBMC sample, both in duplicate.

DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)

# Load the object that you saved in the previous tutorial. This is not needed if it is still in your workspace.
pg_data_combined <- readRDS(file.path(DATA_DIR, "combined_data_integrated.rds"))
pg_data_combined

An object of class Seurat 
474 features across 2066 samples within 3 assays 
Active assay: PNA (158 features, 158 variable features)
 3 layers present: counts, data, scale.data
 2 other assays present: clr_assay, log_assay
 4 dimensional reductions calculated: pca, umap, harmony, harmony_umap

Clustering

To identify cell types in the PNA data we perform data-driven clustering. First, we build a Shared Nearest Neighbor (SNN) graph from the harmony embedding which we then pass to the clustering algorithm (Louvain).

pg_data_combined <- pg_data_combined %>%
  # Create SNN graph
  FindNeighbors(dims = 1:10, reduction = "harmony") %>%
  # Find communities (clusters) of cells in SNN graph
  FindClusters(random.seed = 1,
                resolution = 0.1)

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2066
Number of edges: 67385

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9609
Number of communities: 6
Elapsed time: 0 seconds

# Visualize the clusters
DimPlot(pg_data_combined, reduction = "harmony_umap", label = TRUE)

Cell type marker detection

To annotate the resulting clusters, we use a Wilcoxon Rank sum test to gauge which markers are specific for each cluster using the Seurat method FindAllMarkers. This function also calculates the difference in dsb value of each cluster with all other cells.

DE_cluster <- FindAllMarkers(
  pg_data_combined,
  test.use = "wilcox",
  assay = "PNA",
  slot = "data",
  logfc.threshold = 0, 
  return.thresh = 1,
  mean.fxn = rowMeans,
  fc.name = "difference"
) %>% 
  as_tibble()

DE_cluster %>% 
  group_by(cluster) %>% 
  arrange(-difference) %>% 
  slice_head(n = 5) %>% 
  knitr::kable()

difference	pct.1	pct.2	cluster	gene
2.483939	0.941	0.530	0	CD4
1.931092	0.999	0.666	0	CD5
1.708068	1.000	0.773	0	CD3e
1.653419	0.989	0.549	0	CD6
1.614072	0.874	0.541	0	CD45RO
3.119808	0.968	0.478	1	CD8
1.767911	0.998	0.783	1	CD314
1.354981	1.000	0.934	1	CD27
1.337733	0.938	0.573	1	CD45RB
1.220212	0.995	0.841	1	CD3e
3.261600	0.974	0.549	2	Siglec-9
3.199898	0.971	0.539	2	CD36
3.074027	0.978	0.481	2	CD11c
2.843326	0.978	0.507	2	CD371
2.723601	0.982	0.712	2	CD86
2.960758	0.990	0.566	3	GPR56
2.935354	0.995	0.508	3	CD335
2.866978	0.990	0.515	3	CD94
2.800124	0.980	0.514	3	CD337
2.727929	0.975	0.549	3	CD328
3.378595	1.000	0.499	4	CD22
3.350789	1.000	0.962	4	CD20
2.844543	1.000	0.535	4	HLA-DR
2.786625	0.984	0.470	4	CD72
2.731457	1.000	0.556	4	HLA-DR-DP-DQ
3.986756	0.982	0.614	5	CD41
3.168936	0.974	0.586	5	CD9
2.901747	0.982	0.574	5	CD36
2.560670	0.939	0.552	5	CD62P
2.062905	0.982	0.666	5	CD69

To help us compare the abundance levels of markers across clusters, we will visualize the difference in average dsb values for each cluster relative to all other cells in a heatmap.

heatmap_colors <- colorRampPalette(c("#1F395F", "#8DA2C1", "#FFFFFF", "#DB8CA6", "#781534"))(50)

DE_cluster %>% 
  group_by(gene) %>% 
  # Filter proteins that are statistically significant for at least one cluster
  filter(any(p_val_adj < 0.001 & 
               abs(difference) > 2.2)) %>% 
  select(difference, cluster, gene) %>% 
  pivot_wider(names_from = "cluster", values_from = "difference") %>% 
  column_to_rownames("gene") %>% 
  pheatmap(cellwidth = 10, 
           cellheight = 10,
           angle_col = 0, color = heatmap_colors)

Now we can use the relative abundance levels of cells to manually assign cell identities to clusters.

# Cluster annotation as a named vector
cell_annotation <-
  c("0" = "CD4 T cells",
    "1" = "CD8 T cells",
    "2" = "Monocytes",
    "3" = "NK cells",
    "4" = "B cells",
    "5" = "Platelets")

# Add the cluster annotation to the data
pg_data_combined <- pg_data_combined %>%
  AddMetaData(setNames(cell_annotation[pg_data_combined$seurat_clusters], 
                       nm = colnames(pg_data_combined)), "cell_type")

# Plot UMAP colored by cell annotation
pg_data_combined %>% 
  DimPlot(group.by = "cell_type", reduction = "harmony_umap", label = TRUE)

Now, let’s take a look at the frequency of each of the cell identities within each condition.

CellCountPlot(pg_data_combined, group_by = "cell_type", color_by = "condition") &
  ggtitle("Cell counts per cell type and condition")

Compact annotation alternative - Marker gating

Another cell annotation strategy involves “gating” cells based on dsb-normalized abundance. In this tutorial, we use a dsb value of 1.5 as the gate threshold, given that non-expressing populations are typically centered around 0 with a standard deviation near 1. Here we will be quite strict and only annotate a cell if it passes all defined positive and negative protein marker gates, but we can of course relax this condition if needed to allow for small deviations.

The code below assigns a cell to the cell type for which it passes the highest proportion of gates and this highest proportion needs to be at least required_gate_pass_ratio (set to 1 in the example below). Cells that do not pass any gating strategy are annotated as NA.

# Define a list of cell-type specific markers, a minus sign indicates the marker should not be expressed in this cell-type
celltype_marker_list <- list(
    "CD8 T cells"  = c("CD3e", "CD8", "-CD4", "-CD64", "-CD337"),
    "CD4 T cells"  = c("CD3e", "CD4", "-CD8", "-CD64", "-CD337"),
    "NK cells"  = c("CD337", "NKp80", "-CD22", "-CD3e", "-CD64"),
    "B cells"     = c("CD22", "CD72", "-CD3e", "-CD64", "-CD337"),
    "Monocytes"  = c("Siglec-9", "CD64", "-CD22", "-CD3e", "-CD337"),
    "Platelets"  = c("CD41", "-CD3e", "-CD22", "-CD64", "-CD337")
)

# Define a function to calculate per cell the ratio of markers passing the gating strategy
calc_gates_passed <- function(expression_data, marker_sets, threshold){

  expression_data <- t(expression_data)
  gate_threshold <- threshold
  celltypes <- names(marker_sets)

  celltype_score <- matrix(0, nrow = nrow(expression_data), ncol = length(marker_sets))
  rownames(celltype_score) <- rownames(expression_data)
  colnames(celltype_score) <- celltypes

  for (ct in celltypes){
    for (marker in marker_sets[[ct]]){
      if (str_sub(marker, 1, 1) == "-"){
        celltype_score[,ct] <- celltype_score[,ct] + (expression_data[,str_sub(marker, 2)] < gate_threshold)
      } else {
        celltype_score[,ct] <- celltype_score[,ct] + (expression_data[,marker] > gate_threshold)
      }
    }
  celltype_score[,ct] <- celltype_score[,ct]/length(marker_sets[[ct]])
  }
  return(celltype_score)
}
cell_type_gate <- calc_gates_passed(LayerData(pg_data_combined), celltype_marker_list, threshold = 1.5)

# keep maximum celltype_score and set cell type to "other" if this value is < than the required_gate_pass_ratio 
required_gate_pass_ratio <- 1
gated_cell_type <- cell_type_gate %>%
  as.data.frame() %>%
  rownames_to_column("component") %>%
    pivot_longer(-component, names_to = "gate_celltype", values_to = "ratio") %>%
    group_by(component) %>%
    arrange(-ratio) %>%
    slice(1) %>%
    mutate(gate_celltype = ifelse(ratio >= required_gate_pass_ratio, gate_celltype, NA_character_))

# Add the resulting cell type labels to the data
pg_data_combined <- AddMetaData(pg_data_combined, gated_cell_type, col.name = "gate_celltype")

# Plot UMAP colored by cell annotation
pg_data_combined %>% 
  DimPlot(group.by = "gate_celltype", reduction = "harmony_umap")

With this strict annotation strategy, only 221 cells did not pass our criteria.

sum(is.na(pg_data_combined$gate_celltype))

[1] 220

Annotation sanity check

To validate our annotations against expected marker expression, we now visualize the abundance data for selected marker proteins. The DensityScatterPlot function allows us to compare the abundance distributions of two selected marker proteins. Let’s begin by looking at the distribution of CD3e against its isotype control, mIgG1.

DensityScatterPlot(
  pg_data_combined, 
  layer = "data", 
  marker1 = "mIgG1", 
  marker2 = "CD3e", 
  margin_density = TRUE
)

As anticipated, the isotype control (mIgG1) forms a single distribution, indicating background levels across all cells. However, since mIgG1 is negative in all cell types, dsb has not centered the values at 0. In contrast, CD3E shows two distinct distributions: one centered at 0, representing CD3E-negative cells (likely non-T cell PBMCs), and another centered around 3, representing CD3E-positive T cells.

Now, let’s examine whether our annotated cell types express the expected markers. For instance, we can look at the abundance of CD4 and CD8 proteins across the different identified cell types:

DensityScatterPlot(pg_data_combined, layer = "data", marker1 = "CD4", marker2 = "CD8", facet_vars = "gate_celltype")

As predicted, we observe CD4 expression in CD4 T cells and CD8 expression in CD8 T cells. The presence of CD8 on some NK cells is also consistent with known biology. To further verify our cell type annotations, let’s examine the expression patterns of additional key markers: CD14 (expected in Monocytes), CD19 (B cells), CD161 (Innate T cells), and CD337 (NK cells).

p1 <- DensityScatterPlot(pg_data_combined, layer = "data", marker1 = "CD14", marker2 = "CD22", facet_vars = "gate_celltype")
p2 <- DensityScatterPlot(pg_data_combined, layer = "data", marker1 = "CD161", marker2 = "CD337", facet_vars = "gate_celltype")

p1 | p2

Compare annotations

Now, let’s assess the agreement between our clustering-based and gating-based cell type annotations:

table(pg_data_combined[[]]$cell_type, pg_data_combined[[]]$gate_celltype) %>%
  pheatmap(., 
  display_numbers = .,
  cluster_rows = TRUE, 
  cluster_columns = TRUE, 
  treeheight_row = 0,
  treeheight_col = 0,
  color = colorRampPalette(c("white", "red", "darkred"))(50), )

The annotations show substantial agreement. It’s important to note that gating, by its nature of focusing on a select few markers, can be susceptible to misclassifications. A data-driven approach can leverage abundance information from all proteins to find populations that are easily missed with a more rudimentary gating strategy. However, it offers a valuable complement to clustering-based annotation by refining inaccurate assignments. For instance, a B cell cluster should ideally be devoid of CD3E+/CD14+ cells, and a T cell cluster should not contain CD19+/CD14+ cells. Depending on the analytical goals, a strict gating strategy might be more suitable, for instance, when defining cell populations based on well-established protein markers from the literature.

In this tutorial we have learned how to annotate cell populations and to compare protein abundance levels between annotated cell types.

We will now save this annotated object to proceed to the next tutorial, where we will demonstrate how to perform differential abundance testing between experimental conditions within a specific cell type.

# Save filtered dataset for next step. This is an optional pause step; if you prefer to continue to the next tutorial without saving the object explicitly, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data_annotated.rds"))

Cell Annotation

Setup​

Load data​

Clustering​

Cell type marker detection​

Compact annotation alternative - Marker gating​

Annotation sanity check​

Compare annotations​