Abundance Normalization

Normalization of protein abundance levels is a critical step in the analysis of single-cell PNA data. It ensures that abundance measurements are comparable across cells, which is essential for accurate biological interpretation. This tutorial introduces and discusses dsb, our preferred method for normalizing and denoising PNA data, particularly for mixed cell populations. We will also cover Centered-Log-Ratio (CLR) (CLR normalization, a more general approach.

This tutorial will first provide a concise introduction to dsb and its underlying algorithm. You will then learn how to apply this method to a PNA dataset. Finally, we will discuss important considerations and potential pitfalls when applying these or any normalization strategies.

Upon completing this tutorial, you will be able to:

Understand the necessity for and the algorithm of dsb normalization.
Apply dsb normalization to a PNA dataset.
Evaluate the impact of normalization on protein abundance.
Recognize the potential pitfalls of normalization.

Why denoise and normalize?

Antibody-based assays, including PNA, can be affected by variable non-specific antibody binding. In leukocytes, for example, this can occur due to antibodies binding to Fc-receptors on the cell surface. This non-specific binding introduces background signal, or “noise,” which can vary significantly between samples and even individual cells. Factors such as sample type and quality, preprocessing conditions, and antibody isotype can influence noise levels. Ultimately, this noise can obscure the identification of cell subpopulations that are truly positive or negative for a specific protein, thereby complicating cell type annotation and differential abundance analysis.

While PNA sample processing includes blocking steps to minimize non-specific binding, computational methods like dsb can further reduce noise.

Normalization also plays a crucial role in transforming abundance data into a more interpretable scale. Many normalization strategies employ a log-transformation to achieve this. The dynamic range of protein abundance levels can be very wide, and without such a transformation, highly abundant proteins can disproportionately influence downstream analyses.

The dsb algorithm

The dsb method was initially developed for denoising and normalizing antibody data from droplet-based single-cell platforms like CITE-seq. It leverages the detection of background protein levels in “empty” droplets, which ideally contain reagents but no cell. While PNA does not involve cell compartmentalization and thus does not generate empty droplets in the same way, it has been demonstrated that the background abundance of a protein can be inferred from its distribution across all cells within a mixed population sample (see dsb paper).

dsb adapts this concept by first estimating the background signal for each protein based on the overall cellular abundance distribution. The mean abundance of this inferred background population for each protein is then subtracted from the protein’s abundance in each cell, effectively centering the background at zero. Subsequently, the signals from isotype control proteins and a cell-specific background mean are integrated into a “noise component.” The first principal component of this noise matrix is then regressed out of the background-corrected data to further reduce noise.

In summary, the dsb algorithm involves the following steps:

Apply a log1p transformation to the raw count data.
For each protein, fit a two-component finite Gaussian mixture model to its abundance across all cells. As a background correction step, subtract the mean of the first (typically lower abundance) component from the logged data. Note: If a two-component fit is not possible, this background correction is skipped for that protein.
For each cell, fit a two-component finite Gaussian mixture model across all proteins. The mean of the first component represents the cell-wise background mean. This value is combined with the background-corrected counts of the isotype controls to create a “noise matrix.” Regress out the first principal component of this noise matrix.
The resulting residuals are returned as the normalized and denoised protein abundance values.

CLR

When dealing with pure cell populations such as cell lines or FACS-sorted populations, dsb can sometimes result in unexpected values as it cannot reliably detect positive and negative populations for certain proteins. Therefore, we also include the CLR method as an alternative normalization approach. This method was originally developed for CITE-seq data (Stoeckius et al., 2017) but quickly gained traction in other antibody-based assays. It relies on the assumption that the geometric mean of protein counts per cell is a good reference for normalizing the counts. In pixelatorR, CLR is implemented as the log1p of the protein counts minus the geometric mean of the log1p counts per cell. The CLR normalization has been wrapped into the Normalize function and can be activated by setting method = "clr".

Setup

To follow this tutorial, you will need to load the following R packages:

library(pixelatorR)
library(SeuratObject)
library(Seurat)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(tibble)
library(here)
library(patchwork)
library(ggridges)

Load Data

To illustrate the normalization procedure we continue with the quality checked and filtered object we created in the previous tutorial. As described there, this merged object contains four samples: a resting and PHA stimulated PBMC sample, both in duplicate.

DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)

# Load the object that you saved in the previous tutorial. This is not needed if it is still in your workspace.
pg_data_combined <- readRDS(file.path(DATA_DIR, "combined_data_filtered.rds"))
pg_data_combined

An object of class Seurat 
158 features across 1970 samples within 1 assay 
Active assay: PNA (158 features, 158 variable features)
 1 layer present: counts

Normalizations

DSB and CLR procedures have been wrapped into single function in pixelatorR called Normalize. Since dsb uses the antibody isotype controls to perform the normalization, we need to specify these with the isotype_controls parameter in Normalize. To facilitate later comparisons, we also construct a layer with log1p-transformed data. A call to Normalize as below will perform the normalization and fills the data layer of the assay (here “PNA”, the default assay if not specified) with the normalized values.

# specify the isotype controls included in the antibody panel.
isotype_controls = c("mIgG1", "mIgG2a", "mIgG2b")

pg_data_combined <- pg_data_combined %>% 
  # Join counts layers of merged object
  JoinLayers() %>% 
  # dsb transform data per cell
  Normalize(method = "dsb", isotype_controls = isotype_controls, assay = "PNA")

# Create a new assay to store the CLR-normalized counts
pg_data_combined[["clr_assay"]] <- pg_data_combined[["PNA"]]
pg_data_combined <- pg_data_combined %>% 
  # Join counts layers of merged object
  JoinLayers() %>% 
  # dsb transform data per cell
  Normalize(method = "clr", assay = "clr_assay")

# create a new assay to store the log1p-transformed counts
logged_cts <- log1p(LayerData(pg_data_combined, assay = "PNA", layer = "counts"))
log_assay <- CreateAssay5Object(data = logged_cts)
pg_data_combined[["log_assay"]] <- log_assay

pg_data_combined

An object of class Seurat 
474 features across 1970 samples within 3 assays 
Active assay: PNA (158 features, 158 variable features)
 2 layers present: counts, data
 2 other assays present: clr_assay, log_assay

Evaluate normalization

After having run the normalization step it can be helpful to visualize the overall distribution of some proteins of interest with different normalization strategies. This will allow you to assess in more detail how this procedure has modified the abundance levels of proteins. Let’s have a look at CD19 as an example.

# Visualize the normalized CD19 abundance levels
cd8_norm <- FetchData(
  pg_data_combined, 
  vars = c("condition", "logassay_CD19", "clrassay_CD19", "PNA_CD19")
) %>% 
  rename(log1p = logassay_CD19, clr = clrassay_CD19, dsb = PNA_CD19) %>%
  pivot_longer(all_of(c("log1p", "clr", "dsb")))

cd8_norm %>% 
  ggplot(aes(x = value, y = condition, fill = condition)) +  
  geom_density_ridges(scale = 4, alpha = 0.8) + 
  scale_y_discrete(expand = c(0, 0), name = NULL) + 
  scale_x_continuous(expand = c(0, 0)) + 
  coord_cartesian(clip = "off") + 
  theme_ridges() +  
  geom_vline(xintercept = 0) +  
  facet_grid(name ~ .) +
  labs(x = "Normalized value", title = "CD19")

CD19 is a B cell-specific protein marker. The population of B cells in these samples should be quite small and the majority of cells should be negative. With CLR, the negative population is negative with a mean around -1. dsb centers the negative population at 0, and the positive population has values above 2, making it straightforward to separate the B cells from the negative population. This property of the dsb-normalized values simplifies other analytical tasks, such as cell annotation and cell gating, which will be covered in upcoming tutorials. With log1p, both the negative and positive populations have high values for CD8 and could be falsely interpreted as two positive populations.

Zero-centering background population simplifies downstream analysis

One of the benefits of this dsb-normalization is that it facilitates detecting and annotating specific cell populations in one or more mixed population samples. As an example, we can create a UMAP of the four samples and visualize some well-known cell-type specific protein markers.

DefaultAssay(pg_data_combined) <- "PNA"
pg_data_combined <- pg_data_combined %>%
  ScaleData() %>%
  RunPCA() %>%
  RunUMAP(dims = 1:10)

markers <- c("CD4", "CD8", "CD19", "CD14")
DefaultAssay(pg_data_combined) <- "PNA"
df1 <- FetchData(pg_data_combined, vars = c("umap_1", "umap_2", markers)) %>% mutate(type = "dsb")
DefaultAssay(pg_data_combined) <- "log_assay"
df2 <- FetchData(pg_data_combined, vars = c("umap_1", "umap_2", markers)) %>% mutate(type = "log1p")
df <- bind_rows(df1, df2) %>% pivot_longer(all_of(markers), names_to = "marker")

umap_plot <- lapply(c("dsb", "log1p"), function(x) {
  df <- df %>% filter(type == x)
  lapply(markers, function(m) {
    df_m <- df %>% 
      filter(marker == m)
    max_abs_val <- max(abs(df_m$value))
    df_m %>% 
      ggplot(aes(umap_1, umap_2, color = value)) +
      geom_point(size = 0.5) +
      facet_grid(~ paste0(marker, " (", type, ")"), scales = "free", space = "free") +
      theme_bw() +
      scale_color_gradientn(colours = c("#1F395F", "#8DA2C1", "#FFFFFF", "#DB8CA6", "#781534"), 
                            limits = c(-max_abs_val, max_abs_val))
  }) %>% wrap_plots(ncol = 1)
}) %>% wrap_plots(ncol = 2) + 
  plot_annotation(title = "dsb and log1p transformed data") & 
  theme(axis.ticks = element_blank(), axis.text = element_blank(), axis.title = element_blank())

umap_plot

The color bar is centered at 0, where 0 represent the negative populations and positive values represent positive populations. Here, we assess the distribution of abundance of a few cell specific protein markers: CD4 is positive for T helper cells, CD8 for cytotoxic T cells, CD19 for B cells, and CD14 for Monocytes. We can see that the protein abundance is mostly restricted to single clusters for the dsb-normalized counts on the left, while the log1p-transformed counts on the right show “ambient” levels across most clusters. From these plots we see that the dsb-normalized values are better at distinguishing between positive and negative populations than the log1p-transformed counts.

Besides these benefits for visualization and annotation, gating for specific cell populations becomes more straightforward using dsb-corrected values.

DefaultAssay(pg_data_combined) <- "log_assay"
scatter_log <- DensityScatterPlot(pg_data_combined,
                  marker1 = "CD45",
                  marker2 = "CD14",
                  layer = "data",
                  margin_density = FALSE) +
  ggtitle("log1p")

DefaultAssay(pg_data_combined) <- "clr_assay"
scatter_clr <- DensityScatterPlot(pg_data_combined,
                  marker1 = "CD45",
                  marker2 = "CD14",
                  layer = "data",
                  margin_density = FALSE) +
  ggtitle("CLR")


DefaultAssay(pg_data_combined) <- "PNA"
scatter_dsb <- DensityScatterPlot(pg_data_combined,
                  marker1 = "CD45",
                  marker2 = "CD14",
                  layer = "data",
                  margin_density = FALSE) +
  ggtitle("dsb")

scatter_log | scatter_dsb | scatter_clr

The negative population is not zero-centered with the log1p-transformed counts and thus make it challenging to separate positive and negative populations. Upon inspection of the dsb-normalized counts, we can clearly see three main populations: CD45-/CD14- cells (e.g. Platelets), CD45+/CD14+ Monocytes and CD45+/CD14- cells (B and T cells). The CLR-normalized counts also show three distinct populations, but the values for the negative populations are not necessarily zero-centered.

Note that for later tutorials, we will continue with dsb as the default normalization method.

Remarks

As any other normalization method, dsb has limitations. However, most limitations can typically be circumvented with a good experimental design and analysis plan. Below is a list of a few recommendations, remarks and good practices:

The dsb normalization method works optimally for proteins that have a positive and a negative population in the experimental setup; i.e. in datasets consisting of different cell populations (such as PBMCs). This increases the chance that the GMM fitting will reliably detect the negative background population. Consequently, proteins such as HLA-ABC that have a relatively consistent high expression across all cell types might display some unexpected behaviour (i.e. these can show a low dsb value despite being highly expressed in all cells). If your dataset mainly consists of very similar cells, such as for example pure cell lines, it might be prudent to compare the dsb output to log- or CLR-normalization.
Even in mixed cell populations, the dsb normalization method might not work optimally for all proteins and can be sensitive to batch effects. There’s no guarantee that it will find the correct negative population for all proteins. It is therefore good practice to inspect the distributions for selected protein markers.
Do not apply normalization blindly and always be aware of the effect of a transformation when using normalized protein counts for downstream analysis. A good practice is to visualize the abundance values for proteins you deem important in your analysis before and after normalization with, for example, violin plots. Does this correspond to what you expect and if not, ask yourself why could this be the case?

Differential Abundance testing on normalized values

Make sure the normalized values produced by dsb and/or CLR are suitable for the test you would like to perform! Some downstream differential abundance tests demand strictly positive values as input. For example, the FindAllMarkers and FindMarkers functions from Seurat require that the data provided to them is strictly non-negative and do not provide Log Fold Changes for proteins that are on average negative in one or more groups. If you would like to use these functions as part of your downstream analysis, we suggest to set the mean.fxn parameter to “rowMeans” and the fc.name parameter to “difference”. By doing this you will output the difference in normalized values between the two groups to be compared (as opposed to the Log Fold Change). This output is robust to negative values in the testing procedure.

Finally, let’s save the object with the dsb normalized counts for the next tutorial on Data Integration.

# Save filtered dataset for next step. This is an optional pause step; if you prefer to continue to the next tutorial without saving the object explicitly, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data_normalized.rds"))

Abundance Normalization

Why denoise and normalize?​

The dsb algorithm​

CLR​

Setup​

Load Data​

Normalizations​

Evaluate normalization​

Zero-centering background population simplifies downstream analysis​

Remarks​