Skip to main content
Version: 0.18.x

Abundance Normalization

As with any high-throughput techniques, normalization and denoising of MPX abundance data can be a useful tool to make samples and datasets more comparable to each other as well as help with the biological interpretation of your data. Therefore in this tutorial we will introduce and discuss dsb, our method of choice for the normalization and denoising of mixed cell population data as well as a more general approach in the Centered-Log-Ratio normalization CLR.

We will begin with a short introduction to dsb and the algorithm behind it. Then, you will learn how to apply this technique to an MPX dataset. We will finish with some words of caution when applying these -or any- type of normalization.

After completing this tutorial, you should be able to:

  • Understand the need for and the algorithm behind dsb normalization
  • How to apply this normalization method to an MPX dataset
  • Evaluate the effect of normalization on marker abundance
  • Understand the potential pitfalls of normalization

Why denoise and normalize?

By their very nature antibody-based assays can be affected by variable amounts of nonspecific binding. In leukocytes, this nonspecific binding is partly a result of nonspecific binding of antibodies to Fc-receptors present on the surface of many white blood cell types. These background levels of antibody binding (or “noise”) can fluctuate between samples and cells as a result of sample type and quality, preprocessing conditions, antibody isotype and other issues. Ultimately, this can lead to difficulties distinguishing negative and positive subpopulations of cells expressing a certain marker, complicating cell type annotation and differential abundance analysis.

Therefore, during the sample preprocessing of MPX samples, several blocking steps are included to reduce the possibility of nonspecific binding. In addition, computational methods such as the one presented here can, to some extent, be utilized to reduce the bias of nonspecific binding.

The dsb algorithm

The dsb method was originally developed to denoise and normalize antibody-based assays on droplet-based platforms such as CITE-seq. It is based on the detection of background marker levels in so-called “empty” droplets (droplets encapsulating biological material but not a corresponding cell). MPX does not need cell compartmentalization and therefore does not produce empty droplets as such. However, it has been shown that the background abundance levels of a marker can be inferred from its abundance distribution across all cells in a mixed cell population sample (see dsb paper). By calculating the mean abundance in this background population per marker and subtracting these values from the marker’s abundance, each marker’s background population becomes effectively zero centered. In a subsequent step, the isotype control markers and a cell-specific background mean are combined in a “noise component” that is regressed out of the zero centered data corrected for background nonspecific levels.

To summarize these different steps:

  1. Transform the count data using log1plog1p transformation.
  2. Fit a finite mixed gaussian model with two components (i.e. two normal distributions) to each marker across all cells. And as a background correction step, subtract the mean of the first component on the logged data. Note: if no two component fit was possible, background correction is skipped.
  3. Next, fit a finite mixed gaussian model with two components to each cell across all markers. The mean of the first component is the cell-wise background mean. This value is combined with the corrected counts of the isotype controls to form a “noise matrix”. The first principal component of this noise matrix is regressed out.
  4. The resulting residuals are returned as normalized and denoised values.

Setup

To run this tutorial, several R packages need to be installed and loaded.

library(pixelatorR)
library(SeuratObject)
library(Seurat)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(tibble)
library(here)
library(patchwork)
library(ggridges)

Load Data

To illustrate the normalization procedure we continue with the quality checked and filtered merged object we created in the previous tutorial. As described there, this merged object contains four samples: a resting and PHA stimulated PBMC sample, both in duplicate.

DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)
# Load the object that you saved in the previous tutorial. This is not needed if it is still in your workspace.
pg_data_combined <- readRDS(file.path(DATA_DIR, "combined_data_filtered.rds"))
pg_data_combined
An object of class Seurat 
84 features across 3844 samples within 1 assay
Active assay: mpxCells (84 features, 84 variable features)
1 layer present: counts

NormalizeMPX

the dsb procedure has been wrapped into a single function in pixelatorR. A call to NormalizeMPX as below will perform the normalization and fills the data layer of the assay (here mpxCells, the default assay if not specified) with the normalized counts. Since dsb uses the antibody isotype controls to perform the normalization, we need to specify these to NormalizeMPX.

# specify the isotype controls included in the antibody panel.
isotype_controls = c("mIgG1", "mIgG2a", "mIgG2b")

pg_data_combined <-
pg_data_combined %>%
# Join counts layers of merged object
JoinLayers() %>%
# dsb transform data per cell
NormalizeMPX(method = "dsb",
isotype_controls = isotype_controls,
assay = "mpxCells")

Evaluate normalization

After having run the normalization step it is a good idea to visualize the overall distribution of some markers of interest before and after normalization. This will allow you to assess in more detail how this procedure has modified the abundance levels of markers. Let’s have a look at CD19 as an example.

# Visualize the log1p transformed counts to inspect the distribution befored dsb normalization
p_logged_cts <- FetchData(pg_data_combined,
vars = c("sample", "condition", "CD19"),
layer = "counts") %>%
mutate(log1p_CD19 = log1p(CD19)) %>%
ggplot(aes(x = log1p_CD19, y = sample, fill = sample)) +
geom_density_ridges(scale = 4, alpha = 0.8) +
scale_y_discrete(expand = c(0, 0), name = NULL) +
scale_x_continuous(expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_ridges() +
geom_vline(xintercept = 0) +
labs(title = "CD19", x = "log1p transformed")

# After runnning NormalizeMPX the data slot of the mpxCells assay now contains the normalized values
p_dsb_norm <- FetchData(pg_data_combined,
vars = c("sample", "condition", "CD19"),
layer = "data") %>%
ggplot(aes(x = CD19, y = sample, fill = sample)) +
geom_density_ridges(scale = 4, alpha = 0.8) +
scale_y_discrete(expand = c(0, 0), name = NULL) +
scale_x_continuous(expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_ridges() +
geom_vline(xintercept = 0) +
labs(title = "CD19", x = "dsb transformed")

p_logged_cts / p_dsb_norm + patchwork::plot_layout(guides = "collect")

CD19 is a cell type marker that is specific for B cells. The population of B cells in these samples should be relatively small and the majority of cells should therefore not express CD19 at a meaningful level. However, as you can see in the upper ridge plot prior to normalization the majority of cells do express low levels of CD19 as a result of nonspecific binding. In addition, there is clearly some variability between the levels of this nonspecific background noise. As you can see in the lower ridge plot, the dsb normalization centers this negative background population around 0, while retaining a small CD19+ population of B cells. This normalization and denoising procedure will thus make it easier to annotate cell clusters and gate for cell population as we will show in more detail in the next section.

Zero-centering background population simplifies downstream analysis

One of the benefits of this type of normalization is that it facilitates detecting and annotating specific cell populations in one or more mixed population samples. As an example we can create a UMAP of the four samples and visualize some well-known cell-type specific markers.

# create a new assay to store the logged count information
logged_cts <- log1p(GetAssayData(pg_data_combined, assay = "mpxCells", slot = "counts"))
log_assay <- CreateAssay5Object(data = logged_cts)
pg_data_combined[["log_assay"]] <- log_assay

DefaultAssay(pg_data_combined) <- "mpxCells"
pg_data_combined <-
pg_data_combined %>%
FindVariableFeatures(selection.method = "vst", layer = "data") %>%
ScaleData() %>%
RunPCA(npcs = 30) %>%
RunUMAP(dims = 1:10)

# Overlay the UMAP with dsb normalized abundance levels
DefaultAssay(pg_data_combined) <- "mpxCells"
umap_dsb <- FeaturePlot(pg_data_combined, feature = c("CD4", "CD8", "CD19", "CD14"), reduction = "umap", ncol = 1, min.cutoff = 0, coord.fixed = TRUE)

# Overlay the UMAP with log1p transformed values
DefaultAssay(pg_data_combined) <- "log_assay"
umap_log <- FeaturePlot(pg_data_combined, feature = c("CD4", "CD8", "CD19", "CD14"), reduction = "umap", ncol = 1, min.cutoff = 0, coord.fixed = TRUE)

umap_dsb | umap_log

Here, we assess the distribution of abundance of a few cell specific markers: CD4 is positive for T helper cells and monocytes, CD8 for cytotoxic T cells, CD19 for B cells, and CD14 for monocytes. We can see that the protein abundance is restricted to single clusters for the dsb transformed counts on the left, while the log transformed counts on the right show an “ambient” level across most clusters. As you can appreciate from these plots the dsb normalized values present a much clearer distinction between positive and negative populations than the log1p transformed values.

Besides these benefits for visualization and annotation, gating for specific cell populations becomes more straightforward using dsb-corrected values.

DefaultAssay(pg_data_combined) <- "log_assay"
scatter_log <- DensityScatterPlot(pg_data_combined,
marker1 = "CD3E",
marker2 = "CD4",
layer = "data",
margin_density = TRUE
)

DefaultAssay(pg_data_combined) <- "mpxCells"
scatter_dsb <- DensityScatterPlot(pg_data_combined,
marker1 = "CD3E",
marker2 = "CD4",
layer = "data",
margin_density = TRUE)

scatter_log | scatter_dsb

The log1plog1p transformed counts on the left are not zero centered and thus make it not straightforward to separate positive from negative cells in a uniform way. The dsb values on the left do show negative populations centered around zero for both CD3E and CD4.

CLR

When dealing with pure cell populations such as cell lines or FACS-sorted populations, dsb can sometimes result in unexpected values as it can not reliably detect positive and negative populations for certain markers in these uniform samples. Therefore, we also include the CLR method as an alternative normalization approach. This method was originally developed for CITE-seq data (Stoeckius et al., 2017) but quickly gained traction in other antibody-based assays. It relies on the assumption that the geometric mean of marker counts per cell is a good reference for normalizing the counts. Here we implemented it as the log1plog1p of the marker counts minus the geometric mean of the log1plog1p counts per cell. The CLR normalization has been wrapped into the NormalizeMPX function and can be used by setting the method parameter to clr.

# Example of CLR normalization
pg_data_combined <- NormalizeMPX(pg_data_combined,
method = "clr")

Remarks

As any normalization method, dsb has some potential pitfalls that can be avoided by a thorough understanding of your experimental design and analysis plan. Here we list a few recommendations and good practices:

  • The dsb normalization method works optimally for markers that have a positive and a negative population in the experimental setup; i.e. in datasets consisting of different cell populations (such as PBMCs). This maximizes the chances that the GMM fitting will reliably detect the negative background population. Consequently, markers such as HLA-ABC that have a relatively consistent high expression across all cell types might display some unexpected behaviour (i.e. these can show a low dsb value despite being highly expressed in all cells). If your dataset mainly consists of very similar cells, such as for example pure cell lines, it might be prudent to compare the dsb output to log- or CLR-normalization.
  • Do not apply normalization blindly and always be aware of the effect of a transformation when using normalized marker counts for downstream analysis. A good practice is to visualize the distribution of markers you deem important in your analysis before and after normalization with, for example, violin plots. Does this correspond to what you expect and if not, ask yourself why could this be the case?
Differential Abundance testing on normalized values

Make sure the normalized values produced by dsb and/or CLR are fitting for the test you would like to perform! Some downstream differential abundance tests demand strictly positive values as input. For example, the FindAllMarkers and FindMarkers functions from Seurat require that the data provided to them is strictly non-negative and do not provide Log Fold Changes for markers that are on average negative in one or more groups. If you would like to use these functions as part of your downstream analysis, we suggest to set the mean.fxn parameter to “rowMeans” and the fc.name parameter to “difference”. By doing this you will output the difference in normalized values between the two groups to be compared (as opposed to the Log Fold Change). This output is robust to negative values in the testing procedure.

Finally, let’s save the object with the dsb normalized counts for the next tutorial on Data Integration.

# Save filtered dataset for next step. This is an optional pause step; if you prefer to continue to the next tutorial without saving the object explicitly, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data_normalized.rds"))