Abundance Normalization
As with any high-throughput techniques, normalization and denoising of MPX abundance data can be a useful tool to make samples and datasets more comparable to each other as well as help with the biological interpretation of your data. Therefore in this tutorial we will introduce and discuss dsb, our method of choice for the normalization and denoising of mixed cell population data as well as a more general approach in the Centered-Log-Ratio normalization(CLR).
We will begin with a short introduction to dsb and the algorithm behind it. Then, you will learn how to apply this technique to an MPX dataset. We will finish with some words of caution when applying these -or any- type of normalization.
After completing this tutorial, you should be able to:
- Understand the need for and the algorithm behind dsb normalization
- How to apply this normalization method to an MPX dataset
- Evaluate the effect of normalization on marker abundance
- Understand the potential pitfalls of normalization
Why denoise and normalize?
By their very nature antibody-based assays can be affected by variable amounts of nonspecific binding. In leukocytes, this nonspecific binding is partly a result of nonspecific binding of antibodies to Fc-receptors present on the surface of many white blood cell types. These background levels of antibody binding (or “noise”) can fluctuate between samples and cells as a result of sample type and quality, preprocessing conditions, antibody isotype and other issues. Ultimately, this can lead to difficulties distinguishing negative and positive subpopulations of cells expressing a certain marker, complicating cell type annotation and differential abundance analysis.
Therefore, during the sample preprocessing of MPX samples, several blocking steps are included to reduce the possibility of nonspecific binding. In addition, computational methods such as the one presented here can, to some extent, be utilized to reduce the bias of nonspecific binding.
The dsb algorithm
The dsb method was originally developed to denoise and normalize antibody-based assays on droplet-based platforms such as CITE-seq. It is based on the detection of background marker levels in so-called “empty” droplets (droplets encapsulating biological material but not a corresponding cell). MPX does not need cell compartmentalization and therefore does not produce empty droplets as such. However, it has been shown that the background abundance levels of a marker can be inferred from its abundance distribution across all cells in a mixed cell population sample (see dsb paper). By calculating the mean abundance in this background population per marker and subtracting these values from the marker’s abundance, each marker’s background population becomes effectively zero centered. In a subsequent step, the isotype control markers and a cell-specific background mean are combined in a “noise component” that is regressed out of the zero centered data corrected for background nonspecific levels.
To summarize these different steps:
- Transform the count data using transformation.
- Fit a finite mixed gaussian model with two components (i.e. two normal distributions) to each marker across all cells. And as a background correction step, subtract the mean of the first component on the logged data. Note: if no two component fit was possible, background correction is skipped.
- Next, fit a finite mixed gaussian model with two components to each cell across all markers. The mean of the first component is the cell-wise background mean. This value is combined with the corrected counts of the isotype controls to form a “noise matrix”. The first principal component of this noise matrix is regressed out.
- The resulting residuals are returned as normalized and denoised values.
Setup
To run this tutorial, several python libraries and functions need to be imported.
import numpy as np
import pandas as pd
import scanpy as sc
from pathlib import Path
from pixelator import read, simple_aggregate
from pixelator.plot import density_scatter_plot
from pixelator.analysis.normalization import dsb_normalize
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
DATA_DIR = Path("<path to the directory to save datasets to>")
Load Data
To illustrate the normalization procedure we continue with the quality checked and filtered merged object we created in the previous tutorial. As described there, this merged object contains four samples: a resting and PHA stimulated PBMC sample, both in duplicate.
pg_data_combined_pxl_object = read(DATA_DIR/ "combined_data_filtered.pxl")
# In this tutorial we will mostly work with the AnnData object
# so we will start by selecting that from the pixel data object
pg_data_combined = pg_data_combined_pxl_object.adata
dsb_normalize
The dsb procedure has been wrapped into a single function in
pixelator called dsb_normalize
. Since dsb uses the antibody
isotype controls to perform the normalization, we need to specify these
to dsb_normalize
. To facilitate later comparisons, we also construct a
layer with transformed data.
# Add dsb values as a layer to the object
pg_data_combined.layers["dsb"] = dsb_normalize(
pg_data_combined.to_df(),
isotype_controls = ["mIgG1", "mIgG2a", "mIgG2b"]
)
# Create log1p layer for later comparisons
pg_data_combined.layers['log1p'] = np.log1p(pg_data_combined.to_df())
Evaluate normalization
After having run the normalization step it is a good idea to visualize the overall distribution of some markers of interest before and after normalization. This will allow you to assess in more detail how this procedure has modified the abundance levels of markers. Let’s have a look at CD19 as an example.
# log1p ridgeplot data
layer_data_log = pg_data_combined.to_df("log1p")
data_log = layer_data_log.loc[:, "CD19"]
samples = pg_data_combined.obs.loc[:, 'sample']
plot_data_log = pd.DataFrame({'CD19' : data_log, 'Samples' : samples})
# Initialize the FacetGrid object
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0), 'axes.linewidth':2})
pal = sns.color_palette("Set2", 12)
g = sns.FacetGrid(plot_data_log, row="Samples", hue="Samples", aspect=9, height=1.2, palette=pal)
# Draw the densities in a few steps
g.map(sns.kdeplot, "CD19", bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=1.5)
# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, fontweight="bold", color=color,
ha="left", va="center", transform=ax.transAxes)
g.map(label, "CD19")
# Set the subplots to overlap
g.figure.subplots_adjust(hspace=-.5)
# Clean up plot
g.set_titles("")
g.set(yticks=[], ylabel="")
g.despine(bottom=True, left=True)
g.fig.suptitle('Log1p Transformation')
# Dsb ridgeplot
layer_data_dsb = pg_data_combined.to_df("dsb")
data_dsb = layer_data_dsb.loc[:, "CD19"]
plot_data_dsb = pd.DataFrame({'CD19' : data_dsb, 'Samples' : samples})
# Initialize the FacetGrid object
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0), 'axes.linewidth':2})
pal = sns.color_palette("Set2", 12)
g = sns.FacetGrid(plot_data_dsb, row="Samples", hue="Samples", aspect=9, height=1.2, palette=pal)
# Draw the densities in a few steps
g.map(sns.kdeplot, "CD19", bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=1.5)
# Add label
g.map(label, "CD19")
# Set the subplots to overlap
g.figure.subplots_adjust(hspace=-.5)
# Clean up plot
g.set_titles("")
g.set(yticks=[], ylabel="")
g.despine(bottom=True, left=True)
g.fig.suptitle('dsb Transformation')
dsb_plot = g.fig.show()
CD19 is a cell type marker that is specific for B cells. The population of B cells in these samples should be relatively small and the majority of cells should therefore not express CD19 at a meaningful level. However, as you can see in the upper ridge plot prior to normalization the majority of cells do express low levels of CD19 as a result of nonspecific binding. In addition, there is clearly some variability between the levels of this nonspecific background noise. As you can see in the lower ridge plot, the dsb normalization centers this negative background population around 0, while retaining a small CD19+ population of B cells. This normalization and denoising procedure will thus make it easier to annotate cell clusters and gate for cell population as we will show in more detail in the next section.
Zero-centering background population simplifies downstream analysis
One of the benefits of this type of normalization is that it facilitates detecting and annotating specific cell populations in heterogeneous cell type samples. As an example we can create a UMAP of the four samples and visualize some well-known cell-type specific markers.
sc.pp.highly_variable_genes(
pg_data_combined, flavor="seurat_v3", n_top_genes=50
)
pg_data_combined.layers["scaled_dsb"] = sc.pp.scale(
pg_data_combined, zero_center=True, layer="dsb", copy=True
).layers["dsb"]
pg_data_combined.obsm["pca"] = sc.tl.pca(
pg_data_combined.layers["scaled_dsb"], random_state=42
)
sc.pp.neighbors(
pg_data_combined,
n_neighbors=30,
n_pcs=10,
use_rep="pca",
metric="cosine",
)
sc.tl.umap(
pg_data_combined,
min_dist=0.3,
)
markers_of_interest = ["CD4", "CD8", "CD19", "CD14"]
sc.pl.umap(pg_data_combined, color=markers_of_interest, ncols=4, layer = "log1p", vmin=0)
sc.pl.umap(pg_data_combined, color=markers_of_interest, ncols=4, layer = "dsb", vmin=0)
Here, we assess the distribution of abundance of a few cell specific markers: CD4 is positive for T helper cells and monocytes, CD8 for cytotoxic T cells, CD19 for B cells, and CD14 for monocytes. We can see that the protein abundance is restricted to single clusters for the dsb transformed counts in the top row, while the log transformed counts in the bottom row show an “ambient” level across most clusters. As you can appreciate from these plots the dsb normalized values present a much clearer distinction between positive and negative populations than the transformed values. Besides these benefits for visualization and annotation, dsb-corrected values allows for applying more uniform gating strategies across markers.
density_scatter_plot(pg_data_combined, "CD3E", "CD4", layer="log1p")
density_scatter_plot(pg_data_combined, "CD3E", "CD4", layer="dsb")
CLR
When dealing with pure cell populations such as cell lines or FACS-sorted populations, dsb can sometimes result in unexpected values as it can not reliably detect positive and negative populations for certain markers in these uniform samples. Therefore, we also include the CLR method as an alternative normalization approach. This method was originally developed for CITE-seq data (Stoeckius et al., 2017) but quickly gained traction in other antibody-based assays. It relies on the assumption that the geometric mean of marker counts per cell is a good reference for normalizing the counts. Here we implemented it as the of the marker counts minus the geometric mean of the counts per cell.
from pixelator.statistics import clr_transformation
pg_data_combined.layers["clr"] = clr_transformation(
pg_data_combined.to_df(), axis=1
)
Remarks
As any normalization method, dsb has some potential pitfalls that can be avoided by a thorough understanding of your experimental design and analysis plan. Here we list a few recommendations and good practices:
- The dsb normalization method works optimally for markers that have a positive and a negative population in the experimental setup; i.e. in datasets consisting of different cell populations (such as PBMCs). This maximizes the chances that the GMM fitting will reliably detect the negative background population. Consequently, markers such as HLA-ABC that have a relatively consistent high expression across all cell types might display some unexpected behaviour (i.e. these can show a low dsb value despite being highly expressed in all cells). If your dataset mainly consists of very similar cells, such as for example pure cell lines, it might be prudent to compare the dsb output to log- or CLR-normalization.
- Do not apply normalization blindly and always be aware of the effect of a transformation when using normalized marker counts for downstream analysis. A good practice is to visualize the distribution of markers you deem important in your analysis before and after normalization with, for example, violin plots. Does this correspond to what you expect and if not, ask yourself why could this be the case?
Make sure the normalized values produced by dsb and/or CLR are
fitting for the test you would like to perform! Some downstream
differential abundance tests demand strictly positive values as input.
For example, the rank_genes_group
function from Scanpy requires that
the data provided to them is strictly non-negative and do not provide
Log Fold Changes for markers that are on average negative in one or more
groups. If you would like to use these function as part of your
downstream analysis, we suggest to shift the abundance of markers with
negative values to strictly positive. This would leave the relative
differences between groups unaffected and allow to proceed with the
testing procedure. This procedure has been described in the following
Scanpy issue.
# Example of code to add a layer of shifted normalized values to the adata object
# Calculate minimum value per marker
mins = pg_data_combined.to_df("dsb").min()
# If minimum value is positive, set to zero
mins[mins > 0] = 0
# Add layer to use in Differential Abundance test
pg_data_combined.layers["shifted_dsb"] = pg_data_combined.to_df("dsb").sub(mins)
Finally, let’s save the object with the dsb normalized counts for the next tutorial on Data Integration. This is an optional pause step; if you prefer to continue to the next tutorial without saving the object explicitly, that will also work.
# Add the normalized AnnData to the PXL object
pg_data_combined_pxl_object.adata = pg_data_combined
pg_data_combined_pxl_object.save(DATA_DIR/ "combined_data_normalized.pxl", force_overwrite=True)