Abundance Normalization
As with any high-throughput techniques, normalization and denoising of Proximity Network abundance data can be a useful tool to make samples and datasets more comparable to each other as well as help with the biological interpretation of your data. Therefore in this tutorial we will introduce and discuss DSB, our method of choice for the normalization and denoising of mixed cell population data as well as a more general approach in the Centered-Log-Ratio normalization(CLR).
We will begin with a short introduction to DSB and CLR and the algorithms behind them. Then, you will learn how to apply these techniques to a Proximity Network Assay dataset. We will finish with some words of caution when applying these -or any- type of normalization.
After completing this tutorial, you should be able to:
- Understand the motivation for and the algorithm behind DSB and CLR normalizations
- Apply these normalization methods to a Proximity Network Assay dataset
- Evaluate the effect of normalization on marker abundance
- Understand the potential pitfalls of normalization
Why denoise and normalize?
By their very nature antibody-based assays can be affected by variable amounts of nonspecific binding. In leukocytes, this nonspecific binding is partly a result of nonspecific binding of antibodies to Fc-receptors present on the surface of many white blood cell types. These background levels of antibody binding (or “noise”) can fluctuate between samples and cells as a result of sample type and quality, preprocessing conditions, antibody isotype and other issues. Ultimately, this can lead to difficulties distinguishing negative and positive subpopulations of cells expressing a certain marker, complicating cell type annotation and differential abundance analysis.
Therefore, during the sample preprocessing of Proximity Network Assay samples, several blocking steps are included to reduce the possibility of nonspecific binding. In addition, computational methods such as the one presented here can, to some extent, be utilized to reduce the bias of nonspecific binding.
The DSB algorithm
The DSB method was originally developed to denoise and normalize antibody-based assays on droplet-based platforms such as CITE-seq. It is based on the detection of background marker levels in so-called “empty” droplets (droplets encapsulating biological material but not a corresponding cell). Proximity Network Assay does not need cell compartmentalization and therefore does not produce empty droplets as such. However, it has been shown that the background abundance levels of a marker can be inferred from its abundance distribution across all cells in a mixed cell population sample (see DSB paper). By calculating the mean abundance in this background population per marker and subtracting these values from the marker’s abundance, each marker’s background population becomes effectively zero centered. In a subsequent step, the isotype control markers and a cell-specific background mean are combined in a “noise component” that is regressed out of the zero centered data corrected for background nonspecific levels.
To summarize these different steps:
- Transform the count data using transformation.
- Fit a finite mixed gaussian model with two components (i.e. two normal distributions) to each marker across all cells. And as a background correction step, subtract the mean of the first component on the logged data. Note: if no two component fit was possible, background correction is skipped.
- Next, fit a finite mixed gaussian model with two components to each cell across all markers. The mean of the first component is the cell-wise background mean. This value is combined with the corrected counts of the isotype controls to form a “noise matrix”. The first principal component of this noise matrix is regressed out.
- The resulting residuals are returned as normalized and denoised values.
The CLR algorithm
When dealing with pure cell populations such as cell lines or FACS-sorted populations, DSB can sometimes result in unexpected values as it can not reliably detect positive and negative populations for certain markers in these uniform samples. Therefore, we also include the CLR method as an alternative normalization approach. This method was originally developed for CITE-seq data (Stoeckius et al., 2017) but quickly gained traction in other antibody-based assays. It relies on the assumption that the geometric mean of marker counts per cell is a good reference for normalizing the counts. Here we implemented it as the of the marker counts minus the geometric mean of the counts per cell.
Setup
To run this tutorial, several python libraries and functions need to be imported.
import numpy as np
import pandas as pd
import scanpy as sc
from pathlib import Path
from pixelator import read_pna as read
from pixelator.mpx.plot import density_scatter_plot
from pixelator.common.statistics import clr_transformation, dsb_normalize
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
DATA_DIR = Path("<path to the directory to save datasets to>")
Load Data
To illustrate the normalization procedure we continue with the quality checked and filtered cells we identified in the previous tutorial. We recall that the filtered cells come from combination of a resting PBMC sample and a PHA stimulated PBMC sample.
pg_data_combined_pxl_object = read([
DATA_DIR/ "PNA062_unstim_PBMCs_1000cells_S02_S2.layout.pxl",
DATA_DIR/ "PNA062_PHA_PBMCs_1000cells_S04_S4.layout.pxl",
])
filtered_cells = list(pd.read_csv(DATA_DIR / "qc_filtered_cells.csv")["component"])
pg_data_combined_pxl_object = pg_data_combined_pxl_object.filter(components=filtered_cells)
# In this tutorial we will mostly work with the AnnData object
# so we will start by selecting that from the pixel data object
pg_data_combined = pg_data_combined_pxl_object.adata()
Normalizations
DSB and CLR procedures have been wrapped into single functions in
pixelator, respectively called dsb_normalize
and
clr_transformation
. Since DSB uses the antibody isotype controls to
perform the normalization, we need to specify these to dsb_normalize
.
To facilitate later comparisons, we also construct a layer with
transformed data.
# Add dsb values as a layer to the object
pg_data_combined.layers["dsb"] = dsb_normalize(
pg_data_combined.to_df(),
isotype_controls = ["mIgG1", "mIgG2a", "mIgG2b"]
)
# Add clr values as a layer to the object
pg_data_combined.layers["clr"] = clr_transformation(
pg_data_combined.to_df(), axis=1, non_negative=False
)
# Create log1p layer for later comparisons
pg_data_combined.layers['log1p'] = np.log1p(pg_data_combined.to_df())
Evaluate normalization
After having run the normalization step it is a good idea to visualize the overall distribution of some markers of interest with different normalizations. This will allow you to assess in more detail how each procedure has modified the abundance levels of markers. Let’s have a look at CD19 as an example.
def marker_ridgeplot(adata, layer, marker, title):
marker_data = adata.to_df(layer)["CD19"]
conditions = adata.obs['condition']
plot_data_log = pd.DataFrame({'CD19' : marker_data, 'Conditions' : conditions})
# Initialize the FacetGrid object
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0), 'axes.linewidth':2})
pal = sns.color_palette("Set2", 12)
g = sns.FacetGrid(plot_data_log, row="Conditions", hue="Conditions", aspect=9, height=1.2, palette=pal)
# Draw the densities in a few steps
g.map(sns.kdeplot, "CD19", bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=1.5)
# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, fontweight="bold", color=color,
ha="left", va="center", transform=ax.transAxes)
g.map(label, "CD19")
# Set the subplots to overlap
g.figure.subplots_adjust(hspace=-.5)
# Clean up plot
g.set_titles("")
g.set(yticks=[], ylabel="")
g.despine(bottom=True, left=True)
g.fig.suptitle(title)
plt.tight_layout()
pg_data_combined.obs["condition"] = "Resting"
pg_data_combined.obs.loc[
pg_data_combined.obs["sample"].str.contains("PHA"),
"condition"
] = "PHA stimulated"
# log1p ridgeplot data
g = marker_ridgeplot(pg_data_combined, layer="log1p", marker="CD19", title='Log1p Transformation')
# DSB ridgeplot
marker_ridgeplot(pg_data_combined, layer="dsb", marker="CD19", title='DSB Transformation')
# CLR ridgeplot
marker_ridgeplot(pg_data_combined, layer="clr", marker="CD19", title='CLR Transformation')
CD19 is a cell type marker that is specific for B cells. The population of B cells in these samples should be relatively small and the majority of cells should therefore not express CD19 at a meaningful level. However, as you can see in the upper ridge plot prior to normalization, the majority of cells do express low levels of CD19 as a result of nonspecific binding. In addition, there is clearly some variability between the levels of this nonspecific background noise. As you can see in the middle ridge plot, the DSB normalization centers this negative background population around 0, while retaining a small CD19+ population of B cells. This normalization and denoising procedure will thus make it easier to annotate cell clusters and gate for cell population as we will show in more detail in the next section. Similarly in the lower ridge plot, with CLR transformation, the background population has mostly moved over to the negative region.
Zero-centering background population simplifies downstream analysis
One of the benefits of this type of normalization is that it facilitates detecting and annotating specific cell populations in heterogeneous cell type samples. As an example we can create a UMAP of the two samples and visualize some well-known cell-type specific markers.
sc.pp.highly_variable_genes(
pg_data_combined, flavor="seurat_v3", n_top_genes=50
)
pg_data_combined.obsm["pca"] = sc.tl.pca(
pg_data_combined.layers["dsb"], random_state=42
)
sc.pp.neighbors(
pg_data_combined,
n_neighbors=30,
n_pcs=10,
use_rep="pca",
metric="cosine",
)
sc.tl.umap(
pg_data_combined,
min_dist=0.3,
)
markers_of_interest = ["CD4", "CD8", "CD19", "CD16"]
DSB normalized abundance across clusters
sc.pl.umap(pg_data_combined, color=markers_of_interest, ncols=4, layer = "dsb")
Log1p abundance across clusters
sc.pl.umap(pg_data_combined, color=markers_of_interest, ncols=4, layer = "log1p")
Here, we assess the distribution of abundance of a few cell specific markers: CD4 is positive for T helper cells and monocytes, CD8 for cytotoxic T cells, CD19 for B cells, and CD16 mainly for monocytes. We can see that the protein abundance is restricted to one or few clusters for the DSB transformed counts in the top row, while the log transformed counts in the bottom row show an “ambient” level across most clusters. As you can appreciate from these plots the DSB normalized values present a much clearer distinction between positive and negative populations than the transformed values. Besides these benefits for visualization and annotation, DSB and CLR-transformed values allow for applying more uniform gating strategies across markers.
fig, _ = density_scatter_plot(pg_data_combined, "CD3e", "CD4", layer="log1p")
fig.axes[0].set_title("Log1p transformation")
fig, _ = density_scatter_plot(pg_data_combined, "CD3e", "CD4", layer="dsb")
fig.axes[0].set_title("DSB transformation")
fig, _ = density_scatter_plot(pg_data_combined, "CD3e", "CD4", layer="clr")
fig.axes[0].set_title("CLR transformation")
Text(0.5, 1.0, 'CLR transformation')
Remarks
As any normalization method, DSB has some potential pitfalls that can be avoided by a thorough understanding of your experimental design and analysis plan. Here we list a few recommendations and good practices:
- The DSB normalization method works optimally for markers that have a positive and a negative population in the experimental setup; i.e. in datasets consisting of different cell populations (such as PBMCs). This maximizes the chances that the GMM fitting will reliably detect the negative background population. Consequently, markers such as HLA-ABC that have a relatively consistent high expression across all cell types might display some unexpected behaviour (i.e. these can show a low DSB value despite being highly expressed in all cells). If your dataset mainly consists of very similar cells, such as for example pure cell lines, it might be prudent to compare the DSB output to log- or CLR-normalization.
- Do not apply normalization blindly and always be aware of the effect of a transformation when using normalized marker counts for downstream analysis. A good practice is to visualize the distribution of markers you deem important in your analysis before and after normalization with, for example, violin plots. Does this correspond to what you expect and if not, ask yourself why could this be the case?
Make sure the normalized values produced by DSB and/or CLR are
fitting for the test you would like to perform! Some downstream
differential abundance tests demand strictly positive values as input.
For example, the rank_genes_group
function from Scanpy requires that
the data provided to them is strictly non-negative and do not provide
Log Fold Changes for markers that are on average negative in one or more
groups. If you would like to use these function as part of your
downstream analysis, we suggest to shift the abundance of markers with
negative values to strictly positive. This would leave the relative
differences between groups unaffected and allow to proceed with the
testing procedure. This procedure has been described in the following
Scanpy issue.
# Example of code to add a layer of shifted normalized values to the adata object
# Calculate minimum value per marker
mins = pg_data_combined.to_df("dsb").min()
# If minimum value is positive, set to zero
mins[mins > 0] = 0
# Add layer to use in Differential Abundance test
pg_data_combined.layers["shifted_dsb"] = pg_data_combined.to_df("dsb").sub(mins)