PNA Quality Control
This tutorial details the first steps of data analysis to quality control and clean up the PNA data output from * *Pixelator**.
After completing this tutorial, you should be able to:
- Use the molecule rank plot to manually set cell calling thresholds and filter out low-quality cells.
- Visualize sample-level QC metrics such as the number of cells.
- Check distributions of quality metrics, protein counts (molecules) and read depth.
- Identify and remove cell outliers using the antibody count distribution metric Tau.
Setup
First, we will load packages necessary for downstream processing.
library(pixelatorR)
library(dplyr)
library(stringr)
library(SeuratObject)
library(here)
library(ggplot2)
library(tidyr)
library(tibble)
Load data
We will begin by loading the merged dataset we created the end of the previous tutorial Data handling. As described previously, this merged object contains two samples: one with resting PBMCs and one with PHA-stimulated PBMCs. Note that if you continue directly from the previous tutorial, you don’t need to load the data again.
DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)
# Load the object that you saved in the previous tutorial. This is not needed if it is still in your workspace.
pg_data_combined <- readRDS(file.path(DATA_DIR, "combined_data.rds"))
pg_data_combined
An object of class Seurat
158 features across 2137 samples within 1 assay
Active assay: PNA (158 features, 158 variable features)
1 layer present: counts
Cell calling: Molecule rank plot
Here, we use the molecule rank plot to perform a more fine-tuned quality control of the called cells and make a manual adjustment to the number of cells that were called by * *Pixelator** if needed. Pixelator automatically determines a threshold to exclude small components and if this threshold is lower than 8000 molecules, Pixelator will adjust the threshold to 8000. If you expect find small cells in you sample, such as platelets, might want to adjust this threshold Pixelator to a lower value.
Here, the goal is to remove small components which might not represent whole cells.
# Plot the molecule rank plot
molecule_rank_plot <- MoleculeRankPlot(pg_data_combined, group_by = "condition")
molecule_rank_plot
It looks like components are declining rapidly in size at around 10000 UMIs, so we will set a manual cutoff at that point, represented by a dashed line in the plot below.
cutoff <- 10000
molecule_rank_plot +
geom_hline(yintercept = cutoff, linetype = "dashed")
# Filter cells to have at least 10000 molecules
pg_data_combined <- pg_data_combined %>%
subset(n_umi >= cutoff)
pg_data_combined
An object of class Seurat
158 features across 2085 samples within 1 assay
Active assay: PNA (158 features, 158 variable features)
1 layer present: counts
Now we can inspect the number of called cells per condition after filtering out small cells. The observed numbers of cells are close to the expectation of 1000 cells per condition.
CellCountPlot(pg_data_combined, color_by = "condition")
Other QC metric can also be useful to explore. In the plot below we
check the distribution of unique proteins ( n_antibodies
), number of
protein molecules (n_umi
) and number of reads per component
(reads_in_component
) across the two conditions.
pg_data_combined[[]] %>%
select(condition, n_umi, reads_in_component, n_antibodies) %>%
pivot_longer(cols = c("reads_in_component", "n_umi", "n_antibodies"), names_to = "metric", values_to = "value") %>%
ggplot(aes(condition, value, fill = condition)) +
geom_violin(draw_quantiles = 0.5) +
facet_wrap(~ metric, scales = "free_y") +
scale_y_log10() +
theme_bw() +
labs(y = "")
Antibody count distribution outlier removal
Below, we have plotted the number of molecules against Tau, and colored each cell by Pixelator’s classification of Tau (low, normal, or high). In short, Tau measures the heterogeneity of protein composition which can be used to detect various types of outliers. Here we see some cells with low a Tau score that might be antibody aggregates. Cells with high Tau have low specificity and could for instance include a more heterogeneous composition of proteins than we would expect from a normal cell. As these outliers likely represent technical artefacts, we will remove them from the dataset.
# Custom plot to visualize the Tau distribution
TauPlot(pg_data_combined, group_by = "condition")
# Only keep the components where tau_type is normal
pg_data_combined <- pg_data_combined %>%
subset(tau_type == "normal")
pg_data_combined
An object of class Seurat
158 features across 2082 samples within 1 assay
Active assay: PNA (158 features, 158 variable features)
1 layer present: counts
Isotype control levels
PNA data includes three isotype control antibodies which are used to measure the background signal (non-specific binding). The abundance levels for these isotype controls should be low in comparison to the other antibodies. Elevated levels can therefore be used to identify low-quality cells that should be removed.
Based on the distribution of the isotype control levels plotted below, a cutoff at 0.3% seems to get rid of a handful of outliers.
isotype_controls = c("mIgG1", "mIgG2a", "mIgG2b")
isotype_control_levels <- LayerData(pg_data_combined, layer = "counts")[isotype_controls, ] %>% colSums()
umi_levels <- LayerData(pg_data_combined, layer = "counts") %>% colSums()
pg_data_combined$isotype_control_fraction <- isotype_control_levels / umi_levels
isotype_control_fraction_cutoff <- 0.003
pg_data_combined[[]] %>%
select(condition, isotype_control_fraction) %>%
ggplot(aes(condition, isotype_control_fraction, fill = condition)) +
geom_violin(draw_quantiles = 0.5) +
geom_jitter(size = 0.4, alpha = 0.5) +
theme_bw() +
labs(y = "Fraction of isotype control counts") +
scale_y_continuous(labels = scales::percent) +
geom_hline(yintercept = isotype_control_fraction_cutoff, linetype = "dashed")
Now let’s apply this cutoff to remove cells with elevated isotype control levels from our dataset:
pg_data_combined <- pg_data_combined %>%
subset(isotype_control_fraction < isotype_control_fraction_cutoff)
pg_data_combined
An object of class Seurat
158 features across 2066 samples within 1 assay
Active assay: PNA (158 features, 158 variable features)
1 layer present: counts
Let’s check again how many cells we have left after filtering:
CellCountPlot(pg_data_combined, color_by = "condition")
Going through the above steps we have performed a basic quality control to filter out small cells, Tau outliers and cells with elevated isotype control levels. These filters take us quite far in the QC process, and we encourage users to adjust metric filtering thresholds based on the metric distributions. Depending on the sample type, it can be a good idea to explore other metrics as well. With a clean, high-quality PNA dataset in hand, we are ready to proceed to the next step, in which we will be normalizing and denoising protein abundance data to facilitate cell type annotation.
If you want to pause here and save the filtered data for the next tutorial, you can do so with the following code.
# Save filtered dataset for next step. This is an optional pause step; if you prefer to continue to the next tutorial without saving the object explicitly, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data_filtered.rds"))