Skip to main content

MPX Quality Control

This tutorial details the first steps of data analysis to quality control and clean up the MPX data output from Pixelator.

After completing this tutorial, you should be able to:

  • Use the edge rank plot to manually set cell calling thresholds and filter low-quality cells.
  • Aggregate data across samples and visualize sample-level QC metrics like number of cells.
  • Check distributions of quality metrics like molecule counts and graph connectivity.
  • Identify and remove cell outliers using the antibody count distribution metric Tau.

Setup

First, we will load packages necessary for downstream processing.

from pathlib import Path
from pixelator import read

import seaborn as sns

sns.set_style("whitegrid")

Load data

In this tutorial we will continue where we left off after the previous tutorial Data handling where we combined MPX data from two samples, and read the combined object directly.

from pixelator import simple_aggregate

DATA_DIR = Path.cwd().parents[3] / "data"

paths = [
DATA_DIR / "Sample01_human_pbmcs_unstimulated.dataset.pxl",
DATA_DIR / "Sample02_human_pbmcs_unstimulated.dataset.pxl",
]

pg_data_combined_pxl_object = simple_aggregate(
["sample1", "sample2"], [read(path) for path in paths]
)

# In this tutorial we will mostly work with the AnnData object
# so we will start by selecting that from the pixel data object
pg_data_combined = pg_data_combined_pxl_object.adata

Cell calling: Edge rank plot

Here, we use the edge rank plot to perform an additional quality control of the called cells, to make a manual adjustment to the number of cells that were called by Pixelator. This removes cells that deviate from the component size distribution, and might not represent whole cells.

import matplotlib.pyplot as plt

edge_rank_df = pg_data_combined.obs[["sample", "edges"]].copy()
edge_rank_df["rank"] = edge_rank_df.groupby(["sample"])["edges"].rank(
ascending=False, method="first"
)

edgerank_plot = (
sns.relplot(data=edge_rank_df, x="rank", y="edges", hue="sample", aspect=1.6)
.set(xscale="log", yscale="log")
.set_xlabels("Component rank (by number of edges)")
.set_ylabels("Number of edges")
)
edgerank_plot

It looks like components are declining rapidly in size at around 5000 edges, and we will thus set a manual cutoff at that point, represented by a dashed line.

edgerank_plot = (
sns.relplot(data=edge_rank_df, x="rank", y="edges", hue="sample", aspect=1.6)
.set(xscale="log", yscale="log")
.set_xlabels("Component rank (by number of edges)")
.set_ylabels("Number of edges")
)
edgerank_plot.fig.axes[0].axhline(5000, linestyle="--")

# Filter cells to have at least 5000 edges
pg_data_combined = pg_data_combined[pg_data_combined.obs["edges"] >= 5000]

edgerank_plot

Here, we plot the number of called cells per condition and replicate.

cells_per_sample_df = (
pg_data_combined.obs.groupby("sample").size().to_frame(name="size").reset_index()
)
number_of_cells_plot = (
sns.catplot(data=cells_per_sample_df, x="sample", y="size", kind="bar")
.set_xlabels("Sample")
.set_ylabels("Number of cells")
)
number_of_cells_plot.fig.subplots_adjust(top=0.9)
number_of_cells_plot.fig.suptitle("Number of called cells")
for container in number_of_cells_plot.ax.containers:
number_of_cells_plot.ax.bar_label(container)

Here, we visualize the distribution of some metrics among components.

metrics_per_sample_df = pg_data_combined.obs[
["sample", "edges", "umi_per_upia", "mean_reads"]
].melt(id_vars=["sample"])
metrics_plot = sns.catplot(
data=metrics_per_sample_df,
x="sample",
col="variable",
y="value",
kind="violin",
sharex=True,
sharey=False,
margin_titles=True,
).set_titles(col_template="{col_name}")

metrics_plot

Antibody count distribution outlier removal

Here, we have plotted umi_per_upia against Tau, and colored each component by Pixelator’s classification of Tau (low, normal, or high). It looks like Pixelator has accurately picked out a single outlier that might be an antibody aggregate or a component that has low specificity, binding many more different types of antibodies than we would expect from a normal cell. As this outlier is likely a technical artefact, will remove it from the analysis.

tau_metrics_df = pg_data_combined.obs[["sample", "tau", "umi_per_upia", "tau_type"]]

tau_plot = (
sns.relplot(
tau_metrics_df,
x="tau",
y="umi_per_upia",
hue="tau_type",
col="sample",
)
.set_titles(col_template="{col_name}")
.set_xlabels("Marker specificity (Tau)")
.set_ylabels("Pixel content (UMI/UPIA)")
)
tau_plot

# Only keep the components where tau_type is normal
pg_data_combined = pg_data_combined[pg_data_combined.obs["tau_type"] == "normal"]
pg_data_combined
View of AnnData object with n_obs × n_vars = 863 × 80
obs: 'vertices', 'edges', 'antibodies', 'upia', 'upib', 'umi', 'reads', 'mean_reads', 'median_reads', 'mean_upia_degree', 'median_upia_degree', 'mean_umi_per_upia', 'median_umi_per_upia', 'umi_per_upia', 'upia_per_upib', 'leiden', 'tau_type', 'tau', 'sample'
var: 'antibody_count', 'components', 'antibody_pct', 'nuclear', 'control'

Going through the above steps we have performed critical quality control by filtering low-quality cells and identifying outliers. With a clean, high-quality MPX dataset in hand, we are ready to proceed to the next step, in which we will be leveraging protein abundance for the annotation of different cell populations.