Skip to main content

Cell Annotation

This tutorial demonstrates the annotation of cells from MPX data. We will normalize and scale the count data, perform dimensionality reduction, clustering, and perform a differential abundance analysis to find the cell identities of each cluster.

After completing this tutorial, you should be able to:

  • Preprocess MPX data by removing control markers, normalizing with CLR, finding variable features, and scaling.
  • Perform dimensionality reduction with PCA and UMAP and visualize protein markers projected onto the reduced space.
  • Cluster cells with a graph-based Louvain algorithm to identify groups of cells by state.
  • Identify cluster marker proteins through differential abundance analysis.
  • Visually explore marker differences across clusters using volcano plots and heatmaps.
  • Manually annotate cell type identities based on known marker expression patterns.
  • Summarize cell type frequencies across experimental groups.

Setup

from matplotlib.pyplot import rc_context
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pandas as pd
from pixelator import read
from pixelator.statistics import clr_transformation
import scanpy as sc
import seaborn as sns

sns.set_style("whitegrid")

Load data

Here, we load the same PBMC dataset used in previous two tutorials Data handling and Quality Control. We also apply the filtering steps we performed in Quality Control.

from pixelator import simple_aggregate

DATA_DIR = Path.cwd().parents[3] / "data"

paths = [
DATA_DIR / "Sample01_human_pbmcs_unstimulated.dataset.pxl",
DATA_DIR / "Sample02_human_pbmcs_unstimulated.dataset.pxl",
]

pg_data_combined_pxl_object = simple_aggregate(
["sample1", "sample2"], [read(path) for path in paths]
)

# In this tutorial we will mostly work with the AnnData object
# so we will start by selecting that from the pixel data object
pg_data_combined = pg_data_combined_pxl_object.adata

# Filter cells to have at least 5000 edges
pg_data_combined = pg_data_combined[pg_data_combined.obs["edges"] >= 5000]
# Only keep the components where tau_type is normal
pg_data_combined = pg_data_combined[pg_data_combined.obs["tau_type"] == "normal"]

Data Processing

To begin with, we will remove control antibodies from our dataset, as we don’t want these to inform downstream steps such as dimensionality reduction and clustering.

control_markers = ["ACTB", "mIgG1", "mIgG2a", "mIgG2b"]
pg_data_combined_processed = pg_data_combined[
:, ~pg_data_combined.var.index.isin(control_markers)
]

Next, we normalize our count data using the centered log ratio (CLR) transformation per each cell. We can visualize this transformation by examining a histogram showing the distribution of antibody counts and CLR transformed counts for CD3E.

pg_data_combined_processed.layers["clr"] = clr_transformation(
pg_data_combined_processed.to_df(), axis=1
)

cd3_counts_df = pg_data_combined_processed[:, ["CD3E"]].to_df().melt()
cd3_clr_df = pg_data_combined_processed[:, ["CD3E"]].to_df("clr").melt()

fig, ax = plt.subplots(1, 2)
pre_clr_plot = sns.histplot(
data=cd3_counts_df, x="value", bins=40, ax=ax[0]
).set_xlabel("Counts")
post_clr_plot = sns.histplot(data=cd3_clr_df, x="value", bins=40, ax=ax[1]).set_xlabel(
"CLR transformed counts"
)
fig

Next, we identify the most variable markers in the dataset. This is done to select a set of markers that are most informative to distinguishing cell identities in our dataset. This marker set is used for downstream analysis such as dimensionality reduction and clustering. We then visualize the result in a variable feature plot, which the displays the variance of each marker in relation to its average abundance.

sc.pp.highly_variable_genes(
pg_data_combined_processed, flavor="seurat_v3", n_top_genes=50
)

markers_df = pg_data_combined_processed.var
highly_variable_markers_plot = (
sns.relplot(
markers_df,
x="means",
y="variances_norm",
hue="highly_variable",
)
.set(xscale="log")
.set_xlabels("Average Expression")
.set_ylabels("Standardized variance")
)
highly_variable_markers_plot._legend.set_title("Higly variable marker")

for text, row in markers_df[markers_df["highly_variable"]].iterrows():
x, y = row[["means", "variances_norm"]]
label = highly_variable_markers_plot.ax.text(
x + 0.05, y, text, horizontalalignment="left", fontsize="xx-small"
)

highly_variable_markers_plot

Finally, we scale and center the data.

pg_data_combined_processed.layers["scaled_clr"] = sc.pp.scale(
pg_data_combined_processed, zero_center=True, layer="clr", copy=True
).layers["clr"]

Dimensionality reduction

Now when we have a dataset that is normalized and scaled, we can perform dimensionality reduction. Here we perform both a Principal Component Analysis (PCA) and a Uniform Manifold Approximation and Projection (UMAP).

pg_data_combined_processed.obsm["pca"] = sc.tl.pca(
pg_data_combined_processed.layers["scaled_clr"], random_state=42
)
sc.pl.pca(pg_data_combined_processed)

sc.pp.neighbors(
pg_data_combined_processed,
n_neighbors=30,
n_pcs=10,
use_rep="pca",
metric="cosine",
)
sc.tl.umap(
pg_data_combined_processed,
min_dist=0.3,
)
sc.pl.umap(pg_data_combined_processed)

We can now use these reductions and project abundance of markers or meta data upon the cells.

markers_of_interest = ["CD3E", "CD4", "CD8", "CD19", "CD20", "CD11c"]
metrics_of_interest = ["edges", "reads", "mean_umi_per_upia"]

sc.pl.pca(pg_data_combined_processed, color=markers_of_interest, ncols=2, layer = "clr")

sc.pl.umap(pg_data_combined_processed, color=markers_of_interest, ncols=2, layer = "clr")

sc.pl.umap(pg_data_combined_processed, color=metrics_of_interest, ncols=2)

Clustering

To distinguish cell identities we perform a modularity based clustering using the Louvain algorithm. First, we create a Shared Nearest Neighbor graph, which we pass to Louvain.

sc.tl.louvain(pg_data_combined_processed, resolution=1)
sc.pl.umap(pg_data_combined_processed, color=["louvain"])

Differential protein abundance

To annotate the resulting clusters, we use a Wilcoxon Rank sum test to gauge which markers are differentially abundant between the clusters. This function calculates the average log2 fold change of each cluster in comparison with all other cells.

sc.tl.rank_genes_groups(
pg_data_combined_processed, "louvain", method="wilcoxon", layer="clr"
)
diff_exp_df = sc.get.rank_genes_groups_df(pg_data_combined_processed, group=None)
diff_exp_df["-log10(adjusted p-value)"] = -np.log10(diff_exp_df["pvals_adj"])
diff_exp_df["Significant"] = diff_exp_df["pvals_adj"] < 0.01
diff_exp_df
    group   names  ...  -log10(adjusted p-value)  Significant
0 0 CD11a ... 51.974477 True
1 0 CD27 ... 50.325078 True
2 0 CD127 ... 38.883804 True
3 0 CD197 ... 38.179463 True
4 0 CD4 ... 34.331176 True
.. ... ... ... ... ...
679 8 CD102 ... 2.872029 True
680 8 CD40 ... 2.899830 True
681 8 CD45RA ... 2.978146 True
682 8 CD37 ... 3.518682 True
683 8 CD35 ... 4.768197 True

[684 rows x 8 columns]

The results can be visualized using a volcano plot to get an overview of the test results.

sns.relplot(
diff_exp_df,
x="logfoldchanges",
y="-log10(adjusted p-value)",
hue="Significant",
col="group",
col_wrap=3,
height=4,
s=8,
).set_xlabels("log2 fold change");

To help us compare the abundance levels of markers across clusters, we will visualize the average fold change of each cluster in a heatmap.

diff_exp_df
    group   names  ...  -log10(adjusted p-value)  Significant
0 0 CD11a ... 51.974477 True
1 0 CD27 ... 50.325078 True
2 0 CD127 ... 38.883804 True
3 0 CD197 ... 38.179463 True
4 0 CD4 ... 34.331176 True
.. ... ... ... ... ...
679 8 CD102 ... 2.872029 True
680 8 CD40 ... 2.899830 True
681 8 CD45RA ... 2.978146 True
682 8 CD37 ... 3.518682 True
683 8 CD35 ... 4.768197 True

[684 rows x 8 columns]
df = diff_exp_df.pivot(index=["names"], columns=["group"], values=["logfoldchanges"])
# We only want to plot markers that have a log2 fold change of at least two, and that
# are significant in at least one cluster comparison
markers_for_heatmap = set(
diff_exp_df[
(np.abs(diff_exp_df["logfoldchanges"]) > 2) & diff_exp_df["Significant"]
]["names"]
)

df = df[df.index.isin(markers_for_heatmap)]

df.columns = [cluster for _, cluster in df.columns]
sns.clustermap(df, yticklabels=True, linewidths=0.1, cmap="vlag");

Now we can use the relative abundance levels of cells to manually assign cell identities to clusters.

cell_annotations = {
"0": "CD4 T cells",
"1": "B cells",
"2": "CD4 T cells",
"3": "NK cells",
"4": "Monocytes",
"5": "CD4 T cells",
"6": "CD8 T cells",
"7": "MAIT cells",
"8": "CD4 T cells"
}
pg_data_combined_processed.obs["cell_type"] = pg_data_combined_processed.obs[
"louvain"
].map(cell_annotations)

with rc_context({"figure.autolayout": True}):
sc.pl.umap(pg_data_combined_processed, color=["cell_type"])

Now, let’s take a look at the frequency of each of the cell identities within our data.

cell_counts_df = (
pg_data_combined_processed.obs.groupby(["sample", "cell_type"])
.size()
.to_frame(name="count")
.reset_index()
)
sns.catplot(
cell_counts_df, kind="bar", x="cell_type", y="count", hue="sample", aspect=1.6
).set_xlabels("Cell type").set_ylabels("Count");

In this tutorial we have leraned how to normalize MPX data, annotate different cell populations and use protein abundance levels to compare across populations and between experimental samples. However, abundance is only one facet of the rich multidimensional data generated by MPX experiments. We are now ready to transition into the spatial realm, exploring the arrangement and interactions of proteins on the cell surface. In the upcoming tutorials, we will leverage the graph nature and spatial metrics within MPX data to unravel the spatial biology of cells, surveying protein polarization and colocalization patterns and how these contribute to cell identity, signaling, and function. Abundance has illuminated cell types; spatiality will illuminate the molecular geography that defines them.