Cell Annotation
This tutorial demonstrates the annotation of cells from MPX data. We will normalize and scale the count data, perform dimensionality reduction, clustering, and perform a differential abundance analysis to find the cell identities of each cluster.
After completing this tutorial, you should be able to:
- Cluster cells with a graph-based Louvain algorithm to identify groups of cells by state.
- Identify cluster marker proteins through differential abundance analysis.
- Visually explore marker differences across clusters using volcano plots and heatmaps.
- Manually annotate cell type identities based on known marker expression patterns.
- Summarize cell type frequencies across experimental groups.
Setup
from matplotlib.pyplot import rc_context
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pandas as pd
from pixelator import read
from pixelator.statistics import clr_transformation
import scanpy as sc
import seaborn as sns
sns.set_style("whitegrid")
DATA_DIR = Path("<path to the directory to save datasets to>")
Load data
To illustrate the annotation workflow we continue with the integrated object we created in the previous tutorial. As described before, this merged and harmonized object contains four samples: a resting and PHA stimulated PBMC sample, both in duplicate.
pg_data_combined_pxl_object = read(DATA_DIR/ "combined_data_integrated.pxl")
# In this tutorial we will mostly work with the AnnData object
# so we will start by selecting that from the pixel data object
pg_data_combined_processed = pg_data_combined_pxl_object.adata
Clustering
To distinguish cell identities we perform a modularity based clustering using the Louvain algorithm. First, we create a Shared Nearest Neighbor graph, which we pass to Louvain.
sc.tl.louvain(pg_data_combined_processed, resolution=0.2, neighbors_key="harmony_neighbors")
sc.pl.umap(pg_data_combined_processed, color=["louvain"])
Differential protein abundance
To annotate the resulting clusters, we use a Wilcoxon Rank sum test to gauge which markers are differentially abundant between the clusters. This function calculates the average log2 fold change of each cluster in comparison with all other cells.
# Shift the dsb values as remarked in the normalization tutorial
# Calculate minimum value per marker
mins = pg_data_combined_processed.to_df("dsb").min()
# If minimum value is positive, set to zero
mins[mins > 0] = 0
# Add layer to use in Differential Abundance test
pg_data_combined_processed.layers["shifted_dsb"] = pg_data_combined_processed.to_df("dsb").sub(mins)
sc.tl.rank_genes_groups(
pg_data_combined_processed, "louvain", method="wilcoxon", layer="shifted_dsb"
)
diff_exp_df = sc.get.rank_genes_groups_df(pg_data_combined_processed, group=None)
diff_exp_df["-log10(adjusted p-value)"] = -np.log10(diff_exp_df["pvals_adj"])
diff_exp_df["Significant"] = diff_exp_df["pvals_adj"] < 0.01
diff_exp_df
group | names | scores | logfoldchanges | pvals | pvals_adj | -log10(adjusted p-value) | Significant | |
---|---|---|---|---|---|---|---|---|
0 | 0 | CD4 | 45.287788 | 3.542424 | 0.000000e+00 | 0.000000e+00 | inf | True |
1 | 0 | CD5 | 37.983711 | 2.914594 | 0.000000e+00 | 0.000000e+00 | inf | True |
2 | 0 | CD3E | 30.985083 | 2.477191 | 8.563183e-211 | 2.397691e-209 | 208.620207 | True |
3 | 0 | CD26 | 27.401339 | 1.520066 | 2.643611e-165 | 4.441267e-164 | 163.352493 | True |
4 | 0 | CD27 | 24.893673 | 1.715873 | 8.711723e-137 | 1.045407e-135 | 134.980715 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... |
499 | 5 | CD44 | -20.942900 | -1.580496 | 2.177890e-97 | 1.407252e-96 | 95.851628 | True |
500 | 5 | CD3E | -21.133327 | -3.324961 | 3.928582e-99 | 2.750007e-98 | 97.560666 | True |
501 | 5 | CD2 | -21.300957 | -3.234140 | 1.112242e-100 | 8.493488e-100 | 99.070914 | True |
502 | 5 | CD43 | -25.652502 | -4.065320 | 3.964816e-145 | 4.757779e-144 | 143.322596 | True |
503 | 5 | CD162 | -25.756119 | -3.120937 | 2.752778e-146 | 3.853889e-145 | 144.414101 | True |
504 rows × 8 columns
The results can be visualized using a volcano plot to get an overview of the test results.
sns.relplot(
diff_exp_df,
x="logfoldchanges",
y="-log10(adjusted p-value)",
hue="Significant",
col="group",
col_wrap=3,
height=4,
s=8,
).set_xlabels("log2 fold change");
To help us compare the abundance levels of markers across clusters, we will visualize the average fold change of each cluster in a heatmap.
diff_exp_df
df = diff_exp_df.pivot(index=["names"], columns=["group"], values=["logfoldchanges"])
# We only want to plot markers that have a log2 fold change of at least two, and that
# are significant in at least one cluster comparison
markers_for_heatmap = set(
diff_exp_df[
(np.abs(diff_exp_df["logfoldchanges"]) > 1) & diff_exp_df["Significant"]
]["names"]
)
df = df[df.index.isin(markers_for_heatmap)]
df.columns = [cluster for _, cluster in df.columns]
sns.clustermap(df, yticklabels=True, linewidths=0.1, cmap="vlag");
Now we can use the relative abundance levels of cells to manually assign cell identities to clusters.
cell_annotations = {
"0": "CD4 T",
"1": "CD8 T",
"2": "CD8 T",
"3": "Monocyte",
"4": "Other",
"5": "B",
}
pg_data_combined_processed.obs["cell_type"] = pg_data_combined_processed.obs[
"louvain"
].map(cell_annotations)
with rc_context({"figure.autolayout": True}):
sc.pl.umap(pg_data_combined_processed, color=["cell_type"])
Now, let’s take a look at the frequency of each of the cell identities within our data.
cell_counts_df = (
pg_data_combined_processed.obs.groupby(["condition", "cell_type"])
.size()
.to_frame(name="count")
.reset_index()
)
sns.catplot(
cell_counts_df, kind="bar", x="cell_type", y="count", hue="condition", aspect=1.6
).set_xlabels("Cell type").set_ylabels("Count");
Compact annotation alternative - Marker gating
In the normalization tutorial, we got introduced to dsb normalization as an approach to make abundance values comparable across both cells and proteins. We can therefore use the dsb normalized abundance with fixed thresholds for gating based annotation. In this tutorial we choose 1.5 as gate threshold since the non-expressing population are centered at 0 with a standard deviation close to 1 with dsb normalization. A cell can be annotated as a cell type if passes through the corresponding positive and negative marker gates. In the code below, we annotate a cell with the cell type with highest ratio of passing gates. If for none of the cell types, the cell passes more than 75% of the gates, it will be annotated as “other”.
celltype_marker_sets = {
"CD8T": ["CD3E", "CD8", "-CD4", "-CD26", "-CD328"],
"CD4T": ["CD3E", "CD4", "-CD8", "-CD26", "-CD328"],
"B": ["CD19", "CD20", "CD22", "-CD3E"],
"Monocyte": ["CD1d", "CD36", "-CD3E", "-CD20"],
}
def calc_gates_passed(expression_data: pd.DataFrame, marker_sets):
gate_threshold = 1.5
celltypes = list(marker_sets.keys())
celltype_score = pd.DataFrame(index=expression_data.index, columns = celltypes)
for ct in celltypes:
celltype_score.loc[:, ct] = 0
for marker in marker_sets[ct]:
if marker[0] == '-': # Marker should be missing from celltype ct
celltype_score.loc[:, ct] += (expression_data.loc[:, marker[1:]] < gate_threshold)
else:
celltype_score.loc[:, ct] += (expression_data.loc[:, marker] > gate_threshold)
celltype_score.loc[:, ct] = celltype_score.loc[:, ct] / len(marker_sets[ct])
return celltype_score
required_gate_pass_ratio = 0.75
celltype_scores = calc_gates_passed(pg_data_combined_processed.to_df("dsb"), celltype_marker_sets)
pg_data_combined_processed.obs["cell_type_gate"] = celltype_scores.idxmax(axis=1).values
pg_data_combined_processed.obs.loc[celltype_scores.max(axis=1) < required_gate_pass_ratio, "cell_type_gate"] = "other"
Annotation sanity check
We now plot the abundance data for a few markers to check if the annotations conform with expected patterns. The density_scatter_plot function enables us to see marker distributions across single cells. Let us first look at an example of marker distribution against its isotype control, CD3E vs mIgG1.
from pixelator.plot import density_scatter_plot
density_scatter_plot(pg_data_combined_processed, layer="dsb", marker1="mIgG1", marker2="CD3E", show_marginal=True)
plt.show()
As expected, we see that the isotype control appears to form a single distribution centered around 0 corresponding to negative expression for all cells. CD3E however is divided into two distributions where one corresponds to the negative population centered around 0, i.e. PBMCs other than T cells that do not express CD3E and another distribution corresponding to T cells with CD3E values centered around 2.5.
Now we can check whether each cell type expresses the expected markers. For example looking at CD4 and CD8 proteins across different cell types:
density_scatter_plot(pg_data_combined_processed, layer="dsb", marker1="CD4", marker2="CD8", facet_column="cell_type_gate")
plt.show()
As expected, CD4 and CD8 are expressed in CD4 T cells and CD8 T cells respectively, while some NK cells also express CD8 as expected. Similarly, let’s look at a few more markers, namely, CD14, CD19, CD161 and CD337 that are expected to be expressed in Monocytes, B cells, Innate T cells and NK cells respectively.
density_scatter_plot(pg_data_combined_processed, layer="dsb", marker1="CD14", marker2="CD19", facet_column="cell_type_gate")
density_scatter_plot(pg_data_combined_processed, layer="dsb", marker1="CD161", marker2="CD337", facet_column="cell_type_gate")
plt.show()
Compare annotations
Let us now check to what extent the clustering based annotation and the gating based annotation agree:
confusion_matrix = pd.crosstab(pg_data_combined_processed.obs["cell_type_gate"], pg_data_combined_processed.obs["cell_type"])
sns.heatmap(confusion_matrix, annot=True, cmap="Blues", fmt="d")
plt.xlabel("Clustering Annotation")
plt.ylabel("Gating Annotation")
plt.title("Confusion Matrix")
plt.show()
We observe that other than a majority of the NK cells that are classified as CD8 T or other cells, the annotations mostly agree. We note that on its own, annotation based on gating can be quite prone to errors as it only considers few markers. However, it can serve as a powerful complement to clustering based annotation by cleaning up wrong annotations. For example, B cells can not contain any CD3+/CD14+, T cells can not contain any CD19+/CD14+ etc.
In this tutorial we have learned how to annotate different cell populations and use protein abundance levels to compare across populations and between experimental samples.
We will now save this annotated object to proceed to the next tutorial where we will show how to perform a differential abundance test between experimental conditions within a specific cell type.
# Add the normalized AnnData to the pxl object
pg_data_combined_pxl_object.adata = pg_data_combined_processed
pg_data_combined_pxl_object.save(DATA_DIR/ "combined_data_annotated.pxl", force_overwrite=True)