Data Integration
This tutorial demonstrates how to harmonize multiple datasets using the Harmony method in Python. Harmony is one of several data integration techniques for single-cell omics data. For more information, you can refer to this benchmark paper that reviews a variety of methods. We will preprocess the data, perform dimensionality reduction, harmonize the datasets, and visualize the results.
After completing this tutorial, you should be able to:
- Preprocess the data by normalizing, scaling, and finding variable features.
- Perform dimensionality reduction with PCA and UMAP and visualize the results.
- Integrate multiple datasets using the Harmony method.
- Visualize the harmonized data to ensure effective integration.
Setup
from matplotlib.pyplot import rc_context
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pandas as pd
from pixelator import read, simple_aggregate
from pixelator.statistics import clr_transformation
import scanpy as sc
import scanpy.external as sce
import seaborn as sns
sns.set_style("whitegrid")
DATA_DIR = Path("<path to the directory to save datasets to>")
Load data
To illustrate the integration procedure we continue with the quality checked and filtered merged object we created in the previous tutorial.
pg_data_combined_pxl_object = read(DATA_DIR/ "combined_data_normalized.pxl")
# In this tutorial we will mostly work with the AnnData object
# so we will start by selecting that from the pixel data object
pg_data_combined_processed = pg_data_combined_pxl_object.adata
Harmonizing data
Data integration is often performed to address biases resulting from experimental or biological factors such as sample collection site, sample batch, different experimental conditions or time points. We use the integration to find cells that are similar across the samples, with the presumption that they originate from the same cell population, allowing for comparison of these cells across conditions.
Data integration should always be performed to address a specific bias that affects analysis. In our case, we are comparing samples that have undergone very different treatments, i.e. PHA stimulated or resting. Therefore, integration may be necessary to adjust for these differences to allow for meaningful comparisons and cell type annotations. In essence, we want to identify shared cell populations across the samples and conditions, such that we can be certain we are comparing the same cell types across conditions.
To start with the data integration, we must first identify whether this bias is present in our data. Let’s start with investigating the UMAP we just generated to see if there is a clear difference between these samples.
sc.pl.umap(pg_data_combined_processed, color = "sample")
Indeed, we see that the samples from different conditions have separated in the UMAP space, indicating that there are differences in protein abundance between the cells of the different conditions. Let’s investigate the abundance of some markers to ensure ourselves that the main immune cell types are represented in all samples and conditions.
markers_of_interest = ["CD3E", "CD4", "CD8", "CD19", "CD20", "CD14"]
sc.pl.umap(pg_data_combined_processed, color=markers_of_interest, ncols=3, layer = "dsb")
Using these markers, we can see that all conditions have a presence of T helper cells (CD3+, CD4+), cytotoxic T cells (CD3+, CD8+), B cells (CD19+, CD20+), and monocytes (CD14+). However, it is clear that the protein abundance is somewhat different across the conditions, since these cell types are not fully overlapping across conditions in the UMAP space. If we would perform a clustering on the non-integrated data, we risk generating individual clusters per each cell type and condition, complicating the comparison of cell type protein abundance and spatial scores across the samples. To alleviate this, we will use the Harmony package to integrate the data, and correct for the treatment specific biases.
We are interested in the abundance differences, so we want to perform a correction that doesn’t eliminate these, but allows us to compare the same corresponding cell types across the samples in a meaningful way. Harmony uses the PCA embedding we previously generated and corrects them for specified covariates present as a column in the object. In this case, we will correct for the condition and sample variable. Harmony will create a new set of embeddings that we can use instead of the PCA. We can then use these Harmony embedding for downstream analysis, such as UMAP and clustering. To do this, we first add the condition variables to the meta data of the AnnData object.
# Add condition column based on sample names
pg_data_combined_processed.obs['condition'] = pg_data_combined_processed.obs['sample'].map(
lambda x: 'stimulated' if 'stimulated' in x else 'resting'
)
# Run Harmony to correct for sample and condition effects
sce.pp.harmony_integrate(pg_data_combined_processed, key=['sample', 'condition'], basis="pca", adjusted_basis="X_harmony")
# Run UMAP on the Harmony embedding
sc.pp.neighbors(
pg_data_combined_processed,
n_neighbors=30,
n_pcs=10,
use_rep="X_harmony",
metric="cosine",
key_added="harmony_neighbors",
)
sc.tl.umap(
pg_data_combined_processed,
min_dist=0.3,
neighbors_key="harmony_neighbors"
)
sc.pl.umap(pg_data_combined_processed, color = "sample")
# Create separate UMAPs for resting and stimulated samples
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Plot resting samples
sc.pl.umap(
pg_data_combined_processed[pg_data_combined_processed.obs['sample'].str.contains('resting')],
color='sample',
ax=ax1,
show=False,
title='Resting'
)
# Plot stimulated samples
sc.pl.umap(
pg_data_combined_processed[pg_data_combined_processed.obs['sample'].str.contains('stimulated')],
color='sample',
ax=ax2,
show=False,
title='Stimulated'
)
plt.tight_layout()
plt.show()
We can now see in the UMAP generated from the Harmony embedding that the samples are overlapping more, indicating that the treatment bias has been reduced. There seem to be some clusters shared across conditions as well as some condition-specific clusters. Let’s also compare the PCA and Harmony embedding.
sc.pl.pca(pg_data_combined_processed, color = "sample", title = "PCA")
sc.pl.scatter(pg_data_combined_processed, basis='harmony', color='sample', title = "Harmony")
Save filtered dataset
Let’s save this integrated object now for the next tutorial on Cell Annotation. This is an optional pause step; if you prefer to continue to the next tutorial without saving the object explicitly, that will also work.
# Add the normalized AnnData to the pxl object
pg_data_combined_pxl_object.adata = pg_data_combined_processed
pg_data_combined_pxl_object.save(DATA_DIR/ "combined_data_integrated.pxl", force_overwrite=True)
In this tutorial, we have learned how to integrate several datasets using the Harmony package, and apply some downstream analysis on the integrated data. This is useful when comparing samples from different donors, conditions, or time points, where there are differences, biological or technical, in protein abundance between samples that may affect visualization, clustering, and cell annotation. Using Harmony, we can correct for such effects or confounding variables, allowing for meaningful comparisons across samples.