Data Integration
This tutorial demonstrates how to harmonize multiple datasets using the Harmony method in R. Harmony is one of several data integration techniques for single-cell omics data. For more information, you can refer to this benchmark paper that reviews a variety of methods. We will preprocess the data, perform dimensionality reduction, harmonize the datasets, and visualize the results.
After completing this tutorial, you should be able to:
- Preprocess the data by normalizing, scaling, and finding variable features.
- Perform dimensionality reduction with PCA and UMAP and visualize the results.
- Integrate multiple datasets using the Harmony method.
- Visualize the harmonized data to ensure effective integration.
Setup
library(pixelatorR)
library(dplyr)
library(tibble)
library(stringr)
library(tidyr)
library(Seurat)
library(ggplot2)
library(here)
library(harmony)
library(pheatmap)
Load data
To illustrate the integration procedure we continue with the normalized object we created in the previous tutorial. As described before, this merged object contains four samples: a resting and PHA stimulated PBMC sample, both in duplicate.
DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)
# Load the object that you saved in the previous tutorial. This is not needed if it is still in your workspace.
pg_data_combined <- readRDS(file.path(DATA_DIR, "combined_data_normalized.rds"))
pg_data_combined
An object of class Seurat
168 features across 3844 samples within 2 assays
Active assay: mpxCells (84 features, 84 variable features)
3 layers present: counts, data, scale.data
1 other assay present: log_assay
2 dimensional reductions calculated: pca, umap
Harmonizing data
Data integration is often performed to address biases resulting from experimental or biological factors such as sample collection site, sample batch, different experimental conditions or time points. We use the integration to find cells that are similar across the samples, with the presumption that they originate from the same cell population, allowing for comparison of these cells across conditions.
Data integration should always be performed to address a specific bias that affects analysis. In our case, we are comparing samples that have undergone very different treatments, i.e. PHA stimulated or resting. Therefore, integration may be necessary to adjust for these differences to allow for meaningful comparisons and cell type annotations. In essence, we want to identify shared cell populations across the samples and conditions, such that we can be certain we are comparing the same cell types across conditions.
To start with the data integration, we must first identify whether this bias is present in our data. Let’s start with investigating the UMAP we just generated to see if there is a clear difference between these samples.
DimPlot(pg_data_combined, reduction = "umap", group.by = "sample") +
coord_fixed()
Indeed, we see that the samples from different conditions have separated in the UMAP space, indicating that there are differences in protein abundance between the cells of the different conditions. Let’s investigate the abundance of some markers to ensure ourselves that the main immune cell types are represented in all samples and conditions.
FeaturePlot(pg_data_combined,
features = c("CD3E", "CD4", "CD8",
"CD19", "CD20",
"CD14"),
reduction = "umap", ncol = 3,
min.cutoff = 0,
coord.fixed = TRUE)
Using these markers, we can see that all conditions have a presence of T helper cells (CD3+, CD4+), cytotoxic T cells (CD3+, CD8+), B cells (CD19+, CD20+), and monocytes (CD14+). However, it is clear that the protein abundance is somewhat different across the conditions, since these cell types are not fully overlapping across conditions in the UMAP space. If we would perform a clustering on the non-integrated data, we risk generating individual clusters per each cell type and condition, complicating the comparison of cell type protein abundance and spatial scores across the samples. To alleviate this, we will use the Harmony package to integrate the data, and correct for the sample and condition specific biases.
We are interested in the abundance differences, so we want to perform a
correction that doesn’t eliminate these, but allows us to compare the
same corresponding cell types across the samples in a meaningful way.
Harmony uses the PCA embedding we previously generated and corrects them
for specified covariates present as a column in the object. In this
case, we will correct for the condition
and sample
variable. Harmony
will create a new set of embeddings that we can use instead of the PCA.
We can then use these Harmony embedding for downstream analysis, such as
UMAP and clustering. In a previous step, we added condition
to the
Seurat object using AddMetaData
, so that we can use this variable to
correct for the donor effect.
# Run Harmony to correct for donor effects
pg_data_combined <-
pg_data_combined %>%
RunHarmony(group.by.vars = c("condition", "sample"),
reduction = "pca",
reduction.save = "harmony")
# Run UMAP on the Harmony embedding
pg_data_combined <-
pg_data_combined %>%
RunUMAP(reduction = "harmony", dims = 1:10,
reduction.name = "harmony_umap")
DimPlot(pg_data_combined, reduction = "harmony_umap", group.by = "condition", shuffle = TRUE) +
coord_fixed()
DimPlot(pg_data_combined, reduction = "harmony_umap", group.by = "sample", split.by = "condition")
We can now see in the UMAP generated from the Harmony embedding that the samples are overlapping more, indicating that the condition bias has been reduced. There seem to be some clusters shared across conditions as well as some condition-specific clusters. Let’s also compare the PCA and Harmony embedding.
DimPlot(pg_data_combined, reduction = "pca", group.by = "condition", shuffle = TRUE) +
ggtitle("PCA") +
coord_fixed()
DimPlot(pg_data_combined, reduction = "harmony", group.by = "condition", shuffle = TRUE) +
ggtitle("Harmony") +
coord_fixed()
Let’s save this integrated object now for the next tutorial on Cell Annotation.
# Save filtered dataset for next step. This is an optional pause step; if you prefer to continue to the next tutorial without saving the object explicitly, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data_integrated.rds"))
In this tutorial, we have learned how to integrate several datasets using the Harmony package, and apply some downstream analysis on the integrated data. This is useful when comparing samples from different donors, conditions, or time points, where there are differences, biological or technical, in protein abundance between samples that may affect visualization, clustering, and cell annotation. Using Harmony, we can correct for such effects or confounding variables, allowing for meaningful comparisons across samples.