---
title: Correcting batch effects in single-cell RNA-seq data
author: Aaron Lun
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
fit_caption: false
---
```{r style, echo=FALSE, results='hide', message=FALSE}
library(BiocStyle)
library(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
#options(bitmapType="cairo", width=100) # if transparencies don't work on your machine.
```
# Introduction
Large scRNA-seq datasets are often generated in multiple batches due to logistical constraints.
This results in batch effects due to uncontrollable differences in the processing of different batches, e.g., changes in operator, differences in reagent quality.
Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.
Here, we will explore a few methods for removing these batch effects.
We will be using a haematopoiesis dataset, generated by two groups with different technologies and involving different parts of the haematopoietic lineage.
The [Nestorowa](https://doi.org/10.1182/blood-2016-05-716480) dataset uses Smart-seq2 to profile haematopoietic stem cell and precursor populations.
In contrast, the [Paul](https://doi.org/10.1016/j.cell.2015.11.013) dataset uses MARS-seq to profile the myeloid progenitors.
For the sake of speed, we've already processed the two datasets to call highly variable genes - see `prepare.html` for details.
We load in the various R objects using the `load()` function.
```{r}
library(SingleCellExperiment)
load("objects.Rda")
ls()
```
**Exercise:**
We inspect some of these objects:
```{r}
sceN
```
```{r}
sceP
```
```{r}
decN
```
```{r}
decP
```
# Merging a set of highly variable genes
First, we need to obtain a common set of genes based on the Ensembl IDs.
This is why Ensembl IDs are so useful - they are, by definition, constant across experiments.
```{r}
common <- intersect(rownames(decN), rownames(decP))
length(common)
```
We combine the statistics using `combineVar` from `r Biocpkg("scran")`.
This uses Fisher's method to combine p-values across batches.
We take all genes that are significant at a FDR of 5%.
```{r}
library(scran)
combined <- combineVar(decP[common,], decN[common,])
to.use <- common[combined$FDR <= 0.05]
length(to.use)
```
**Exercise:**
```{r}
# What happens if we forget to use [common,]?
#H# combined2 <- combineVar(decP, decN)
#A# try(combined2 <- combineVar(decP, decN))
```
# Removing the batch effect
## With a linear model
The simplest approach is to remove batch effects with a linear model, e.g., using `removeBatchEffect`.
In this case, we're basically shifting the expression values so that the mean is the same between batches for every gene.
Let's try doing that:
```{r}
library(limma)
batch <- rep(c("N", "P"), c(ncol(sceN), ncol(sceP)))
together <- cbind(logcounts(sceN)[to.use,], logcounts(sceP)[to.use,])
corrected <- removeBatchEffect(together, batch=batch)
```
We set up a `SingleCellExperiment` object again:
```{r}
sceT <- SingleCellExperiment(list(exprs=corrected))
sceT$Batch <- batch
```
...and we make a PCA plot to look at, using `r CRANpkg("irlba")` to get the first 50 PCs.
We specify `ntop=Inf` to tell `runPCA` to use all available genes in `to.use`.
```{r}
library(scater)
sceT <- runPCA(sceT, exprs_values="exprs", method="irlba",
ncomponents=50, ntop=Inf)
plotPCA(sceT, colour_by="Batch")
```
Trying out a _t_-SNE plot:
```{r}
sceT <- runTSNE(sceT, use_dimred="PCA", perplexity=80)
plotTSNE(sceT, colour_by="Batch")
```
Even after `removeBatchEffect`, a strong batch effect still remains, for two reasons:
- differences in variance between UMI and read counts
- differences in cell type composition between data sets
**Exercise:**
```{r}
# How do I check what reduced dimension results are available?
#H# reducedDimNames()
#A# reducedDimNames(sceT)
```
We can get rid of the first effect to some extent using cosine normalization, but the second effect can't be resolved by _a priori_-defined linear models:
```{r}
together2 <- scran:::cosine.norm(together)
rownames(together2) <- to.use
#H# corrected2 <- removeBatchEffect()
#A# corrected2 <- removeBatchEffect(together2, batch=batch)
# Same as before...
sceT2 <- SingleCellExperiment(list(exprs=corrected2))
sceT2$Batch <- batch
sceT2 <- runPCA(sceT2, exprs_values="exprs", method="irlba",
ncomponents=50, ntop=Inf)
plotPCA(sceT2, colour_by="Batch")
```
## With MNN correction
A more sophisticated approach involves using the "mutual nearest neighbors" correction method (see https://www.nature.com/articles/nbt.4091/figures/1).
This involves finding pairs of cells - one per batch - where one cell in one batch is the nearest neighbour of a cell in the other batch, **and vice versa**.
Such pairs are likely to represent cells of the same type or state; we can use this to compute correction vectors between the two batches.
```{r}
corrected <- mnnCorrect(counts(sceN)[to.use,], counts(sceP)[to.use,], sigma=0.05)
```
We can combine the corrected matrices into a single object:
```{r}
out <- do.call(cbind, corrected$corrected)
colnames(out) <- NULL
sceM <- SingleCellExperiment(list(exprs=out),
colData=DataFrame(Batch=batch))
```
... and we run a PCA using `r CRANpkg("irlba")` to get the first 50 PCs.
```{r}
sceM <- runPCA(sceM, exprs_values="exprs", method="irlba",
ncomponents=50, ntop=Inf)
```
Plotting it indicates that the batch effect has been removed from the first two PCs.
In particular, the Paul dataset has been integrated into the Nestorowa dataset:
```{r}
plotPCA(sceM, colour_by="Batch")
```
The corrected values can be used for clustering, but it not for DE as the values are not interpretable as log-counts.
We suggest using the learnt structure (clusters) with the batch of origin as a blocking factor in the design matrix.
**Exercise:**
```{r}
# The _t_-SNE is a bit messier, but it shows the point:
#H# sceM <- runTSNE(, perplexity=80)
#H# plotTSNE()
#A# sceM <- runTSNE(sceM, use_dimred="PCA", perplexity=80)
#A# plotTSNE(sceM, colour_by="Batch")
```
# Session information
```{r}
sessionInfo()
```