---
title: Studying heterogeneity in differentiating LUHMES cells
author: Aaron Lun
date: 27 January 2016
---

```{r, echo=FALSE, results='hide'}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
options(width=100)
```

# Introduction

This describes the analysis of scRNA-seq data from Justyna Cholewa-Waclaw to identify population heterogeneity in differentiating LUHMES cells.
CRAM files were obtained from the Sanger FTP servers on the 18th January 2016.
Each file contained read pairs aligned to the hg19 build of the human genome.
Read pairs were counted into exonic regions of the ENSEMBL annotation using the `featureCounts` function in the `Rsubread` package.

First we load the counts into memory.
We put things into a `DGEList` for easier handling later on.

```{r}
library(edgeR)
all.counts <- read.table("genic_counts.tsv", header=TRUE, row.names=1)
y <- DGEList(all.counts[,-1])
dim(y)
```

We also add some annotation to relate each gene ID to its symbol.

```{r}
library(org.Hs.eg.db)
anno <- select(org.Hs.eg.db, keytype="ENSEMBL", keys=rownames(y$counts), column="SYMBOL")
y$genes <- anno[match(rownames(y$counts), anno$ENSEMBL),]
y$genes$Length <- all.counts[,1]
```

Finally, we load the target file.
We have counts for 384 samples, split up into four differentiation time-points -- undifferentiated, day 6, day 9 and day 13.
There are also an additional 4 files for PhiX controls.
These aren't in the target file and must be removed from the count matrix.

```{r}
targets <- read.csv("targets.csv", header=TRUE)
targets$CRAM.file <- paste0("X", sub("#", ".", targets$CRAM.file))
m <- match(colnames(y$counts), targets$CRAM.file)
keep <- !is.na(m)
sum(keep)
y <- y[,keep]
y$samples$group <- targets$day.of.differentiation[m[keep]]
table(y$samples$group)
```

# Quality control on cells and genes

We remove low-quality libraries with fewer than 10^5^ reads.
This removes about a quarter of our cells.

```{r}
summary(y$samples$lib.size)
failed <- y$samples$lib.size < 1e5
table(y$samples$group[failed])
y <- y[,!failed]
```

We check the mitochondrial proportions as well.
High proportions of mitochondrial reads are indicative of aberrant cellular behaviour.
There's no strong outliers, but we'll remove those cells with proportions above 0.2 just in case.

```{r, message=FALSE}
library(GenomicFeatures)
#txdb <- makeTxDbFromUCSC(genome='hg19',tablename='ensGene') 
#saveDb(txdb, file="hg19.ensGene")
txdb <- loadDb(file="hg19.ensGene")
location <- select(txdb, keys=rownames(y$counts), column="TXCHROM", keytype="GENEID")
is.mito <- (location$TXCHROM=="chrM" & !is.na(location$TXCHROM))[match(rownames(y$counts), location$GENEID)]
mito.prop <- colSums(y$counts[is.mito,])/y$samples$lib.size
hist(mito.prop)
ok.mito <- mito.prop < 0.2
y <- y[,ok.mito]
sum(ok.mito)
```

We remove low-abundance genes with an average count below 1.
These genes are unlikely to be interesting as they won't provide much information.
Discreteness also interferes with downstream statistical methods.

```{r}
keep <- rowMeans(y$counts) > 1 
y <- y[keep,]
summary(keep)
```

We use our deconvolution method for normalizing low counts. 
It seems to perform similarly to TMM and library size normalization for this data set, which makes sense given that the library sizes are fairly large.

```{r, fig.width=10, fig.height=5}
library(scran)
sf.deconv <- normalizeBySums(y$counts)
sf.tmm <- calcNormFactors(y$counts) * y$samples$lib.size
par(mfrow=c(1,2))
plot(sf.deconv, sf.tmm, log="xy", pch=16)
plot(sf.deconv, y$samples$lib.size, log="xy", pch=16)
y$samples$norm.factors <- sf.deconv/y$samples$lib.size
```

# Examining the overall behaviour of the data set

We first examine the overall behaviour of the data by making a PCA plot with all samples included.
Some set-up is necessary, though; we log-transform the data to stabilize the variances, and we assign colors to each group.

```{r}
adjc <- cpm(y, log=TRUE, prior.count=1)
colors <- c(undiff="blue", D6="red", D9="darkgreen", D13="orange")
pointtype <- c(undiff=16, D6=17, D9=18, D13=15)
```

We proceed to construct a PCA plot, with some extra aesthetics.

```{r}
PCAplot <- function(out, pc=c(1,2), group, col=NULL, pch=NULL) {
    if (!missing(group)) { group <- as.character(group) }
    if (is.null(col)) { col <- colors[group] }
    if (is.null(pch)) { pch <- pointtype[group] }
    all.var <- out$sdev^2
    plot(out$x[,pc[1]], out$x[,pc[2]], col=col, pch=pch,
        xlab=paste0("PC ", pc[1], " (", round(all.var[pc[1]]/sum(all.var)*100, 2), "%)"), 
        ylab=paste0("PC ", pc[2], " (", round(all.var[pc[2]]/sum(all.var)*100, 2), "%)"))
    return(invisible(out))
}
```

The differences between differentiation time are mostly on the second PC, where you see a transition from undifferentiated cells to the day 13 state.
However, heterogeneity within populations seems to dominate on the first PC.
This is true despite the fact that each time point is sampled from a separate culture, such that one would expect to see batch effects separating the differentiation time points.

```{r, fig.width=10, fig.height=6}
out <- prcomp(t(adjc), scale.=TRUE, center=TRUE, retx=TRUE)
par(mfrow=c(1,2))
PCAplot(out, pc=c(1,2), group=y$samples$group)
PCAplot(out, pc=c(2,3), group=y$samples$group)
```

There doesn't seem to any major change in heterogeneity between time points, as the spread of cells in the PCA plot is fairly comparable between time points.
We can quantify this by computing the variance of all genes and comparing the distribution of variances between populations.
There's no change in the overall distribution of variances, but it looks like there's an increase in the variance of the most highly variable genes after differentiation.
Be aware, however, that this does not account for the mean variance trend -- see below for a more rigorous analysis.

```{r, fig.width=10, fig.height=6}
is.undiff <- y$samples$group=="undiff"
undiff.var <- sort(apply(adjc[,is.undiff], 1, var), decreasing=TRUE)
is.d6 <- y$samples$group=="D6"
d6.var <-  sort(apply(adjc[,is.d6], 1, var), decreasing=TRUE)
is.d9 <- y$samples$group=="D9"
d9.var <-  sort(apply(adjc[,is.d9], 1, var), decreasing=TRUE)
is.d13 <- y$samples$group=="D13"
d13.var <- sort(apply(adjc[,is.d13], 1, var), decreasing=TRUE)
par(mfrow=c(1,2))
boxplot(list(Undiff=undiff.var, D6=d6.var, D9=d9.var, D13=d13.var), ylab="Gene-wise variance")
boxplot(list(Undiff=undiff.var[1:500], D6=d6.var[1:500], D9=d9.var[1:500], D13=d13.var[1:500]), 
    ylab="Gene-wise variance (top 500)")
```

We'll have to check cell cycle activity for the undifferentiated cells.
Upon differentiation, neuronal cells become post-mitotic so it shouldn't be a major concern from then on.
It seems the the undifferentiated cells have the most G2M assignments, whereas the differentiated cells get bogged down in S phase.
This is rather odd -- perhaps it's because the training set does not consider complete exit from the cell cycle.

```{r}
hs.pairs <- readRDS(system.file("exdata", "human_cycle_markers.rds", package="scran"))
assignments <- cyclone(y$counts, hs.pairs)
g.char <- as.character(y$samples$group)
plot(assignments$score$G1, assignments$score$G2M, col=colors[g.char], pch=pointtype[g.char])
phase <- c("S", "G1", "G2M", "unknown")[1+(assignments$score$G1 > 0.5)+(assignments$score$G2M > 0.5)*2]
table(phase[is.undiff])/sum(is.undiff)
table(phase[is.d6])/sum(is.d6)
table(phase[is.d9])/sum(is.d9)
table(phase[is.d13])/sum(is.d13)
```

In any case, we'll keep an eye on the G2M assignments and make sure they're not responsible for separate clustering.

```{r}
phase.point <- c(S=1, G1=16, G2M=7, unknown=4)[phase]
```

# Examining the behaviour at each time point

## Looking at undifferentiated cells

We focus on undifferentiated cells to see if we can extract detailed substructure.
We remove lowly-expressed genes and construct a PCA plot as before.
There's no distinct G2M group, so we're probably safe from artificial clusters due to cell cycle phase.

```{r, fig.height=5, fig.width=10}
y.undiff <- y[,is.undiff]
y.undiff <- y.undiff[rowMeans(y.undiff$counts) >= 1,]
undiff.only <- cpm(y.undiff, log=TRUE, prior.count=1)
undiff.out <- prcomp(t(undiff.only), scale.=TRUE, center=TRUE, retx=TRUE)
par(mfrow=c(1,2))
PCAplot(undiff.out, c(1,2), "undiff")
PCAplot(undiff.out, c(1,2), pch=phase.point[is.undiff], col="black")
```

We then look at the highly variable genes within this population.
This is done after fitting a loess curve to the (log-)variances, and subtracting the fitted value from each variance.
The idea is to account for the mean-variance relationship during the modelling process.

```{r}
undiff.trend <- fitTechTrend(y.undiff$counts, exp(getOffset(y.undiff)), trend="loess")
undiff.mean <- undiff.trend$mean
undiff.var <- undiff.trend$var
plot(undiff.mean, log2(undiff.var))
undiff.fit <- undiff.trend$trend(undiff.mean)
points(undiff.mean, log2(undiff.fit), col="red")
```

We can now examine the most highly variable genes.
It looks like all of them have non-zero expression, rather than being driven by outliers, which is reassuring.

```{r}
undiff.hvg.stat <- undiff.var - undiff.fit
o <- order(undiff.hvg.stat, decreasing=TRUE)
to.look <- undiff.only[o[1:10],]
rownames(to.look) <- ifelse(is.na(y$genes$SYMBOL), y$genes$ENSEMBL, y$genes$SYMBOL)[o[1:10]]
out <- split(to.look, matrix(rownames(to.look), nrow=10, ncol=ncol(undiff.only)))
boxplot(out, las=2)
```

This can be examined more thoroughly with a heatmap.
The idea is to check whether there are distinct subpopulations or whether this is a result of general heterogeneity.
Strong subsets don't seem to be present.

```{r}
heat.undiff <- as.matrix(undiff.only - rowMeans(undiff.only))
library(gplots)
out <- heatmap.2(heat.undiff[o[1:500],], col=bluered, symbreaks=TRUE, trace='none')
```

There might be a general trend of increasing expression towards the right.
However, this also shows up in the heatmap constructed using all genes.
This suggests that this pattern is more likely to be an artifact of imperfect normalization.
Under/over-estimates of the scaling factor result in global changes to gene expression.

```{r}
heatmap.2(heat.undiff[,out$colInd], col=bluered, symbreaks=TRUE, trace='none', Rowv=FALSE, Colv=FALSE)
```

## Looking at day 6 cells

We repeat this procedure for day 6 cells.

```{r, fig.height=5, fig.width=10}
y.d6 <- y[,is.d6]
y.d6 <- y.d6[rowMeans(y.d6$counts) >= 1,]
d6.only <- cpm(y.d6, log=TRUE, prior.count=1)
d6.out <- prcomp(t(d6.only), scale.=TRUE, center=TRUE, retx=TRUE)
par(mfrow=c(1,2))
PCAplot(d6.out, c(1,2), "D6")
PCAplot(d6.out, c(1,2), pch=phase.point[is.d6], col="black")
```

We then look at the highly variable genes within this population.
First, we fit a trend like we did before.

```{r}
d6.trend <- fitTechTrend(y.d6$counts, exp(getOffset(y.d6)), trend="loess")
d6.mean <- d6.trend$mean
d6.var <- d6.trend$var
plot(d6.mean, log2(d6.var))
d6.fit <- d6.trend$trend(d6.mean)
points(d6.mean, log2(d6.fit), col="red")
```

We can now examine the most highly variable genes.

```{r}
d6.hvg.stat <- d6.var - d6.fit
o <- order(d6.hvg.stat, decreasing=TRUE)
to.look <- d6.only[o[1:10],]
rownames(to.look) <- ifelse(is.na(y$genes$SYMBOL), y$genes$ENSEMBL, y$genes$SYMBOL)[o[1:10]]
out <- split(to.look, matrix(rownames(to.look), nrow=10, ncol=ncol(d6.only)))
boxplot(out, las=2)
```

Examination of the heatmap doesn't reveal any clear subpopulations.
Again, there's some graduations but they're also present in the global heatmap, which suggests that they're more likely to be normalization artifacts.

```{r}
heat.d6 <- as.matrix(d6.only - rowMeans(d6.only))
library(gplots)
out <- heatmap.2(heat.d6[o[1:500],], col=bluered, symbreaks=TRUE, trace='none')
heatmap.2(heat.d6[,out$colInd], col=bluered, symbreaks=TRUE, trace='none', Rowv=FALSE, Colv=FALSE)
```

## Looking at day 9 cells

We repeat this procedure for day 9 cells.

```{r, fig.height=5, fig.width=10}
y.d9 <- y[,is.d9]
y.d9 <- y.d9[rowMeans(y.d9$counts) >= 1,]
d9.only <- cpm(y.d9, log=TRUE, prior.count=1)
d9.out <- prcomp(t(d9.only), scale.=TRUE, center=TRUE, retx=TRUE)
par(mfrow=c(1,2))
PCAplot(d9.out, c(1,2), "D9")
PCAplot(d9.out, c(1,2), pch=phase.point[is.d9], col="black")
```

We then look at the highly variable genes within this population.
First, we fit a trend like we did before.

```{r}
d9.trend <- fitTechTrend(y.d9$counts, exp(getOffset(y.d9)), trend="loess")
d9.mean <- d9.trend$mean
d9.var <- d9.trend$var
plot(d9.mean, log2(d9.var))
d9.fit <- d9.trend$trend(d9.mean)
points(d9.mean, log2(d9.fit), col="red")
```

We can now examine the most highly variable genes.

```{r}
d9.hvg.stat <- d9.var - d9.fit
o <- order(d9.hvg.stat, decreasing=TRUE)
to.look <- d9.only[o[1:10],]
rownames(to.look) <- ifelse(is.na(y$genes$SYMBOL), y$genes$ENSEMBL, y$genes$SYMBOL)[o[1:10]]
out <- split(to.look, matrix(rownames(to.look), nrow=10, ncol=ncol(d9.only)))
boxplot(out, las=2)
```

It looks like there's some decent substructure here. 

```{r}
heat.d9 <- as.matrix(d9.only - rowMeans(d9.only))
library(gplots)
out <- heatmap.2(heat.d9[o[1:500],], col=bluered, symbreaks=TRUE, trace='none')
sub.A <- unlist(out$colDendrogram[[1]])
sub.B <- unlist(out$colDendrogram[[2]][[1]])
sub.C <- unlist(out$colDendrogram[[2]][[2]][[2]][[1]][[1]][[2]][[1]][[1]][[2]])
sub.D <- setdiff(unlist(out$colDendrogram[[2]][[2]]), sub.C)
```

We can have a look at our manual allocations on the heatmap and on the PCA plot.

```{r}
group.d9 <- character(ncol(heat.d9))
group.d9[sub.A] <- "A"
group.d9[sub.B] <- "B"
group.d9[sub.C] <- "C"
group.d9[sub.D] <- "D"
side.cols <- c(A="black", B="green", C="purple", D="orange")[group.d9]
heatmap.2(heat.d9[o[1:500],], col=bluered, symbreaks=TRUE, trace='none', ColSideColors=side.cols)
PCAplot(d9.out, c(1,2), col=side.cols, pch=16)
```

We examine the DE genes between subpopulations using `voom`.
This uses log-transformed counts, so should give the greatest consistency with our PCA and clustering results.

```{r}
g <- factor(group.d9)
design <- model.matrix(~0 + g)
colnames(design) <- levels(g)
v.d9 <- voom(y.d9, design, plot=TRUE)
fit <- lmFit(v.d9, design)
fit2 <- contrasts.fit(fit, makeContrasts(A-B, A-C, A-D,
           B-A, B-C, B-D,
           C-A, C-B, C-D,
           D-A, D-B, D-C, levels=design))
fit2 <- eBayes(fit2, robust=TRUE)
```

We rank genes on their $p$-value but focus on genes that have the same sign across all contrasts for each subpopulation.
Keep in mind, though, that the $p$-value and FDR have no meaning here, as the groups were empirically identified.

```{r}
saveComparison <- function(fit, coefs, fname) {
    out <- topTable(fit, coef=coefs, n=Inf, sort.by="none")
    all.signs <- rowSums(fit$coefficients[,coefs]>0)
    keep <- all.signs==length(coefs) | all.signs==0L
    out <- out[keep,]
    out <- out[order(out$P.Value),]
    write.table(out, file=fname, sep="\t", quote=FALSE, row.names=FALSE)
    discard <- match(c("ENSEMBL", "P.Value", "adj.P.Val"), colnames(out))
    return(head(out[,-discard]))
}   
dir.create("de_results", showWarning=FALSE)
saveComparison(fit2, 1:3, "de_results/d9_A.tsv")
saveComparison(fit2, 4:6, "de_results/d9_B.tsv")
saveComparison(fit2, 7:9, "de_results/d9_C.tsv")
saveComparison(fit2, 10:12, "de_results/d9_D.tsv")
```

## Looking at day 13 cells

We repeat this procedure for day 13 cells.

```{r, fig.height=5, fig.width=10}
y.d13 <- y[,is.d13]
y.d13 <- y.d13[rowMeans(y.d13$counts) >= 1,]
d13.only <- cpm(y.d13, log=TRUE, prior.count=1)
d13.out <- prcomp(t(d13.only), scale.=TRUE, center=TRUE, retx=TRUE)
par(mfrow=c(1,2))
PCAplot(d13.out, c(1,2), "D13")
PCAplot(d13.out, c(1,2), pch=phase.point[is.d13], col="black")
```

We then look at the highly variable genes within this population.
First, we fit a trend like we did before.

```{r}
d13.trend <- fitTechTrend(y.d13$counts, exp(getOffset(y.d13)), trend="loess")
d13.mean <- d13.trend$mean
d13.var <- d13.trend$var
plot(d13.mean, log2(d13.var))
d13.fit <- d13.trend$trend(d13.mean)
points(d13.mean, log2(d13.fit), col="red")
```

We can now examine the most highly variable genes.

```{r}
d13.hvg.stat <- d13.var - d13.fit
o <- order(d13.hvg.stat, decreasing=TRUE)
to.look <- d13.only[o[1:10],]
rownames(to.look) <- ifelse(is.na(y$genes$SYMBOL), y$genes$ENSEMBL, y$genes$SYMBOL)[o[1:10]]
out <- split(to.look, matrix(rownames(to.look), nrow=10, ncol=ncol(d13.only)))
boxplot(out, las=2)
```

Some substructure is present.
I would discount the red stripe, that's probably a normalization artifact as it appears in the global heatmap as well.

```{r}
heat.d13 <- as.matrix(d13.only - rowMeans(d13.only))
library(gplots)
out <- heatmap.2(heat.d13[o[1:500],], col=bluered, symbreaks=TRUE, trace='none')
heatmap.2(heat.d13[,out$colInd], col=bluered, symbreaks=TRUE, trace='none', Rowv=FALSE, Colv=FALSE)
```

Instead, let's focus on the group with missing expression.
To make it a bit easier for myself, I'll use the following heatmap instead.

```{r}
sub.heat.d13 <- heat.d13[o[1:200],]
out <- heatmap.2(sub.heat.d13, col=bluered, symbreaks=TRUE, trace='none')
sub.A <- unlist(out$colDendrogram[[1]])
sub.B <- unlist(out$colDendrogram[[2]][[1]])
sub.C <- unlist(out$colDendrogram[[2]][[2]])
```

We can check out our manual allocations, below.

```{r}
group.d13 <- character(ncol(heat.d13))
group.d13[sub.A] <- "A"
group.d13[sub.B] <- "B"
group.d13[sub.C] <- "C"
side.cols <- c(A="black", B="green", C="purple")[group.d13]
heatmap.2(sub.heat.d13, col=bluered, symbreaks=TRUE, trace='none', ColSideColors=side.cols)
PCAplot(d13.out, c(1,2), col=side.cols, pch=16)
```

We now look at the genes that distinguish the proposed subpopulations.

```{r}
g <- factor(group.d13)
design <- model.matrix(~0 + g)
colnames(design) <- levels(g)
v.d13 <- voom(y.d13, design, plot=TRUE)
fit <- lmFit(v.d13, design)
fit2 <- contrasts.fit(fit, makeContrasts(A-B, A-C,
           B-A, B-C, 
           C-A, C-B, levels=design))
fit2 <- eBayes(fit2, robust=TRUE)
saveComparison(fit2, 1:2, "de_results/d13_A.tsv")
saveComparison(fit2, 3:4, "de_results/d13_B.tsv")
saveComparison(fit2, 5:6, "de_results/d13_C.tsv")
```

# Assessing the variance at each time point

As previously mentioned, the boxplots of the variance estimates don't account for the mean-variance trend.
We now compute the equivalent boxplots of the residual variance for all timepoints, where the mean-variance trend has been removed.
We see the same sort of effect as we did with the raw variances, which is reassuring.

```{r, fig.width=10, fig.height=5}
par(mfrow=c(1,2))
boxplot(list(Undiff=undiff.hvg.stat, D6=d6.hvg.stat, D9=d9.hvg.stat, D13=d13.hvg.stat), 
    ylab="Gene-wise variance")
boxplot(list(Undiff=sort(undiff.hvg.stat, dec=TRUE)[1:200], D6=sort(d6.hvg.stat, dec=TRUE)[1:200], 
    D9=sort(d9.hvg.stat, dec=TRUE)[1:200], D13=sort(d13.hvg.stat, dec=TRUE)[1:200]), 
    ylab="Gene-wise variance (top 200)")
```

Note however, that while this will remove the mean-variance trend, it will _also_ remove genuine systematic differences in variance (e.g., if the trend goes all-up or all-down).
The only way to have your cake and eat it too would be to add spike-ins.

# Session information

```{r}
sessionInfo()
```