---
author: Aaron Lun
title: Analysis of single-cell RNA-seq data for spermatogonial cells in homeostatic and regenerating testes
date: 30 October 2015
---

```{r, echo=FALSE, results="hide"}
knitr::opts_chunk$set(error=FALSE)
options(width=115)
```

# Loading in the data

First, we read in the gene counts for each cell.

```{r}
counts <- read.table("genic_counts.tsv", header=TRUE)
gene.lengths <- counts$Length
gene.ids <- counts$Gene
counts <- counts[,-(1:2)]
rownames(counts) <- gene.ids
dim(counts)
```

We then determine the testes condition for each sample.

```{r}
tcond <- sub("_.*", "", colnames(counts))
table(tcond)
```

# Quality control

## Library size considerations

We start by performing some quality control based on library size.
Very small libraries are removed as these correspond to wells in which RNA capture failed.
For this particular data set, it seems that all of the libraries have decent numbers of reads.
I don't think any removal needs to be done in this respect.

```{r}
totals <- colSums(counts)
summary(totals)
```

## Mitochondrial read proportions

We also remove libraries with a high proportion of mitochondrial reads.
This is motivated by the fact that such high proportions are symptomatic of apoptosis.
I guess the logic is that it's easier to sequence mitochondrial transcripts if the mitochondria themselves are already lysed by the apoptotic machinery.
Alternatively, stress may trigger a transcriptional response from the mitochondria, e.g., to increase energy output.

```{r, message=FALSE}
require(TxDb.Mmusculus.UCSC.mm10.ensGene)
anno <- select(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=rownames(counts), keytype="GENEID", columns="TXCHROM")
mito.prop <- colSums(counts[rownames(counts) %in% anno$GENEID[anno$TXCHROM=="chrM"],])/totals 
summary(mito.prop)
```

For some reason, a lot of the cells have high mitochondrial proportions.
Note that the regenerating cells were treated with a toxin to induce damage; as such, some of these cells may be apoptotic.
We can have a look at the distribution of the mitochondrial proportions between groups.
It looks like there is indeed an increase in the mitochondrial proportions for some cells in the regenerating condition.

```{r}
boxplot(split(mito.prop, tcond), ylab="Mitochondrial proportions")
```

I'm disinclined to remove all cells with high mitochondrial proportions, as the whole point is to identify the transcriptional response to damage.
Removing near-apoptotic cells will probably hinder our ability to detect upregulation of non-apoptotic pathways.
As such, I'll only remove obvious outliers with very high mitochondrial proportions.

```{r}
discard <- mito.prop > 0.5
counts <- counts[,!discard]
totals <- totals[!discard]
tcond <- tcond[!discard]
summary(discard)
```

## Testing positive and negative control genes

Checking the behaviour of some marker genes - all cells should be expressing *Cd9* and *Miwi2*, and should not express *Gfra1*.
The homeostatic population should also exhibit strong expression of *Plzf*, *Neurog3* and *Lin28a*.
Unfortunately, the expression of these positive controls seems to be rather weak.
We don't have any spike-ins, so it's hard to know for sure whether this is technical noise from the scRNA-seq protocol or poor sorting.
Actin is fine, but you'd expect strong coverage from something that's so highly expressed.

```{r, message=FALSE}
all.genes <- c(Cd9="ENSMUSG00000030342", Miwi2="ENSMUSG00000036912", 
    Plzf="ENSMUSG00000066687", Neurog3="ENSMUSG00000044312",
    Lin28a="ENSMUSG00000050966", Oct4="ENSMUSG00000024406",
    Actin="ENSMUSG00000029580", Gfra1="ENSMUSG00000025089") # as controls
control.counts <- as.matrix(counts[match(all.genes, rownames(counts)),])
rownames(control.counts) <- names(all.genes)
summary(t(control.counts))
require(gplots)
heatmap.2(log2(control.counts+0.01), dendrogram="none", trace="none", col=gray.colors,
    Rowv=FALSE, Colv=FALSE, colsep=sum(tcond=="Hom"), sepcolor="red")
```

## Checking CD9 intensities against expression

We have intensities for CD9 protein levels from FACS.
This can be compared to the *Cd9* expression levels in terms of the observed read counts.

```{r}
cd9.int <- read.csv("../cd9_intensities.csv", header=FALSE)
unnamed <- sub("_15s[0-9]+_lane[0-9]", "", colnames(counts))
refnamed <- paste0(cd9.int[,1], cd9.int[,2])
cd9.match <- match(unnamed, refnamed) # converse will have missing values, due to library filtering.
sum(is.na(cd9.match) | duplicated(cd9.match)) # should be zero.
log.int <- log(cd9.int[cd9.match,3])
plot(log((as.integer(counts[rownames(counts)==all.genes["Cd9"],])+0.5)/totals*1e6), 
    log.int, pch=16, xlab="log-CPM", ylab="Fluorescence")
```

As you can see, there's no relationship between the protein intensities and the expression levels. 
On the plus side, there's a fairly clear demarcation between CD9-positive and negative cells.
We can use this to throw out the latter.

```{r}
discard <- log(cd9.int[cd9.match,3]) < 4
sum(discard)
counts <- counts[,!discard]
totals <- totals[!discard]
tcond <- tcond[!discard]
```

## Checking the sex

All mice are male, so there shouldn't be any expression of female-specific transcripts like *Xist*.
Funnily enough, there is one cell that contains high read counts for *Xist*, so we throw it out.

```{r}
xist.counts <- as.integer(counts[rownames(counts)=="ENSMUSG00000086503",])
summary(xist.counts)
discard <- xist.counts > 1e4
sum(discard)
counts <- counts[,!discard]
totals <- totals[!discard]
tcond <- tcond[!discard]
```

# Filtering out useless genes

I only keep genes if they have an average count above 1 and are expressed above 20 in at least 5 cells.
The idea is to remove lowly expressed genes while protecting against genes that are only expressed in a small number of outlier samples.
This filtering strategy is statistically *ad hoc*, and will probably distort downstream estimates.
Nonetheless, I'm using it as it avoids problems with genes that have high average counts due to one or two highly-expressing outlier cells.

```{r}
keep <- rowSums(counts > 20) > 5 & rowMeans(counts) > 1
counts <- counts[keep,]
gene.lengths <- gene.lengths[keep]
summary(keep)
```

# Exploratory data analysis

We compute log-count-per-million values for each gene in each cell, and we use that to make a PCA plot.
We add a fairly large prior count before log-transformation.
This avoids undefined or large negative values from zero counts, which would otherwise inflate the variance of each gene.

```{r, message=FALSE}
require(edgeR)
evals <- cpm(counts, log=TRUE, prior=3)
pca.out <- prcomp(t(evals), center=TRUE, scale.=TRUE)
plot(pca.out$x[,1], pca.out$x[,2], col=ifelse(tcond=="Hom", "red", "blue"), pch=16)
```

That doesn't really inspire confidence. 
There's no great separation between groups, and the first principal component is dominated by a sparse spread of cells.
Examination of GO categories for the genes with the top PC loadings doesn't suggest any consistent function beyond some rather vague terms.
Spermatogenesis and differentiation pops up, which suggests that the first component might represent something relevant.

```{r, message=FALSE}
top.genes <- rownames(pca.out$rotation)[order(abs(pca.out$rotation[,1]), decreasing=TRUE)[1:1000]]
require(org.Mm.eg.db)
goterms <- select(org.Mm.eg.db, keys=top.genes, keytype="ENSEMBL", column="GO")
top.go <- sort(table(goterms$GO), decreasing=TRUE)[1:20]
require(GO.db)
cbind(select(GO.db, keys=names(top.go), columns="TERM"), top.go)
```

There's a weak association between PC1 and the mitochondrial read proportions.
The cells that are dispersed further along PC1 seem to have lower mitochondrial read proportions than other cells in either population.
Oddly enough, none of the top-loading genes are located within the mitochondrial genome.
All in all, I don't think that this suggests that the PC1 separation is being driven by apoptotic cells.

```{r, message=FALSE}
pc1 <- pca.out$x[,1]
rematch <- match(names(pc1), names(mito.prop))
plot(pc1, mito.prop[rematch], col=ifelse(tcond=="Hom", "red", "blue"), pch=16)
anno <- select(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=top.genes, keytype="GENEID", columns="TXCHROM")
sum(anno$TXCHROM=="chrM", na.rm=TRUE)
```

The *t*-SNE plots look a bit better.
I suspect that this is because they place less weight on the effect of outlier expression patterns, which would otherwise affect the choice of PCA axes.
Of course, this doesn't mean that such outliers don't exist.

```{r, message=FALSE}
set.seed(1000)
require(Rtsne)
tsne.out <- Rtsne(t(evals), perplexity=10)
plot(tsne.out$Y[,1], tsne.out$Y[,2], col=ifelse(tcond=="Hom", "red", "blue"), pch=16)
```

Interestingly enough, the cells that are so widely dispersed on the first PC form a distinct cluster in *t*-SNE.
This might indicate that the first PC isn't just a result of outlier cells, but may instead represent some biologically meaningful expression patterns.

```{r, message=FALSE}
chosen <- pca.out$x[,1]*-2 + 40 > pca.out$x[,2] 
par(mfrow=c(1,2))
plot(pca.out$x[,1], pca.out$x[,2], col=ifelse(chosen, "grey", "black"), pch=16)
plot(tsne.out$Y[,1], tsne.out$Y[,2], col=ifelse(chosen, "grey", "black"), pch=16)
```

We can see whether we do any better by restricting the clustering to only spermatogenesis-related genes.
It seems that it doesn't help.
This suggests that the failure to separate the conditions is fundamentally tied in with their sperm-related activities.

```{r, message=FALSE}
sperm <- select(org.Mm.eg.db, key="GO:0007283", keytype="GO", column="ENSEMBL")
sperm.genes <- evals[rownames(evals) %in% sperm$ENSEMBL,]
sperm.out <- prcomp(t(sperm.genes), center=TRUE, scale.=TRUE)
plot(sperm.out$x[,1], sperm.out$x[,2], col=ifelse(tcond=="Hom", "red", "blue"), pch=16)
set.seed(10000)
sperm.tsne <- Rtsne(t(sperm.genes), perplexity=10)
plot(sperm.tsne$Y[,1], sperm.tsne$Y[,2], col=ifelse(tcond=="Hom", "red", "blue"), pch=16)
```

Finally, we have a look at whether the pool/plate effects have any influence.
There's no indication that these effects are responsible for the structure in the PCA or *t*-SNE plots.
One of the regenerating pools (silver) is more dispersed than the other (gold) in the *t*-SNE map - perhaps more of the homeostatic-like cells got sorted into the same pool.

```{r, message=FALSE}
batching <- sub("_15.*", "", colnames(counts))
col <- c("red", "black", "blue", "darkgreen", "grey", "orange")[as.integer(factor(batching))]
plot(pca.out$x[,1], pca.out$x[,2], col=col, pch=16)
plot(tsne.out$Y[,1], tsne.out$Y[,2], col=col, pch=16)
```

# Eliminating cell cycle effects

Probably a futile effort here, given the mess above.
Besides, would this remove genuine biological differences, given that you'd expect more cycling in the regenerating condition?

# Differential expression analysis

We use `edgeR` to pick up differential expression between our two conditions, given that `BASiCS` depends on spike-in data (which is absent).
There's not much point doing normalization besides that based on library size.
I'm not convinced that the assumptions of TMM normalization are safe in a single-cell setting.

```{r}
require(edgeR)
y <- DGEList(counts)
#y <- calcNormFactors(y)
boxplot(split(y$samples$lib.size, tcond))
```

We estimate the NB dispersion to model biological variability between cells.
The dispersion estimates are quite large, even for a scRNA-seq experiment (a common value of 4-5 seems to be the norm).
In general, we would expect more genes at the base or plateau of the mean-dispersion trend.
The size of these dispersions may be indicative of strong technical noise or substantial cellular heterogeneity within each group.

```{r}
design <- model.matrix(~factor(tcond))
y <- estimateDisp(y, design, prior.df=0)
y$common.dispersion
plotBCV(y)
```

We can't distinguish between technical and biological variability as we don't have spike-ins.
If the large dispersions were caused by the latter, we could try to drive down the variability by regrouping cells based on the observed expression patterns.
I'm rather unwilling to do so, though -- that seems like a recipe for generating false positives, especially in the absence of clear clusters.

Well, let's try to get something out of it anyway.
We estimate the QL dispersion for each gene, which allows us to model the variability in the dispersions across different genes.
We then apply the QL F-test to test for significant DE between conditions. 

```{r}
fit <- glmQLFit(y, design, robust=TRUE)
plotQLDisp(fit)
res <- glmQLFTest(fit)
summary(decideTestsDGE(res))
```

```{r, eval=FALSE, echo=FALSE}
# Quasi-Poisson model; probably liberal, doesn't model the mean-variance relationship properly
# when library sizes are different.
fit1 <- glmQLFit(y, design, dispersion=0)
res1 <- glmLRT(fit1)
fstat <- res1$table$LR/res1$df.test/(res1$deviance/res1$df.res1idual.zeros)
pval1 <- pmax(pf(fstat, res1$df.test, res1$df.residual.zeros, lower=FALSE), res1$table$PValue)
sum(p.adjust(pval, method="BH")<=0.05, na.rm=TRUE)
```

That's a very ugly-looking trend; I'm not sure how reliable it is.
In fact, almost all of the QL dispersion estimates are below unity.
I'm guessing that this is caused by difficulties with getting the NB dispersion trend right when there is extreme variability in the dispersions. 

In such cases, `edgeR` will switch to a LRT with the trended dispersion instead.
This is not ideal as the test results for genes with high dispersions will be liberal. 
To avoid this, I use the LRT with the per-gene NB dispersions.
This does not rely on the fitted trend, as the estimate is obtained for each gene in isolation.
The flipside is that it fails to account for estimation uncertainty in the NB dispersions, such that the resulting DE test will likely be liberal.
This bias is mitigated by the presence of many residual d.f. in this data set, which reduces the variance of the dispersion estimates.

```{r}
fit <- glmFit(y, design, dispersion=y$tagwise.dispersion)
res <- glmLRT(fit)
summary(decideTestsDGE(res))
```

We save these results to file, along with some annotation indicating what they are.

```{r, message=FALSE}
all.res <- topTags(res, n=Inf)$table
anno <- select(org.Mm.eg.db, keys=rownames(all.res), keytype="ENSEMBL", 
    columns=c("SYMBOL", "GENENAME"))
is.first <- !duplicated(anno$ENSEMBL)
all.res$Symbol <- anno$SYMBOL[is.first]
all.res$Description <- anno$GENENAME[is.first]
write.table(all.res, file="de_out.tsv", row.names=TRUE, quote=FALSE, col.names=NA, sep="\t")
```

As it happens, there's a fair number of sperm-related genes that appear to be upregulated in the regenerating condition.
However, I'm not sure whether this is due to genuine upregulation, or whether it is caused by irrelevant differences in population composition (given the variety in *Miwi2* and *Cd9* expression).

```{r, message=FALSE}
sperm <- select(org.Mm.eg.db, key="GO:0007283", keytype="GO", column="ENSEMBL")
of.interest <- all.res$FDR <= 0.05 & (rownames(all.res) %in% sperm$ENSEMBL | 
    grepl("sperm", all.res$Description, ignore.case=TRUE))
all.res[of.interest,c("logFC", "Symbol")]
```

# Identifying high-variance genes

## Defining the mean-variance trend

We identify high-variance genes in each population, to identify potential contributors to population heterogeneity.
Without spike-ins, it's hard to tell whether a large variance is driven by biological factors or by technical noise.
Instead, we just rank the genes based on the variances of the log-CPMs.
The log-transformation provides some variance stabilization, to protect against the fact that variance increases with the mean for count data.

```{r}
get.var <- function(y) {
    as.cpm <- cpm(y, log=TRUE, prior=3)
    ave.ab <- aveLogCPM(y)	
    all.var <- apply(as.cpm, 1, FUN=var)
    fit <- loessFit(y=all.var, x=ave.ab)
    smoothScatter(ave.ab, all.var)
    o <- order(ave.ab)
    lines(ave.ab[o], fit$fitted[o], col="red")
    return(list(variance=all.var, ab=ave.ab, fitted=fit$fitted, 
        zero=rowSums(y$counts!=0L)))
}
```

Again, we use a large prior count to protect against zero counts.
This shouldn't really matter, as low-abundance genes will have low variances and will generally be ignored.
The above code will also allow us to observe the empirical mean-variance relationship, though this will be largely ignored.
The relative contribution from technical artifacts and genuine biology cannot be determined without spike-in data.

<!--
In particular, there are many low-abundance genes that have strong expression in only 1-2 cells and zeros in all others.
These genes have lower variances (as variation is not consistent) which drags down the mean-variance trend at low means.
Other genes with fewer zeros (i.e., expression in more cells) will generally have higher variances relative to the trend.
This results in inappropriate selection of such genes as high-variance features, despite having more consistent expression.
Normally, this would be avoided by estimating and subtracting large technical noise estimates; however, we can't rely on that because we don't have spike-ins.
-->

## Fitting for the homeostatic condition 

Okay, so we can now do this on for the homeostatic population.
We rank based on the variances and save the most variable genes to file.
Highly variable genes may be markers for distinct subpopulations, though they may also represent uninteresting heterogeneity,
      e.g., where the expression profile changes gradually across the population due to natural stochasticity.

<!--
Incidentally, the apparent cut-off on the diagonal is formed by genes that are regularly expressed, i.e., equal number of zeros to non-zero counts. 
If you set the prior count to a large value, this more or less disappears, ensuring that these genes do not attain a large rank.
-->

```{r}
y.hom <- y[,tcond=="Hom"]
hom.var <- get.var(y.hom)
anno <- select(org.Mm.eg.db, key=rownames(y.hom$counts), 
		keytype="ENSEMBL", column=c("SYMBOL", "GENENAME"))
output <- data.frame(anno[!duplicated(anno$ENSEMBL),], Variance=hom.var$variance, 
		AveLogCPM=hom.var$ab, NonZero=hom.var$zero)
output <- output[order(output$Variance, decreasing=TRUE),]
write.table(output, file="hom_var.tsv", quote=FALSE, sep="\t", row.names=FALSE)
head(output)
```

<!--
The alternative is to fit a trend to the CV2 against the mean (log both, fit the trend, unlog the fitted values and subtract).
Ranking on these residuals means that, for the top-ranked genes, the variance must increase by an absolute amount comparable to the mean.
However, in a sense, using log-CPMs already provides protection against that. 
You can only get an large increase in the variance of the log-CPMs if the CPMs start varying by several-fold relative to the mean.

I prefer using log-CPMs for several reasons:
- If you use the CV2, then you need to log the CV2 anyway to fit the trend. 
Why not just compute the variances of the log-counts to start with?
- You can treat log-counts as normally distributed, which makes the variance more interpretable (i.e., as a parameter with actual meaning, not as an abstract).
It also allows you to actually calculate confidence intervals, perform hypothesis tests, etc.
You can't do that with non-normal counts.
- If log-counts are normal, then the counts themselves are log-normal.
The CV of these counts has a direct relationship with the variance of the log-counts, i.e., knowing one will give you the other.
There's no need to manually calculate it.
-->

We can have a look at what the GO terms are generally doing for the top set of genes.

```{r gocheck, message=FALSE}
N <- 500
goterms <- select(org.Mm.eg.db, keys=output$ENSEMBL[1:N], keytype="ENSEMBL", column="GO")
top.go <- sort(table(goterms$GO), decreasing=TRUE)[1:20]/N
cbind(select(GO.db, keys=names(top.go), columns="TERM"), Prop=top.go)
```

We can also try using the top 1000 most variable genes and use them for clustering.
There seems to be a general increase in expression from left to right, possibly associated with cell size or cycle phase.
Otherwise, there doesn't seem to be any specific substructure.

```{r}
most.var <- output$ENSEMBL[1:1000]
all.cpm.hom <- cpm(y.hom, log=TRUE, prior=3)
current.genes <- all.cpm.hom[rownames(y.hom) %in% most.var,]
current.genes <- current.genes - rowMeans(current.genes)
hom.heat <- heatmap.2(current.genes, trace="none", symbreaks=TRUE, col=bluered)
```

```{r, echo=FALSE, eval=FALSE}
new.grouping <- numeric(ncol(y.hom))
new.grouping[unlist(hom.heat$colDendrogram[[2]])] <- 1
design.hom <- model.matrix(~new.grouping)
y.hom <- estimateDisp(y.hom, design.hom, prior.df=0, trend="none")
fit.hom <- glmFit(y.hom, design.hom)
res.hom <- glmLRT(fit.hom)
all.res.hom <- topTags(res.hom, n=Inf)$table
anno.hom <- select(org.Mm.eg.db, keys=rownames(all.res.hom), 
    keytype="ENSEMBL", columns=c("SYMBOL", "GENENAME"))
is.first <- !duplicated(anno.hom$ENSEMBL)
all.res.hom$Symbol <- anno.hom$SYMBOL[is.first]
all.res.hom$Description <- anno.hom$GENENAME[is.first]
all.res.hom$PValue <- all.res.hom$FDR <- NULL
write.table(all.res.hom, file="de_hom.tsv", row.names=TRUE, quote=FALSE, col.names=NA, sep="\t")
head(anno.hom[is.first,], 20)
```

## Fitting for the regenerating condition 

And the same again, for the regenerating condition.

```{r, message=FALSE}
y.reg <- y[,tcond=="Reg"]
reg.var <- get.var(y.reg)
anno <- select(org.Mm.eg.db, key=rownames(y$counts), keytype="ENSEMBL", 
		column=c("SYMBOL", "GENENAME"))
output <- data.frame(anno[!duplicated(anno$ENSEMBL),], Variance=reg.var$variance, 
		AveLogCPM=hom.var$ab, NonZero=reg.var$zero)
output <- output[order(output$Variance, decreasing=TRUE),]
write.table(output, file="reg_var.tsv", quote=FALSE, sep="\t", row.names=FALSE)
head(output)
```

Again, looking at the GO terms:

```{r, ref.label="gocheck", message=FALSE, echo=FALSE}
```

And looking at the effects on clustering. 
This time, we get a very clear partition between several clusters.

```{r}
most.var <- output$ENSEMBL[1:1000]
all.cpm.reg <- cpm(y.reg, log=TRUE, prior=3)
current.genes <- all.cpm.reg[rownames(y.reg) %in% most.var,]
current.genes <- current.genes - rowMeans(current.genes)
reg.heat <- heatmap.2(current.genes, trace="none", symbreaks=TRUE, col=bluered)
```

Notice that the far-right cluster should be more related to the far-left cluster, rather than the central cluster.
The only reason for this odd partitioning is because of the larger number of genes with gradual variability.
If we reduce the number of highly variable genes used to construct the heatmap, we get more sensible cluster formation.

```{r}
most.var <- output$ENSEMBL[1:100]
current.genes <- all.cpm.reg[rownames(y.reg) %in% most.var,]
current.genes <- current.genes - rowMeans(current.genes)
reg.heat <- heatmap.2(current.genes, trace="none", symbreaks=TRUE, col=bluered)
```

We can peel apart the DE genes involved.
There's a bunch of spermatogenic genes, which is reassuring. 
Note that we don't report the *p*-value and FDR, as these are not to be relied on when the clusters are empirically defined.

```{r, message=FALSE}
new.grouping <- numeric(ncol(y.reg))
new.grouping[unlist(reg.heat$colDendrogram[[2]])] <- 1
design.reg <- model.matrix(~new.grouping)
y.reg <- estimateDisp(y.reg, design.reg, prior.df=0, trend="none")
fit.reg <- glmFit(y.reg, design.reg)
res.reg <- glmLRT(fit.reg)
all.res.reg <- topTags(res.reg, n=Inf)$table
anno.reg <- select(org.Mm.eg.db, keys=rownames(all.res.reg), keytype="ENSEMBL", columns=c("SYMBOL", "GENENAME"))
is.first <- !duplicated(anno.reg$ENSEMBL)
all.res.reg$Symbol <- anno.reg$SYMBOL[is.first]
all.res.reg$Description <- anno.reg$GENENAME[is.first]
all.res.reg$PValue <- all.res.reg$FDR <- NULL
write.table(all.res.reg, file="de_reg.tsv", row.names=TRUE, quote=FALSE, col.names=NA, sep="\t")
head(anno.reg[is.first,], 20)
```

### Comparing each regenerating subcluster to the homeostatic population

One possibility is that one of the clusters in the regenerating condition simply represents a set of cells that failed to respond to toxin treatment.
We can test this by examining the number of DE genes relative to the homeostatic condition.
If there was a failure to respond, there should be few DE genes when one of the regenerating clusters is compared to the homeostatic population. 

```{r}
all.cond <- tcond
all.cond[tcond=="Reg"][new.grouping==1L] <- "Reg2"
re.design <- model.matrix(~all.cond)
re.y <- estimateDisp(y, re.design, prior.df=0, trend="none")
re.fit <- glmFit(re.y, re.design)
```
	
Fewer DE genes are observed for the first regenerating cluster, consistent with a weakened response in that cluster compared to the second cluster.
That said, there are enough DE genes to suggest some differences with the homeostatic population.
Whether this is genuine biology or due to plate effects is not clear.

```{r, message=FALSE}
res.HvR <- glmLRT(re.fit, coef=2) # Homeostatic, against the first cluster.
sig.HvR <- decideTestsDGE(res.HvR)
summary(sig.HvR)
res.HvR2 <- glmLRT(re.fit, coef=3) # Homeostatic, Against the second cluster.
sig.HvR2 <- decideTestsDGE(res.HvR2)
summary(sig.HvR2)
overall.sig <- sapply(-1:1, FUN=function(x) { sapply(-1:1, FUN=function(y) { sum(sig.HvR==x & sig.HvR2==y) }) })	
colnames(overall.sig) <- rownames(overall.sig) <- -1:1
overall.sig
```

These DE lists can be dumped to file, for posterity.
Again, ignore the *p*-value and FDR as they make no sense for empirically defined groupings.

```{r, message=FALSE}
all.res.HvR <- topTags(res.HvR, n=Inf)$table
anno.HvR <- select(org.Mm.eg.db, keys=rownames(all.res.HvR), keytype="ENSEMBL", columns=c("SYMBOL", "GENENAME"))
is.first <- !duplicated(anno.HvR$ENSEMBL)
all.res.HvR$Symbol <- anno.HvR$SYMBOL[is.first]
all.res.HvR$Description <- anno.HvR$GENENAME[is.first]
all.res.HvR$PValue <- all.res.HvR$FDR <- NULL
write.table(all.res.HvR, file="de_HvR.tsv", row.names=TRUE, quote=FALSE, col.names=NA, sep="\t")
head(anno.HvR[is.first,], 20)
all.res.HvR2 <- topTags(res.HvR2, n=Inf)$table
anno.HvR2 <- select(org.Mm.eg.db, keys=rownames(all.res.HvR2), keytype="ENSEMBL", columns=c("SYMBOL", "GENENAME"))
is.first <- !duplicated(anno.HvR2$ENSEMBL)
all.res.HvR2$Symbol <- anno.HvR2$SYMBOL[is.first]
all.res.HvR2$Description <- anno.HvR2$GENENAME[is.first]
all.res.HvR2$PValue <- all.res.HvR2$FDR <- NULL
write.table(all.res.HvR2, file="de_HvR2.tsv", row.names=TRUE, quote=FALSE, col.names=NA, sep="\t")
head(anno.HvR2[is.first,], 20)
```

# Examining characteristics of the newly defined subpopulations

We can also have a look at how these clusters separate on the *t*-SNE map.
As you might expect, the clustering is consistent with the previously observed partition.
The first regenerating cluster consists of cells that are located close to the homeostatic population on the map.

```{r}
new.colors <- c(Hom="red", Reg="blue", Reg2="grey50")[all.cond]
plot(tsne.out$Y[,1], tsne.out$Y[,2], col=new.colors, pch=16)
```

We can also have a look at the mitochondrial proportions, again.
It looks like the first regenerating subcluster has higher proportions.
Perhaps stress is driving the formation of the two subclusters within the regenerating condition.

```{r}
rematch <- match(colnames(counts), names(mito.prop))
boxplot(split(mito.prop[rematch], all.cond), ylab='Mitochondrial proportions')
```

Finally, we can construct a big heatmap with all surviving cells.
Genes are chosen for this heatmap based on variability across the entire data set.
Note that this particular list won't provide additional information beyond the DE lists that have already been generated.
Its construction is purely for the purposes of selecting genes for visualization in this large heatmap.

```{r}
all.var <- get.var(y)
most.var <- rank(-all.var$variance) <= 1000
current.genes <- cpm(y[most.var,], log=TRUE, prior=3)
current.genes <- current.genes - rowMeans(current.genes)
heatmap.2(current.genes, trace="none", symbreaks=TRUE, col=bluered, ColSideColors=new.colors)
```

The colors on the top indicate the assigned (sub)cluster for each cell.
Note that the second regenerating subcluster is split into two, due to the aforementioned issues involving a greater number of genes with weak differences within that subcluster.
It also seems like some of the cells in the homeostatic population are exhibiting expression profiles similar to the second regenerating population.
This is consistent with the presence of a couple of homeostatic cells in the subcluster on the *t*-SNE map. 

# Session information

```{r}
sessionInfo()
```