---
title: "Single-cell RNA-seq of naive/primed embryonic stem cells: primed-only analysis"
author: Tobias Messmer and Aaron Lun
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
    depth: 3
    number_sections: true
    theme: united 
    highlight: tango 
    fig_caption: false
---

```{r, echo=FALSE, results="hide"}
dir.create("figure-primed/", showWarning=FALSE)
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE, fig.path="figure-primed/")
```

# Selecting only the primed subpopulations

We select only the primed cells and attempt to identify subpopulations within the primed condition.
This is more sensitive than doing so on the entire data set, where the naive/primed differences dominate.

```{r primedin}
library(scran)
sce <- readRDS("results-preprocess/sce_all.rds")
sce_primed <- sce[,colData(sce)$phenotype=="primed"] 
dim(sce_primed)
```

We re-normalize to rescale the size factors at unity, so that they are comparable between spike-ins and endogenous genes.

```{r}
sce_primed <- normalize(sce_primed)
```

We can examine the expression of the various ESC markers in more detail.

```{r primedExpMarker}
library(scater)
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotExpression(sce_primed, c("KLF4", "KLF17", "DPPA3", "TFCP2L1", "NANOG", "ZFP42", "POU5F1"),
    colour_by="batch") + fontsize
```

We also set up an output directory for the results.

```{r primedout}
resdir <- "results-primed"
dir.create(resdir, showWarning=FALSE)
```

# Detecting primed-only HVGs

We detect HVGs in the primed population, using the variance of the normalized log-expression values.
A trend is fitted to the spike-in variances and represents the technical aspect of variability.
This is done separately for each batch due to the differences in spike-in quantity.
The trend is okay, though some deviation of the spike-ins is attributable to amplification-induced variability.

```{r primed.vartrend, fig.height=6, fig.width=10}
sce_primed <- multiBlockNorm(sce_primed, sce_primed$batch)
var.out_primed <- multiBlockVar(sce_primed, block=sce_primed$batch,
    trend.args=list(parametric=TRUE, loess.args=list(span=0.4)))

par(mfrow=c(1,2))
is.spike <- isSpike(sce_primed)
for (batch in levels(sce_primed$batch)) {
    cur.out <- var.out_primed$per.block[[batch]]
    plot(cur.out$mean, cur.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
        ylab="Variance of log-expression", main=batch)
    curve(metadata(cur.out)$trend(x), col="dodgerblue", lwd=2, add=TRUE)
    points(cur.out$mean[is.spike], cur.out$total[is.spike], col="red", pch=16)
}
```

We save the resulting table to file.

```{r naive.varplot}
write.table(file=file.path(resdir, "var_primed.tsv"), var.out_primed, sep="\t", quote=FALSE, col.names=NA)
```

Genes with significantly non-zero biological components at a FDR of 5% are defined as HVGs.

```{r var.primed}
hvg.out_primed <- var.out_primed[which(var.out_primed$FDR <= 0.05 & var.out_primed$bio >= 0.5),]
hvg.out_primed <- hvg.out_primed[order(hvg.out_primed$bio, decreasing=TRUE),]
nrow(hvg.out_primed)
write.table(file= file.path(resdir, "hvg_primed.tsv"), hvg.out_primed, sep="\t", quote=FALSE, col.names = NA) 
head(hvg.out_primed)
```

We check that the HVGs are not being driven by outliers or batch effects.

```{r violin.hvg}
plotExpression(sce_primed, rownames(hvg.out_primed)[1:10], point_alpha=0.05, 
    colour_by="batch", jitter_type="jitter") + fontsize
```

# Performing dimensionality reduction

We remove the batch effect prior to performing dimensionality reduction.
The old `norm_exprs` is not sufficient as the batch effect for the primed-only cells may be different from the batch effect of the entire data set.
Note that we're dealing with one phenotype here so there's no need for `design=`.

```{r}
library(limma)
norm_exprs(sce_primed) <- removeBatchEffect(exprs(sce_primed), batch=sce_primed$batch) 
sce_primed <- denoisePCA(sce_primed, var.out_primed, assay.type="norm_exprs")
ncol(reducedDim(sce_primed))
```

We also examine the variance explained by the first few PCs.

```{r}
pc.out <- attr(reducedDim(sce_primed), "percentVar")
plot(pc.out)
```

# Visualization in low-dimensional space

We then create _t_-SNE plots, using only the set of correlated HVGs.

```{r tsne.primed, fig.width=15, fig.height=6}
set.seed(100)
sce_primed <- runTSNE(sce_primed, use_dimred="PCA", perplexity=30)
tsne1_primed <- plotTSNE(sce_primed, colour_by=rownames(hvg.out_primed)[1]) +  fontsize
tsne2_primed <- plotTSNE(sce_primed, colour_by="KLF4") + fontsize
tsne3_primed <- plotTSNE(sce_primed, colour_by="batch") + fontsize
multiplot(tsne1_primed, tsne2_primed, tsne3_primed, cols=3)
```

And again, for PCA plots.

```{r pca.primed, fig.width=15, fig.height=6}
pca1_primed <- plotPCA(sce_primed, colour_by=rownames(hvg.out_primed)[1]) + fontsize
pca2_primed <- plotPCA(sce_primed, colour_by="KLF4") + fontsize
pca3_primed <- plotPCA(sce_primed, colour_by="batch") + fontsize
multiplot(pca1_primed, pca2_primed, pca3_primed, cols=3)
```

Both plots suggest that there is a small subpopulation off the main bulk of cells.

# Clustering to identify putative subpopulations

We perform hierarchical clustering on the Euclidean distance in the PC space.

```{r primedclust}
pcs_primed <- reducedDim(sce_primed, "PCA")
my.dist_primed <- dist(pcs_primed)
my.tree_primed <- hclust(my.dist_primed)
```

To identify the "best" number of clusters, we vary the `k` for tree cutting and examine the resulting silhouette plots.
There's not much evidence for clear separation of the clusters at any `k`.
Many clusters have cells have negative widths that should be allocated to other clusters, indicating that the separation of cells here is quite weak.
As such, we do not perform any clustering.

```{r silhouette, fig.width=10, fig.height=10}
library(cluster)
par(mfrow=c(2,2))
for (k in 2:5){
    my.clusters_primed <- unname(cutree(my.tree_primed, k=k))
    col <- rainbow(k)
    sil <- silhouette(my.clusters_primed, dist = my.dist_primed)
    sil.cols <- col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
    sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
    plot(sil, main = paste(k, "clusters"), border=sil.cols, 
        col=sil.cols, do.col.sort=FALSE)
}
```

# Wrapping up

We save our object for later use and report the session information.

```{r}
saveRDS(sce_primed, file=file.path(resdir, "sce_primed.rds"))
sessionInfo()
```