---
title: "Single-cell RNA-seq of naive/primed embryonic stem cells: naive-only analysis"
author: Tobias Messmer and Aaron Lun
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
    depth: 3
    number_sections: true
    theme: united 
    highlight: tango 
    fig_caption: false
---

```{r, echo=FALSE, results="hide"}
dir.create("figure-transition/", showWarning=FALSE)
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE, fig.path="figure-transition/")
```

# Selecting only the naive subpopulations

We select only the naive cells, in which we have previously identified the transition subpopulation
This is more sensitive than doing so on the entire data set, where the naive/primed differences dominate.

```{r naivein}
library(scran)
sce_naive <- readRDS("results-naive/sce_naive.rds")
```

We also set up an output directory for the results.

```{r naiveout}
resdir <- "results-transition"
dir.create(resdir, showWarning=FALSE)
```

# Characterizing cluster 2

## Checking diagnostics

Cluster 2 seems like the most interesting cluster, with a number of cells breaking away from the bulk primed population.
We examine the expression of marker genes for this cluster. 
As cluster 3 consists of only 1 cell, we consider this an outlier that is not representing a different population.

```{r naiveexpclust}
library(scater)
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotExpression(sce_naive, c("KLF4", "KLF17", "DPPA3", "TFCP2L1", "NANOG", "ZFP42", "POU5F1"), 
    x="cluster") + fontsize
```

We also examine the distribution of size factors and number of expressed genes for each cluster.
Nothing particularly untoward.

```{r, fig.width=10, fig.height=6}
par(mfrow=c(1,2))
boxplot(split(sizeFactors(sce_naive), sce_naive$cluster), 
    xlab="Cluster", ylab="Size factor", log="y")
boxplot(split(sce_naive$total_features_by_counts, sce_naive$cluster), 
    xlab="Cluster", ylab="Size factor", log="y")
```

Most cells in this cluster are in batch 2, but so are most cells in general, so that's not really a problem.

```{r}
lapply(split(sce_naive$batch, sce_naive$cluster), table)
```

We also check the cell cycle phase for each of the clusters.
It's a fairly decent mix, so it's unlikely to be driven by cell cycle phase alone.

```{r}
lapply(split(sce_naive$phase, sce_naive$cluster), table)
```

## Identifying DE between clusters 

With these clusters identified, we can assess DE of cluster 2 relative to each other cluster in the naive population using the `findMarkers` function.
We also block on the batch to avoid batch effects.
Note that single-cell clusters are automatically ignored here by the t-test.

```{r marker_clust_naive}
all.markers_naive_up <- findMarkers(sce_naive, clusters=sce_naive$cluster, 
    block=sce_naive$batch, direction="up")
all.markers_naive_down <- findMarkers(sce_naive, clusters=sce_naive$cluster, 
    block=sce_naive$batch, direction="down")
```

We then focus on the DE genes in cluster 2.
We need to rank the up- and down-regulated genes separately due to asymmetry in power when using the Welch t-test for unbalanced designs.

```{r}
trans.cluster <- 2
marker.set_naive_up <- all.markers_naive_up[[as.character(trans.cluster)]]
write.table(marker.set_naive_up, 
    file=file.path(resdir, paste0("markers_", trans.cluster, "_up.tsv")), 
    sep="\t", quote=FALSE, col.names=NA) 
head(marker.set_naive_up)

marker.set_naive_down <- all.markers_naive_down[[as.character(trans.cluster)]]
write.table(marker.set_naive_down, 
    file=file.path(resdir, paste0("markers_", trans.cluster, "_down.tsv")), 
    sep="\t", quote=FALSE, col.names=NA) 
head(marker.set_naive_down)
```

We create a heatmap that shows differentially expressed genes relative to the chosen cluster.
Clearly, some strong differential expression exists.

```{r clust.naiv2, fig.height=10, fig.width=6}
top.markers_naive <- unique(c(
    rownames(marker.set_naive_up)[marker.set_naive_up$Top <= 50],
    rownames(marker.set_naive_down)[marker.set_naive_down$Top <= 50]
))

plotHeatmap(sce_naive, exprs_values="norm_exprs",
    center=TRUE, symmetric=TRUE, features=top.markers_naive,
    columns=order(sce_naive$cluster), cluster_cols=FALSE,
    colour_columns_by="cluster", show_colnames = FALSE, show_rownames=FALSE)
```

## Comparing to all cells

We record the identity of cells in this cluster in the context of the entire data set.

```{r}
trans.cluster <- 2
sce <- readRDS("results-preprocess/sce_all.rds")
all.pheno <- as.character(sce$phenotype)
all.pheno[colnames(sce) %in% colnames(sce_naive)[sce_naive$cluster==trans.cluster]] <- "transition"
write.table(data.frame(Cell=colnames(sce), Type=all.pheno), 
    file=file.path(resdir, "groups.tsv"),
    row.names=FALSE, quote=FALSE, sep="\t")
```

If we map the transition cells back to the whole-data PCA plot, they show up in the middle.
This motivates their description as a "transition" subpopulation.

```{r}
sce$phenotype <- all.pheno
reducedDim(sce, "PCA") <- readRDS("results-overall/pcs.rds")
plotPCA(sce, colour_by="phenotype") + fontsize
```

# Identifying markers relative to primed/naive cells

We do the same for the transition population against the other naive or primed cells in general.
This is done using _limma_, which seems to provide more protection against outliers than _edgeR_ as you go down the list.

```{r}
library(limma)
collected <- list()
for (mode in c("naive", "primed")) {
    current <- all.pheno %in% c("transition", mode)
    curE <- exprs(sce)[!isSpike(sce),current]
    keep <- rowMeans(curE) > 0.1 # modest filtering to avoid distorted mean-var trend.
    
    curB <- factor(sce$batch[current])
    curP <- factor(all.pheno[current])
    curX <- model.matrix(~0 + curP + curB)

    fit <- lmFit(curE, curX)
    con <- makeContrasts(contrasts=sprintf("curPtransition - curP%s", mode), levels=curX)
    fit2 <- contrasts.fit(fit, con)
    fit2 <- eBayes(fit2, robust=TRUE, trend=TRUE)
    res <- topTable(fit2, n=Inf, sort.by="p")

    collected[[mode]] <- res
    write.table(file=file.path(resdir, sprintf("markers_trans_vs_%s.tsv", mode)), res, 
        sep="\t", quote=FALSE, col.names = NA) 
}
```

We briefly examine the top candidate markers in each comparison.
There's both up- and down-regulation, which indicates that the transition population isn't just a bunch of dead cells with low global expression.

```{r}
table(sign(collected$naive$logFC[1:100]))
head(collected$naive)
table(sign(collected$primed$logFC[1:100]))
head(collected$primed)
```

We make a heatmap to visualize the expression of these candidates.

```{r heatNvPvT}
use.markers <- unique(c(rownames(collected$naive)[1:50], rownames(collected$primed)[1:50]))
o <- order(all.pheno)
top.exprs <- norm_exprs(sce)[use.markers,o,drop=FALSE]
heat.vals <- top.exprs - rowMeans(top.exprs)

library(pheatmap)
clust.cols <- c("black", "grey", "red")
pheatmap(heat.vals, cluster_cols=FALSE, 
    annotation_col=data.frame(Phenotype=all.pheno[o], row.names=colnames(top.exprs)),
    show_rownames = FALSE, show_colnames = FALSE,
    annotation_colors=list(Cluster=setNames(clust.cols, unique(all.pheno))))
```

We also have a look at the top genes that are uniquely expressed in the transition population.
These genes must be DE against both naive and primed cells, _and_ in the same direction against both.

```{r}
norder <- collected$naive[order(rownames(collected$naive)),]
porder <- collected$primed[order(rownames(collected$primed)),]
maxp <- pmax(porder$P.Value, norder$P.Value)
maxp <- ifelse(sign(norder$logFC)==sign(porder$logFC), maxp, 1)
unique.marker <- data.frame(vsNaive=norder$logFC, vsPrimed=porder$logFC,
    AveExpr=norder$AveExpr, P.Value=maxp, row.names=rownames(porder))
unique.marker <-  unique.marker[order(maxp),]
write.table(unique.marker, file=file.path(resdir, "markers_unique_trans.tsv"),
    sep="\t", quote=FALSE, col.names=NA)
head(unique.marker, 20)
```

# Gene set analysis for unique transition genes

We do gene set analyses on the top set of genes that are uniquely expressed in this transition population.
First, gene ontology terms:

```{r naivego}
unique.genes <- rownames(unique.marker)[1:200]
is.unique <- rownames(sce) %in% unique.genes
entreznames <- as.character(rowData(sce)$entrezgene)
has.entrez <- !is.na(entreznames)
go <- goana(entreznames[has.entrez & is.unique], universe=entreznames[has.entrez], species="Hs")
topgo <- topGO(go, n=Inf, ontology = "BP") 
write.table(file=file.path(resdir, "go_unique_trans.tsv"), topgo, sep="\t", quote=FALSE, col.names = NA) 
head(topgo)
```

And again with KEGG terms:

```{r naivekegg}
kegg <- kegga(entreznames[has.entrez & is.unique], universe=entreznames[has.entrez], species="Hs")
topkegg <- topKEGG(kegg, n=Inf)
write.table(file=file.path(resdir, "kegg_unique_trans.tsv"), topkegg, sep="\t", quote=FALSE, col.names = NA) 
head(topkegg)
```

We also do this for the top set of genes that are DE between the transition and naive populations.

```{r geneset_vs_naive}
vs.naive.genes <- rownames(collected$naive)[1:200]
vs.naive <- rownames(sce) %in% vs.naive.genes
go <- goana(entreznames[has.entrez & vs.naive], universe=entreznames[has.entrez], species="Hs")
topgo <- topGO(go, n=Inf, ontology = "BP") 
write.table(file=file.path(resdir, "go_trans_vs_naive.tsv"), topgo, sep="\t", quote=FALSE, col.names = NA) 
head(topgo)
kegg <- kegga(entreznames[has.entrez & vs.naive], universe=entreznames[has.entrez], species="Hs")
topkegg <- topKEGG(kegg, n=Inf)
write.table(file=file.path(resdir, "kegg_trans_vs_naive.tsv"), topkegg, sep="\t", quote=FALSE, col.names = NA) 
head(topkegg)
```

We repeat this for the top set of genes that are DE between transition and primed populations.

```{r geneset_vs_primed}
vs.primed.genes <- rownames(collected$primed)[1:200]
vs.primed <- rownames(sce) %in% vs.primed.genes
go <- goana(entreznames[has.entrez & vs.primed], universe=entreznames[has.entrez], species="Hs")
topgo <- topGO(go, n=Inf, ontology = "BP") 
write.table(file=file.path(resdir, "go_trans_vs_primed.tsv"), topgo, sep="\t", quote=FALSE, col.names = NA) 
head(topgo)
kegg <- kegga(entreznames[has.entrez & vs.primed], universe=entreznames[has.entrez], species="Hs")
topkegg <- topKEGG(kegg, n=Inf)
write.table(file=file.path(resdir, "kegg_trans_vs_primed.tsv"), topkegg, sep="\t", quote=FALSE, col.names = NA) 
head(topkegg)
```

# Looking for imprinted genes

Here, we examine a set of imprinted genes supplied by Ferdinand.
The idea is to see if these genes change in expression between naive, transition and primed cells.
If the transition cells have naive-like levels of these genes, it suggests that they once were naive cells themselves.

```{r}
imprinted <- c("ENSG00000162595", # DIRAS3
               "ENSG00000177432", # NAP1L5
               "ENSG00000145945", # FAM50B 
               "ENSG00000118495", # PLAGL1
               "ENSG00000197081", # IGF2R
               "ENSG00000106070", # GRB10
               "ENSG00000242265", # PEG10
               "ENSG00000106484", # MEST
               "ENSG00000167632", # TRAPPC9
               "ENSG00000198825", # INPP5F
               "ENSG00000139687", # RB1
               "ENSG00000179455", # MKRN3
               "ENSG00000265673", # MIR4508
               "ENSG00000254585", # MAGEL2
               "ENSG00000182636", # NDN
               "ENSG00000128739", # SNRPN
               "ENSG00000273173", # SNURF
               "ENSG00000140443", # IGF1R
               "ENSG00000130844", # ZNF331
               "ENSG00000198300", # PEG3
               "ENSG00000101898", # MCTS2P
               "ENSG00000101294", # HM13
               "ENSG00000166619", # BLCAP
               "ENSG00000053438", # NNAT
               "ENSG00000185513", # L3MBTL
               "ENSG00000235590", # GNAS-AS1
               "ENSG00000087460", # GNAS
               "ENSG00000204186", # ZDBF2
               "ENSG00000130600", # H19
               "ENSG00000167244", # IGF2
               "ENSG00000214548", # MEG3
               "ENSG00000167981", # ZNF597
               "ENSG00000122390") # NAA60
```

Making the heatmap of expression values for these guys.
It seems like there's changes in expression both ways, not just an increase in naive cells upon loss of imprinting.
Apparently, this is due to loss of imprinting for the antisense locus that leads to silencing of the corresponding gene.

```{r, fig.height=10, fig.width=6}
o <- order(all.pheno)
top.exprs <- norm_exprs(sce)[rowData(sce)$ensembl_gene_id %in% imprinted,o,drop=FALSE]
heat.vals <- top.exprs - rowMeans(top.exprs)

clust.cols <- c("black", "grey", "red")
pheatmap(heat.vals, cluster_cols=FALSE, 
    annotation_col=data.frame(Phenotype=all.pheno[o], 
        row.names=colnames(top.exprs)),
    annotation_colors=list(Cluster=setNames(clust.cols, unique(all.pheno))))
```

# Wrapping up

We save our object for later use and report the session information.

```{r}
sessionInfo()
```