--- title: A simple analysis of single-cell RNA seq data with Bioconductor packages author: Aaron Lun date: "`r Sys.Date()`" output: BiocStyle::html_document: fig_caption: false --- ```{r style, echo=FALSE, results='hide', message=FALSE} library(BiocStyle) library(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) #options(bitmapType="cairo", width=100) # if transparencies don't work on your machine. ``` # Introduction This practical will perform some of the initial steps in a basic scRNA-seq data analysis, including: - Quality control on the cells - Cell cycle phase assignment - Normalization for cell-specific biases - Modelling technical noise - Dimensionality reduction - Some visualization and clustering - Differential gene detection This will use R version 3.5.0 or higher, with a number of Bioconductor packages. If you haven't downloaded and installed them already, you can do so by running the code below. **This only needs to be done once** - the packages will be on your computer once installed, and can be loaded with `library`. ```{r, eval=FALSE} source("https://bioconductor.org/biocLite.R") biocLite(c("knitr", "BiocStyle", "org.Mm.eg.db", "scater", "Rtsne", "TxDb.Mmusculus.UCSC.mm10.ensGene", "scran", "pheatmap")) ``` # Setting up the data ## Reading in the counts To demonstrate, we'll use a subset of the data from a study of mouse embryonic stem cells (mESCs). These mESCs were cultured under various conditions - 2i (ground state), lif (serum) and a2i (alternative ground state). To keep things simple, I've only taken only cells in the single batch that contains spike-in transcripts. A more detailed description of the study is available at http://www.ebi.ac.uk/teichmann-srv/espresso/. Our first task is to read in the counts and the associated metadata. Here, both files are stored in the CSV format so we can just use the `read.csv` command. (The `.gz` suffix just indicates that it is compressed, which is automatically handled by `read.csv`.) ```{r} count.data <- read.csv("es_data.csv.gz", header=TRUE, row.names=1) head(count.data[,1:10]) ``` The original study used _HTSeq_ to assign reads to genes to obtain gene counts in each cell. It's worth pointing out that _HTSeq_ puts some gunk at the end of the count matrix, e.g., number of unassigned reads. ```{r} tail(count.data[,1:10]) ``` We throw these out because they're not counts for actual genes. ```{r} count.data <- count.data[!grepl("^_", rownames(count.data)),] tail(count.data[,1:10]) ```

**Exercise:** Regular expressions: what does the caret do? ```{r} grepl("^_", "__UNMAPPED") ``` ```{r} grepl("_", "I_AM_A_GENE") ``` ```{r} grepl("^_", "I_AM_A_GENE") ```

## Organizing the cell-based metadata We read in the metadata: ```{r} metadata <- read.csv("metadata.csv", header=TRUE, row.names=1) head(metadata) ``` It's a good idea to check that the metadata actually matches up with the count data. The `match()` command below ensures that the ordering of metadata rows are the same as the ordering of count columns. ```{r} m <- match(colnames(count.data), rownames(metadata)) metadata <- metadata[m,] head(metadata) ``` This information can be stored alongside the counts in a `SingleCellExperiment` object. By storing everything together, we avoid book-keeping errors, e.g., when one matrix is subsetted and the other is not. ```{r} library(SingleCellExperiment) sce <- SingleCellExperiment( list(counts=as.matrix(count.data)), colData=metadata) sce ```

**Exercise:** What column metadata are available for this dataset? ```{r} #H# sce #A# colData(sce) ``` How can you get column metadata values for each cell? ```{r} #H# sce #A# sce$Culture ``` How do you get the counts? ```{r} # str() => structure of the object #H# str() #A# str(counts(sce)) ```

## Adding gene-based annotation We pull out annotation from `r Biocpkg("org.Mm.eg.db")` to relate the ENSEMBL identifiers to the gene symbols. The `mapIds()` function ensures that only one gene symbol is returned if two symbols map to the same Ensembl ID. ```{r} library(org.Mm.eg.db) symbols <- mapIds(org.Mm.eg.db, keys=rownames(sce), keytype="ENSEMBL", column="SYMBOL") anno <- data.frame(ENSEMBL=rownames(sce), SYMBOL=symbols, stringsAsFactors=FALSE) head(anno) ``` To identify which rows correspond to mitochondrial genes, we need to use extra annotation describing the genomic location of each gene. For Ensembl, this involves using the `r Biocpkg("TxDb.Mmusculus.UCSC.mm10.ensGene")` package. ```{r} library(TxDb.Mmusculus.UCSC.mm10.ensGene) location <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=rownames(sce), column="CDSCHROM", keytype="GENEID") anno$CHR <- location table(anno$CHR) ``` We add all of this information to our `SingleCellExperiment` object. ```{r} rowData(sce) <- anno rowData(sce) ```

**Exercise:** How do you recover the row metadata for each gene? ```{r} #H# sce #A# head(rowData(sce)$SYMBOL) ``` How do you add an extra row metadata variable? ```{r} BLAH <- runif(nrow(sce)) #H# rowData(sce) #A# rowData(sce)$BLAH <- BLAH #A# rowData(sce) ```

Identification of rows that correspond to spike-in transcripts is much easier, given that the ERCC spike-ins were used. (If you're doing this on the gene symbols, beware of the human gene family that also starts with "ERCC".) ```{r} is.spike <- grepl("^ERCC", rownames(sce)) sum(is.spike) ``` Note that we need to explicitly indicate that the ERCC set is, in fact, a spike-in set. This is necessary as spike-ins require special treatment in some downstream steps such as variance estimation and normalization. ```{r} isSpike(sce, "ERCC") <- is.spike sce ``` To make things easier to interpret, we'll use the gene symbols as row names. This requires some fiddling to avoid non-unique gene symbols (in which case we paste the Ensembl ID after it) or missing gene symbols (in which case we use the Ensembl ID). `r Biocpkg("scater")` provides the `uniquifyFeatureNames()` function to do this conveniently: ```{r} library(scater) rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ENSEMBL, rowData(sce)$SYMBOL) head(rownames(sce), 50) ``` **** If you're releasing your own data with a count table, be sure to put a stable ID as the row name, e.g., Ensembl or Entrez. People can always easily convert from these IDs to gene symbols; it is much harder to go from symbols to umambiguous IDs. (And put _just_ the ID; don't paste the ID with something else, as this makes life difficult for people who want to use your data.) **** ## Quality control on the cells We need to remove low-quality cells to ensure that technical effects do not distort downstream analyses. We have a number of simple metrics that we use to assess quality: - Library size - Number of expressed features - Proportion of spike-in reads - Proportion of mitochondrial reads (if spike-ins are not available) Together, these catch failures in cDNA capture or sequencing, especially for the endogenous transcripts. Spike-in information is automatically extracted from the `SingleCellExperiment` object, but we need to explicitly specify genes on the mitochondrial genome: ```{r} is.mito <- which(rowData(sce)$CHR == "chrM") is.mito ```

**Exercise:** Why which()? See this neat trick: ```{r} Y <- LETTERS[1:5] X <- c(TRUE, NA, FALSE, NA, TRUE) Y[X] ``` ```{r} Y[which(X)] ``` Why use it with 'is.mito'? ```{r} #H# rowData(sce)$CHR #A# sum(is.na(rowData(sce)$CHR)) ```

For each cell, we calculate quality control metrics such as the total number of counts or the proportion of counts in mitochondrial genes or spike-in transcripts. These are stored in the `colData` of the `SingleCellExperiment` for future reference. ```{r} library(scater) sce <- calculateQCMetrics(sce, feature_controls=list(Mt=is.mito)) head(colnames(colData(sce))) ``` We create plots of these statistics below for each culture condition. ```{r qualplot, fig.height=10, fig.width=10} multiplot(cols=2, plotColData(sce, x="Culture", y="log10_total_counts"), plotColData(sce, x="Culture", y="total_features_by_counts"), plotColData(sce, x="Culture", y="pct_counts_ERCC"), plotColData(sce, x="Culture", y="pct_counts_Mt") ) ``` We assume that, in each culture condition, most cells are of high quality. Cells with outlier values for each of the QC metrics are identified based on some number of MADs from the median value. ```{r} low.lib <- isOutlier(sce$total_counts, log=TRUE, nmads=3, type="lower", batch=sce$Culture) # using log-values, here. low.nfeatures <- isOutlier(sce$total_features_by_counts, log=TRUE, nmads=3, type="lower", batch=sce$Culture) high.ercc <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher", batch=sce$Culture) data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.nfeatures), HighSpike=sum(high.ercc)) ```

**Exercise:** What does `batch=` do in `isOutlier()`? ```{r} x <- c(1,2,0,1,3,2,1,3,0,1,3,2,4,2,0,1,1,5,2,50,52,53,51) b <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2) #H# isOutlier(x) #A# isOutlier(x, batch=b) # last four are outliers #A# isOutlier(x, batch=b) # last four are not outliers ```

We only retain cells that pass all of the specified criteria. Of course, this involves some assumptions about independence from biology. (For example, don't use the mitochondrial proportions if the number/activity of mitochondria changes between cell types.) ```{r} discard <- low.lib | low.nfeatures | high.ercc data.frame(TotalLost=sum(discard), TotalLeft=sum(!discard)) ``` We toss out the cells that we consider to be low-quality, and keep the rest. Here, most cells are retained - which makes sense, as some QC was already applied to the published data. ```{r} sce <- sce[,!discard] ncol(sce) ``` Of course, more sophisticated QC procedures can be used, e.g., `r Biocpkg("cellity")`. We stick to the simple stuff because it's easier to interpret and troubleshoot. # Classification of cell cycle phase We use the `cyclone` method to classify cells into different cell cycle phases (http://dx.doi.org/10.1016/j.ymeth.2015.06.021). This was previously trained on some mouse data with known cell cycle classifications - we can get the trained classifier with `mm.pairs`. ```{r} library(scran) mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran")) str(mm.pairs) ``` The method uses a randomization step, so we use `set.seed()` to obtain consistent results from different runs. ```{r} set.seed(100) assignments <- cyclone(sce, mm.pairs, gene.names=rowData(sce)$ENSEMBL) plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16) ``` Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. It seems that most cells here are in the S phase.

**Exercise:** ```{r} # How do we figure out the specific calls? #H# assignments #A# table(assignments$phases) ```

We store the cell cycle phases for future use, e.g., to check whether any downstream results are driven by cell cycle effects. Phase assignment is difficult so I generally don't use the assignments for anything more than diagnostics. ```{r} sce$cycle_phase <- assignments$phases tail(colnames(colData(sce))) ``` # Examining the genes We inspect the distribution of log-mean counts across all genes. The peak represents the bulk of moderately expressed genes while the rectangular component corresponds to lowly expressed genes. ```{r} ave.counts <- calcAverage(sce) hist(log10(ave.counts), breaks=100, main="", col="grey80", xlab=expression(Log[10]~"average count")) ``` We also look at the identities of the most highly expressed genes. This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. ```{r} plotHighestExprs(sce, n=50) ``` We can also have a look at the number of cells expressing each gene. This is usually well-correlated to the average expression of each gene. ```{r} numcells <- nexprs(sce, byrow=TRUE) smoothScatter(log2(ave.counts), numcells, xlab=expression(Log[2]~"average count"), ylab="Number of expressing cells") ``` We discard genes that are not expressed in any cell, as these are obviously uninformative. ```{r} sce <- sce[numcells > 0,] summary(numcells > 0) ``` # Normalization of cell-specific biases ## For the endogenous genes We apply the deconvolution method (https://dx.doi.org/10.1186/s13059-016-0947-7) to eliminate biases in the counts for the endogenous transcripts. This computes size factors for each cell representing the scaling bias in the counts. Some filtering is required to remove low-abundance genes prior to normalization, see `min.mean=`. ```{r} sce <- computeSumFactors(sce, min.mean=1) summary(sizeFactors(sce)) ```

**Exercise:** It's a good idea to plot the size factors against the library sizes as a sanity check. There should be some positive correlation. ```{r} #H# plot( , #A# plot(sizeFactors(sce), sce$total_counts/1e6, log="xy", ylab="Library size (millions)", xlab="Size factor") ```

For highly heterogeneous data sets with multiple cell types, we should cluster first with `quickCluster()`, and then normalize within each cluster with `cluster=`. This avoids trying to pool very different cells together during the deconvolution process. We don't bother doing this here, as all the cells are mESCs. ## Computing separate size factors for spike-in transcripts We need to normalize spike-in transcripts separately as they are not subject to all biases affecting endogenous transcripts - in particular, total RNA content. Applying the endogenous size factors would over-normalize, so we define separate size factors for the spike-ins. This is simply defined for each cell as the total count across all transcripts in the spike-in set. ```{r} sce <- computeSpikeFactors(sce, type="ERCC", general.use=FALSE) summary(sizeFactors(sce, "ERCC")) ``` These size factors are stored in a separate field of the `SingleCellExperiment` object by setting `general.use=FALSE` in `computeSpikeFactors`. This ensures that they will only be used with the spike-in transcripts but not the endogenous genes.

**Exercise:** The two sets of size factors tend to agree less due to the effects of heterogeneity in total RNA content between cells - this is expected. ```{r} #H# plot( , #A# plot(sizeFactors(sce, 'ERCC'), sizeFactors(sce), log="xy", xlab="Size factor (ERCC)", ylab="Size factor (genes)") ``` Note that if you _do_ want to use the spike-in size factors to normalize all genes, set `general.use=TRUE` instead.

## Applying the size factors to normalize gene expression Counts are transformed into normalized log-expression values for use in downstream analyses. Each value is defined as the log-ratio of each count to the size factor for the corresponding cell, after adding a pseudo-count of 1 to avoid undefined values at zero counts. Division of the counts for each gene by its appropriate size factor ensures that any cell-specific biases are removed. ```{r} sce <- normalize(sce) sce ``` The log-transformation provides some measure of variance stabilization, so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an `logcounts` matrix in addition to the other assay elements.

**Exercise:** How do we find the available assays? ```{r} #H# sce #A# assayNames(sce) ``` How to we get a particular assay? ```{r} #H# sce #A# str(assay(sce, 'logcounts')) ``` How do we get the logcounts? ```{r} #H# sce #A# str(logcounts(sce)) ```

# Modelling the technical component of variation We identify HVGs to focus on the genes that are driving heterogeneity across the population of cells. This requires estimation of the variance in expression for each gene, followed by decomposition of the variance into biological and technical components. ```{r} var.fit <- trendVar(sce, method="loess", loess.args=list(span=0.2)) var.out <- decomposeVar(sce, var.fit) head(var.out) ``` We can have a look at the fitted trend to the spike-in variances. Some tinkering may be required to get a good fit, usually by modifying `span=`. ```{r hvgplothsc} plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", ylab="Variance of log-expression") curve(var.fit$trend(x), add=TRUE, col="dodgerblue", lwd=2) cur.spike <- isSpike(sce) points(var.out$mean[cur.spike], var.out$total[cur.spike], col="red", pch=16) ``` If you don't have spike-ins, you can fit the trend to the variances of the genes with `use.spikes=FALSE` (but this probably overestimates the technical component).

**Exercise:** It's wise to check the distribution of expression values for the top HVGs to ensure that the variance estimate is not being dominated by one or two outlier cells. ```{r} top <- rownames(var.out)[order(var.out$bio, decreasing=TRUE)[1:10]] #H# plotExpression() #A# plotExpression(sce, features = top) ```

An alternative approach is to use `technicalCV2`, which implements the method described by Brennecke _et al._ (http://dx.doi.org/10.1038/nmeth.2645). This has some pros and cons compared to the log-variance method described above. # Dimensionality reduction based on the technical noise We use all genes with a positive biological component in `denoisePCA`. This performs a principal components analysis on the expression profiles, choosing the number of PCs to retain based on the total technical noise in the data set. The idea is to discard later PCs that contain random technical noise, thus enriching for early biological signal (and also reducing work in downstream steps). ```{r} sce <- denoisePCA(sce, technical=var.fit$trend) sce ``` We can have a look at the PCs directly: ```{r} pcs <- reducedDim(sce) # stored in the object dim(pcs) # Cells are rows, PCs are columns ``` Creating pairwise plots between the first four PCs: ```{r pcaplothsc, fig.height=10, fig.width=10} plotPCA(sce, ncomponents=4, colour_by="Culture") ```

**Exercise:** What if we want to look at the proportion of variance explained by each PC? ```{r} prop.var <- attr(pcs, "percentVar") #H# plot(.var, xlab="PC", ylab="Proportion of variance explained") #A# plot(prop.var, xlab="PC", ylab="Proportion of variance explained") ```

We also use _t_-SNE, which is very good at displaying distinct clusters of cells and resolving complex structure. Note that we use the PCA results as "denoised expression values" for input into downstream functions like _t_-SNE. This is valid as it is the distance between cells that is important. ```{r} sce <- runTSNE(sce, use_dimred="PCA", perplexity=30, rand_seed=100) plotTSNE(sce, colour_by="Culture") ``` However, _t_-SNE is stochastic and more complicated than PCA. Testing different settings of the "perplexity" parameter is recommended, as well as running multiple times to check that the conclusions are the same. (Check out http://distill.pub/2016/misread-tsne/ for examples of odd _t_-SNE behaviour.) # Clustering into putative subpopulations Clearly there's some structure here. Here, we know that they're associated with the different culture conditions, but if we didn't we'd have to cluster the cells. A quick and dirty approach with hierarchical clustering on Euclidean distances: ```{r} my.dist <- dist(pcs) # Making a distance matrix my.tree <- hclust(my.dist, method="ward.D2") # Building a tree plot(my.tree) ``` Cutting the tree into three clusters, and seeing how these correspond to the known culture conditions: ```{r} my.clusters <- cutree(my.tree, k=3) table(my.clusters, sce$Culture) ```

**Exercise:** How do I make a _t_-SNE plot coloured by cluster? ```{r} sce$Cluster <- factor(my.clusters) #H# plotTSNE() #A# plotTSNE(sce, colour_by='Cluster') ```

We use a heatmap to visualize the expression profiles of the top 100 genes with the largest biological components. We see "blocks" in expression that correspond nicely to the known culture conditions. We possibly could have subclustered further, in which case we would subset and repeat the above process. ```{r} plotHeatmap(sce, features=rownames(var.out)[order(var.out$bio, decreasing=TRUE)[1:100]], cluster_cols=my.tree, colour_columns_by="Culture", center=TRUE, symmetric=TRUE) ``` When clustering, it is often useful to look at silhouette plots to assess cluster separatedness. Each bar corresponds to a cell, and is proportional to the difference in the average distances to all other cells in the same cluster versus cells in the nearest neighbouring cluster. A good gauge for the number of clusters is that which maximizes the average silhouette width. ```{r silhouette, fig.width=10, fig.height=10} library(cluster) par(mfrow=c(2,2)) for (k in 2:5) { example.clusters <- cutree(my.tree, k=k) sil <- silhouette(example.clusters, dist=my.dist) plot(sil, col=rainbow(k)[sort(sil[,1])]) } ``` Other options for clustering are: - Use the `cutreeDynamic()` function in the `r CRANpkg("dynamicTreeCut")` package, for toplogy-aware cutting of the tree. - Use graph-based methods such as `buildSNNGraph()` or `buildKNNGraph()`, followed by clustering methods from `r CRANpkg("igraph")`. - Use methods with pre-specified number of clusters, e.g., k-means with `kmeans()` and `r Biocpkg("SC3")`, self-organizing maps in `r Biocpkg("flowSOM")`. # Identifying marker genes between subpopulations We use the `findMarkers` function to detect differences between clusters. This will perform pairwise DE analyses between clusters, and consolidate the results into a single table of marker genes per cluster. The tricky part is how to summarize results from many pairwise comparisons into a single ranking of genes. ```{r} out <- findMarkers(sce, clusters=my.clusters) ``` Consider the results of cluster 2: ```{r} marker.set <- out[[2]] # Marker set for cluster 2 head(marker.set) ```

**Exercise:** What does the `Top` parameter represent? ```{r} marker.set[marker.set$Top <= 1,] ``` Try it yourself: ```{r} #H# marker.set[marker.set$Top <= ,] #A# marker.set[marker.set$Top <= 10,] ```

We can visualize this more clearly with a heatmap of the top 10 genes for the pairwise comparisons. ```{r} plotHeatmap(sce, features=rownames(marker.set)[marker.set$Top <= 10], cluster_cols=my.tree, colour_columns_by="Cluster", center=TRUE, symmetric=TRUE) ``` A valid alternative strategy is to detect marker genes that are uniquely up-regulated or down-regulated in each cluster (set `pval.type="all"`). However, be aware that no such genes may exist. For example, in a mixed population of T cells, you could have CD4^+^ T cells, CD8^+^ T cells, double negative and double positive cells. If each of these formed a cluster, and we only looked for unique genes, neither CD4 or CD8 would be detected! # Additional comments It's a good idea to save the `SingleCellExperiment` object to file with the `saveRDS` function. The object can then be easily restored into new R sessions using the `readRDS` function. ```{r} saveRDS(file="data.rds", sce) copy <- readRDS("data.rds") copy ``` Data within it can be extracted and used for more complex analyses. - Droplet-based or UMI data analysis. - Batch correction, see `?mnnCorrect`. - Trajectory reconstruction, see `r Biocpkg("destiny")` and `r Biocpkg("monocle")`. See https://www.bioconductor.org/packages/devel/workflows/vignettes/simpleSingleCell/inst/doc/work-0-intro.html for more details. Meanwhile, show the session information for record-keeping: ```{r} sessionInfo() ```