--- title: Analyzing single-cell RNA seq data from droplet-based protocols author: Aaron Lun date: "`r Sys.Date()`" output: BiocStyle::html_document: fig_caption: false --- ```{r style, echo=FALSE, results="hide", message=FALSE} library(BiocStyle) library(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) #options(bitmapType="cairo", width=100) # if transparencies don't work on your machine. ``` # Overview Droplet-based scRNA-seq protocols encapsulate cells in water-in-oil droplets for massively multiplexed library prepation. We will be examining how to analyze this type of data, using a publicly available data set generated by 10X Genomics. We assume that you have already taken the previous workshop describing the basics of scRNA-seq data analysis. This worskshop will use R version 3.5.0 or higher, with a number of Bioconductor packages. If you haven't downloaded and installed them already, you can do so by running the code below. **This only needs to be done once** - the packages will be on your computer once installed, and can be loaded with `library`. ```{r, eval=FALSE} source("https://bioconductor.org/biocLite.R") biocLite(c("knitr", "BiocStyle", "EnsDb.Hsapiens.v86", "scater", "Rtsne", "DropletUtils", "scran", "pheatmap")) ``` # Setting up the data ## Reading in a sparse matrix We are using a data set containing 3000 T cells from a healthy human donor, see https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/t_3k. The data have already been run through the _CellRanger_ pipeline to obtain unique molecular identifier (UMI) counts for each gene in each cell. We load in the raw count matrix using the `read10xCounts()` function from the `r Biocpkg("DropletUtils")` package. This will create a `SingleCellExperiment` object where each column corresponds to a cell barcode. ```{r} library(DropletUtils) fname <- "t_3k/raw_gene_bc_matrices/GRCh38" sce <- read10xCounts(fname, col.names=TRUE) sce ``` Here, each count represents the number of UMIs assigned to a gene for a cell barcode. Note that the counts are loaded as a sparse matrix object - specifically, a `dgCMatrix` instance from the `r CRANpkg("Matrix")` package. This avoids allocating memory to hold zero counts. ```{r} class(counts(sce)) ```

**Exercise:** ```{r} # How many non-zeroes are there? #A# mean(counts(sce)!=0) ``` ```{r} # What is the difference in memory usage? #H# object.size() #A# object.size(counts(sce)) #A# as.numeric(ncol(sce))*as.numeric(nrow(sce)) * 4 ``` ```{r} # How can I get column sums? #H# colSums(counts(sce)) #A# library(Matrix) #A# head(colSums(counts(sce))) #A# head(Matrix::colSums(counts(sce))) ```

## Annotating the rows We relabel the rows with the gene symbols for easier reading using the `uniquifyFeatureNames()` function. ```{r} library(scater) rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol) head(rownames(sce)) ``` We also identify the chromosomal location for each gene, especially the mitochondrial location as this is particularly useful for later quality control. ```{r} library(EnsDb.Hsapiens.v86) location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce)$ID, column="SEQNAME", keytype="GENEID") rowData(sce)$CHR <- location summary(location=="MT") ``` # Calling cells from empty droplets In droplet-based data, we don't know which droplets actually contain cells and which droplets are empty. We need to call cells from empty droplets based on the observed expression profiles, which is not always easy due to the presence of extracellular RNA in empty droplets. A good place to start is to examine a "barcode rank" plot: ```{r} br.out <- barcodeRanks(counts(sce)) plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total count") abline(h=br.out$inflection, col="forestgreen") abline(h=br.out$knee, col="dodgerblue") legend("bottomleft", legend=c("inflection", "knee"), col=c("forestgreen", "dodgerblue"), lwd=2) ``` We should see a sharp drop, above which are presumably cells (high RNA content) and below which are empty droplets (low RNA content). We could draw some horizontal line to use as a threshold for defining cells, which is pretty much what _CellRanger_ does. A more objective approach is implemented in the `emptyDrops()` function (see https://www.biorxiv.org/content/early/2018/04/04/234872). This tests whether the expression profile for each cell barcode is significantly different from the ambient pool. Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. We call cells at a false discovery rate (FDR) of 1%, meaning that no more than 1% of our called barcodes should be empty droplets on average. ```{r} set.seed(100) e.out <- emptyDrops(counts(sce)) sig <- e.out$FDR <= 0.01 sum(sig, na.rm=TRUE) ``` We can look at how the likelihoods (of originating from the ambient pool) _decrease_ with increasing total UMI count. Note that the negative log-probabilities are plotted below, which _increase_ with increasing total count. ```{r} plot(e.out$Total, -e.out$LogProb, log="xy", col=ifelse(sig, "red", "black"), xlab="Total count", ylab="-Log probability") legend("topleft", legend=c("Putative cell", "Empty"), col=c("red", "black"), pch=16) ``` `emptyDrops()` computes Monte Carlo _p_-values, so it is important to set the random seed to obtain reproducible results. The number of Monte Carlo iterations also determines the lower bound for the _p_values. If any non-significant barcodes are `TRUE` for `Limited`, we may need to increase the number of iterations to ensure that they can be detected. ```{r} table(Sig=e.out$FDR <= 0.01, Limited=e.out$Limited) ``` We then subset our `SingleCellExperiment` object to retain only the detected cells. ```{r} # using which() to automatically remove NAs. sce <- sce[,which(e.out$FDR <= 0.01)] ```

**Exercise:** ```{r} # How could I get all cells above the knee point? #H# keep <- br.out$total >= #A# keep <- br.out$total >= br.out$knee #A# summary(keep) ``` ```{r} # How can I recapitulate _CellRanger_'s cell calling? #H# ?defaultDrops #A# cr.keep <- defaultDrops(counts(sce), expected=3000) #A# summary(cr.keep) ``` ```{r} # What does table() do with two variables? fruits <- sample(c("Apple", "Banana", "Cucumbers"), 10, replace=TRUE) people <- sample(c("Alex", "Brett", "Cameron"), 10, replace=TRUE) #H# table() #A# table(fruits, people) ```

# Quality control on the cells It is possible for non-empty droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. We compute some QC metrics using `calculateQCMetrics()` and examine their distributions: ```{r} sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(location=="MT"))) par(mfrow=c(1,3)) hist(sce$log10_total_counts, breaks=20, col="grey80", xlab="Log-total UMI count") hist(sce$log10_total_features_by_counts, breaks=20, col="grey80", xlab="Log-total number of expressed features") hist(sce$pct_counts_Mito, breaks=20, col="grey80", xlab="Proportion of reads in mitochondrial genes") ``` We remove cells with low library sizes, low total number of expressed features or high mitochondrial proportions. The last is a proxy for cell damage in the absence of spike-in RNA. ```{r} low.lib <- isOutlier(sce$log10_total_counts, nmads=3, type="lower") low.nfeat <- isOutlier(sce$log10_total_features_by_counts, nmads=3, type="lower") high.mito <- isOutlier(sce$pct_counts_Mito, nmads=3, type="higher") discard <- low.lib | low.nfeat | high.mito data.frame(LowLib=sum(low.lib), LowNFeat=sum(low.nfeat), HighMito=sum(high.mito), discard=sum(discard)) ``` We subset the `SingleCellExperiment` object to remove the low-quality cells. ```{r} sce <- sce[,!discard] summary(discard) ``` Outlier-based QC requires some care in heterogeneous datasets, where particular cell types might naturally express fewer features. In this case, the population should be fairly homogeneous (all T cells) so we don't have to worry too much. # Examining gene expression The average expression of each gene is much lower here compared to the read count data set, due to: - the reduced coverage per cell when thousands of cells are multiplexed together for sequencing. - the collapsing of multiple reads from PCR amplicons of the same transcript into a single UMI. ```{r} ave <- calcAverage(sce) rowData(sce)$AveCount <- ave hist(log10(ave), col="grey80") ``` The set of most highly expressed genes is dominated by ribosomal protein and mitochondrial genes, as expected. ```{r} plotHighestExprs(sce) ``` # Normalizing for cell-specific biases We perform some pre-clustering to break up obvious clusters and avoid pooling cells that are very different. The `min.mean=` argument just avoids using genes with lots of zero counts. ```{r} library(scran) clusters <- quickCluster(sce, method="igraph", min.mean=0.1) table(clusters) ``` We then apply the deconvolution method to compute size factors for all cells. ```{r} sce <- computeSumFactors(sce, min.mean=0.1, cluster=clusters) summary(sizeFactors(sce)) ``` The size factors are well correlated against the library sizes, indicating that capture efficiency and sequencing depth are the major biases. ```{r sfplot, fig.cap="Size factors for all cells in the PBMC dataset, plotted against the library size."} plot(sce$total_counts, sizeFactors(sce), log="xy") ``` Finally, we compute normalized log-expresion values. There is no need to call `computeSpikeFactors()` here, as there are no spike-in transcripts available. ```{r} sce <- normalize(sce) assayNames(sce) ```

**Exercise:** ```{r} # What do we do about negative size factors? lib.factors <- librarySizeFactors(sce) is.neg <- sizeFactors(sce) < 0 sizeFactors(sce)[is.neg] <- lib.factors[is.neg] ```

# Modelling the mean-variance trend We don't have spike-ins, so we can't directly model the technical noise. One option is to assume that most genes do not exhibit strong biological variation, and to fit a trend to the variances of endogenous genes. ```{r} fit <- trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05)) plot(fit$mean, fit$var, pch=16, xlab="Mean log-exression", ylab="Variance of log-expression") curve(fit$trend(x), col="dodgerblue", add=TRUE) ``` Another option is to assume that the technical noise is Poisson and create a fitted trend on that basis using the `makeTechTrend()` function. ```{r} new.trend <- makeTechTrend(x=sce) plot(fit$mean, fit$var, pch=16, xlab="Mean log-exression", ylab="Variance of log-expression") curve(new.trend(x), col="red", add=TRUE) ``` Assuming Poisson noise, we decompose the total variance to its technical and biological components: ```{r} fit$trend <- new.trend # tricking decomposeVar into thinking this is the trend! dec <- decomposeVar(fit=fit) top.dec <- dec[order(dec$bio, decreasing=TRUE),] head(top.dec) ``` We have a look at the genes with the largest biological components. ```{r} plotExpression(sce, features=rownames(top.dec)[1:10]) ``` # Dimensionality reduction We use the `denoisePCA()` function with the assumed Poisson technical trend, to choose the number of dimensions to retain after PCA. Note the `approx=TRUE`, which uses `r CRANpkg("irlba")` to perform a fast PCA. ```{r} sce <- denoisePCA(sce, technical=new.trend, approx=TRUE) ncol(reducedDim(sce, "PCA")) ``` Examination of the first few PCs already reveals two clear clusters in the data: ```{r} plotPCA(sce, ncomponents=3, colour_by="log10_total_features_by_counts") ``` This is recapitulated with a _t_-SNE plot _on the PCs_, see the use of `use_dimred=`. ```{r} sce <- runTSNE(sce, use_dimred="PCA", perplexity=30, rand_seed=100) plotTSNE(sce, colour_by="log10_total_features_by_counts") ```

**Exercise:** ```{r} # How do we check the percentage of variance explained? #A# plot(attr(reducedDim(sce), "percentVar"), xlab="PC", #A# ylab="Proportion of variance explained") #A# abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red") ```

# Clustering with graph-based methods We start by building a shared nearest neighbour graph, where each cells is connected to its neighbours by an edge. Unlike hierarchical clustering, we do not need to construct a distance matrix for a large number of cells. ```{r} snn.gr <- buildSNNGraph(sce, use.dimred="PCA") snn.gr ``` We then use the Walktrap algorithm to identify clusters, i.e., "communities" in the graph. These are groups of cells that are highly connected within each group, which relatively few connections between groups. ```{r} clusters <- igraph::cluster_walktrap(snn.gr) sce$Cluster <- factor(clusters$membership) table(sce$Cluster) ``` We examine the "modularity" of the clusters, i.e., how enriched a cluster is for intra-group connections relative to what is expected under a null model. At least a few of the clusters are not very modular (strong off-diagonal entries) - not surprising for T cell subsets. ```{r} cluster.mod <- clusterModularity(snn.gr, sce$Cluster, get.values=TRUE) log.ratio <- log2(cluster.mod$observed/cluster.mod$expected + 1) library(pheatmap) pheatmap(log.ratio, cluster_rows=FALSE, cluster_cols=FALSE, color=colorRampPalette(c("white", "blue"))(100)) ``` We examine the cluster identities on a _t_-SNE plot: ```{r} plotTSNE(sce, colour_by="Cluster") ```

**Exercise:** ```{r} # Why did I use igraph::? library(igraph) #A# normalize ```

# Marker gene detection We detect marker genes for each cluster using `findMarkers()`. We only look at upregulated genes in each cluster, as these are more useful for positive identification of cell types in a heterogeneous population. ```{r} markers <- findMarkers(sce, clusters=sce$Cluster, direction="up") ``` We examine the markers for cluster 9 in more detail. ```{r} marker.set <- markers[["9"]] head(marker.set, 10) ``` The transcriptional profile of cluster 9 is clearly distinct from the others: ```{r} chosen <- rownames(marker.set)[marker.set$Top <= 10] plotHeatmap(sce, features=chosen, exprs_values="logcounts", zlim=3, center=TRUE, symmetric=TRUE, cluster_cols=FALSE, colour_columns_by="Cluster", columns=order(sce$Cluster)) ``` ... which can be further examined on the _t_-SNE plot: ```{r} plotTSNE(sce, colour_by="GNLY") ``` So, these are probably cytotoxic T cells, with some Jun/Fos activity.

**Exercise:** ```{r} # What's cluster 8? How is it different from 3 and 10? marker.set2 <- markers[["8"]] chosen <- rownames(marker.set2)[marker.set2$Top <= 10] plotHeatmap(sce, features=chosen, exprs_values="logcounts", zlim=3, center=TRUE, symmetric=TRUE, cluster_cols=FALSE, colour_columns_by="Cluster", columns=order(sce$Cluster)) ``` ```{r} # Try looking at all genes in both directions: markers2 <- findMarkers(sce, clusters=sce$Cluster) marker.set2 <- markers2[["8"]] chosen <- rownames(marker.set2)[marker.set2$Top <= 10] plotHeatmap(sce, features=chosen, exprs_values="logcounts", zlim=3, center=TRUE, symmetric=TRUE, cluster_cols=FALSE, colour_columns_by="Cluster", columns=order(sce$Cluster)) ``` Check out _CD7_, _CD8_, _S100A4_.

# Concluding remarks Having completed the basic analysis, we save the `SingleCellExperiment` object with its associated data to file. This avoids having to repeat all of the pre-processing steps described above prior to further analyses. ```{r} saveRDS(sce, file="t3k_data.rds") ``` See https://www.bioconductor.org/packages/devel/workflows/vignettes/simpleSingleCell/inst/doc/work-3-tenx.html for more details. Meanwhile, show the session information for record-keeping: ```{r} sessionInfo() ```