---
title: Analysis of APL mass cytometry data with `cydar`
author: Arianne Richard, with adaptations by Aaron Lun
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc_float: yes
    fig_caption: false
---

```{r, echo=FALSE, results="hide"}
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
```

# Overview

This script will analyze the mass cytometry data for the CD8+ T cell stimulation project by Richard et al. (2018, https://www.nature.com/articles/s41590-018-0160-9).
The aim is to look for changes in population abundance upon treatment with different TCR ligands, using methods from the `r Biocpkg("cydar")` package.
See `README.md` for instructions on how to obtain the data and prepare it for analysis.

# Loading the data

First we load in the data.
This is done using the `r Biocpkg("ncdfFlow")` package, which provides facilities for rapidly loading FCS files into memory.

```{r}
library(ncdfFlow)
fnames <- list.files(path='data', pattern='*0.fcs', full.names=TRUE)
ncdf <- read.ncdfFlowSet(files=fnames)
```

We double-check the identities of the channels.
We also identify the subset of channels that were actually used.

```{r}
channels <- data.frame(channel=colnames(ncdf), 
    marker=parameters(ncdf[[1]])$desc, 
    stringsAsFactors=FALSE)
channels
in.use <- grepl("^[0-9]+[A-Za-z]+_", channels$marker)
channels[in.use,]
```

We also read in the sample annotations.

```{r}
samples <- read.table('data/experiment_66456_annotations.tsv', 
    sep='\t', header=TRUE, stringsAsFactors=FALSE)
samples
```

# Estimating the transformation

We use the "logicle" transform (https://www.ncbi.nlm.nih.gov/pubmed/16604519), which provides linearity at zero and logarithmic behaviour at large intensities.
However, this involves some parameter estimation to match the range of intensities observed in the data.

To estimate these parameters, we pool all the cells together to use information from all samples.
Note that we only compute transformations for the channels that actually contain marker intensities (and a couple others that we'll use during gating).

```{r}
library(cydar)
pool.ff <- poolCells(ncdf, equalize=FALSE)
lgcl <- estimateLogicle(pool.ff, c(channels$channel[in.use], 
    c("Eu153Di", "Ir191Di", "Ir193Di", "Pt195Di"))) 
```

We then transform the pooled intensities for use in gating.

```{r}
pool.ff <- transform(pool.ff, lgcl)
nrow(pool.ff)
```

Other transformations are also possible, e.g., `hyperlog`, `arcsinhTransform`, `logTransform`, but we find that the logicle method works well.

# Defining gates for quality control

## Checking the event time 

First, we deal with an obvious clog in the machine.

```{r}
plot(exprs(pool.ff)[,"Time"], exprs(pool.ff)[,"Y89Di"], 
    cex=0.1, xlab='time', ylab='CD45')
abline(v=2700000, col='dodgerblue')
abline(v=2950000, col='dodgerblue')
```

We define a gate to remove cells within this time period.

```{r}
clog.gate <- rectangleGate(filterId="timegate", list("Time"=c(2700000,2950000)))
clog.filt <- filter(pool.ff, clog.gate)
clog.filt
```

We then subset the pooled set to only keep the events outside of the gate.
Note the use of `!` to invert the gate.

```{r}
pool.ff <- Subset(pool.ff, !clog.filt)
nrow(pool.ff)
```

## Checking the beads

We also check that the normalization beads have done their job with respect to correcting for signal decay over time.

```{r}
smoothScatter(exprs(pool.ff)[,"Time"], exprs(pool.ff)[,"Eu153Di"],
    xlab="Time", ylab="Bead intensity")
smoothScatter(exprs(pool.ff)[,"Eu151Di"], exprs(pool.ff)[,"Eu153Di"],
    xlab="Eu151", ylab="Eu153")
```

We define a gate to only retain non-bead events.

```{r}
bead.gate <- rectangleGate(filterId="beadgate", list("Eu153Di"=c(0, 2.5)))
smoothScatter(exprs(pool.ff)[,"Y89Di"], exprs(pool.ff)[,"Eu153Di"], 
    xlab='CD45', ylab='Eu153')
abline(h=2.5, col="red")
```

Note that if we want to keep everything in a gate, we don't need to go through `filter`.

```{r}
pool.ff <- Subset(pool.ff, bead.gate)
nrow(pool.ff)
```

## Checking DNA content

We can use the DNA markers to gate for singlets.
In this case, unfortunately, most of this data is in doublets (based on previous experiments with B/T cell mixtures).
So, we'll use some options from `dnaGate` to select the smaller peak.

```{r}
smoothScatter(exprs(pool.ff)[,"Ir191Di"], exprs(pool.ff)[,"Ir193Di"], 
    xlab="DNA1", ylab="DNA2") 
dna.gate <- dnaGate(pool.ff, 'Ir191Di', 'Ir193Di', type='both', rank=2)
polygon(dna.gate@boundaries[,1], dna.gate@boundaries[,2], border="red")
```

We apply this gate to our data set.

```{r}
pool.ff <- Subset(pool.ff, dna.gate)
nrow(pool.ff)
```

## Checking dead-or-alive status

This experiment was stained with cisplatin to mark the dead cells. 
We exclude dead cells as `Pt195Di` high (i.e., above a threshold of 2).

```{r}
hist(exprs(pool.ff)[,"Pt195Di"], breaks=200, xlab='Pt195', main='')
abline(v=2, col='dodgerblue')
```

We retain only cells that are `Pt195Di` low.

```{r}
live.gate <- rectangleGate(filterId="livegate", list("Pt195Di"=c(0, 2)))
pool.ff <- Subset(pool.ff, live.gate)
nrow(pool.ff)    
```

## Gating for specific cells of interest

We exclude extreme low values of CD45 as non-haematopoietic.
We use a threshold of 5 MADs in `outlierGate()` to just remove the properly negative ones.

```{r}
hist(exprs(pool.ff)[,"Y89Di"], breaks=200, xlab='CD45', main='')
haem.gate <- outlierGate(pool.ff, 'Y89Di', nmads=5, type='lower')
abline(v=haem.gate@min, col="red")
pool.ff <- Subset(pool.ff, haem.gate)
nrow(pool.ff)    
```

We then remove non-T cells.
We gate out 6 MAD lower because TCR levels can fall dramatically in activating cells and we don't want to miss these. 
(Also, we started with primarily CD8+ T cells in the live fraction by experimental design.)

```{r}
hist(exprs(pool.ff)[,"Tm169Di"], breaks=200, xlab='CD45', main='')
t.gate <- outlierGate(pool.ff, 'Tm169Di', nmads=6, type='lower')
abline(v=t.gate@min, col="red")
pool.ff <- Subset(pool.ff, t.gate)
nrow(pool.ff)
```

Finally, we remove non-CD8+ T cells.
We gate out 5 MAD lower because this can also drop dramatically with activation. 

```{r}
hist(exprs(pool.ff)[,"Er168Di"], breaks=200, xlab='CD8a', main='')
CD8.gate <- outlierGate(pool.ff, 'Er168Di', nmads=5, type='lower')
abline(v=CD8.gate@min, col="red")
pool.ff <- Subset(pool.ff, CD8.gate)
nrow(pool.ff)
```

# Transforming and gating individual samples

First we apply the logicle transformation:

```{r}
processed <- transform(ncdf, lgcl)
```

We apply the clog filter and visualize just sample 2 to see that the filter worked.

```{r}
clog.filt <- filter(processed, clog.gate)
processed <- split(processed, clog.filt)["timegate-"][[1]]
plot(exprs(processed[[2]])[,"Time"], exprs(processed[[2]])[,"Y89Di"], 
    cex=0.1, xlab='time', ylab='CD45')
```

Applying the bead filter.

```{r}
processed <- Subset(processed, bead.gate)
smoothScatter(exprs(processed[[2]])[,"Y89Di"], 
    exprs(processed[[2]])[,"Eu153Di"], 
    xlab='CD45', ylab='Eu153')
```

Applying the DNA gate.

```{r}
processed <- Subset(processed, dna.gate)
par(mfrow=c(1,2))
hist(exprs(processed[[2]])[,"Ir191Di"], breaks=50, xlab='Ir191', main='')
hist(exprs(processed[[2]])[,"Ir193Di"], breaks=50, xlab='Ir193', main='')
```

Applying the live cell gate.

```{r}
processed <- Subset(processed, live.gate)
hist(exprs(processed[[2]])[,"Pt195Di"], breaks=50, xlab='Pt195', main='')
```

And the CD45 haematopoietic cell gate.

```{r}
processed <- Subset(processed, haem.gate)
hist(exprs(processed[[2]])[,"Y89Di"], breaks=50, xlab='CD45', main='')
```

And the TCRb T cell gate.

```{r}
processed <- Subset(processed, t.gate)
hist(exprs(processed[[2]])[,"Tm169Di"], breaks=50, xlab='TCRb', main='')
```

And the CD8+ cell gate.

```{r}
processed <- Subset(processed, CD8.gate)
hist(data.frame(exprs(processed[[2]]))[,"Er168Di"], breaks=50, xlab='CD8a', main='')
```

# Performing a differential abundance analysis

## Counting cells into hyperspheres

We restrict our analysis to the markers used for proteins.
We're not going to exclude the markers we filtered on because changes in expression of CD45, TCRb and CD8 correspond to activation status and can be interesting.
But we want to get rid of bead intensities, DNA, live/dead, etc. that are not of interest beyond quality control.

```{r}
of.interest <- in.use & 
    !channels$marker %in% c("191Ir_DNA1", "193Ir_DNA2", "195Pt_Live")
channels[of.interest,]
processed <- processed[,colnames(processed) %in% channels$channel[of.interest]]
processed
```

We'll also clean up the names to make things more palatable later on.

```{r}
true.names <- with(channels, marker[match(colnames(processed), channel)])
true.names <- sub("^[0-9]+[A-Za-z]+_", "", true.names)
colnames(processed) <- as.character(true.names)
colnames(processed) 
```

We compute the distances to neighbouring cells.
The aim here is to determine whether the default size of our hyperspheres (`tol=0.5`) will contain enough cells for statistical modelling.
We see that we have >50 cells per hypersphere on average, which is more than enough.

```{r}
cd <- prepareCellData(processed)
dist <- neighborDistances(cd)
boxplot(dist, xlab="Number of neighbors", ylab="Tolerance")
abline(h=0.5, col="red", lwd=2, lty=2)
```

We now count cells into hyperspheres.
Each hypersphere basically describes a region of the expression space, centred on a cell and including neighbouring cells from any sample.

```{r}
cd <- countCells(cd, tol=0.5)
cd
```

The count matrix contains the number of cells from each sample (column) in each hypersphere (row):

```{r}
head(assay(cd, "counts"))
```

... while the intensity matrix contains the median intensity of each hypersphere (row) for each marker (column):

```{r}
head(intensities(cd))
```

## Preparing for the DA analysis

We consider only hyperspheres with 5 or more cells on average (as defined using functions from the `r Biocpkg("edgeR")` package).
This avoids analyzing hyperspheres with very low cell counts, which are underpowered and are generally less interesting anyway.

```{r}
library(edgeR)
y <- DGEList(assay(cd), lib.size=cd$totals)
keep <- aveLogCPM(y) >= aveLogCPM(5, mean(cd$totals))
table(keep)
cd2 <- cd[keep,]
y <- y[keep,]
```

We can create a MDS plot to visualize the differences between samples.

```{r}
adjc <- edgeR::cpm(y, log=TRUE, prior.count=3)
label <- sub(" .*", "", samples$Condition)
col <- c("blue", "red", "black", "cyan2", "green3")[factor(label)]
plotMDS(adjc, label=label, col=col)
```

## Modelling biological variability

We use an additive model that accounts for the peptide (the factor of interest) and blocks on the mouse of origin for each sample.
**It is important to ensure that your metadata matches up with your samples!**

```{r}
m <- match(colnames(cd), samples$FCS.Filename)
peptide <- factor(sub(" .*", "", samples$Condition))
mouse <- factor(sub(".* ", "", samples$Individuals))
design <- model.matrix(~ 0 + peptide + mouse)
design
```

We then model biological variability using the quasi-likelihood framework in `r Biocpkg("edgeR")`.
This estimates two "dispersion" parameters - the first to model the mean-dispersion trend, the second to model variation around the trend.
Here I have increased `span` to just make the curve smoother when there are not many (>10000) points.

```{r}
y <- estimateDisp(y, design, span=1)
plotBCV(y)
fit <- glmQLFit(y, design, robust=TRUE)
plotQLDisp(fit)
```

## Testing for changes in abundance

Our aim here is to identify hyperspheres that exhibit any changes in abundance with respect to peptide strength.
We set up a contrast matrix:

```{r}
con <- makeContrasts(N4=peptideN4 - peptideNP68,
    G4=peptideG4 - peptideNP68,
    Q4H7=peptideQ4H7 - peptideNP68,
    T4=peptideT4 - peptideNP68, levels=design)
con    
```

... and compute a $p$-value for each hypersphere:

```{r}
res.all <- glmQLFTest(fit, contrast=con)
head(res.all$table)    
```

Of course, we need to correct for multiple testing by controlling the false discovery rate (FDR).
However, hyperspheres are not evenly distributed across the high-dimensional intensity space.
More hyperspheres will be observed in highly abundant subpopulations, due to the manner in which hyperspheres are formed (by centering them on existing cells).

We want to control the FDR across the intensity space, regardless of cell density.
This involves the notion of a "spatial FDR", which can be computed and controlled below a threshold, e.g., of 5%.
In other words, of the volume of space that involves changes in abundance, 5% of that volume are false positives.

```{r}
qval.all <- spatialFDR(intensities(cd2), res.all$table$PValue)
is.sig.all <- qval.all <= 0.05
summary(is.sig.all)
```

# Interpreting the results

## Creating dimensionality reduction plots

The simplest approach to interpretation is to use non-linear dimensionality reduction methods.
The most popular is _t_-SNE, though UMAP and self-organizing maps are also widely used.
Most of these methods involve randomization so make sure to use `set.seed()` to get reproducible results.
_t_-SNE also requires some tinkering with the perplexity, see https://distill.pub/2016/misread-tsne/ for some thoughts.

```{r}
set.seed(100)
library(Rtsne)
t.out <- Rtsne(intensities(cd2), perplexity=30)
head(t.out$Y)
```

We can then colour the coordinates by the log-fold change relative to NP68 (remember that the first level of `peptide` is treated as the reference).
Only the hyperspheres with significant changes at any time points are shown here.

```{r, fig.height=15, fig.width=6}
layout(cbind(c(1:4),5), widths=c(8, 1.5))
par(mar=c(2.1, 2.1, 2.1, 1.1), oma=c(3,3,1,1), mgp=c(2.2,1,0))
for (peptide in c('N4', 'T4', 'Q4H7', 'G4')) {
    fc.col <- plotSphereLogFC(t.out$Y[is.sig.all,1], t.out$Y[is.sig.all,2], 
	    logFC=res.all$table[is.sig.all,paste0("logFC.", peptide)],
        max.logFC=5, cex.axis=1.5, cex.lab=1.5, cex.main=1.5, cex=1, main=peptide)
}
par(mar=c(0,0,0,0))
plot(0,0, axes=FALSE, xlab="", ylab="")
createColorBar(fc.col)
```

To annotate them, we will make similar plots coloured by the intensity of all relevant markers:

```{r, fig.height=20, fig.width=20}
par(oma=c(4.1,4.1,1.1,1.1))
lmat <- cbind(matrix(seq_len(5*4), ncol=4, nrow=5), 21)
layout(lmat, widths=c(rep(1, 5), 0.2))

reranges <- intensityRanges(cd2)
par(mar=c(2.1, 2.1, 2.1, 2.1))

for (i in order(markernames(cd2))) {
    int.col <- plotSphereIntensity(t.out$Y[,1], t.out$Y[,2], 
        intensities(cd2)[,i], irange=reranges[,i], 
        main=markernames(cd2)[i], 
        cex=1.5, cex.lab=1.5, cex.main=1.5, cex.axis=1.5)
}

# just filling up the leftover panels. 
for (j in seq_len(20-ncol(intensities(cd2)))) { plot.new() } 

par(mar=c(0,0,0,0))
plot(0,0, type="n", axes=FALSE, ylab="", xlab="")
createColorBar(int.col, top.text="High", bottom.text="Low")
```

## Creating heatmaps

An alternative visualization strategy is to create heatmaps for each hypersphere.
First we add some row names to our intensity matrix (otherwise `pheatmap()` becomes quite unhappy).

```{r}
library(pheatmap)
mat <- intensities(cd2)
rownames(mat) <- seq_len(nrow(mat)) 
```

We indicate whether each hypersphere was significant, and if so the direction in which its abundance changed for each peptide against NP68.

```{r}
statuses <- list()
for (i in colnames(res.all$table)[1:4]) {
    cur.lfc <- res.all$table[,i]
    cur.sign <- ifelse(cur.lfc > 0, "up", "down")
    cur.sign[!is.sig.all] <- "ns"
    statuses[[i]] <- cur.sign
}

# Adding colors for each status.
fc.colors <- c(up="red", down="blue", ns="grey80")
fc.colors <- rep(list(fc.colors), 4)
names(fc.colors) <- names(statuses)    
```

Finally, we create a heatmap using the eponymous function from the `r CRANpkg("pheatmap")` package.

```{r, fig.width=7, fig.height=7}
pheatmap(mat,
    col=viridis::viridis(101),
    cluster_cols=FALSE,
    annotation_row=do.call(data.frame, statuses),
    annotation_color=fc.colors,
    show_rownames=FALSE)
```

This tends to be more compact with respect to showing the intensity information, though some find that this makes it difficult to interpret. 

## Using the shiny app

Another alternative is to use a interactive application.
Pointing and clicking on the _t_-SNE plot will show the median intensity of the chosen hypersphere in the context of the distribution across all cells.
It will also show the spread of intensities within each hypersphere and the associated differential testing statistics.

```{r}
app <- interpretSpheres(cd2, 
    markers=sort(markernames(cd2)),
    metrics=cbind(res.all$table, FDR=qval.all),
    red.coords=t.out$Y,
    red.highlight=is.sig.all,
    run=FALSE)
# app # uncomment to run 'live'.
```

# Wrapping up

```{r}
sessionInfo()
```