---
title: Preliminary analysis of the centrosome enrichment experiment
author: Aaron Lun
date: 2016-06-14
output:
   html_document:
      fig_caption: false
---

```{r, echo=FALSE, results='hide'}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
options(width=100)
```

# Introduction

This analysis is a preliminary analysis of a centrosome enrichment experiment.
The aim is to identify centrosome-associating transcripts that are upregulated in the pulldown vs the "soup" (i.e., supernatant).
We load in the counts and we set up the metadata.

```{r}
counts <- read.table("genic_counts.tsv", header=TRUE, row.names=1)[,-1]
groupings <- c("ip", "ip", "soup", "input")
data.frame(groupings, colnames(counts))
```

We construct our objects and add some annotation to get some human-readable gene symbols.

```{r}
library(edgeR)
colnames(counts) <- sub("_.*", "", colnames(counts))
y <- DGEList(counts)
library(org.Hs.eg.db)
anno <- select(org.Hs.eg.db, keytype="ENSEMBL", keys=rownames(counts), column="SYMBOL")
y$genes <- anno[match(rownames(counts), anno$ENSEMBL),]
dim(y)
```

# Filtering, data exploration and normalization

We remove low-abundance genes with an average log-CPM below 1.
This removes lowly expressed genes that do not have enough power to detect DE.

```{r}
keep <- aveLogCPM(y) > 1
y <- y[keep,]
sum(keep)
```

We construct a MDS plot to see how the libraries are distributed.


```{r}
plotMDS(y, labels=paste0(groupings, seq_along(groupings)))
```

Finally, we perform some normalization to remove library-specific biases.
The normalization factors are close to unity, and the library sizes are quite even, so there doesn't seem to be many problems here.

```{r}
y <- calcNormFactors(y)
y$samples
```

Still, we make some plots for peace of mind.
The clouds are all centred at a log-fold change of zero, indicating that normalization was successful.

```{r, fig.width=10, fig.height=10}
par(mfrow=c(2,2))
for (i in seq_len(ncol(y))) { 
    plotMD(y, column=i)
    abline(h=0, col="red", lwd=2, lty=2)
}
```

# Modelling biological variability

First we set up the design matrix.
We treat the two IP samples as replicates.

```{r}
g <- factor(groupings)
design <- model.matrix(~0 + g)
colnames(design) <- levels(g)
```

We then log-transform the counts for linear modelling.
We fit a mean-variance trend and compute precision weights to account for the fact that larger counts are less variable (relative to their mean).

```{r}
v <- voom(y, design, plot=TRUE)
```

We fit a linear model to the log-transformed counts using the previously calculated weights.

```{r}
fit <- lmFit(v, design)
```

As a check, we also examine the negative binomial dispersion estimate.
The values are low indicating that the replication was tight.

```{r}
y <- estimateDisp(y, design)
plotBCV(y)
```

# Testing for differential expression

We test for DE between each pair of conditions, using a log-fold change threshold of 0.5 to only consider biologically meaningful changes.

```{r}
con <- makeContrasts(ip - input, input - soup, ip - soup, levels=design)
fit2 <- contrasts.fit(fit, con)
fit2 <- treat(fit2, lfc=0.5)
```

We examine the numbers of DE in each comparison.
It looks like the IP vs input has the fewest, while many genes seem to be lost in the soup compared to the input (but not gained in the IP compared to the input).

```{r}
summary(decideTests(fit2))
```

We can proceed through the specific comparisons. Firstly, for IP vs input:

```{r}
ipvin <- topTreat(fit2, coef=1, n=Inf)
write.table(file="de_IPvInput.tsv", ipvin, row.names=FALSE, sep="\t", quote=FALSE)
head(ipvin[,-1], n=10)
```

We report the CPMs for all libraries, just to check that the results are, in fact, sensible:

```{r}
cpm(y[ipvin$ENSEMBL[1:10],])
```

Then for input vs soup:

```{r}
invso <- topTreat(fit2, coef=2, n=Inf)
write.table(file="de_InputvSoup.tsv", invso, row.names=FALSE, sep="\t", quote=FALSE)
head(invso[,-1], n=10)
cpm(y[invso$ENSEMBL[1:10],])
```

Finally, for IP vs soup:

```{r}
ipvso <- topTreat(fit2, coef=3, n=Inf)
write.table(file="de_IPvSoup.tsv", ipvso, row.names=FALSE, sep="\t", quote=FALSE)
head(ipvso[,-1], n=10)
cpm(y[ipvso$ENSEMBL[1:10],])
```

We also look at genes that are only upregulated in IP vs input *and* input vs soup.
This is the most conservatively selected set of genes:

```{r}
pvals <- pmax(fit2$p.value[,1], fit2$p.value[,2])
keep <- fit2$coefficients[,1] > 0 & fit2$coefficients[,2] > 0
collected <- data.frame(y$genes, fit2$coefficients[,1:2], P.Value=pvals)[keep,]
collected$adj.P.Val <- p.adjust(collected$P.Value, method="BH")
sum(collected$adj.P.Val <= 0.05)
collected <- collected[order(collected$P.Value),]
write.table(file="de_best.tsv", collected, row.names=FALSE, sep="\t", quote=FALSE)
head(collected)
```

# Session information

```{r}
sessionInfo()
```