---
title: Differential expression analysis in hESCs after silencing HUSH components
author: Aaron Lun
date: 28 March 2016
---

```{r, echo=FALSE, results='hide'}
knitr::opts_chunk$set(error=FALSE, message=FALSE)
knitr::opts_knit$set(width=100)
options(width=100)
```

# Background

This project involves using shRNAs to silence components of the HUSH complex in human embryonic stem cells.
Data were generated by Dan Greaves in the Lehner group (CIMR) and obtained from the CI Genomics Core servers on March 17, 2016.
Paired-end reads were uniquely aligned to the hg38 build of the human genome using `subread`.
Reads were counted into exonic regions of genes in the ENSEMBL GRCh38.83 annotation _and_ into repeat calls from the RepeatMasker software, using `featureCounts` with stringent settings (minimum mapping quality of 50, both reads in the pair mapped to the same feature).
To reduce file size, repeat elements with fewer than 10 counts across all samples were removed.

```{r}
all.counts <- read.table("genic_subcounts.tsv", row.names=1, header=TRUE, 
    colClasses=rep(c("character", "integer"), c(1, 67)))
gene.length <- all.counts[,1]
all.counts <- all.counts[,-1]
dim(all.counts)
```

We determine the experimental design, and we report it along with the mapping statistics for each file.

```{r}
fnames <- strsplit(colnames(all.counts), split=".", fixed=TRUE)
barcodes <- sub("_", "-", sapply(fnames, "[", 3))
index <- sub("s_", "", sapply(fnames, "[", 5))
tab <- read.csv("contents.csv", header=TRUE, stringsAsFactors=FALSE)
sample.names <- tab$Sample.name[match(barcodes, tab$Barcode)]
mapstat <- read.table("my_qual.tsv", header=TRUE)
totals <- rowSums(mapstat)
mapped <- totals - mapstat$Unassigned_Unmapped
data.frame(Sample=sample.names, Total=totals, Mapped=mapped/totals, Genic=mapstat$Assigned/totals)
```

We then aggregate each set of technical replicates into a single entry.

```{r}
require(edgeR)
summed <- sumTechReps(all.counts, sample.names)
dim(summed)
```

We set up an appropriate design matrix to reflect the groupings for each sample.
We use a one-way layout without an intercept, for simplicity.

```{r}
groups <- factor(sub("_[0-9]+$", "", colnames(summed)))
design <- model.matrix(~0 + groups)
colnames(design) <- sub("-", ".", levels(groups))
colnames(design)
```

We also identify which rows correspond to genes, and which correspond to repeat elements.
As these can have quite different properties, we shall analyze them separately.

```{r}
is.ens <- grepl("^ENS", rownames(summed))
sum(is.ens)
```

# Setting up the analysis of the genes

## Filtering and normalization

Our first job is to filter out the low-abundance genes.
We define these as those with an average count-per-million (CPM) less than 1.
This is equivalent to an average count of around 5-10, given the library sizes involved here.

```{r}
gene.counts <- summed[is.ens,]
gy <- DGEList(gene.counts)
keep <- aveLogCPM(gy) > 0
gy <- gy[keep,]
sum(keep)
```

We proceed to normalize using the TMM normalization.
This removes any biases between libraries due to differences in RNA composition.
Sequencing depth is automatically normalized and does not need to be considered here.
In general, the normalization factors are fairly mild which suggests that not much normalization was required.

```{r}
gy <- calcNormFactors(gy)
gy$samples
```

## Data exploration and annotation

We can have a look at the similarities and differences between samples.
We colour each sample by its group and construct a MDS plot.
Libraries that are close together on the plot are more similar.
We would expect that replicates would cluster into distinct groups, though we don't really observe this here.

```{r, fig.width=9, fig.height=6}
colors <- c("black", "grey50", "grey80", "blue", "red", "orange", "brown", 
            "yellow", "green", "forestgreen", "lightblue", "purple", "pink")
par(mar=c(5.1, 4.1, 1.1, 11.1), xpd=TRUE)           
plotMDS(cpm(gy, log=TRUE, prior.count=3), pch=16, col=colors[as.integer(groups)])
legend("topright", legend=levels(groups), col=colors, inset=c(-0.4,0.1), pch=16)
```


First, we convert the ENSEMBL identifiers into gene symbols that are more informative.

```{r}
library(org.Hs.eg.db)
anno <- select(org.Hs.eg.db, keys=rownames(gy$counts), columns="SYMBOL", keytype="ENSEMBL")
m <- match(rownames(gy$counts), anno$ENSEMBL)
gy$genes <- anno[m,]
rownames(gy$counts) <- NULL
head(gy$genes)
```

## Log-transforming for linear modelling

We then use the `voom` framework to convert the counts into a suitable format for linear modelling.
This fits a mean-variance trend and computes precision weights to account for the greater precision (relative to the mean) for large counts.
There aren't quite enough samples in each group to compute sample-specific quality weights, so we'll give that a miss.

```{r}
gv <- voom(gy, design, plot=TRUE)
```

<!--
# There's practically no correlations, though:
batches <- sub(".*_", "", colnames(gv))
dc <- duplicateCorrelations(gv, design, batch=batches) 
# ... yields a value of -0.04.
# This is largely borne out by the MDS plots. 
-->

We go on to fit a linear model to the data.
We use empirical Bayes shrinkage to stabilize the variance estimates -- the extent of shrinkage is determined by the prior degrees of freedom shown below.
We perform robust shrinkage to reduce the effect of outliers.

```{r}
gfit <- lmFit(gv, design) 
gfit <- eBayes(gfit, robust=TRUE)
summary(gfit$df.prior)
```

As a diagnostic, we estimate the NB dispersion across replicates.
Values of around 0.05 to 0.1 should be expected in well-replicated studies.
This is consistent with what we observ here (note that the square-root of the dispersion is shown below).

```{r}
gy <- estimateDisp(gy, design)
plotBCV(gy, ylim=c(0, 0.5))
```

## Performing the differential analyses for periphilin knockdown

### Testing for differential induction response 

We have a look at the expression profile for periphilin.
Ideally, the shRNA'd samples should show a decrease in expression values.

```{r, fig.width=10, fig.height=6}
adj <- cpm(gy[gy$genes$ENSEMBL=="ENSG00000134283",])
par(mar=c(10.1, 4.1, 2.1, 2.1))
plot(as.integer(groups), adj, pch=16, xaxt="n", ylab="CPM", xlab="")
axis(1, at=seq_len(nlevels(groups)), label=levels(groups), las=2)
```

We test for differences in the response to induction upon KD of periphilin.
In other words, we look for genes where the effect of induction differs between the scrambled control and the periphilin-KD samples.

```{r}
con <- makeContrasts((shPeriphilin_induced - shPeriphilin_neg) - 
                     (Scrambled_induced - Scrambled_neg), levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
summary(decideTests(gfit2))
```

It looks like there's not a great effect on the transcriptional profile.
The top gene is periphilin itself, which is somewhat reassuring.

```{r}
out.dr <- topTable(gfit2, n=Inf, sort.by="none")
head(out.dr[order(out.dr$P.Value),])
```

### Testing for any effect of knockdown

For comparison, we also see what happens when we compare the shRNA-treated samples to the scramble control or to the uninduced sample.
This results in slightly more DE genes, though that's not particularly surprising as most of them seem to have the same log-fold changes with and without induction.

```{r}
con <- makeContrasts(Scramble=shPeriphilin_induced - Scrambled_induced,
                     Induced=shPeriphilin_induced - shPeriphilin_neg, levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
summary(decideTests(gfit2))
out.any <- topTable(gfit2, n=Inf, sort.by="none")
head(out.any[order(out.any$P.Value),])
```

We compile these results into a single structure and save it to file.

```{r}
drop.any <- colnames(out.any) %in% c("ENSEMBL", "SYMBOL", "AveExpr")
output <- data.frame(gy$genes, AveExpr=out.dr$AveExpr, 
    Response.logFC=out.dr$logFC, Response.PValue=out.dr$P.Value, Response.FDR=out.dr$adj.P.Val,
    VsScramble.logFC=out.any$Scramble, VsUninduced.logFC=out.any$Induced,
    Any.PValue=out.any$P.Value, Any.FDR=out.any$adj.P.Val)
output <- output[order(output$Response.PValue),]
write.table(output, file="DEgenes_Periphilin.tsv", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

### Testing for recovery off tetracyclin

When tetracyclin is removed, shRNA is no longer produced and periphilin levels should recover.
We test for genes where there is any DE between the uninduced, induced and off-tet samples.

```{r}
con <- makeContrasts(Induced=shPeriphilin_induced - shPeriphilin_neg,
                     Recovery=shPeriphilin_off_tet - shPeriphilin_induced, levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
out.rec <- topTable(gfit2, n=Inf, sort.by="none")
```

We select those genes where the sign of the log-fold change between induced and uninduced is opposite to that between off-tet and induced.
This is consistent with a recovery of the baseline transcriptional profile.
All other genes are assigned _p_-values of unity, under the null hypothesis that the signs are the same.

```{r}
opp.sign <- sign(out.rec$Induced)!=sign(out.rec$Recovery)
out.rec$P.Value[!opp.sign] <- 1
out.rec$adj.P.Val <- p.adjust(out.rec$P.Value, method="BH")
sum(out.rec$adj.P.Val <= 0.05)
out.rec <- out.rec[order(out.rec$P.Value),]
head(out.rec)
write.table(out.rec, file="recovered_Periphilin.tsv", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

## Performing the differential analyses for TASOR knockdown

### Testing for differential induction response 

We have a look at the expression profile for TASOR.

```{r, fig.width=10, fig.height=6}
adj <- cpm(gy[gy$genes$ENSEMBL=="ENSG00000163946",])
par(mar=c(10.1, 4.1, 2.1, 2.1))
plot(as.integer(groups), adj, pch=16, xaxt="n", ylab="CPM", xlab="")
axis(1, at=seq_len(nlevels(groups)), label=levels(groups), las=2)
```

We test for differences in the response to induction upon KD of TASOR.

```{r}
con <- makeContrasts((shTASOR_induced - shTASOR_neg) - 
                     (Scrambled_induced - Scrambled_neg), levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
summary(decideTests(gfit2))
```

Again, it looks like there's not a great effect on the transcriptional profile.

```{r}
out.dr <- topTable(gfit2, n=Inf, sort.by="none")
head(out.dr[order(out.dr$P.Value),])
```

### Testing for any effect of knockdown

For comparison, we also see what happens when we compare the TASOR-treated samples to their respective scramble controls.

```{r}
con <- makeContrasts(Scramble=shTASOR_induced - Scrambled_induced,
                     Induced=shTASOR_induced - shTASOR_neg, levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
summary(decideTests(gfit2))
out.any <- topTable(gfit2, n=Inf, sort.by="none")
head(out.any[order(out.any$P.Value),])
```

We compile these results into a single structure and save it to file.

```{r}
drop.any <- colnames(out.any) %in% c("ENSEMBL", "SYMBOL", "AveExpr")
output <- data.frame(gy$genes, AveExpr=out.dr$AveExpr,
    Response.logFC=out.dr$logFC, Response.PValue=out.dr$P.Value, Response.FDR=out.dr$adj.P.Val,
    VsScramble.logFC=out.any$Scramble, VsUninduced.logFC=out.any$Induced,
    Any.PValue=out.any$P.Value, Any.FDR=out.any$adj.P.Val)
output <- output[order(output$Response.PValue),]
write.table(output, file="DEgenes_TASOR.tsv", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

### Testing for recovery off tetracyclin

We test for genes where there is any DE between the uninduced, induced and off-tet samples.

```{r}
con <- makeContrasts(Induced=shTASOR_induced - shTASOR_neg,
                     Recovery=shTASOR_off_tet - shTASOR_induced, levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
out.rec <- topTable(gfit2, n=Inf, sort.by="none")
```

We select those genes where the sign of the log-fold change between induced and uninduced is opposite to that between off-tet and induced.

```{r}
opp.sign <- sign(out.rec$Induced)!=sign(out.rec$Recovery)
out.rec$P.Value[!opp.sign] <- 1
out.rec$adj.P.Val <- p.adjust(out.rec$P.Value, method="BH")
sum(out.rec$adj.P.Val <= 0.05)
out.rec <- out.rec[order(out.rec$P.Value),]
head(out.rec)
write.table(out.rec, file="recovered_TASOR.tsv", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

## Performing the differential analyses for SETDB-1 knockdown

### Testing for differential induction response

We have a look at the expression profile for SETDB-1.

```{r, fig.width=10, fig.height=6}
adj <- cpm(gy[gy$genes$ENSEMBL=="ENSG00000143379",])
par(mar=c(10.1, 4.1, 2.1, 2.1))
plot(as.integer(groups), adj, pch=16, xaxt="n", ylab="CPM", xlab="")
axis(1, at=seq_len(nlevels(groups)), label=levels(groups), las=2)
```

We test for differences in the response to induction upon KD of SETDB-1.

```{r}
con <- makeContrasts((shSETDB.1_induced - shSETDB.1_neg) - 
                     (Scrambled_induced - Scrambled_neg), levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
summary(decideTests(gfit2))
```

Again, it looks like there's not a great effect on the transcriptional profile.

```{r}
out.dr <- topTable(gfit2, n=Inf, sort.by="none")
head(out.dr[order(out.dr$P.Value),])
```

### Testing for any effect of knockdown

For comparison, we also see what happens when we compare the SETDB-1-treated samples to their respective scramble controls.

```{r}
con <- makeContrasts(Scramble=shSETDB.1_induced - Scrambled_induced,
                     Induced=shSETDB.1_induced - shSETDB.1_neg, levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
summary(decideTests(gfit2))
out.any <- topTable(gfit2, n=Inf, sort.by="none")
head(out.any[order(out.any$P.Value),])
```

We compile these results into a single structure and save it to file.

```{r}
drop.any <- colnames(out.any) %in% c("ENSEMBL", "SYMBOL", "AveExpr")
output <- data.frame(gy$genes, AveExpr=out.dr$AveExpr,
    Response.logFC=out.dr$logFC, Response.PValue=out.dr$P.Value, Response.FDR=out.dr$adj.P.Val,
    VsScramble.logFC=out.any$Scramble, VsUninduced.logFC=out.any$Induced, 
    Any.PValue=out.any$P.Value, Any.FDR=out.any$adj.P.Val)
output <- output[order(output$Response.PValue),]
write.table(output, file="DEgenes_SETDB-1.tsv", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

### Testing for recovery off tetracyclin

We test for genes where there is any DE between the uninduced, induced and off-tet samples.

```{r}
con <- makeContrasts(Induced=shSETDB.1_induced - shSETDB.1_neg,
                     Recovery=shSETDB.1_off_tet - shSETDB.1_induced, levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
out.rec <- topTable(gfit2, n=Inf, sort.by="none")
```

We select those genes where the sign of the log-fold change between induced and uninduced is opposite to that between off-tet and induced.
This is consistent with a recovery of the baseline transcriptional profile.
All other genes are assigned _p_-values of unity, under the null hypothesis that the signs are the same.

```{r}
opp.sign <- sign(out.rec$Induced)!=sign(out.rec$Recovery)
out.rec$P.Value[!opp.sign] <- 1
out.rec$adj.P.Val <- p.adjust(out.rec$P.Value, method="BH")
sum(out.rec$adj.P.Val <= 0.05)
out.rec <- out.rec[order(out.rec$P.Value),]
head(out.rec)
write.table(out.rec, file="recovered_SETDB-1.tsv", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

## Performing the differential analyses for MPP8 knockdown

### Testing for differential induction response 

We have a look at the expression profile for MPP8.

```{r, fig.width=10, fig.height=6}
adj <- cpm(gy[gy$genes$ENSEMBL=="ENSG00000196199",])
par(mar=c(10.1, 4.1, 2.1, 2.1))
plot(as.integer(groups), adj, pch=16, xaxt="n", ylab="CPM", xlab="")
axis(1, at=seq_len(nlevels(groups)), label=levels(groups), las=2)
```

We test for differences in the response to induction upon KD of MPP8.

```{r}
con <- makeContrasts((shMPP8_induced - shMPP8_neg) - 
                     (Scrambled_induced - Scrambled_neg), levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
summary(decideTests(gfit2))
```

Again, it looks like there's not a great effect on the transcriptional profile.

```{r}
out.dr <- topTable(gfit2, n=Inf, sort.by="none")
head(out.dr[order(out.dr$P.Value),])
```

### Testing for any effect of knockdown

For comparison, we also see what happens when we compare the MPP8-treated samples to their respective scramble controls.

```{r}
con <- makeContrasts(Scramble=shMPP8_induced - Scrambled_induced,
                     Induced=shMPP8_induced - shMPP8_neg, levels=design)
gfit2 <- contrasts.fit(gfit, con)
gfit2 <- eBayes(gfit2, robust=TRUE)
summary(decideTests(gfit2))
out.any <- topTable(gfit2, n=Inf, sort.by="none")
head(out.any[order(out.any$P.Value),])
```

We compile these results into a single structure and save it to file.

```{r}
drop.any <- colnames(out.any) %in% c("ENSEMBL", "SYMBOL", "AveExpr")
output <- data.frame(gy$genes, AveExpr=out.dr$AveExpr,
    Response.logFC=out.dr$logFC, Response.PValue=out.dr$P.Value, Response.FDR=out.dr$adj.P.Val,
    VsScramble.logFC=out.any$Scramble, VsUninduced.logFC=out.any$Induced,
    Any.PValue=out.any$P.Value, Any.FDR=out.any$adj.P.Val)
output <- output[order(output$Response.PValue),]
write.table(output, file="DEgenes_MPP8.tsv", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

# Setting up the analysis of the specific repeat elements

## Filtering, normalization and dispersion estimation

We repeat the analysis on the counts for the repeat elements.
We re-use the normalization factors from the genic analysis.
This is because it may not be reasonable to assume that most repeats are not DE when you perturb the silencing machinery.
In contrast, the assumption that most genes are not DE is potentially safer due to other layers of regulation.
Also, the counts for the repeats are prohibitively low and preclude stable normalization.

```{r}
ry <- DGEList(summed[!is.ens,], lib.size=gy$samples$lib.size * gy$samples$norm.factors)
```

We filter to remove repeats with an average count below 5.
This is less stringent than for the genic analysis, due to the fact that repeats are lowly expressed at best.
We don't use `aveLogCPM` here as it requires too much memory when there are so many rows.

```{r}
keep <- rowMeans(ry$counts) >= 5
ry <- ry[keep,]
```

We proceed to estimate the NB dispersion across all repeats.
It's probably unwise to use `voom` here as the range of counts is too low to accurately model the mean-variance relationship.

```{r}
ry <- estimateDisp(ry, design)
plotBCV(ry)
```

We use the quasi-likelihood methods to account for the uncertainty of dispersion estimation.
This also involves empirical Bayes shrinkage to stabilize the repeat-specific dispersions, the prior d.f. for which is shown below.

```{r}
qfit <- glmQLFit(ry, design, robust=TRUE)
summary(qfit$df.prior)
```

## Testing for differential repeat expression

We repeat the contrasts from before, but using the repeat counts this time.
For each comparison, we store the differential response along with the DE between shRNA and scrambled and that upon induction.
First, we do periphilin as shown below.

```{r}
con <- makeContrasts((shPeriphilin_induced - shPeriphilin_neg) - (Scrambled_induced - Scrambled_neg), levels=design)
qres <- glmQLFTest(qfit, contrast=con)                     
summary(decideTestsDGE(qres))
topTags(qres)
con <- makeContrasts(Scramble=shPeriphilin_induced - Scrambled_induced,
                     Induced=shPeriphilin_induced - shPeriphilin_neg, levels=design)
qres2 <- glmQLFTest(qfit, contrast=con)
o <- order(qres$table$PValue)
write.table(file="repeats_Periphilin.tsv", data.frame(rownames(ry), logCPM=qres$table$logCPM,
    Response.logFC=qres$table$logFC, Response.PValue=qres$table$PValue, 
    Response.FDR=p.adjust(qres$table$PValue, method="BH"), 
    VsScramble.logFC=qres2$table$logFC.Scramble, VsUninduced.logFC=qres2$table$logFC.Induced,
    Any.PValue=qres2$table$PValue, Any.FDR=p.adjust(qres2$table$PValue, method="BH"))[o,],
    sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

Next, TASOR.

```{r}
con <- makeContrasts((shTASOR_induced - shTASOR_neg) - (Scrambled_induced - Scrambled_neg), levels=design)
qres <- glmQLFTest(qfit, contrast=con)                     
summary(decideTestsDGE(qres))
topTags(qres)
con <- makeContrasts(Scramble=shTASOR_induced - Scrambled_induced,
                     Induced=shTASOR_induced - shTASOR_neg, levels=design)
qres2 <- glmQLFTest(qfit, contrast=con)
o <- order(qres$table$PValue)
write.table(file="repeats_TASOR.tsv", data.frame(rownames(ry), logCPM=qres$table$logCPM,
    Response.logFC=qres$table$logFC, Response.PValue=qres$table$PValue, 
    Response.FDR=p.adjust(qres$table$PValue, method="BH"), 
    VsScramble.logFC=qres2$table$logFC.Scramble, VsUninduced.logFC=qres2$table$logFC.Induced,
    Any.PValue=qres2$table$PValue, Any.FDR=p.adjust(qres2$table$PValue, method="BH"))[o,],
    sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

Then SETDB-1.

```{r}
con <- makeContrasts((shSETDB.1_induced - shSETDB.1_neg) - (Scrambled_induced - Scrambled_neg), levels=design)
qres <- glmQLFTest(qfit, contrast=con)                     
summary(decideTestsDGE(qres))
topTags(qres)
con <- makeContrasts(Scramble=shSETDB.1_induced - Scrambled_induced,
                     Induced=shSETDB.1_induced - shSETDB.1_neg, levels=design)
qres2 <- glmQLFTest(qfit, contrast=con)
o <- order(qres$table$PValue)
write.table(file="repeats_SETDB-1.tsv", data.frame(rownames(ry), logCPM=qres$table$logCPM,
    Response.logFC=qres$table$logFC, Response.PValue=qres$table$PValue, 
    Response.FDR=p.adjust(qres$table$PValue, method="BH"), 
    VsScramble.logFC=qres2$table$logFC.Scramble, VsUninduced.logFC=qres2$table$logFC.Induced,
    Any.PValue=qres2$table$PValue, Any.FDR=p.adjust(qres2$table$PValue, method="BH"))[o,],
    sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

Finally, MPP8.

```{r}
con <- makeContrasts((shMPP8_induced - shMPP8_neg) - (Scrambled_induced - Scrambled_neg), levels=design)
qres <- glmQLFTest(qfit, contrast=con)                     
summary(decideTestsDGE(qres))
topTags(qres)
con <- makeContrasts(Scramble=shMPP8_induced - Scrambled_induced,
                     Induced=shMPP8_induced - shMPP8_neg, levels=design)
qres2 <- glmQLFTest(qfit, contrast=con)
o <- order(qres$table$PValue)
write.table(file="repeats_MPP8.tsv", data.frame(rownames(ry), logCPM=qres$table$logCPM,
    Response.logFC=qres$table$logFC, Response.PValue=qres$table$PValue, 
    Response.FDR=p.adjust(qres$table$PValue, method="BH"), 
    VsScramble.logFC=qres2$table$logFC.Scramble, VsUninduced.logFC=qres2$table$logFC.Induced,
    Any.PValue=qres2$table$PValue, Any.FDR=p.adjust(qres2$table$PValue, method="BH"))[o,],
    sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

# Setting up the analysis of repeat families

## Loading the counts

We need to load in a new set of counts for the repeat families.
These were obtained by using `subread` to align reads to the sequences for each repeat family in RepBase.
Read pairs were only counted into a repeat family if both reads were aligned to the same repeat with mapping quality scores above 50.
We sum together technical replicates as before.

```{r}
fam.counts <- read.table("family_counts.tsv", header=TRUE, row.names=1)[,-1]
stopifnot(identical(colnames(fam.counts), colnames(all.counts)))
fam.sum <- sumTechReps(fam.counts, sample.names)
```

## Filtering, normalization and dispersion estimation

We re-use the normalization factors from the genic analysis, as described before.

```{r}
rfy <- DGEList(fam.sum, lib.size=gy$samples$lib.size * gy$samples$norm.factors)
```

We filter to remove repeat families with an average count below 5.

```{r}
keep <- rowMeans(rfy$counts) >= 5
rfy <- rfy[keep,]
```

We estimate the NB dispersion across all repeat families.

```{r}
rfy <- estimateDisp(rfy, design)
plotBCV(rfy)
```

We also estimate the QL dispersion across all repeat families.

```{r}
qffit <- glmQLFit(rfy, design, robust=TRUE)
summary(qffit$df.prior)
```

## Testing for differential repeat expression

We repeat the contrasts from before, but using the family counts this time.
First, periphilin.

```{r}
con <- makeContrasts((shPeriphilin_induced - shPeriphilin_neg) - (Scrambled_induced - Scrambled_neg), levels=design)
qres <- glmQLFTest(qffit, contrast=con)                     
summary(decideTestsDGE(qres))
topTags(qres)
con <- makeContrasts(Scramble=shPeriphilin_induced - Scrambled_induced,
                     Induced=shPeriphilin_induced - shPeriphilin_neg, levels=design)
qres2 <- glmQLFTest(qffit, contrast=con)
o <- order(qres$table$PValue)
write.table(file="family_Periphilin.tsv", data.frame(rownames(rfy), logCPM=qres$table$logCPM,
    Response.logFC=qres$table$logFC, Response.PValue=qres$table$PValue, 
    Response.FDR=p.adjust(qres$table$PValue, method="BH"), 
    VsScramble.logFC=qres2$table$logFC.Scramble, VsUninduced.logFC=qres2$table$logFC.Induced,
    Any.PValue=qres2$table$PValue, Any.FDR=p.adjust(qres2$table$PValue, method="BH"))[o,],
    sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

Next, TASOR.

```{r}
con <- makeContrasts((shTASOR_induced - shTASOR_neg) - (Scrambled_induced - Scrambled_neg), levels=design)
qres <- glmQLFTest(qffit, contrast=con)                     
summary(decideTestsDGE(qres))
topTags(qres)
con <- makeContrasts(Scramble=shTASOR_induced - Scrambled_induced,
                     Induced=shTASOR_induced - shTASOR_neg, levels=design)
qres2 <- glmQLFTest(qffit, contrast=con)
o <- order(qres$table$PValue)
write.table(file="family_TASOR.tsv", data.frame(rownames(rfy), logCPM=qres$table$logCPM,
    Response.logFC=qres$table$logFC, Response.PValue=qres$table$PValue, 
    Response.FDR=p.adjust(qres$table$PValue, method="BH"), 
    VsScramble.logFC=qres2$table$logFC.Scramble, VsUninduced.logFC=qres2$table$logFC.Induced,
    Any.PValue=qres2$table$PValue, Any.FDR=p.adjust(qres2$table$PValue, method="BH"))[o,],
    sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

Then SETDB-1.

```{r}
con <- makeContrasts((shSETDB.1_induced - shSETDB.1_neg) - (Scrambled_induced - Scrambled_neg), levels=design)
qres <- glmQLFTest(qffit, contrast=con)                     
summary(decideTestsDGE(qres))
topTags(qres)
con <- makeContrasts(Scramble=shSETDB.1_induced - Scrambled_induced,
                     Induced=shSETDB.1_induced - shSETDB.1_neg, levels=design)
qres2 <- glmQLFTest(qffit, contrast=con)
o <- order(qres$table$PValue)
write.table(file="family_SETDB-1.tsv", data.frame(rownames(rfy), logCPM=qres$table$logCPM,
    Response.logFC=qres$table$logFC, Response.PValue=qres$table$PValue, 
    Response.FDR=p.adjust(qres$table$PValue, method="BH"), 
    VsScramble.logFC=qres2$table$logFC.Scramble, VsUninduced.logFC=qres2$table$logFC.Induced,
    Any.PValue=qres2$table$PValue, Any.FDR=p.adjust(qres2$table$PValue, method="BH"))[o,],
    sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

Finally, MPP8.

```{r}
con <- makeContrasts((shMPP8_induced - shMPP8_neg) - (Scrambled_induced - Scrambled_neg), levels=design)
qres <- glmQLFTest(qffit, contrast=con)                     
summary(decideTestsDGE(qres))
topTags(qres)
con <- makeContrasts(Scramble=shMPP8_induced - Scrambled_induced,
                     Induced=shMPP8_induced - shMPP8_neg, levels=design)
qres2 <- glmQLFTest(qffit, contrast=con)
o <- order(qres$table$PValue)
write.table(file="family_MPP8.tsv", data.frame(rownames(rfy), logCPM=qres$table$logCPM,
    Response.logFC=qres$table$logFC, Response.PValue=qres$table$PValue, 
    Response.FDR=p.adjust(qres$table$PValue, method="BH"), 
    VsScramble.logFC=qres2$table$logFC.Scramble, VsUninduced.logFC=qres2$table$logFC.Induced,
    Any.PValue=qres2$table$PValue, Any.FDR=p.adjust(qres2$table$PValue, method="BH"))[o,],
    sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
```

# Session information

```{r}
sessionInfo()
```