# A deeper explanation of spike-in normalization 

The assumption is that the coverage of endogenous genes are also affected by $L_i$ and $R_{i0}$ (which we assume to be a scalar transformation of $R_{is}$),
while everything else is constant across cells (see Equation 1 and the equation in Supplementary Section 1).
This means that dividing by the spike-in coverage will eliminate the effect of these cell-specific variables.
The normalized expression for gene $g$ in each cell $i$ can then be written as $zN_{ig}$,
where $z$ is constant across all cells and includes all of the endogenous and spike-in constants, i.e., $r_s$, $c_s$, $l_s$, $V_s$ (presumably) and so on.
This ensures that the normalized values are comparable across cells.

We can extend this reasoning by considering the case where we add different _known_ amounts of spike-in RNA.
For example, consider a cell $i^*$ in which we added twice as much spike-in RNA as the others.
The coverage of the spike-in RNA may not necessarily be doubled due to the composition effects in $L_i$.
Nonetheless, if we proceed to the calculation of the normalized values, we will obtain a normalized expression of $zN_{i^*g}/2$.
This is half the correct value, and can be fixed by halving the spike-in size factor prior to normalization.
We don't have to worry about the effect of doubling on composition bias, as the composition bias cancels out anyway during normalization.

# Explaining the simulation results

Variance ranking should largely be robust to spike-in variability.
Any variability should affect all genes equally, such that the ranking shouldn't change much.
Testing for significance over tehnical noise will be more sensitive, as the technical estimates are normalized by assuming no variability.
Any variability is absorbed into the total variability, resulting in an overestimate of the biological variability.
The extent to which this affects things will depend on the magnitude of the variability.

For DE analyses, the ranking should also be robust, provided that sufficient cells are present in each condition.
Any variability will average out across the condition, such that the log-fold change between conditions is unaffected.
Of course, the variance of each gene will be increased by the same amount.
This won't affect the rankings, as each gene will be affected; however, it will result in loss of power.
Whether this results in a substantial effect in detection depends on the size of the increase.

Clustering is likely to be most sensitive to spike-in variability.
This is because clustering (and dimension reduction) treats each cell as its own entity.
Fluctuations in spike-in quantities will affect the quantification and inferences for that cell.
There is no averaging out that can robustify against changes at a per-cell level.

# Alternative methods to accounting for total RNA content

Assume that the expectation of the bias in capture efficiency is the same between cell types.
This means that the average size factor across cells in each cluster can be used as a relative measure of total RNA content, without needing spike-ins.
One could theoretically use this to adjust the log-fold changes in expression between clusters to incorporate changes in total RNA content.
However, this strategy assumes that no library quantification was performed, and that there is no competition for capture/sequencing resources between transcripts in each cell.
Such effects would result in non-equal expectations of the biases between cell types.
It also assumes that you can obtain sensible clusters in the first place without using information about total RNA content.
This may not be possible, and not normalizing at all would require the clustering to deal with large variability in capture efficiency or sequencing depth across cells.
In such cases, use of spike-ins would be preferable for directly separating the technical effects from biological changes in RNA content.

# Concerning accuracy of spike-in normalization

We did not discuss the _accuracy_ of spike-in normalization, only focusing on _precision_.
In hindsight, we should have also done the former.
This would ideally be done with a pool of endogenous RNA at varying (known) dilutions, mixed with a constant amount of spike-in RNA.
The aim would be to see if the dilutions could be successfully recovered.
I guess this would be expected, given the history of work that has gone into showing that sequencing is a linear measurement technique.