Thursday, December 13, 2007

Why simulations are important

Now first of all I should say that I find the hypothesis that human evolution has sped up quite plausible. Large populations are generally more capable of responding more readily to new selection pressures, than small ones. Human populations have been exposed to (and created) a range of new environments, to which our increasing large population size may have allowed us to adapt to rapidly.

However the plausibility of a hypothesis does not mean that it is right, or that the burden of proof is less. Hawks et al have proposed an interesting hypothesis, but I feel that they have gone relatively little of the way to providing a convincing demonstration of their hypothesis. I try and outline here some of the reasons that I have doubts and what the authors could have done to do a better job of convincing me. Now this is obviously a somewhat selfish exercise, and I'm sure there are many other reasons why people accept of don't accept Hawks et al work. This not a complete list of reasons why I am not convinced by Hawks et al, but I thought that I would make a start. Also these comments are written without having seen the supp. material of Hawks et al (I could not find the material as yet), so perhaps some of these issues are addressed there. This entry was written hastily and so will be a bit rough (I also accidentally published an earlier version, but it did not differ much from this).

The authors argument can be crudely put as "we find a lot of selected events and none of these events seem to be very old (and the number of selected events we find are inconsistent with long-term genomic patterns)". My main reasons for doubting this comes down to statistical power and false positives. These issues are the key make or break of many statistical genomics analyzes.

I see relatively little evidence that the tail of the distribution of the test all (or even mostly) represent positive selection. The simulations performed by Wang et al are not sufficient to show that this is the case. If the number of sweeps observed can not be trusted (or is not shown to be trustworthy) then its relationship to the expected number of sweeps under some model is dubious. Now this is not to say that the number is wrong, but what evidence do we have that it is right? Saying that certain gene categories are over represented in the tail is not sufficient 'proof' that the cutoff is appropriate. The fact that certain gene categories are over represented in the tail is also consistent with the authors test having different false positive rates over different recombination environments and SNP densities. The presence of previously known targets of selection in the tails is also not sufficient evidence of an appropriate cutoff.

Extensive simulations under various null models (which are easy to conduct) are needed to show that the cutoff is appropriate. If we do not have these, we can not trust the claimed number of on-going sweeps, which means that we don't know what we would expect. Perhaps there are only 300 strong sweeps currently happening, perhaps there are 2000, the truth is that we (or at least I) don't really know, and in the absence of simulations we can not know.

The authors modify their test to look for older sweeps, as they need to be able to detect older sweeps in order to say that there is a lack of older sweeps. However, the fact that they have redesigned their test should not have been the end. The authors do not provide simulations to show that they have good power to detect old sweeps. If the authors test does not do a good job of detecting old sweeps then they should not be surprised by a lack of old sweeps. Without simulations one can not judge whether the lack of old sweeps is truly a lack of old sweeps or a lack of power.

The authors calculate the age of the selected alleles by a calculation based on the extent of the long haplotypes on an allelic background and the frequency of the allele. Now the age distribution of alleles detected shows a distinctive peak and trails off rapidly consistent with a lack of old sweeps. However, if the tail is mostly neutral false positives that is also what we would expect. False positives are false positives because they look interesting, to look interesting to the Wang et al test (and other tests based on haplotypes) one of the alleles at a SNP must be associated with long haplotypes, and so the test will find interesting the alleles that look young.

Even if the tail only contained true positives the distribution of ages would still be strongly biased towards young strongly selected alleles. As the true positives (the loci that are under selection and end up in the tail) will be associated with recent allele ages as they have to have unusually long haplotypes to be in the tail of the test. The lack of old 'sweeps' is therefore not surprising the authors test is predicated on finding young strongly selected alleles. Weak or old selected alleles will preferentially absent from the tail.

Now it is possible that the skew towards young sweeps (if we believe they are sweeps) is inconsistent with a constant influx of selected mutations. But as we do not know how the power of the author's method depends on the allele age we can not judge this. The authors could do simulations of a constant population, with a selected mutation having arise at some (uniformly distributed) point in the past, the authors could then keep simulations that were significant by their cutoff and then calculate the sweep age distribution of significant simulations. Now if this distribution looked different from the observed distribution for a wide range of selection coefficients then the authors could offer this as evidence of their hypothesis.

Now there is an obvious and frequently cited caveat to the mantra 'do simulations', which is that simulations are based on a particular model of human evolution. Models are based on questionable assumptions and so can not be completely trusted. However, showing that for a particular set of assumptions your test works and gives you what you expect is very helpful. The absence of simulations means that I think we lack assurance that the authors have a strong evidence for their hypothesis.

1 comment:

John Hawks said...

I appreciate your concern for simulations to model the ascertainment biases of our inferred selection events.

But of course, Wang et al. (2006) performed extensive simulations to validate the performance of the test. Their threshold was chosen at a value where false positives were observed to be exceedingly rare.

Nobody has yet shown that they were wrong, to my knowledge, and of course the current paper relies on the earlier work validating the method.

So, notwithstanding what you say about the importance of simulations, your only real criticism is the assertion that "simulations performed by Wang et al are not sufficient to show that this is the case."

What is your support for that assertion? If the LDD test is really that bad -- if it is really identifying more than 95 percent false positives, for instance -- then how does this come about?

Now, I agree that the test must miss a substantial fraction of older selection events, and not only because they have risen above the ascertainment range. But "older" in the context of the sample means older than 20,000 years old. If we added some substantial number of such older events to our total, it doesn't weaken the evidence of acceleration -- it strengthens it.

Every positively selected allele that is still segregating in humans makes the evidence of acceleration stronger, because once the count is above a few hundred, it is at least 10-fold higher than any credible rate across the previous six million years of our evolution.