Friday, December 28, 2007

Another comment on Hawks et al

John Hawks has recently published a post commenting on what I thought was a pretty decent post by evolgen, about the rate of adaptative vs neutral evolution. I take issue with a number of points raised in the Hawks post, and in Hawks et al. Firstly, as I discussed previously I think that Hawks et al have not shown a speed up in the rate of selective sweeps. Also there are a number of aspects of the Hawks et al analysis (particularly the cutoff and the inclusion of the parents and children in the HapMap analysis, see the post and comments at gene expression), which are very non-standard and (at least to my knowledge) have not been shown to be appropriately robust, nor seemingly discussed in the paper.

The issue that I want to discuss here, is the one that Hawks raises about the rate of genetic drift having slowed and the rate of sweeps is expected to have increased dramatically in the past 40,000 years. This idea is presumably based on the graph of population size (Fig 2) in Hawks et al, however this graph seems to be of population size rather than effective population size. Though in the paper they seem to be treated interchangeably.

Now the rate of genetic drift and the efficiency of selection in a population is governed by the effective population size. The effective size of a population (Ne) is often much smaller than the census size of the population, as it is affected by the variance in reproductive success between individuals. The effective population size is very hard to predict as it depends on many demographic factors. Ne can be estimated from patterns of genetic diversity, and the long-term estimated effective population size of humans is ~10,000 (similar to many great apes), reflecting the strong effect of genetic drift in the evolutionary history of humans. Now human populations have recently expanded dramatically in size. All else being equal human effective population sizes will have increased as well, but there is no good way of guessing how much they have increased by. For example rapidly fluctuating population sizes, result in a smaller Ne. If the fluctuations are sufficiently rapid, Ne can be approximated by the harmonic mean of the population size over time. Thus Ne is closer to the lowest values taken by the population overtime, rather than the average population size. If the population size increases on average over time but with frequent crashes, the effective population size will also increase but much less dramatically.

The usual (and perhaps only real) way to learn about how Ne has changed through time is to model patterns of genetic diversity, and numerous authors have done this (see for recent example ). The genetic diversity data is consistent with a history of mild recent growth in Bantu-speaking Africans and a moderate strength bottleneck (a burst of genetic drift) into Europe and East Asia.

All this is a long way of saying, the graph of population size increase in Hawks et al, is I think somewhat misleading ( As are Hawks claims "human populations reached an effective size on the order of 100,000 -- certainly by 40,000 years ago"). The graph is of population size estimated from archaeological data, not effective population size. I have no doubt that the effective population size of humans is now large, but when and how this came to be is much harder to tell. The estimation of effective population sizes through time must be from population genetic data. The archaeological data is hopefully a good predictor of average census population size, but can not be taken at face value as a predictor of genetic drift or the rate of selective sweeps, to do this you need to take estimates of the effective population size through time. Now in the estimation of effective population sizes from genetic diversity data can be confounded by selection (if selection is very pervasive), and so perhaps somewhat conservative, but it is the correct approach to take.

Many of the patterns we see in genetic diversity in humans are likely to be the result of genetic drift. Now I have no problem with the idea that the dramatic differences in allele frequencies between population seen at some loci are due to selection. But I believe (and I am willing to change my mind) that the bulk of the patterns seen (eg. the reduction of genetic diversity moving away from Africa) are due to genetic drift not hitch-hiking. Thus the majority of markers are useful for learning about human history, but we should always be careful to consider selection as an alternative explanation.

Hawks post also criticized the phrase 'invoke selection'. I think that invoke is a somewhat charged word (to some people), and perhaps should be avoided. But the fact remains that population genetic test of selection is and will continue to be based on testing for selection by rejecting the null hypothesis of neutral or nearly neutral evolution.

Saturday, December 22, 2007

Are males just simpler?

There is an interesting article in PNAS on the differences in inheritance in males and female Drosophila by Wayne M et al.

Sexual selection is a key driving force in evolution. Sexual dimorphisms (differences between the sexes) are one of the most obvious patterns in nature, from peacocks tails to the elaborate genitalia of insects. So there is obviously a lot of interest in understanding how sexual dimorphisms arise, and a lot of evolutionary theory (see evolgen for example) devoted to understanding how they evolve. Another pattern in nature is that male traits evolve quickly (compared to female traits) between species. The authors discuss how differences in the form take by genetic variation in males and females could be a somewhat neglected component of this rapid evolution in male traits.

Before I can discuss their results we need a brief aside into additive and dominant components of genetic variance. See also gene expression for an explanation (which may well be clearer than my own).
A mutation is said to act in an additive manner if the effect of a mutation in heterozygotes is exactly intermediate between homozgyotes for the mutation. If the heterozygote phenotype is not intermediate to the homozygotes (for example it resembles one of the parents more) then the mutation shows dominance. Now often we don't know (or perhaps care about) the mutations underlying a trait, as perhaps there are many mutations controlling a trait.

Crosses of individuals can be performed to disect the genetic basis of a trait, and we can learn about the amount of variation in a trait that can explained by different kinds of 'variance' (additive or dominant, and higher epistatic terms that we shall not worry about here). Now the additive and dominant contribution to the variance are not the properties of a individual mutations, but are the properties of the (perhaps many) mutations present in the parents in a cross (which depends on the make up of the population). A trait can show purely additive variation, with the offspring of a cross always being exactly intermediate between the parents. Or the offspring can resemble one of the parents more, suggesting that the phenotype is not determined merely by the additive sum of the mutations contributed by the parents and has some dominance component. The additive component of a trait in a cross (or a population) is defined to be the amount of genetic variation in a trait, attributable to the average contribution of parent's genotype to a phenotype in the child (
i.e. what does a parent contribute to a child averaged across the possible matings). The dominance component is the remaining proportion of variation in the phenotype assignable to the specific genotype of the child (beyond the average contribution of the allele from each parent),
i.e. the particular combination of the parents of the child.

The proportion of genetic variation in a trait that is additive (the narrow sense heritability), is very important quantity as natural selection changing the mean of a trait usually only acts on the additive variation in a trait. Imagine natural selection acting to change the mean of a phenotype,
e.g. selection for longer horns. A mother with a particular genotype might do very well for herself (i.e. have many successful offspring due to her genotype giving her a phenotype of long horns), but she only passes one of her alleles to each child. The phenotype of the child is determined by the combination of the allele from the mother, and the allele from a father. Thus it is the average (across fathers) contribution of one of the mother alleles to the child which matters, for how the mean of the trait changes. (Note that this not to say that dominant mutations are not selected upon, as dominant mutations also contribute to additive variation.)

So traits that have large additive genetic components are more easily changed by selection. The authors set out to investigate the genetic basis of gene expression level in Drosophila (for many genes measured on array) and how this differs between males and females. By a series of crosses, the authors find that many more genes show additive genetic variation in their expression level in males than in females, and that a number of these genes are found on the X chromosome (as well as on the autosomes). Now the X chromosome seems to be be the key to this difference in the form of genetic variation (and the authors conduct further experiments to show this). As males have only a single X chromosome there simply is not any dominance due to genetic variation on the X chromosome (at least not simple non-epistatic dominance). The genetic variation on the X chromosome in males mainly contributes to the additive genetic component of variation (as there is no second allele to cause dominance). So genes on the X with cis acting mutations that affect gene expression will inherited in an additive manner. Also mutations on the X which affect gene expression in trans (on the autosomes), will contribute to the additive genetic component of the trait in males but can be involved in the dominance component in females. So the expression level of many genes is more additive and so more easily selected upon in males than in females, because the single copy of the X chromosome makes thing more simple in males. I particularly like this idea that trans modifers of gene expression on the X chromosome, makes gene expression simpler to select upon in males.

This paper is a nice demonstration of the power of crosses combined with gene expression traits to reveal general patterns of inheritance. These patterns are key to understanding the raw material that evolution acts upon, and such studies are vital if we to dissect the historical patterns found in genomic data.

Saturday, December 15, 2007

RE why-simulations-are-important

Thank you for commenting on my post John. I appreciate you discussing your paper with me, I understand that this is a very busy time for you.

My criticism of Hawks et al, is that Wang et al (nor Hawks et al) did not do sensible
simulations, which is a concern as the false positive rate and power of the test is key to Hawks et al statements about the number of sweeps. Most previous selection-scan papers used empirical cutoffs and shied away from making strong statements about the true number of selective events. They often (though there are also plenty of papers which do not even do this) simply performed simulations to show that their tails are likely to be enriched for true positives. This is not because researchers are not interested in the rate of adaptation (or whether it has changed), or lack the 'mathematics' to understand the concept, but because it is a very tough statistical problem. One which I don't think that Wang et al (or subsequent analysis by Hawks et al) solved.

I know that Wang et al did simulations. However, I would argue against the relevance of the simulations. For example, one of the simulations permuted the SNPs on a chromosome, which does not simulate under a sensible null as it removes the correlation between neighboring SNPs (which is exactly what can generate false positives). In other simulations they generated bottle-neck data via an ad-hoc resampling procedure, but I am not convinced that this is sensible procedure as there is little to show that this is an appropriate null (the concordance of D' is not sufficient, and none of the simulations look to my (admittedly tired) eye very like the data outside of the tails which is somewhat worrying).Thus I do not know whether the cutoff proposed by Wang et al is appropriately conservative. Now nobody has shown that the cutoff is inappropriate, but previously no one was asked to place a very strong belief in the cutoff. Thus the Wang et al paper is a potentially interesting source of loci worth further investigation (as are many other scans of selection), but I've not seen any evidence to convince me that the majority (or even a reasonable fraction) of the Wang et al candidates are true sweeps. I'm not saying that I disbelieve in tests of selection, but to make high profile statements about the rate of selected mutations (and how that has changed through time) requires good evidence.

If no power simulations have been performed it is hard to judge whether the method has good power. Poor power would not automatically imply that you should assume that there are many more selective events than given by your cutoff, merely that the test only recovers a low biased subset of the true events (which might be small number). The tail of the distribution of the test will (like all tests) contain false positives and true positives, the ratio of these two quantities depends on the false positive rate,
the true number of selective sweeps and the power of the test to detect these sweeps. The absence of good simulations means that people are unable to judge whether the 0.5% tail of the test contains mostly false positives or mostly true positives (or somewhere in between). The statement that the tail (thousands of putative sweeps)
contains mostly true positives is obviously only true if there are are thousands
of sweeps. If on the other hand, there are only a few detectable sweeps the tail will contain mostly false positives. Now you might argue that there are thousands of sweeps (and I might be inclined to agree with you), but in the absence of good
simulations/evidence this is just an assertion.

If you want to show that there are an absence of older sweeps in the data, a natural null to simulate from, would be one with a constant influx of selected mutations. The method of Wang et al (or more likely a combination of methods, to improve power across the frequency range) could then be applied to the data, to see whether a very different pattern is found from that observed in the data. If it was (for a wide range of demographic parameters) then this would offer a reasonable demonstration of the hypothesis.

Now on population genetic theory grounds I'm inclined to believe the
assertion that the rate of evolution might well have changed throughout human history. But I'm not convinced that Hawks et al shows that human evolution has sped up, though I'm quite willing to believe that the assertion might be proved correct.

Thursday, December 13, 2007

Why simulations are important

Now first of all I should say that I find the hypothesis that human evolution has sped up quite plausible. Large populations are generally more capable of responding more readily to new selection pressures, than small ones. Human populations have been exposed to (and created) a range of new environments, to which our increasing large population size may have allowed us to adapt to rapidly.

However the plausibility of a hypothesis does not mean that it is right, or that the burden of proof is less. Hawks et al have proposed an interesting hypothesis, but I feel that they have gone relatively little of the way to providing a convincing demonstration of their hypothesis. I try and outline here some of the reasons that I have doubts and what the authors could have done to do a better job of convincing me. Now this is obviously a somewhat selfish exercise, and I'm sure there are many other reasons why people accept of don't accept Hawks et al work. This not a complete list of reasons why I am not convinced by Hawks et al, but I thought that I would make a start. Also these comments are written without having seen the supp. material of Hawks et al (I could not find the material as yet), so perhaps some of these issues are addressed there. This entry was written hastily and so will be a bit rough (I also accidentally published an earlier version, but it did not differ much from this).

The authors argument can be crudely put as "we find a lot of selected events and none of these events seem to be very old (and the number of selected events we find are inconsistent with long-term genomic patterns)". My main reasons for doubting this comes down to statistical power and false positives. These issues are the key make or break of many statistical genomics analyzes.

I see relatively little evidence that the tail of the distribution of the test all (or even mostly) represent positive selection. The simulations performed by Wang et al are not sufficient to show that this is the case. If the number of sweeps observed can not be trusted (or is not shown to be trustworthy) then its relationship to the expected number of sweeps under some model is dubious. Now this is not to say that the number is wrong, but what evidence do we have that it is right? Saying that certain gene categories are over represented in the tail is not sufficient 'proof' that the cutoff is appropriate. The fact that certain gene categories are over represented in the tail is also consistent with the authors test having different false positive rates over different recombination environments and SNP densities. The presence of previously known targets of selection in the tails is also not sufficient evidence of an appropriate cutoff.

Extensive simulations under various null models (which are easy to conduct) are needed to show that the cutoff is appropriate. If we do not have these, we can not trust the claimed number of on-going sweeps, which means that we don't know what we would expect. Perhaps there are only 300 strong sweeps currently happening, perhaps there are 2000, the truth is that we (or at least I) don't really know, and in the absence of simulations we can not know.

The authors modify their test to look for older sweeps, as they need to be able to detect older sweeps in order to say that there is a lack of older sweeps. However, the fact that they have redesigned their test should not have been the end. The authors do not provide simulations to show that they have good power to detect old sweeps. If the authors test does not do a good job of detecting old sweeps then they should not be surprised by a lack of old sweeps. Without simulations one can not judge whether the lack of old sweeps is truly a lack of old sweeps or a lack of power.

The authors calculate the age of the selected alleles by a calculation based on the extent of the long haplotypes on an allelic background and the frequency of the allele. Now the age distribution of alleles detected shows a distinctive peak and trails off rapidly consistent with a lack of old sweeps. However, if the tail is mostly neutral false positives that is also what we would expect. False positives are false positives because they look interesting, to look interesting to the Wang et al test (and other tests based on haplotypes) one of the alleles at a SNP must be associated with long haplotypes, and so the test will find interesting the alleles that look young.

Even if the tail only contained true positives the distribution of ages would still be strongly biased towards young strongly selected alleles. As the true positives (the loci that are under selection and end up in the tail) will be associated with recent allele ages as they have to have unusually long haplotypes to be in the tail of the test. The lack of old 'sweeps' is therefore not surprising the authors test is predicated on finding young strongly selected alleles. Weak or old selected alleles will preferentially absent from the tail.

Now it is possible that the skew towards young sweeps (if we believe they are sweeps) is inconsistent with a constant influx of selected mutations. But as we do not know how the power of the author's method depends on the allele age we can not judge this. The authors could do simulations of a constant population, with a selected mutation having arise at some (uniformly distributed) point in the past, the authors could then keep simulations that were significant by their cutoff and then calculate the sweep age distribution of significant simulations. Now if this distribution looked different from the observed distribution for a wide range of selection coefficients then the authors could offer this as evidence of their hypothesis.

Now there is an obvious and frequently cited caveat to the mantra 'do simulations', which is that simulations are based on a particular model of human evolution. Models are based on questionable assumptions and so can not be completely trusted. However, showing that for a particular set of assumptions your test works and gives you what you expect is very helpful. The absence of simulations means that I think we lack assurance that the authors have a strong evidence for their hypothesis.

Saturday, December 8, 2007

a couple of interesting blog pieces

Pondering Pikaia on how the globin genes of Penguins have evolved to help them dive.

and Jonathan Eisen at tree of life on Nancy Moran's work on the genomes of symbionts (just out in PNAS). I saw Nancy Moran give a talk on her work (a year or two ago) and I have to agree with Eisen that she does really great and pretty work. Her talk has definitely stuck with me for a long time.

Asymmetric interactions between species and invasion.

First of all I should start by saying that everything on this blog (about peer-reviewed research) comes with the disclaimers 'I may well have misunderstood the paper or be incorrect in my thinking' and 'I think that this is interesting and novel, but I may have missed a flaw in the author's logic or prior work on this topic'. This is especially true when I talk about papers outside my area. While I'm pointing out (fairly obvious/pointless) disclaimers, spelling and grammar are not my strengths (especially as I'm trying to minimize my time writing this blog), so I apologize in advance for that. One of the reasons for writing this blog is to improve (and speedup) my science writing skills, so hopefully I'll get better with time.

Anyway all of that aside, here's an interesting article on the pre-release server in Science .

I came across it in a short segment in the Science magazine podcast, this might be a good way of hearing the main results if you can't access the article (one of the annoying things about blogging on non-open-access research). The paper discusses the invasion/replacement of one species of White fly with another. Lets call the invading species A and the species being replaced B. This replacement has happened quickly (in the early 90's and 00's) and in at least two distinct locations (China and Australia). The authors investigate one cause of this replacement, asymmetric breeding behavior between the two species.

The species are haploid-diploid, males are born from unfertilized eggs (hence they are haploid) while the females are born from fertilized eggs (diploid). Thus for a mother to give birth to females she has to mate.

The authors note that the two species (A and B) don't produce viable (female) offspring, and while they perform courtship they do not copulate. The fact that cross species pairings court is key as it means that A and B waste each other's time (if they occur in the same location). The authors perform experiments that show that the A females cope with this by mating a lot more, while B females fail to do this. This reduces the number of matings between B males and B females if species A is present, which reduces the number of fertilizations of the eggs of B females. This in turn reduces the number of B females in the population, while the A females mate more and so produce more females. This allows species A to (potentially) expand more rapidly and so replace species B. The authors record this sex-ratio change in both species in both the wild (when the overlap during the invasion), suggesting that this process seems to occur.

We often expect mating interaction between closely related species to be asymmetric so this is a really fun example of how mating interactions between closely related species can lead to the differential success of species.

Also I think that the species don't have to be haploid-diploid for this to be a factor in the survival of the species (though this form of sex-ratio change does require it). Imagine two closely related species C and D. Further, imagine that C males waste D females reproductive opportunities more than D males waste C females reproductive opportunities. Then all else being equal, C females will have more successful matings than D females and so species C will potentially out compete species D. This form of between species sabotage could take many forms, from simply getting in the way of courtship to successfully mating with females of the opposite species and making them invest in sterile/inviable offspring.

I'm guessing that these competition ideas have been explored in depth (the authors of the Science paper cite a couple of book chapters that I intend to read someday).

I think that these ideas also link into while females might be the ones to initially evolve reinforcement, which is the evolution of behavior to avoid mating with closely related species (see Coyne and Orr's Chapter on reinforcement for a discussion on potential differences between the sexes).

Thus the process of speciation and the survival of recently diverged species can both depend on these concepts.

Sunday, December 2, 2007

sex ratio distortion

There are a couple of interesting papers (1, 2) over at PLoS biology (also discussed here ) studying a sex ratio distortion system. This is a type of mutation, which causes the sex ratio of the bearer's offspring to differ from the usual 50:50. If the mutation occurs on a sex chromosome such a mutation can be spread through the population. For example, a mutation on the X chromosome, which causes XY father to have more female offspring (i.e. XX offspring) will benefit itself as the offspring will carry the X chromosome more often. This can lead to situations where the population is very strongly biased towards a particular sex, e.g. females.

Now, as this sex-ratio mutation spreads through the population, mutations that block this distorting mutation can arise on the autosomes, returning an individual's sex ratio (of their offspring) to 50:50. These mutations are in turn strongly favoured. For example, if the population is mostly female (due to the sex-ratio mutation) and you start having sons, your sons will father a lot of children.

These papers study such a system between two closely related species of Drosophila. One of these species has both the sex-ratio mutation and the autosomal restorer mutation, while the other have neither. By performing crosses between the two species the authors unmask the sex-ratio mutation and autosomal mutation, and so can map the location in the genome of both mutations. They have characterized both the sex-ratio mutation on the X chromosome and the mutation on the autosomes that masks the effect returning the sex ratio to normal. It is a really pretty system.

The authors of the paper point to a number of other examples of sex-ratio distortion loci (and their suppressors) in Drosophila. These kind of conflicts are happening all of the time.
Another example of sex-ratio distortion are parasites (e.g. Wolbachia) that are transmitted only to daughters. These parasites often kill the male offspring to further their own transmission, and the host then evolves mechanisms to suppress these effects.

The rate at which these sex-ratio distorters evolve and are suppressed can be phenomenal. Here for example is a paper, which reports on a population of butterflies where the sex-ratio was distorted 100:1 and then is returned to 50:50 within 10 generations.

In fact one wonders whether evolution to resolve these conflicts is as common or more common than evolution in response the external environment. A sex-ratio distorting mutation can actually drive a population extinct if a suppressor does not arise in time.

I love these sex-ratio distortion systems as they are a wonderful example of evolution in action. They also show that evolution does not always act to improve an organism or a population, in fact the initial distorter is deleterious for the population.