Popgen ramblings: April 2008

Friday, April 25, 2008

Would a gene by any other name smell as sweet?

I was thinking about blogging about the paper on metal tolerance evolution in Arabidopsis halleri via a cis-regulatory change (Hanikenne et al), but I see that gnxp has already done so here. So I thought in the vein of the 'cis-regulatory vs protein' evolution debate, I would point people towards a recent paper (Scalliet et al and here for a commentary) looking at a phenotypic change involving a coding change in roses. The plant in question is the Chinese rose, which apparently is where the garden variety hybrid tea rose (chinese x european) gets its scent from. The final part of the pathway which underlies the rose's scent involves two genes (OOMT1 and OOMT2), and the authors identify a single residual underlying the crucial difference in the specificity of the two proteins. OOMT1 appears to have arisen by a gene duplication in Chinese roses, as it is absent in other roses. So here is a case of recent gene duplication followed by protein divergence underlying a novel phenotype.

Interesting the Arabidopsis halleri paper identifies both a change in copy number and cis regulatory mutation underlying a phenotype. Changes in copy number can be considered regulatory mutations (as they can change the expression level of a protein), and can be selected for because of this; a recent example of this is the amalyse copy number variation (Perry et al, see the commentary on this paper by Coyne and Hoekstra). Subsequent selection pressures may favour the functionality of the duplicates to change, by amino-acid substitutions, or changes in where the two duplicates are expressed. Thus it is very likely that evolution proceeds by a combination of regulatory (cis and copy number) and protein changes (including mutations in trans factors etc) .

References:
Evolution of metal hyperaccumulation required cis-regulatory changes and triplication of HMA4
Hanikenne et al. Nature 2008

Evolution of Protein Expression: New Genes for a New Diet
Jerry A. Coyne, Hopi E. Hoekstra. Current Biology 2007

Diet and the evolution of human amylase gene copy number variation
Perry et al 2007. Nature Genetics

Plant biology: Scent of a rose
Shadan S. Nature (News and Views) 2008

Scent evolution in Chinese roses
Scalliet et al. PNAS 2008

Monday, April 21, 2008

Sines of expansion

Cavalli-Sforza and colleagues used Principal Component Analysis (PCA) to summarize general spatial patterns of human allele frequencies across continents into maps. They interpreted peaks in these PCA maps to indicate sources of colonization, e.g. Neolithic expansions etc. However, a new paper by Novembre and Stephens (Novembre and Stephens 2008) questions this interpretation. They show that many forms of spatial covariance of allele frequencies between nearby populations can generate characteristic peaks in PCA maps. For example, simple isolation by distance or stepping stone models also give rise to peaks in the PCA maps, despite having homogeneous migration. These peaks arise because PCA analysis of data with covariance that decreases with spatial distance lead to PCA components that are sines and cosines.

The paper has some really pretty examples, I particularly enjoy the fact that these patterns also appear in Greenish warblers (a famous example of a ring species, i.e. isolation by distance). These results do not to say that human populations did not expand out of particular regions, just that PCA maps are not the best tool to judge this. The authors also note that this does not invalidate the use of PCA to correct for structure in association studies, and in fact might aid in their interpretation in epidemiological models.

Reference
Interpreting principal component analyses of spatial population genetic variation
John Novembre, Matthew Stephens. Nature Genetics
Published online: 20 April 2008 doi:10.1038/ng.139

Saturday, April 12, 2008

Mapingp cellular susceptibility to HIV

A really neat paper (Loeuillet et al.) in PLoS Biology identifying a candidate SNP for cellular susceptibility to the HIV-1 virus. The paper adding to our growing knowledge of the genetics of HIV susceptibility (see a review of the paper by David Goldstein).

Rather than investigating this in patients with the disease, the paper initially measures how susceptible different cell lines are to HIV. The paper uses the CEPH cell lines (immortalized lymphoblastoid B cells) to do an initial linkage map, and identified a broad candidate region. They followed this up using the CEU HapMap cell lines to confirm and fine map the variant. The authors then confirmed the SNP association was also present using the more biologically relevant CD4+ T-cells. Finally on the association front they showed that the SNP is associated with disease progression in patients.

The use of the HapMap and CEPH cell lines to map variants affecting cellular phenotypes is a really interesting approach. One which I'm sure we will see a lot more of in the future. At least some cellular phenotypes are likely to be easier to map as they likely have a simpler basis than complex diseases. I'm slightly surprised that the authors did not do a genome-wide association study of this cellular trait (after all they are HapMap cell lines) and instead restricted themselves to doing an association study in the region of significant linkage. Obviously a GWAS would have to meet genome-wide significance, but the region of significant linkage could have been up-weighted or considered separately in this analysis.

Hat tip to Tree of Life.

Reference:
Loeuillet C, Deutsch S, Ciuffi A, Robyr D, Taffé P, Muñoz M, Beckmann JS, Antonarakis SE, Telenti A.
In vitro whole-genome analysis identifies a susceptibility locus for HIV-1.PLoS Biol. 2008 Feb;6(2):e32
Goldstein DB.
Genomics and biology come together to fight HIV
PLoS Biol. 2008 Mar 25;6(3):e76

Sunday, April 6, 2008

Common variants, when do we stop looking?

Just a few thoughts on genome-wide association studies, prompted by Genetic Future's recent posts on the low returns of some genome-scans (here and the here). Now meta-analysis of combined studies will get us a long way towards getting small effect alleles without the expense of typing additional cases (and we've seen quite a bit of this already), as will methods for studying epistatic interactions. So people will definitely squeeze the current data sets more. But my question/thought is: when do we stop looking for common variants for a particular common disease by increasing our sample size? Now this is a silly question, because the answer is mainly determined by practical constraints like funding and the ease of phenotyping cases. Also I suspect the answer is that we'll keep on doing genome-wide association studies until resequencing is cheap enough to become a common tool, and then rare variants will be popular. But theoretically when should we stop, as thinking about this might help us weigh the merits of different studies/study designs?

I think the answer depends strongly on the reasons for doing genome-wide association studies in the first place. I think there are two main reasons: predicting disease risk and understanding the pathways involved in the disease (though obviously these are not distinct aims).

If you are interested in predicting the disease risk from a person's genotype, you need to think about 'if I increase our sample size dramatically will I get much better at predicting disease risk'? The answer to that perhaps will be no, most of the common variants known are not very predictive so the ones that will be discovered next will be even less helpful in predicting risk. Now there will be a vast number of tiny effect loci, but it seems to me that we rapidly hit diminishing returns for predicting risk.

If on the other hand you want to learn about the pathways involved in the disease (for drug targeting...etc), then perhaps the size of the effect is not important, just finding a new region will be informative about some part of the pathway (if you can understand what the region is telling you). One perhaps serious wrinkle on this is: 'are tiny effect loci really informative about the pathways'. The effect size may be small because the effect of the allele is very remote in the network from the main pathways, in which case it might be very hard to work your way back to understanding something new about the disease.

Now an additional benefit of finding a tiny effect common allele, is that the region containing the region might be the target of large effect rare variants. Might one view be that by discovering tiny-effect common allele that people are preparing the ground (i.e. finding candidate loci) for resequencing studies of rare variants. Obviously the resequencing will be done genome-wide, but researchers will up-weight interesting rare variants in loci where weak effect loci are already known. If so, this will be a funny twist because genes involved in mendelian diseases (rare-strong effect mutations) were originally seen as candidate locations for common disease variation.

I think the exciting thing is we really don't know what we will learn, nor when to stop. The great thing about the WTCCC (and other major efforts) is that by concentrating a lot of effort on a few diseases, we might quickly learn what does and doesn't work.

Saturday, April 5, 2008

"Exons, Schmexons"

A summary by PZ Myers of Coyne and Wray's keynote speeches on evodevo. It sounds like it would have been fun to see, particularly the dueling t-shits (one is quoted in the title of this post). I think that Coyne is right that the only real way to know where selected changes occur and what type of mutations they are is to do very detailed follow-up work.

I thought I would give a link to a relatively recent paper by Wray's group looking for positive selection in promoters using the human, chimp and macaque sequences (Haygood et al). Their main point (at least in the coding vs noncoding debate) is that many promoters seem to undergo positive selection compared to exons (especially in interesting categories of genes). The paper looks for positive selection by looking for promoter regions that have significantly more substitutions than nearby intronic regions.

I've not read it in a while so I'll avoid commenting on the technical details. However, a lack of genes with d_N/d_S>1 is not proof that genes do not often experience positive selection, just that d_N/d_S>1 is a pretty crappy measure of positive selection. The problem is that d_N incorporates all of the selection against amino-acid changes plus any weak signal of positive selection. For a gene to meet the d_N/d_S>1 criteria it has to have had a whole bunch of amino-acid changes, if positive selection on a gene often involves just a few amino-acid changes it will not satisfy d_N/d_S>1. Promoter sequences on the other had could be made up of a mixture of near-neutrally evolving regions plus a small number of more constrained regions. A few additional substitutions in the promoter due to positive selection could easily tip the balance to make the promoter be 'rapidly evolving' (i.e. faster than the nearby intron) because the promoter's rate of substitution was not that different from the intronic rate anyway. That's not to say that the promoters found to be rapidly evolving are not interesting, just that the results should not be taken to mean that there is more positive selection on promoters than exons, as this is like comparing apples to oranges.

References:
Haygood R, Fedrigo O, Hanson B, Yokoyama KD, Wray GA.
Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution.
Nat Genet. 2007

Wednesday, April 2, 2008

Ancient mtDNA: clocking up the mutations

A quick post about Hay et al, which I spotted thanks to Pondering Pikaia (Nature news also has a piece on it). I think that this paper is interesting but I think there are some issues in the interpretation.

The paper estimates the mtDNA mutation rate of tuatara (an ancient lineage of reptile living on New Zealand), and find that the mtDNA mutation rate is one of the highest estimated so far. The usual way of estimating mutation rates is to compare sequence divergence between two species (or set of species) where the divergence time of the pair of species (or nodes in the tree relating the species) are known with some precision (this technique is not without its problems, and a lot of work in molecular phylogentics is devoted to solving these problems). However, if you don't have a set of relatively close species or they a lack of fossil record then you are usually out of luck, and the mutation rate can not be estimated. To get around this the authors use patterns of mtDNA diversity in extant and ancient mtDNA to estimate the mutation rate. Usually patterns of population genetic diversity are not in themselves informative about mutation rates, because the diversity within a species is determined by the product of the mutation rate and the effective population size (i.e. the mutation rate and the effective population size are confounded). High diversity within a species could be due to a high mutation rate and a low effective population size, or a low mutation rate and a high effective population size. The inclusion of ancient DNA samples can potentially resolve this confounding; this is because the temporal spacing between the samples, gives information about the rate of coalescence (i.e. the effective population size). This information in turn provides a way of disentangling the effective population size from the mutation rate (see Drummond et al.). The authors use this idea to estimate the mtDNA mutation rate, and have also used this idea before for Penguin ancient mtDNA ( Lambert et al ).

However, as with all population genetic analyzes things are not quite that simple. Violations of the simple coalescent model used to estimate the population size and hence the mutation rate could be problematic. Population size changes (such as a bottleneck) or selection could alter the rate of rate of coalescence through time, and so potentially confuse the estimation of the effective population size and so the mutation rate. Population structure is another potential problem, if the present day samples and ancient DNA samples are not all drawn from the same panmictic population this would also violate the model assumptions and so potentially impact the estimate of the mutation rate. Now the authors sample both extant and ancient specimens from around New Zealand, so population structure may not be a problem though this can not be assumed to be the case. The authors also look at whether exponential population growth changes their results and find that it does not, but it is not clear that other plausible demographic models could not cause biases.

Estimates of mutation rate from this kind of analysis are clearly not problem free, and so may need be treated with some caution. If you are somewhat skeptical of the estimation population demography from mtDNA sequences (and you should be) then you should be somewhat skeptical of estimating mutation rates via this technique. I think that the idea behind this study and the results are really neat, but for the moment I would regard this study as supporting/suggesting a hypothesis that the mutation rate is high rather than definitely showing that this is the case.

References

Hay JM, Subramanian S, Millar CD, Mohandesan E, Lambert DM.Rapid molecular evolution in a living fossil.Trends Genet. 2008 Mar;24(3):106-9

Drummond AJ, Nicholls GK, Rodrigo AG, Solomon W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data.
Genetics. 2002 Jul;161(3):1307-20.

Lambert DM, Ritchie PA, Millar CD, Holland B, Drummond AJ, Baroni C. Rates of evolution in ancient DNA from Adélie penguins.
Science. 2002 Mar 22;295(5563):2270-3.

Popgen ramblings