Monday, April 21, 2008

Sines of expansion

Cavalli-Sforza and colleagues used Principal Component Analysis (PCA) to summarize general spatial patterns of human allele frequencies across continents into maps. They interpreted peaks in these PCA maps to indicate sources of colonization, e.g. Neolithic expansions etc. However, a new paper by Novembre and Stephens (Novembre and Stephens 2008) questions this interpretation. They show that many forms of spatial covariance of allele frequencies between nearby populations can generate characteristic peaks in PCA maps. For example, simple isolation by distance or stepping stone models also give rise to peaks in the PCA maps, despite having homogeneous migration. These peaks arise because PCA analysis of data with covariance that decreases with spatial distance lead to PCA components that are sines and cosines.

The paper has some really pretty examples, I particularly enjoy the fact that these patterns also appear in Greenish warblers (a famous example of a ring species, i.e. isolation by distance). These results do not to say that human populations did not expand out of particular regions, just that PCA maps are not the best tool to judge this. The authors also note that this does not invalidate the use of PCA to correct for structure in association studies, and in fact might aid in their interpretation in epidemiological models.

Interpreting principal component analyses of spatial population genetic variation
John Novembre, Matthew Stephens. Nature Genetics
Published online: 20 April 2008 doi:10.1038/ng.139


John Hawks said...

Some years ago, somebody demonstrated that the PCA axes found by Cavalli-Sforza were a topologically necessary consequence of the European geography -- that is, the first axis would always have a SE-NW direction because that is the longest axis of the European continent, and the second axis is necessarily perpendicular to the first. I forget who this was, it may have been Sokal. Anyway, the PC-migration interpretation may be widely believed, but really has no evidentiary basis.

G said...

Yes it was Sokal (Sokal et al 1998), thanks for pointing that out. The major geographic axis will only be the first PC axis if migration is relatively uniform. If migration was strongly anisotropic, the direction of the lowest gene flow would probably be the first PC. Also the PCs are orthogonal with respect to the data, not to geography, so the 2nd PC does not automatically have to be at right angle to the first PC on a map (but under uniform migration it usually will be).

I think the helpful thing about the Novembre and Stephens paper is that it explains why the PC axes come out the way they do, and shows that any form of spatially decreasing covariance will result in these kind of patterns. While historical interpretation of PCA might have had a number of detractors, it is still widely used. I think that this kind of prominent paper pointing out the flaws in a clear mathematical way can be very helpful, especially as PCA analysis is making a come back due to its usefulness in dealing with spatial confounding in association studies.

John Hawks said...

I think that this kind of prominent paper pointing out the flaws in a clear mathematical way can be very helpful,

Hear, hear!