The coalescent: what is it? And, perhaps more importantly: what is it not?

[edit: the reference in the first paragraph pointed to a review on the ABC, not the coalescent, which is quite ironical. I’ve fixed the mistake]

We population geneticists very often use the coalescent (Kingman 1982) to analyse data, infer demographic histories, estimate population parameters. Yet, do we know exactly what “the coalescent” is? Let us have a look. Notice that this is not a course, it is a blog post. There will be very little formalism, if any, and the goal is to clear some misunderstandings, not provide demonstrations. There are plenty of review papers (my favourite is Nordborg 2019), courses, textbooks out there to grasp the maths behind the coalescent, and I’m certainly not the best one to explain them to you readers.

The coalescent is a modelling tool in which, going backward in time, at each generation the different lineages can “coalesce”, that is, have a shared ancestor. You can view this as a links relating siblings to one of their parents. If you are looking at a particular set of “lineages” at current time, and follow their coalescent events back in time, at some point you’ll end up with only one “ancestor” for those lineages. This ancestor is called the “Most recent Common Ancestor” (MRCA) and the generation at which it occurs is the Time to MRCA, or TMRCA. Here, a seven-lineages coalescence is shown, with a TMRCA equal to 10 generations. Notice that lineages that have never made it to the present time are also shown in grey.

There is an intuitive link between the representation above and drift in a constant-sized population: due to drift, some lineages are lost, while others increase in frequency over time and eventually become fixed (here, the blue lineage becomes fixed in the observed sample).

The most current version of the coalescent, the one most commonly used in off-the-shelf population-genetic software, is the one where the probability of coalescence of any two lineages is 1/2N (for diploids, with 2N the effective population size of the population the sample is taken from) per generation per lineage.

Because of this 1/2N probability of coalescence, the probability that any two copies coalesce in a given generation is only driven by population size. This way, you do not even need to simulate the whole population to describe your sample! It suffices to say that, in you sample, any lineage can coalesce with another at a given generation with probability 1/2N; so you only need to simulate the lineages you have in your sample. This makes coalescent simulations exceedingly fast and flexible. Another handy consequence of the 1/2N rule is that, if you have a large sample, there will be many coalescent events in the first generations (“first” backward: that is, those closer to present) and then the “waiting time” for further coalescent events will increase as the number of independent lineages left decreases. When only few lineages remain, you wait longer and longer, and indeed, the longest “waiting time” is the last one, the one leading to the MRCA. When you count everything, the TMRCA is largely dominated by the last steps, those leading to the coalescensce of the very last lineages. Said otherwise, it is almost useless to simulate (and genotype!) large samples if you are using the coalescent to analyse your populations.

The other nice thing about the coalescent is that is provides very useful predictions. For example, the expectation of the TMRCA equals 2N, which means that if you are able to estimate the TMRCA, then you have an estimate of 2N, which is quite handy. This expectation, and all its variations based on changes in population size, migration, population splits (not to mention selection), are the root of the usage of the coalescence for population-genetic inferences.

Yet, how do you know the TMRCA of an empirical sample? Well, you don’t. To obtain inferences, you need to compare patterns observed in the simulations with those observed in the empirical data. But you are a geneticist, and what you like to compare is genetic diversity patterns. Enter the next step of the coalescent simulation: introducing genetic diversity. To do this, your favourite coalescent simulator will pepper the genealogy with mutations: here, new colours appearing in the genealogy, and being inherited down the genealogy, represent mutations. At the end of the process (that is, at present time, the last row), you’ll have genetic diversity in your sample.

Here, the “dead-end” lineages have been removed from the picture: anyway, you cannot have sampled them, and therefore you have no genotype for them (notice, though, that many programs can accommodate data from past generations, thus possibly allowing sampling “dead-end” lineages).

The coalescent simulation part of the analytical pipeline stops here: your simulator has produced a sample with its genetic characteristics. Usually, the simulation is repeated a very large number of times (generally drawing the simulation parameters from prior distributions), so that the genetic properties of the empirical population under study can be compared with a very large array of simulations.

This is what the coalescent is and does, in a nutshell. What is it not, and what does it not do?

What the coalescent is not and will not do

The Coalescent is not ABC

I have very often seen people conflate, in their thoughts, the coalescent with ABC (Approximate Bayesian Computation). They are not at all the same thing, nor do they need to be glued together. The former is, as explained above, a tool to simulate genealogies and data; the latter is a method to infer parameters from the data, and involves, indeed, simulations. And indeed, using ABC to estimate parameters of an empirical population-genetic sample, by comparing its genetic properties with those of a very large number of coalescent-simulated samples, is an extremely effective way of taking advantage of the power of the coalescent and of ABC (see Csilléry et al. (2010) for a clear explanation).

Yet, the simulations that the ABC rests upon do not necessarily need to be based on the Coalescent. Indeed, provided you have a tool to simulate some system (which may have nothing to do with genetics) starting with a set of parameter values, you can use the ABC. True, the ABC was invented by population geneticists (Mark Beaumont, Wenyang Zhang, and David J Balding; Beaumont et al. (2002)), but I am assured they’re not jealous if one uses the ABC for anything else than genetic data (I’m not going to detail why population geneticists use it, nor am I going to describe the differences between the ABC and “true” Bayesian methods); and it is true that famous software packages use the coalescent and the ABC together (e.g., DIYABC), maybe contributing to the feeling that the coalescent and ABC are one and the same thing.

The coalescent does not suggest a time when there was only one individual of a species

The Most Recent Common Ancestor (MRCA) is not the ancestor of all living beings of a given species. It is the ancestor of all the currently existing copies of the locus of which the coalescent is built, which does not mean that, at some point back in time, there was only one copy of that locus around, either. It means that all copies of all other “branches”, or genealogies, have gone extinct—by drift, very likely (see above). This is of course quite counterintuitive, but yes, if the postulates of the Coalescent hold true, all current copies of a locus are derived from a single copy having been around at TMRCA. Of course, at TMRCA, that particular locus copy co-existed with (possibly plenty of) other copies, but these ones have not left any descendant that could survive in the population (or more precisely, in the observed sample) up to the present.

So, sorry, but you cannot use the coalescent to infer when Adam and Eve have been walking in the Garden of Eden. Use a Bible for that purpose.

The coalescent does not help to study very recent (and some very short) events

Because all the matter in the coalescent revolves around estimating the TMRCA, which is what allows one to estimate population parameters, and particularly the effective population size, and because any two randomly picked copies of a locus may have in common a very old ancestor (including the MRCA), sometimes inferences from the coalescent may be highly biased.

Suppose for example you have a very nice, large, historically stable forest stand (it could be a population of any other organism, but hey, this is a blog about forests); one day, you fell the whole forest, leaving one tree. Suppose you now analyse the coalescence of the loci carried by the chromosomes in this single tree (I suppose it is at least diploid: truly haploid tree species are really rare!). If that single tree is the outcome of a long history of random mating, then the chromosomes it carries are a random sample of the chromosomes of the population, and their coalescence may, on average, go back to a very distant past, and based on this, you will likely conclude that the population’s effective size is very large! (remember: TMRCA is proportional to effective population size) Which of course does not make sense, because there is only one tree left (I’ll come back in another post to the multifarious meanings and interpretations of the concept of effective population size).

The same applies when the population has undergone a very short bottleneck not very long ago: once again, if there has been an “instant bottleneck”, the lineages cannot have coalesced within the bottleneck, and the coalescence signal is closer to the historical population size’s than to the bottleneck population size’s.

[more “what is not” items can be added: I invite the readers to make suggestions]

So, in conclusion: the coalescent is a wonderful tool, but one has to know exactly what it means and how to use it.

Bibliography

Beaumont, Mark, Wenyang Zhang, David J Balding. « Approximate Bayesian Computation in Population Genetics ». Genetics 162, 4 (2002): 2025‑35.

Csilléry, Katalin, Michael G B Blum, Oscar E Gaggiotti, Olivier François. « Approximate Bayesian Computation (ABC) in practice ». Trends in ecology & evolution, 25, 7 (2010): 410‑18.

Kingman, J F C. « The Coalescent ». Stochastic processes and their applications 13, 3 (1982): 235‑48.

Nordborg, Magnus. « Coalescent Theory ». In Handbook of Statistical Genomics, 1:145‑30. John Wiley & Sons, Ltd, 2019.