Information, opinions, discovery.

Let us have a chat about science.

 

Do you ever feel the need to go beyond your everyday research activity, to stop and think about how science actually works?

Do you ever feel that only a part of what you do, think and find in your research activity can fit the strict frame of peer-reviewed publication and conference talk? And yet, such things have to be said and written?

I do. So I invite you to come by my campfire and have a chat around forest science.

Scroll down for the latest content.

To Ne or not to Ne? that is the question

This blog post is written with guest blogger Bruno Fady, so the pronoun used is “we”. Welcome Bruno!

A few days ago, On 13 June 2023, we attended a very interesting webinar about the estimation of effective population size (Ne) and its usage as a genetic indicator for (forest population) conservation. The webinar is part of the EUFORGEN webinar series.

The webinar consisted of a trio of very insightful talks given by one of us (Bruno Fady, INRAE URFM, Avignon, France), by Juan José Robledo-Arnuncio (INIA-CSIC, Spain), and by Sean Hoban (Morton Arboretum, IL, USA). We should perhaps mention that Robin S. Waples (University of Washington, Seattle, Washington, USA), one of the largest single contributors to the literature on effective population size, also attended and took part to the very dense debate that followed the talks. Overall, it was a great piece of science!

Here, we’d like to highlight some key points of the discussion and to develop it a little further with some provocative thoughts.

The first point we’d like to stress: notwithstanding the complications in estimating it, effective population size is a key demographic parameter in the field of population genetics. It has a direct relationship with genetic drift and we should do all we can to obtain estimates of it, to characterise the evolutionary potential of species and populations.

In nature, reproductive success is very often unequal. Fierce struggles are sparked by the drive to gain (exclusive) access to reproduction (Picture credits: Heather Smithers, from Wikimedia Commons)

That said, we can only notice that the three speakers, and all those who asked questions and made comments, spent a good deal of time and words to explain how biased Ne estimates can be, and how easy it is that they are biased. It is actually way more likely that an estimate is biased than unbiased. Truckloads of reasons for this; indeed, the only case in which genetically derived estimators are unbiased seems to be an ideal Wright-Fisher population (a finite, constant-sized, panmictic, isolated population without selection nor overlapping generations). In all other cases, Ne estimates will likely end up estimating something (sub-population size, neighbourhood size, inbreeding, you name it), but not Ne itself… and there is no way to tell exactly what is being estimated. This is particularly true when one tries to estimate contemporary Ne (as opposed to historical Ne), the one that matters most for conservation, the one susceptible to change dramatically in case of population collapse or breakdown of population mating system.

Alas, it seems almost impossible to get a decent estimate of Ne from genetic data, in the case of large, continuous, structured populations, such as the ones that we very often deal with in forestry.

(we’re currently running some simulations to check the effect of limited sample size and demographic change on genetically-based estimations of contemporary Ne: from preliminary results, we can say that it is hopeless. More on this when the simulations are complete. On the theoretical side, you can see this paper about how Ne is affected by meta-population dynamics)

So, the relevant question to ask is: why should we try to estimate Ne from genotypic dataShould we give up attempts at estimating Ne in this way for practical purposes, such as delivering an indicator of adaptation risk?

Let us take a step back and look at what Ne is and why we try to estimate it that way.

Ne is not per se a genetic parameter. Estimated locally (contemporary Ne), it describes the way (or how many) fertile individuals contribute to reproduction: do they all contribute equally? Or is their contribution uneven? How many do reproduce? These are demographic, or population-dynamic, matters. You can handle them (and measure them!) without the slightest knowledge of genetics.

In a population with unequal contributions to the next generation, effective population size can be very small relative to the total number of potential parents (and relative to the number of actual parents)

Of course, Ne matters to geneticists because it has enormous consequences on true genetic parameters: levels of inbreeding and of genetic diversity in particular. In a finite population, inbreeding increases, and genetic diversity decreases, at rates directly proportional to Ne. So, of course, it is super relevant for geneticists. But here is the twist: because Ne drives population-genetic quantities, we have become accustomed to the very convenient strategy of estimating Ne based on estimates of those quantities (linkage disequilibrium, genetic diversity, and so on). The high-throughput sequencing bonanza has worsened our addiction to genetically derived Ne estimates.

It does look like the chain of reasoning ends up being somewhat tautological: Ne  is an important driver of genetic parameters; we know how to estimate genetic parameters; we use genetic parameters to derive Ne; Ne becomes itself a genetic parameter and we restrain ourselves to estimating it form other genetic parameters.

Yet all those Ne estimation methods come with all those biases, to the point that they make estimates useless, and in spite of our beliefs and desires, effective population size is still not a population genetic quantity.

This is not, though, a reason for despair. Working with trees comes with disadvantages, surely (like all those biases that make Ne estimations useless), but also with some assets.

For example, seed dispersal is mostly local for most species, and getting a crude estimate of the contribution of individual trees to reproduction by counting young and recruited seedlings in quadrats should not be too complicated (there are more sophisticated methods to assess single-tree fecundity, of course, like the SEMM-based methods developed by Klein et al. (2013), and direct observations of cone and fruit production through aerial surveys, like the ones we develop in the FORGENIUS project). Individual contributions to reproduction are the very building blocks of effective population size—i.e. the number of reproducing individuals (with equal contribution). Thus, counting the individuals that reproduce, even without accounting for their relative contribution, will likely be less biased than any genetic marker-based, indirect estimate.

So, back to basics: if we want to estimate a demographic parameter, like Ne, let us estimate it directly! No need to worry about the bias of indirect methods, resting heavily on assumptions that almost never hold. Let us go out there and count reproducing trees. It will be sweet to discuss those numbers in the evening, once back to the Forest Genetics Campsite.

Is it real? A reflexion on how to call simulation data.

We scientists are used to sometimes do “back-of-the-envelop” calculations, synonym of an approximate calculation from fuzzy data and hypotheses, used to grasp order of magnitude of some quantity, and then start the real work in a more formal way (I suspect that, on the contrary, in some decision-making situations the analysis stops with the envelop itself, hence some major policy failures).

Today we’re trying an even more ambitious, and even fuzzier, exercise: back-of-the-envelop philosophy of science. We’re going to think about how we call objects of everyday scientific use, having different nature and status, and to which—here’s my argument—we assign the wrong names, leading to mistakes in our perception of what they actually are. I’m talking about two categories of objects in particular: those data that are generated by running some computer code, without collecting or analysing any sample from the outer world, on the one hand; and those data that require inspecting non-computer-generated objects, on the other hand.

Typically, when developing a new method to analyse / model data, the scientist first applies it to digitally-generated data, and then provides an example of application to “real-life” data (quotes obligatory here, if you subscribe to my view below). This pair of classes of objects is called in different ways in the literature, but none of those name pairs is, to me, satisfactory.

In my (incomplete, biased etc.) library of scientific papers, under the subset “population genetics methods”, there are 140 papers with the keyword “simulation”. Within these, 41 use “empirical” to describe “field” data; 26 use “real” to describe such data; 27 use “empirical” and “real” as synonyms, or indicating closely related concepts pertaining to “field” data; 8 use “real” for field data and “empirical” for simulation data; 4 refer to simulation output as “empirical”.

Theoretical vs empirical vs real?

Theoretical vs empirical vs real? “empirical” is the formulation sometimes used when producing an expected distribution of some statistic based on simulations. So, what comes from simulations is called “empirical”, yet “empirical evidence”in science is associated to observations out there, in the field, based on data that, so to speak, arise from processes that the investigator does not entirely control (unlike simulations, where starting conditions are entirely chosen by the researcher). So, it would seem that the “empirical envelopes” (on the back of which this post is written) that one computes from simulations are empirical in a different way than empirical evidence. No good. And then there are “real” data, which are, of course, also empirical (more empirical, I should say, from a philosophical standpoint). Even worse.

Theoretical vs simulated vs real?

Most of the time, though (26+27 instances out of 140), the chosen pair is “simulated vs real”.

The “simulated vs real” alternative is, to me, not very good either. Indeed, if the simulated data are not “real”, then what are they? Imaginary? Unreal? If they’re imaginary, are they as imaginary as a unicorn? (I hope my 7 year-old daughter will not read this post any time soon).

________________________________________________________________________

A unicorn is likely an imaginary thing. Yet there are real unicorns, too.

In Edinburgh, for example.

________________________________________________________________________

And would one have scientific results rest on imaginary stuff? if they are “unreal”, does it relate to “unrealistic”? Yet the whole point of simulations is that they should mimic reality. Plus, simulations are really real to me: they do exist as, at least, as combinations of zeoes and ones in some memory storage device. It’s something you can “touch” and “see”, provided you have the right device. So they are about as real (actually, exactly as real) as the same-type data obtained by applying some inspection technique to samples collected in the field. They are actually more real than the academic institution that pays my salary (you can certainly touch and see the buildings and the staff of INRAE, but can you touch and see the institution itself? No you cannot. It only exists as a social convention. Social conventions are also real, of course, but they exist out of mutual agreement among humans, not because they are stored on a physical support).

A proposal: theoretical vs simulated vs empirical

“Theoretical” sits on safe ground. Nobody will challenge what a theoretical expectation is. It is derived mathematically from first principles and is “unaffected” by data (it may be flawed, of course, and data may contribute to reject it; but ti does not build upon data). And, of course, a theoretical expectation is also real.

But then, how to call those things that exist in a computer and are generated inside it, without ever receiving any input from the outside world? I suggest to call them by the most natural name, even if it is not precise (see below): let us stay with simulated (if you wish to be verbose, you could say “computer-generated“). Saying you have produced a “simulated envelop” for a statistic should not be a source of shock to anyone, and it has the advantage to make it entirely clear that no field data were harmed for the production of such envelop (something that is not entirely clear if one says “empirical envelop”).

As a consequence, this leaves the good old “empirical” to things arising from the analysis of data coming from outside the machine (with the notable exception of science studying the behaviour of the machine itself. The post you’re reading belongs to a blog about forests and we will not consider this). I call “empirical” any data the natural world objects, which are the primary focus of your science, can provide you (even if you have manipulated it, as in a reciprocal transplant test). I am aware this definition also comes with drawbacks and limitations, but I reckon it to be the least worse alternative. This is the choice made by 41 examples out of 140, and I suggest to side with them.

If you want to be verbose, you can say “experiment-generated” or “observation-generated“. The [something]-generated option is certainly entirely explicit, fully clear and unambiguous. It is probably the best one, but maybe too verbose.

A posh solution: enter latin.

The computer-generated vs experiment/observation-generated alternative may benefit from some ancient tongue. We could, quite elegantly and concisely, say in silico (for things generated on a computer), and in natura for data obtained from the outer world (it could be in vivo as well, but this is locked in a pair with another opponent, in vitro, and may lead to confusion). As for making the difference, with a latin expression, between experiment and observation, I do not currently have any solution to offer, except perhaps recycling the in situ / ex situ pair, which is not entirely satisfactory.

Of course, this view of mine should be taken cum grano salis.

And very likely, the best solution will be a compromise: in medio stat virtus.

The coalescent: what is it? And, perhaps more importantly: what is it not?

[edit: the reference in the first paragraph pointed to a review on the ABC, not the coalescent, which is quite ironical. I’ve fixed the mistake]

We population geneticists very often use the coalescent (Kingman 1982) to analyse data, infer demographic histories, estimate population parameters. Yet, do we know exactly what “the coalescent” is? Let us have a look. Notice that this is not a course, it is a blog post. There will be very little formalism, if any, and the goal is to clear some misunderstandings, not provide demonstrations. There are plenty of review papers (my favourite is Nordborg 2019), courses, textbooks out there to grasp the maths behind the coalescent, and I’m certainly not the best one to explain them to you readers.

The coalescent is a modelling tool in which, going backward in time, at each generation the different lineages can “coalesce”, that is, have a shared ancestor. You can view this as a links relating siblings to one of their parents. If you are looking at a particular set of “lineages” at current time, and follow their coalescent events back in time, at some point you’ll end up with only one “ancestor” for those lineages. This ancestor is called the “Most recent Common Ancestor” (MRCA) and the generation at which it occurs is the Time to MRCA, or TMRCA. Here, a seven-lineages coalescence is shown, with a TMRCA equal to 10 generations. Notice that lineages that have never made it to the present time are also shown in grey.

There is an intuitive link between the representation above and drift in a constant-sized population: due to drift, some lineages are lost, while others increase in frequency over time and eventually become fixed (here, the blue lineage becomes fixed in the observed sample).

The most current version of the coalescent, the one most commonly used in off-the-shelf population-genetic software, is the one where the probability of coalescence of any two lineages is 1/2N (for diploids, with 2N the effective population size of the population the sample is taken from) per generation per lineage.

Because of this 1/2N probability of coalescence, the probability that any two copies coalesce in a given generation is only driven by population size. This way, you do not even need to simulate the whole population to describe your sample! It suffices to say that, in you sample, any lineage can coalesce with another at a given generation with probability 1/2N; so you only need to simulate the lineages you have in your sample. This makes coalescent simulations exceedingly fast and flexible. Another handy consequence of the 1/2N rule is that, if you have a large sample, there will be many coalescent events in the first generations (“first” backward: that is, those closer to present) and then the “waiting time” for further coalescent events will increase as the number of independent lineages left decreases. When only few lineages remain, you wait longer and longer, and indeed, the longest “waiting time” is the last one, the one leading to the MRCA. When you count everything, the TMRCA is largely dominated by the last steps, those leading to the coalescensce of the very last lineages. Said otherwise, it is almost useless to simulate (and genotype!) large samples if you are using the coalescent to analyse your populations.

The other nice thing about the coalescent is that is provides very useful predictions. For example, the expectation of the TMRCA equals 2N, which means that if you are able to estimate the TMRCA, then you have an estimate of 2N, which is quite handy. This expectation, and all its variations based on changes in population size, migration, population splits (not to mention selection), are the root of the usage of the coalescence for population-genetic inferences.

Yet, how do you know the TMRCA of an empirical sample? Well, you don’t. To obtain inferences, you need to compare patterns observed in the simulations with those observed in the empirical data. But you are a geneticist, and what you like to compare is genetic diversity patterns. Enter the next step of the coalescent simulation: introducing genetic diversity. To do this, your favourite coalescent simulator will pepper the genealogy with mutations: here, new colours appearing in the genealogy, and being inherited down the genealogy, represent mutations. At the end of the process (that is, at present time, the last row), you’ll have genetic diversity in your sample.

Here, the “dead-end” lineages have been removed from the picture: anyway, you cannot have sampled them, and therefore you have no genotype for them (notice, though, that many programs can accommodate data from past generations, thus possibly allowing sampling “dead-end” lineages).

The coalescent simulation part of the analytical pipeline stops here: your simulator has produced a sample with its genetic characteristics. Usually, the simulation is repeated a very large number of times (generally drawing the simulation parameters from prior distributions), so that the genetic properties of the empirical population under study can be compared with a very large array of simulations.

This is what the coalescent is and does, in a nutshell. What is it not, and what does it not do?

What the coalescent is not and will not do

The Coalescent is not ABC

I have very often seen people conflate, in their thoughts, the coalescent with ABC (Approximate Bayesian Computation). They are not at all the same thing, nor do they need to be glued together. The former is, as explained above, a tool to simulate genealogies and data; the latter is a method to infer parameters from the data, and involves, indeed, simulations. And indeed, using ABC to estimate parameters of an empirical population-genetic sample, by comparing its genetic properties with those of a very large number of coalescent-simulated samples, is an extremely effective way of taking advantage of the power of the coalescent and of ABC (see Csilléry et al. (2010) for a clear explanation).

Yet, the simulations that the ABC rests upon do not necessarily need to be based on the Coalescent. Indeed, provided you have a tool to simulate some system (which may have nothing to do with genetics) starting with a set of parameter values, you can use the ABC. True, the ABC was invented by population geneticists (Mark Beaumont, Wenyang Zhang, and David J Balding; Beaumont et al. (2002)), but I am assured they’re not jealous if one uses the ABC for anything else than genetic data (I’m not going to detail why population geneticists use it, nor am I going to describe the differences between the ABC and “true” Bayesian methods); and it is true that famous software packages use the coalescent and the ABC together (e.g., DIYABC), maybe contributing to the feeling that the coalescent and ABC are one and the same thing.

The coalescent does not suggest a time when there was only one individual of a species

The Most Recent Common Ancestor (MRCA) is not the ancestor of all living beings of a given species. It is the ancestor of all the currently existing copies of the locus of which the coalescent is built, which does not mean that, at some point back in time, there was only one copy of that locus around, either. It means that all copies of all other “branches”, or genealogies, have gone extinct—by drift, very likely (see above). This is of course quite counterintuitive, but yes, if the postulates of the Coalescent hold true, all current copies of a locus are derived from a single copy having been around at TMRCA. Of course, at TMRCA, that particular locus copy co-existed with (possibly plenty of) other copies, but these ones have not left any descendant that could survive in the population (or more precisely, in the observed sample) up to the present.

So, sorry, but you cannot use the coalescent to infer when Adam and Eve have been walking in the Garden of Eden. Use a Bible for that purpose.

The coalescent does not help to study very recent (and some very short) events

Because all the matter in the coalescent revolves around estimating the TMRCA, which is what allows one to estimate population parameters, and particularly the effective population size, and because any two randomly picked copies of a locus may have in common a very old ancestor (including the MRCA), sometimes inferences from the coalescent may be highly biased.

Suppose for example you have a very nice, large, historically stable forest stand (it could be a population of any other organism, but hey, this is a blog about forests); one day, you fell the whole forest, leaving one tree. Suppose you now analyse the coalescence of the loci carried by the chromosomes in this single tree (I suppose it is at least diploid: truly haploid tree species are really rare!). If that single tree is the outcome of a long history of random mating, then the chromosomes it carries are a random sample of the chromosomes of the population, and their coalescence may, on average, go back to a very distant past, and based on this, you will likely conclude that the population’s effective size is very large! (remember: TMRCA is proportional to effective population size) Which of course does not make sense, because there is only one tree left (I’ll come back in another post to the multifarious meanings and interpretations of the concept of effective population size).

The same applies when the population has undergone a very short bottleneck not very long ago: once again, if there has been an “instant bottleneck”, the lineages cannot have coalesced within the bottleneck, and the coalescence signal is closer to the historical population size’s than to the bottleneck population size’s.

[more “what is not” items can be added: I invite the readers to make suggestions]

So, in conclusion: the coalescent is a wonderful tool, but one has to know exactly what it means and how to use it.

Bibliography

Beaumont, Mark, Wenyang Zhang, David J Balding. « Approximate Bayesian Computation in Population Genetics ». Genetics 162, 4 (2002): 2025‑35.

Csilléry, Katalin, Michael G B Blum, Oscar E Gaggiotti, Olivier François. « Approximate Bayesian Computation (ABC) in practice ». Trends in ecology & evolution, 25, 7 (2010): 410‑18.

Kingman, J F C. « The Coalescent ». Stochastic processes and their applications 13, 3 (1982): 235‑48.

Nordborg, Magnus. « Coalescent Theory ». In Handbook of Statistical Genomics, 1:145‑30. John Wiley & Sons, Ltd, 2019.

Summer school, and the teaching is easy

People with diverse background, objectives meet and mix within this particular type of course

The GenTree project held its 2018 Summer School “From genotypes to phenotypes: assessing forest tree diversity in the wild” on 4-7 June 2018 in Kaunas, Lithuania. About twenty “students” (most of them Ph.D. students, but some of them also confirmed scientists) convened to learn theory and practice of population (quantitative) genetic analysis from five teachers (including myself).

WP_20180605_013

We were warmly hosted by colleagues of the Aleksandras Stulginskis University (ASU); the course was organised by fellow forest scientist Darius Danusevičius, with essential support by his students.

The course covered a variety of subjects, from the basics of population genetics theory and the coalescent to the application of multiple programs for Genotype-Environment Association and Genotype-Phenotype Association, and included a day out in the forest, were a demonstration of the usage of drones for the survey of forest stands was held. Very interesting, with plenty of information, although sometimes with a quite steep learning curve!

A summer school has the great advantage of letting us explore new teaching – and learning – strategies, because the goals are left sometimes ‘open’ and the teachers can adjust to the students’ needs and limits (and of course, the students adjusts to teachers’ limits!).

IMG_20180603_163240
students and teachers visiting Lithuanian historical sites (Credit: ASU Kaunas)


Informal learning sessions extend out of the official program, very often during the night, when traditional tools facilitating the transmission of knowledge (slides, computer scripts, whiteboards, chalk, paper and pencil) are replaced by more unconventional ones (jokes, crisps, beers). We saw multiple teaching approaches, spanning from the “zero-electronics restless teacher” (myself: only chalk and blackboard, walked several kilometers while teaching), through the “activity-time interactive teacher” (Tanja Pyhäjärvi: having students stand and do some exercise
, and then sitting and doing some more exercises, this time through shinyApps), to “100% hands-on teaching” (Santi González-Martínez and Leo Sanchez, with their rich array of software packages, scripts, and datasets to put to test) (I cannot say what Basti Richter did with his drones out in the forest; I had to leave earlier). All this was peppered with contests (spelling out population genetics laws, presenting a piece of one’s country’s popular culture, declaring one’s favourite sports team, movie, even philosopher – which actually provided some surprises: for example, I was unaware that Donald Trump was a philosopher at all, but after all I am not a philosopher, how on earth was I supposed to know).

The mystery of number 19

At some point we thought we were close to finding some fundamental natural pattern, when the number nineteen started popping up recurrently in our life. For example, there were nineteen of us in the bus that took us from Vilnius to Kaunas, and we concurrently learned that the first Lithuanian Republic was founded in 1919. Some of us were even reported to have drunk in excess of nineteen drinks in a single evening. After having observed that there were no public seats on the otherwise very green ASU campus, we even formulated the hypothesis that there may be only nineteen public seats throughout the whole country (a short walk in downtown Kaunas allowed us tho reject the hypothesis).
After all, we dropped the idea that number 19 would contain some important meaning; so the only universally meaningful number is still number 42.

Baroque paradise

And finally, on my way back I had the opportunity to do some walking in downtown Vilnius. In my ignorance, I did not know that its city centre is a Unesco World Heritage site, and that it harbours some very nice examples of baroque architecture. Nice place, you should all go visit it.


WP_20180607_007

 

 

 

In GPS we trust

Of lotteries and men

Nowadays, everybody rests on GPS, and nobody is capable of reading a map anymore.
Good old forest plot maps, drawn with compasses and distances measured on the ground? Gone. Trees are mapped by GPS, with variable precision and success. And there is no way anybody is capable of giving you directions on the road. You got a GPS? Use it, for Hermes’ sake.

But this post is not and old man’s rant about technology.
I’m not talking about Global Positioning System.
I’m thinking of Grant Proposal Selection.

As everybody knows, the way grant proposals are chosen for funding is a lottery (when your favourite proposal is turned down) or a very meritocratic process carried out by clever, competent reviewers (when you get funded).

Yet, it must be either one or the other, or a mix of the two.
So, while I was submitting my last grant proposal, I wondered: is this all worth the effort? What’s the point of all the energy, stress, nights up moving a sentence there, changing a word here, days on the phone discussing with collaborators, if it all boils down to random outcome?
When success rates are very low, one may suspect that all the money spent by funding agencies to rank all those very good proposals is a waste: at the end, who’s in and who’s out may just be a matter of a fluke in the review process. Maybe one reviewer has had a bad night yesterday, so today he’s upset and will turn a mark down a notch, and out goes your great idea. It would be better to draw tickets from a lottery.

How can one check whether this is true, or not? To know whether the real good proposals are really the ones that have been funded, one should know beforehand which are the good ones. But this is tantamount to evaluating the proposals, which brings us back to square one.

There is a way to assess the process, though: playing games. I mean: doing some modelling.

So I set out and built a simple model, mimicking the French ANR’s selection process, which proceeds in two steps: a selection on short pre-proposal, followed by a second round of selection on full proposals which have passed the first check.
According to the data provided by the agency, in 2017 about 3500 proposals, out of about 7000, passed the pre-proposal phase, and an the end about 900 were funded, for a success rate of about 12-13%.

So I simulated “true” scores for 1000 “proposals” according to a gamma distribution.

The distribution of proposal true values looks like this:

histTrueValues
Then I supposed that two reviewers examined the proposals, each providing a score that was built by summing the true value to an “error” drawn from a gaussian with mean zero – basically, each reviewer introduced “white noise” in the score. The final pre-proposal score was the mean of the two reviewers’ scores. True to the fact that I introduced noise, the scores were dispersed around the true values.
TrueValVsEvalMean
The top half of the ranking (on the y axis, the reviewers’ scores) went on to phase two, and then the process was started all over, with a smaller error (full proposals provide more details, so it should be easier to assess their “true” value). The top 20% tier was “funded”. Then I compared the “true” scores of the winners with the final reviewers’ scores. In a perfect world, the successful proposals should be those with the best “true” values. Is this the case?

Yes and no.
Dispersal of ranks looked large: the relationship between “true” ranking of a given proposal (x axis) and its final ranking (y axis) did not look so precise, even though there was a clear trend in favour of the best proposals:

rankVsRank

How many of the proposals belonging to the top 20% post-evaluation are also in the top 20% of the “true” values? Around 70%.

selection

In other terms, approximately one third of the proposals that should have been selected have been turned down, and have been replaced by proposals that do not belong there. Is it satisfactory? Unsatisfactory? I’ll let you decide.
What is the alternative (apart from suppressing altogether this system for funding science)? Let us suppose that we skip the second round of selection, and that we randomly draw proposals to fund  from the proposals that have passed phase one. How do we fare?
random

Quite poorly. Only 16% of the funded projects are “good” ones. So, after all, there seems to be some value in the GPS (even though it can be as imprecise as a GPS under thick forest cover).

Of course, the outcome depends on the size of the error introduced by the evaluation process: increase it, and the funded projects will belong less and less to the group of the “good” ones. And I am not counting for the effect of PI previous record, nor for the fact that, if you have been funded previously, there are high chances that you’ll be funded again. Actually, a recent study – that I warmly recommend you to read – shows that, when the funding system has such a “memory”, it produces large inequality and favours luck over merit!

The R code I used to run the simulations is posted below. You can play around with it and se what happens. Enjoy the game playing – if you are not busy with a GPS.

——————–

#rule of the game:
#overall funding rate is 10%
#we start with 1000 projects.
#at first round, 50% of the proposals are selected
#at second round, 20% of the remaining projects are
#selected for funding.
#Project “true” values are gamma-distributed.
#At each step, the reviewers’ evaluation equals
#the “true” value of each project plus a white (gaussian) noise
#noise is twice as strong in the first round as in the second
#
#generating distributions of “true” project marks
nSubmitted = 1000
excludedFirstRound = 0.5
funded = 0.1
marks.gamma<-rgamma(nSubmitted,shape = 1)
#plotting:
jpeg(filename = “histTrueValues.jpg”)
hist(marks.gamma, breaks = 20, main = “True project values”,
xlab = “Project value”, col = “blue”)
dev.off()
#
#generating first round evaluator’s marks (by introducing noise)
evaluator1.noise<-marks.gamma+rnorm(n=nSubmitted, sd = 2)
evaluator2.noise<-marks.gamma+rnorm(n=nSubmitted, sd = 2)
#producing final marks (mean of two reviewers)
evalMean.noise<-rowMeans(cbind(evaluator1.noise,evaluator2.noise))
#visualising relationship:
#plotting:
jpeg(filename = “TrueValVsEvalMean.jpg”)
plot(evalMean.noise~marks.gamma,
xlab = “True Values”,
ylab = “Mean 1st round mark”,
pch = 21, bg = “red”)
dev.off()
#building a data frame
projects.df<-data.frame(seq(1,nSubmitted,1),marks.gamma,evalMean.noise)
names(projects.df)<-c(“projId”,”trueVal”,”score1stRound”)
#1st round of selection:
excluded1round<-which(projects.df$score1stRound<=quantile(projects.df$score1stRound,
probs = excludedFirstRound))
#generating 2nd round scores:
evaluator1.noise<-marks.gamma+rnorm(n=nSubmitted,sd = 1)
evaluator2.noise<-marks.gamma+rnorm(n=nSubmitted,sd = 1)
evalMean.noise<-rowMeans(cbind(evaluator1.noise,evaluator2.noise))
projects.df$score2ndRound<-evalMean.noise
projects.df$score2ndRound[excluded1round]<-NA
#computing ranks:
projects.df$trueRanking<-rank(-projects.df$trueVal)
projects.df$ranking2ndRound<-rank(-projects.df$score2ndRound)
projects.df$ranking2ndRound[excluded1round]<-NA

#let us have a look at the rankings based on true scores vs final rankings:
#plotting:
jpeg(filename = “rankVsRank.jpg”)
plot(projects.df$ranking2ndRound~projects.df$trueRanking,
xlab = “True Ranks”,
ylab = “Final evaluation ranks”,
pch = 21, bg = “aquamarine”)
dev.off()
#
projectsFunded.df<-projects.df[which(projects.df$ranking2ndRound<=nSubmitted*funded),]
projectsRandomlyFunded.df<-projects.df[sample(
which(is.na(projects.df$ranking2ndRound)==F), size = nSubmitted*funded),]
projectsBestTrueRanks.df<-projects.df[which(projects.df$trueRanking<=nSubmitted*funded),]
#how effective is the selection process?
length(intersect(projectsBestTrueRanks.df$projId,projectsFunded.df$projId))
length(intersect(projectsBestTrueRanks.df$projId,projectsRandomlyFunded.df$projId))
library(VennDiagram)
#plotting
venn.diagram(list(True = projectsBestTrueRanks.df$projId,
Selected = projectsFunded.df$projId),
filename = “selection.png”,
imagetype = “png”,
fill=c(“palegreen”,”palevioletred”))
#
venn.diagram(list(True = projectsBestTrueRanks.df$projId,
Random = projectsRandomlyFunded.df$projId),
filename = “random.png”,
imagetype = “png”,
fill=c(“palegreen”,”sandybrown”))
#

 

Did you say, “gradualist”?

How gradual must evolution be for an evolutionist to be called “gradualist”?

Today I was reading the interesting, and by all means very good, paper by Lowe et al (2017) Trends Ecol. Evol. 32: 141-152, and the Authors say in the Introduction:

“In the past decade, ecologists have embraced the concept of eco-evolutionary dynamics, which emphasizes the power of ecological selection to cause rapid adaptation and, likewise, for adaptive evolution to influence ecological processes in real time [10,11]. The perceived novelty of this concept appears to stem from the fast rate of interaction between ecological conditions and phenotypic adaptation, which contrasts with traditional, gradualistic models of adaptation.” (my highlight).

A quick question and comment crossed my mind.

All this thing of evolutionary processes being “traditionally” considered gradual is a vast hoax.

Consider uncle Charlie (Darwin) himself. Where did he get (the mechanism behind) the theory of evolution by natural selection? from fossils? rare mutations in DNA sequences?

NO. He got it from observing people around him selecting pigeons, sheeps and ornamental plants. Did that happen over millions of years? No. It happended over few generations (of pigeons!).

So the whole thing that we realised only in the last decade that evolution can happen quickly, while “traditional” science (read: conservative, backward “standard” evolutionary biologists) thought this impossible, is plain wrong.

It is true that one only (generally) finds things that she looks for. Consequently, if we think that evolution over few generations (or, for that matter, genetic divergence over few tens of meters) cannot happen, then we’ll never look for it and never find it.

I invite you reader, though, to take half an hour and go wander through the older ecological-genetic literature, and you’ll find abundant proof that people have kept looking for – and finding! – fast evolutionary shifts for at least a century. I’m not even talking about Biston betularia or the LTEE. I’m talking plenty of observations that have recurrently witnessed fast evolution everywhere.

 

It’s the demography, stupid!

Of Sharks, Giraffes and Malthus. Mostly, Malthus.

First, let me make it clear that, as far as I know, the sentence “It’s the economy, stupid!” was never used, orally or in writing, by the (Bill) Clinton campaigns. It may be a nice summary, but it was never used as such. But let us go back to our topic.

One day, I was teaching teachers who teach biology teachers how to teach biology (yes, it is a true sentence). And I asked the teachers’ teachers to spell out the mechanism of evolution by natural selection. Everybody told me: there is variation; variation is heritable; then the best individuals survive/produce more offspring and evolution happens.

That’s right. Selection is perceived as a sort of mechanism testing ‘adequacy’ (we call it ‘adaptation’) of individuals to their environment. Somehow, we’re still pretty much innately Lamarckian*, after all: we think in terms of how an individual copes with its everyday problems. While this is certainly an important component of ‘adaptation’ in general terms, and while for sure it is individuals (and not genes, or populations, or – heavens forbid – species) who survive or die, reproduce or not, the description falls short of describing the actual mechanism of evolution, because it misses an essential component.

Thinking in terms of individual properties and problems is not the exactly right way of looking at selection and adaptation. As the Australian saying goes, “when there is a shark in the water, you do not need to swim faster than the shark: you need to swim faster than the slowest swimmer” (I’ll let you generalise to the case where there are n sharks in the water, I know you can manage; and I’m sure you know a regional version of the saying, with other threats than sharks). The point is that selection is not about individual relationships to the environment, but about whether you do better or worse than somebody else.

Great_white_shark_scatters_mackerel_scad.jpg

And this brings to the fore the essential cog of selection’s machinery that many people (including biologists – except for evolutionary biologists themselves) miss: demography. It is because there are always many more siblings than the environment can carry that eventually some of them die / do not reproduce.

King_Penguins_(Youngs)

The “fitness” part of the game is, of course, that those who exploit the available resources better / better cope with stress perform better from the survival / reproduction point of view, and leave more offspring (the way the parents’ traits are inherited does not change a thing). If there were infinite space and resources, nobody would suffer from selection, and everything would behave according to neutral evolution. Darwin borrowed the idea from Malthus, as everybody knows, and this is the piece that makes the difference between any evolutionary hypothesis and the Modern Synthesis’ successful one** (in terms of explanatory power). It is because some individuals die / do not reproduce that there is adaptation. Somehow, we should be happy to observe (moderate amounts of) mortality (in forests there’s a lot of it) and unequal fecundity in populations, because this is how adaptation occurs.

The Malthusian piece of Darwin’s genius idea is understandably hard to swallow. As one of the teachers’ teachers exclaimed, after I had pointed out the strict necessity of the cruel Malthusian piece in the Theory: “oh, that’s so SAD”. Yes, life is unfair, but adaptive biological evolution happens only if there are winners and losers. Now, if I were in the losers’ camp, I’d rather prefer no evolution-by-selection to happen at all, but this is the way it is – no social or anthropocentric judgement to be attached.

 

 

*Lamarck, in spite of his post-Darwin very bad press covfefe, was a true evolutionary biologist, and a clever one at that. He lacked some important pieces of understanding of how selection works, but then again, Darwin too had silly views about heritability of traits.

** I refrain from attributing the idea entirely to dear uncle Charles for two reasons. First, he lacked the mathematical formalism to model the mechanism; second, I disagree with the identification of Evolutionary theory with one person, no matter how grateful we should all be to the genius that was Charles Darwin. After all, nobody talks in terms of “Einsteinism” or “Röntgenism”, so why should we talk about “Darwinism”?

Where have all the forest geneticists gone?

Missing mass of forest population geneticists at conferences leaves me wondering why they stay home

I’m back from a couple of conferences: the ESEB meeting in Groningen and the SIBE meeting in Rome.

Both were terrific, and both allowed me to come back home with the usual mix of excitement (for the impressive amount of good science that people do, and for the truckload of good ideas I could grab) and frustration (for not having done myself all that good science!).

Among other things, I must stress the feeling of being (at 47) among the eldest at both conferences – and this is a very positive remark: of course, one gets older and thus climbs the pyramid of ages, but I reckon that evolutionary biology conference-goers are, on average, pretty young and impressively competent. This spells good for the future of evolutionary biology!

primaryschool

Yet, I’ve been wondering throughout both conferences where all my fellow forest evolutionary biologists were hiding. Certainly, those two conferences do not focus on forests, but they do not focus on fruit flies and mice either, and I’ve been hearing plenty of talks on those critters. For sure, forest trees are not “model” species, but the share taken by model species at both conferences was, globally, very small, so there must not have been a “filter” against papers on trees. The fact is, there were very few forests across the conference landscape. Somehow, I felt slightly lonely with my forest population genetics talks and posters.

C7IVeyhXkAEwQMA.jpg:large

Yet – although I’ll provide no list, for fear of omitting somebody – I know plenty of forest scientists having provided major contributions to asking and answering overarching (*) evolutionary questions and to developing evolutionary theory: evolutionary biology is a relevant playground for forest geneticists. So why was I so lonely? Why the attendance of forest geneticists, young and old, to general conferences is decreasing? Are they all busy tending to their science, with nothing worth sharing in their hand? Or is their budget, both in terms of time and money, decreasing so abruptly that they cannot afford those meetings any more? Or maybe they are folding back on their community?

To check, I had a look at the program of the IUFRO general meeting, that will be held in Freiburg next week – IUFRO is the United Nations of forestry research, every forest scientist goes to a IUFRO meeting every so often. And even there, although I have carefully scrolled all symposia and checked speaker lists, I could barely find the names of acclaimed and less known forest geneticists. Essentially, our research field will not be represented there, either (well, I confess: I am not attending, but I could not go to three conferences in less than a month).

Forest geneticists are deserting both general evolution / evolutionary genetics events and forest-focused meetings. Why? And – apart from forest genetics conferences – where do they go? I’d very much like to know the answers to those questions. Plus, I would like to say that it is very important, for junior and senior scientists alike, to get out of our “comfort zone”, and mix with people doing (relatively speaking) entirely different things. As I said above, one comes home with his suitcase full of great ideas.

(*) It is good to fit the word overarching into a text, from time to time. It makes you feel important.

A whole biome ablaze?

And it burns, burns, burns,
The ring of fire, the ring of fire.

Mediterranean forests are burning.
All of a sudden, Portugal, France, Italy, Greece… fires – sometimes large, out-of-control, deadly wildfires – are burning all over Mediterranean forest ecosystems.

Hot temperatures, little rainfall, strong winds, and a dense human population: all factors are there for the perfect firestorm. If this is what climate change has in store for us, well, the outlook for Mediterranean forests is bleak.

Incêndios_Pedrógão_Grande_2017-06-18_(02)

Besides stopping climate change (ha ha ha!) and fencing humans out of forests (unlikely to work, either) what can be done?

There is only one word: MANAGEMENT (the alternative is: ashes).

My lab‘s director, Eric Rigolot, has provided some clues  in an interview (in French) with the French Huffington Post website. What does he say? That we have to use managed fires to prevent big, uncontrolled wildfires. This technique is current in other continents, but not in Europe.

I would add: vegetation itself (the fuel) must be managed in ways that minimise fire expansion, if not ignition. This is particularly true where human beings are likely to wander, because they are most of the time, albeit often unconsciously, the source of fires.

Forests must be tended to, must be gardened. In Europe, they’ve stopped being wilderness a long time ago, so the potential argument that, by managing forests, we alter some fancy natural equilibrium, is nonsense. It is maybe valid for some truly pristine biomes (if there is any), but not in Europe, not around the Mediterranean basin.

This means we are responsible for the health of our forests, including by limiting the effects of fires that we are the primary cause of.