Is it real? A reflexion on how to call simulation data.

We scientists are used to sometimes do “back-of-the-envelop” calculations, synonym of an approximate calculation from fuzzy data and hypotheses, used to grasp order of magnitude of some quantity, and then start the real work in a more formal way (I suspect that, on the contrary, in some decision-making situations the analysis stops with the envelop itself, hence some major policy failures).

Today we’re trying an even more ambitious, and even fuzzier, exercise: back-of-the-envelop philosophy of science. We’re going to think about how we call objects of everyday scientific use, having different nature and status, and to which—here’s my argument—we assign the wrong names, leading to mistakes in our perception of what they actually are. I’m talking about two categories of objects in particular: those data that are generated by running some computer code, without collecting or analysing any sample from the outer world, on the one hand; and those data that require inspecting non-computer-generated objects, on the other hand.

Typically, when developing a new method to analyse / model data, the scientist first applies it to digitally-generated data, and then provides an example of application to “real-life” data (quotes obligatory here, if you subscribe to my view below). This pair of classes of objects is called in different ways in the literature, but none of those name pairs is, to me, satisfactory.

In my (incomplete, biased etc.) library of scientific papers, under the subset “population genetics methods”, there are 140 papers with the keyword “simulation”. Within these, 41 use “empirical” to describe “field” data; 26 use “real” to describe such data; 27 use “empirical” and “real” as synonyms, or indicating closely related concepts pertaining to “field” data; 8 use “real” for field data and “empirical” for simulation data; 4 refer to simulation output as “empirical”.

Theoretical vs empirical vs real?

Theoretical vs empirical vs real? “empirical” is the formulation sometimes used when producing an expected distribution of some statistic based on simulations. So, what comes from simulations is called “empirical”, yet “empirical evidence”in science is associated to observations out there, in the field, based on data that, so to speak, arise from processes that the investigator does not entirely control (unlike simulations, where starting conditions are entirely chosen by the researcher). So, it would seem that the “empirical envelopes” (on the back of which this post is written) that one computes from simulations are empirical in a different way than empirical evidence. No good. And then there are “real” data, which are, of course, also empirical (more empirical, I should say, from a philosophical standpoint). Even worse.

Theoretical vs simulated vs real?

Most of the time, though (26+27 instances out of 140), the chosen pair is “simulated vs real”.

The “simulated vs real” alternative is, to me, not very good either. Indeed, if the simulated data are not “real”, then what are they? Imaginary? Unreal? If they’re imaginary, are they as imaginary as a unicorn? (I hope my 7 year-old daughter will not read this post any time soon).


A unicorn is likely an imaginary thing. Yet there are real unicorns, too.

In Edinburgh, for example.


And would one have scientific results rest on imaginary stuff? if they are “unreal”, does it relate to “unrealistic”? Yet the whole point of simulations is that they should mimic reality. Plus, simulations are really real to me: they do exist as, at least, as combinations of zeoes and ones in some memory storage device. It’s something you can “touch” and “see”, provided you have the right device. So they are about as real (actually, exactly as real) as the same-type data obtained by applying some inspection technique to samples collected in the field. They are actually more real than the academic institution that pays my salary (you can certainly touch and see the buildings and the staff of INRAE, but can you touch and see the institution itself? No you cannot. It only exists as a social convention. Social conventions are also real, of course, but they exist out of mutual agreement among humans, not because they are stored on a physical support).

A proposal: theoretical vs simulated vs empirical

“Theoretical” sits on safe ground. Nobody will challenge what a theoretical expectation is. It is derived mathematically from first principles and is “unaffected” by data (it may be flawed, of course, and data may contribute to reject it; but ti does not build upon data). And, of course, a theoretical expectation is also real.

But then, how to call those things that exist in a computer and are generated inside it, without ever receiving any input from the outside world? I suggest to call them by the most natural name, even if it is not precise (see below): let us stay with simulated (if you wish to be verbose, you could say “computer-generated“). Saying you have produced a “simulated envelop” for a statistic should not be a source of shock to anyone, and it has the advantage to make it entirely clear that no field data were harmed for the production of such envelop (something that is not entirely clear if one says “empirical envelop”).

As a consequence, this leaves the good old “empirical” to things arising from the analysis of data coming from outside the machine (with the notable exception of science studying the behaviour of the machine itself. The post you’re reading belongs to a blog about forests and we will not consider this). I call “empirical” any data the natural world objects, which are the primary focus of your science, can provide you (even if you have manipulated it, as in a reciprocal transplant test). I am aware this definition also comes with drawbacks and limitations, but I reckon it to be the least worse alternative. This is the choice made by 41 examples out of 140, and I suggest to side with them.

If you want to be verbose, you can say “experiment-generated” or “observation-generated“. The [something]-generated option is certainly entirely explicit, fully clear and unambiguous. It is probably the best one, but maybe too verbose.

A posh solution: enter latin.

The computer-generated vs experiment/observation-generated alternative may benefit from some ancient tongue. We could, quite elegantly and concisely, say in silico (for things generated on a computer), and in natura for data obtained from the outer world (it could be in vivo as well, but this is locked in a pair with another opponent, in vitro, and may lead to confusion). As for making the difference, with a latin expression, between experiment and observation, I do not currently have any solution to offer, except perhaps recycling the in situ / ex situ pair, which is not entirely satisfactory.

Of course, this view of mine should be taken cum grano salis.

And very likely, the best solution will be a compromise: in medio stat virtus.