These days, it seems that the scientific community is concerned about the lack of reproducibility of experimental results in biology and related fields (e.g. medicine, neuroscience, psychology), in relation with statistics. Say, some study claims that there is a significant correlation between X (eating potatoes) and Y (developing lung cancer), with p<0.05. The next study tries to replicate it and finds no significant correlation. This happens all the time. Every week we read a media report about a peer-review study finding a significant correlation between something we eat and some disease. But in general those reports are not taken too seriously, at least by scientists and doctors. But why is that? Isn't “statistically significant” precisely supposed to mean that the observed outcome was not a matter of chance and that we should trust it? If yes, then why shouldn't we take those reports seriously? And if not, then why do we keep on backing up scientific claims with statistics?

I have to say I am a bit surprised that the biology fields generally don't seem to fully draw the implications of this rather obvious observation. Paper after paper, X and Y are “significantly different” or “not significantly different”, and some scientific conclusions seem to be drawn from this statistical statement.

Let us briefly recall what “significantly different with p<0.05” means. You have two sets of quantitative observations, obtained in two different conditions. You calculate the difference of their means, and you want to know whether the result is just due to irrelevant variability or whether it really reflects a difference due to the different conditions. To say that the difference is “significant with p<0.05” essentially means that if the two sets of observations were drawn from the same distribution (with the same mean), then there would be a 5% chance that you would observe that difference. For example, if you do the same experiments 20 times, then you should expect to find a “statistically significant” difference, even if the experimental condition has actually no effect at all on what you are measuring. For the biologists: every 20 experiments, you will find something statistically significant even when there is actually nothing interesting. I bet only that experiment will be published. And that is if only one particular condition is monitored. For the epidemiologists: if you monitor 20 correlations, chances are that one of them will be statistically significant by pure chance.

It is ironic that this replicability crisis comes at the same time as the hype around “big data”. Michael Jordan, one of the leading experts in machine learning, pointed out very clearly that we don't have the statistical tools to deal with big data. Looking for statistically significant correlations among piles of data is nonsense. Of course will find them, tons of them. And how will you know which ones are meaningful?

As I pointed out above, in biology there is a huge selection bias, ie you publish only statistically significant results. Some say that we should also publish negative results, and not just positive results. But the problem is that “not statistically significant” is not a negative result. It's no result at all. It says that we haven't seen anything. It doesn't mean there is nothing. Maybe we would see something with more observations. We just can't know. I often read in biology papers that “X and Y are not significantly correlated” as if it was a result, i.e. X and Y have nothing to do with each other. But that is not true at all! It's a lack of result, and neither a positive nor a negative result.

So some have argued for increasing the number of observations (ie cells/animals/subjects), so that we can say something like: X and Y are significantly different with p<0.005. I think this misses the point and reveals a deeper problem, which is of epistemological nature, not just statistical. Here is a quote I like by Ernest Rutherford, who was a Nobelized physicist and chemist: "If your experiment needs statistics, you ought to have done a better experiment". That's a bit exaggerated, probably, but quite true in my opinion. If you need statistics, it's because the result is not obvious, in other words the difference between X and Y might be “statistically significant” but it's tiny. Statistical significance is not at all about significance in the usual sense of the word. A difference of 0.1% can be statistically significant, if you have enough observations. But think about it: in a complex system, like a living organism for example, would you expect that one part of the system is absolutely uncorrelated with another part of the system? That would basically mean that the two parts are unconnected and therefore not part of the same system. For example, I'm pretty sure that eating potatoes is positively or negatively correlated with developing lung cancers. There might be a correlation between potatoe eating and revenue, and clearly a correlation between revenue and any kind of disease. I'm not even mentioning the effect of potatoes on metabolism, which certainly has some slight correlation with cancer development or the immune system. However tiny these correlations might be, you will still find a significant correlation between potatoe eating and lung cancers, if you look at enough cases.

So in reality, in a complex system, one should expect that every single pair of variables are correlated, and any condition that affects the system should affect all its components in various, possible tiny, amounts. Therefore “significantly different” and “significantly correlated” are really not very useful statistical concepts for biology. The first thing to do should be to start reporting how big those differences or correlations actually are, and not just if they exist. A useful statistical concept, for example, is effect size: the difference between the means of the two groups of observations divided by their standard deviation. So for example, effect size of 1 tells you that the mean difference you observe is of the same order as the intrinsic variability in each group, so that would be considered a quite strong effect. Effect size is much closer to what we would naturally mean by “significant” than statistical significance. If we reported effect sizes for all the observed cognitive differences between men and women that have been publicized in media and books, we would find in most cases that they are statistically significant but their effect size is very small. As Dindia put it concisely, “Men are from North Dakota, women are from South Dakota”.

In this context, I don't find the call for increasing the number of observations in biology papers (ie of animals) very ethical. Reporting effect sizes would be the minimum that everyone should do in biology papers, and if it's tiny then why bother increasing the number of observations to show that indeed it is tiny, yes, but statistically significant?

This obviously doesn't solve all the issues. I'll mention just one more. Very often, groups are compared that differ not by one condition, but by many. This is typical in epidemiology for example. You are looking for the effect of obesity and heart disease, say, and you find a strong correlation. But you want to make sure that it isn't just due to the fact that obese people tend to do less exercise, for example. It's crucial because then diet would probably not be efficient. The standard way to deal with this issue is to do multilinear regression or analysis of variance, that is, to fit a statistical model that includes all the variables that you think might be important. These are almost always linear models, ie, you assume that the observation you are interested in scales linearly with every variable. Then you will read in the paper that the authors have taken into account the other possible factors and so they can be sure that those are not involved.

I find this sort of statement hilarious. Who would think that a living organism is a linear system? A linear system takes no decision. In a linear system, there is no state that is qualitatively different from any other state. There is no life and death. There is no cancer. If you inject a dose of anesthetics into a human and monitor say heart beat, I bet that you won't find a linear relation between dose and heart beat.

I am not saying that multilinear regressions and similar tools are completely useless (although possibly close to it in many cases); after all linear approximations often work in some limited ranges (but that would need to be demonstrated specifically!). I simply mean that you can't trust their results. They should be taken as suggestive at best. If you look at the epidemiology literature, or simply reviews and reports from WHO, you'll find that there are not so many cases where experts hold strong convinctions about causal relations between diet or habits and diseases. When they do, it is never only based on statistical correlations. It is a combination of epidemiological studies, which are difficult to interpret because of the issues I have mentioned, and also intervention studies (ie an experiment with a control group) and detailed biological knowledge about the physiopathology of the disease, ie biological experiments. Such is the case of the relation between smoking and lung cancer, for example. Hence the relevance of Rutherford's quote: if you need statistics, then you ought to have done a better experiment.

I would like to conclude that the problem with the use of statistics in biology and related field is not in my opinion due to a lack of mathematical (specifically, statistical) education. Rather, it is due to a lack of epistemological education or reflection, which is a much broader problem, not specific to biologists at all. The question is not to know about all statistical tests and how to use them, the question is to understand the epistemological value of results, whether statistical or not, ie: what does it tell me exactly about the system I am interested in, and what exact question, beyond the suggestive words (“significant”), does it provide an answer to?