On the expanding volume of experimental data – (I) So... where is the data?

One of the major motivations of the HBP, as I have heard from Henry Markram, is that we already produce tons of experimental data, and yet we still don't understand much about the brain. So what we need is not more data, but rather to do something with all that data. I don't think that the specific way in which the HBP proposes to deal with the data necessarily follows from this remark, but I do think that the remark is generally correct. There are so many more papers than one can read, and so much more data that one can analyze or comprehend. I would like to give my view on this problem as a theoretical neuroscientist.

First of all, as a theoretical neuroscientist, I have to ask the question that many of my colleagues probably have in mind: yes, there is an enormous amount of data, but where is it? Most theoreticians crave data. They want data. Where is it: in the hard drives of the scientists who produce the data, and they usually stay there. So there is a very simple reason why this enormous amount of data is not exploited: it is not shared in the first place. In my view, this would be the first problem to solve. How is it that the data are not shared? If you carefully read the instructions to authors of journals, you generally find that they explicitly ask the authors to share all the data analyzed in the paper with anyone who would ask. And authors have to sign the statement. I can hear some theoreticians laughing in the back! Let's face it: it almost never happens, unless you personally know the authors, and even then it is complicated. Why? I have heard mainly two reasons. There is the “I have to dig out the data” type of reason. That is, the data lie in a disorganized constellation of files and folders, with custom formats etc: basically, it's a mess. This is probably the same reason why many modelling papers also don't share the code for their models: they are not proud of their code! There are some efforts underway to address this issue by developing standardized formats – for example NeuroML for neuron models, and there's a similar effort for experimental data that the Allen Institute is leading. I doubt that this will entirely solve the problem, but it is something.

In any case, I think this first issue is solvable (see below). The second reason why people don't share their data is much more profound: they want to own the data they have produced. They want to own it because it has value. In the academic world of neuroscience, data is wealth. It takes effort and resources to produce data, and you don't want someone taking your data, analyzing it in a different way and publishing it. Many scientists have in mind a few ideas to use their data in different ways, and so they want to protect it. Yet this attitude directly contradicts the statement they have signed when submitting their papers.

There is an additional factor that has more to do with the epistemological biases of the community. Compare these two abstracts. (1) “We wanted to test hypothesis A, and so we did the following experiment: it works”. Here the authors used data that they previously acquired, and confronted them with some hypothesis. (2) “We wanted to test hypothesis A, and so we checked previously published data: it works”. Here the authors used data that other authors previously acquired, and confronted them with some hypothesis. Exactly the same methodology. But certainly you know that abstract (2) will probably get rejected in the highest impact journals. It is not “new data”. It happened to me a number of times that reviewers complained that we actually used “old data”. Apparently the timestamp of the data was a problem. This might mean two things. One is that the value of a paper is essentially in the experimental data, and so if the raw data is the same, then the paper has no additional value. I find this statement philosophically inept, but in any case it is rather hypocritical as it does not actually distinguish between abstracts (1) and (2). A second interpretation is that data is only meaningful if it was acquired after rather than before a hypothesis was expressed. From a logical standpoint, this is absurd. But I think it stems from the wide-spread Popperian fantasy that science is about testing theories with critical experiments. That is, at a given moment in time, there are a number of theories that are consistent with current facts, and you distinguish between them by a carefully chosen experiment. This is certainly not true from a historical point of view (see Thomas Kuhn), and it is naïve from a philosophical point of view (see Imre Lakatos). I have discussed this elsewhere in detail. More trivially, there is the notion in the mind of many scientists (reviewers) that to have the data beforehand is sort of “cheating”, because then it is trivial to come with a hypothesis or theory that is consistent with it. I would argue that it is trivial only if a small amount of data is considered and the explanation is allowed to be as complex as the data, but otherwise it is not trivial at all – rather, it is actually what science is all about. But in any case, it is again hypocritical because you don't know when you read abstract (1) whether the data was actually produced after or before the hypothesis was expressed.

So what can we do about it? First, there is a trivial way to make a lot of progress on data sharing. Instructions to authors already specify that the data should be public. So let's just stop being hypocritical and make the submission of data mandatory at the time of paper submission, or acceptance. This could be dealt with by journals, or by archiving initiatives (such as Arxiv for preprints or ModelDB for models). I believe this will partly solve the issue of “I have to dig out the data” because authors will know in advance that they will have to submit the data. So they will do what is necessary – let's be honest for a minute: this is probably not the most complicated thing scientists are supposed to deal with. Quite possibly, standardized formats might help, but the first significant step is really to make the data accessible.

For the epistemological issue, the problem is more profound. Scientists have to accept that producing data is not the only valuable thing in science, and that making sense of the data is also valuable. I will address this specific point in the next post of this short series.

Update (20.08.2014). I just realized Neuron actually makes it mandatory that some types of datasets are submitted along with the paper (but not all types).

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *