Back from a panel discussion on connectomics

I just came back from a panel discussion on connectomics between Moritz Helmstaedter and myself, in the Donders summer school on neural metrics. I will share my presentation when I figure out how to upload a 50 MB file on this website! (there are a few videos) In my presentation, I essentially point out that connectivity, especially anatomical connectivity (as opposed to functional connectivity), generally tells you very little about function. In particular, it doesn't help you distinguish between general theories of nervous function (say, liquid state machines or attractor networks), because those theories could easily accommodate very different types of connectivity (as long as, say, network connectivity is recurrent).

What came up in reaction to that remark is classical Popperianism. That is, the notion that a scientific theory should aim for critical experiments, experiments that could immediately disqualify that theory. So theories of brain function ought to make falsifiable predictions about connectivity, and if they don't they hardly qualify as scientific theories (the term “laziness” was used).

I have two remarks. First of all, the idea that scientific theories are theories that are falsifiable by critical experiments, and that scientific progress is essentially the performing of those experiments, is an idea that dates back from Popper's 1934 book and a few things have been written since then. Apparently many scientists are philosophically stuck in the 1930s. Thomas Kuhn's historical analysis shows that it is rarely the case that science progresses in this way. Of course it happens sometimes, but it's not the generic case. There are good reasons for that, which have been analyzed by philosophers such as Imre Lakatos. The basic remark is that a scientific theory is one that is falsifiable (the “demarcation criterion”), yes, but in practice it is also one that is falsified. There are always countless observations that do not fit the theoretical framework and those are just ignored or the theory is amended with ad hoc assumptions, which might later be explained in a more satisfactory way (e.g. the feather falls more slowly than the hammer because of some other force, let's call it “friction”). So it is very rare than a single experiment can discard a broad theory, because the outcome can often be accomodated by the theory. This can seem like a flaw in the scientific discovery process, but it's not: it's unavoidable if we are dealing with the complexity of nature; an experimental outcome can be negative because the theory is wrong, or because, say, there might be a new planet that we didn't know about (“let's call it Pluto”). This is why science progresses through the long-term interaction of competing theories (what Lakatos calls “research programs”), and this is why insisting that scientific theories should produce critical experiments is a fundamental epistemological error. Anyone who has spent a little time in research must have noticed that most hypothesis-driven papers actually test positive predictions of theories, the success of which they interpret as support for those theories.

The second remark is that, nonetheless, there is a bit of truth in the claim that theories of neural network function are difficult to confront with experiments. Certainly they are not very mature. I wouldn't say it is out of laziness, though. It is simply a very difficult task to build meaningful theories of the brain! But it is absolutely not true that they are not constrained enough because of the lack of data. Not only are they constrained, but I do not know of any such theory that is not immediately falsified by countless observations. There is not a single model of brain funtion that comes close to accounting for the complexity of animal behavior, let alone of physiological properties. How many theories in systems neuroscience are actually about systems, i.e. about how an organism might interact with an ecological environment, as opposed to describing responses of some neurons to some stimuli, interpreted as “code”? The biggest challenge is not to distinguish between different theories that would all account for current data (none does), but to build at least one that could qualify as a quantitative theory of brain function.

Importantly, if this diagnosis is correct, then our efforts should rather be spent on developing theories (by this I mean broad, ambitious, theories) than on producing yet more data when we have no theoretical framework to make use of them. This will be difficult as long as the field lives in the 1930s when it comes to epistemology, because any step towards an ambitious theory will be a theory that is falsified by current data, especially if we produce much more data. Can you make a scientific career by publishing theories that are empirically wrong (but interesting)? As provocative as it might sound, I believe you should be able to, if we ever want to make progress on the theory of brain function – isn't that the goal of neuroscience?

On the expanding volume of experimental data – (I) So... where is the data?

One of the major motivations of the HBP, as I have heard from Henry Markram, is that we already produce tons of experimental data, and yet we still don't understand much about the brain. So what we need is not more data, but rather to do something with all that data. I don't think that the specific way in which the HBP proposes to deal with the data necessarily follows from this remark, but I do think that the remark is generally correct. There are so many more papers than one can read, and so much more data that one can analyze or comprehend. I would like to give my view on this problem as a theoretical neuroscientist.

First of all, as a theoretical neuroscientist, I have to ask the question that many of my colleagues probably have in mind: yes, there is an enormous amount of data, but where is it? Most theoreticians crave data. They want data. Where is it: in the hard drives of the scientists who produce the data, and they usually stay there. So there is a very simple reason why this enormous amount of data is not exploited: it is not shared in the first place. In my view, this would be the first problem to solve. How is it that the data are not shared? If you carefully read the instructions to authors of journals, you generally find that they explicitly ask the authors to share all the data analyzed in the paper with anyone who would ask. And authors have to sign the statement. I can hear some theoreticians laughing in the back! Let's face it: it almost never happens, unless you personally know the authors, and even then it is complicated. Why? I have heard mainly two reasons. There is the “I have to dig out the data” type of reason. That is, the data lie in a disorganized constellation of files and folders, with custom formats etc: basically, it's a mess. This is probably the same reason why many modelling papers also don't share the code for their models: they are not proud of their code! There are some efforts underway to address this issue by developing standardized formats – for example NeuroML for neuron models, and there's a similar effort for experimental data that the Allen Institute is leading. I doubt that this will entirely solve the problem, but it is something.

In any case, I think this first issue is solvable (see below). The second reason why people don't share their data is much more profound: they want to own the data they have produced. They want to own it because it has value. In the academic world of neuroscience, data is wealth. It takes effort and resources to produce data, and you don't want someone taking your data, analyzing it in a different way and publishing it. Many scientists have in mind a few ideas to use their data in different ways, and so they want to protect it. Yet this attitude directly contradicts the statement they have signed when submitting their papers.

There is an additional factor that has more to do with the epistemological biases of the community. Compare these two abstracts. (1) “We wanted to test hypothesis A, and so we did the following experiment: it works”. Here the authors used data that they previously acquired, and confronted them with some hypothesis. (2) “We wanted to test hypothesis A, and so we checked previously published data: it works”. Here the authors used data that other authors previously acquired, and confronted them with some hypothesis. Exactly the same methodology. But certainly you know that abstract (2) will probably get rejected in the highest impact journals. It is not “new data”. It happened to me a number of times that reviewers complained that we actually used “old data”. Apparently the timestamp of the data was a problem. This might mean two things. One is that the value of a paper is essentially in the experimental data, and so if the raw data is the same, then the paper has no additional value. I find this statement philosophically inept, but in any case it is rather hypocritical as it does not actually distinguish between abstracts (1) and (2). A second interpretation is that data is only meaningful if it was acquired after rather than before a hypothesis was expressed. From a logical standpoint, this is absurd. But I think it stems from the wide-spread Popperian fantasy that science is about testing theories with critical experiments. That is, at a given moment in time, there are a number of theories that are consistent with current facts, and you distinguish between them by a carefully chosen experiment. This is certainly not true from a historical point of view (see Thomas Kuhn), and it is naïve from a philosophical point of view (see Imre Lakatos). I have discussed this elsewhere in detail. More trivially, there is the notion in the mind of many scientists (reviewers) that to have the data beforehand is sort of “cheating”, because then it is trivial to come with a hypothesis or theory that is consistent with it. I would argue that it is trivial only if a small amount of data is considered and the explanation is allowed to be as complex as the data, but otherwise it is not trivial at all – rather, it is actually what science is all about. But in any case, it is again hypocritical because you don't know when you read abstract (1) whether the data was actually produced after or before the hypothesis was expressed.

So what can we do about it? First, there is a trivial way to make a lot of progress on data sharing. Instructions to authors already specify that the data should be public. So let's just stop being hypocritical and make the submission of data mandatory at the time of paper submission, or acceptance. This could be dealt with by journals, or by archiving initiatives (such as Arxiv for preprints or ModelDB for models). I believe this will partly solve the issue of “I have to dig out the data” because authors will know in advance that they will have to submit the data. So they will do what is necessary – let's be honest for a minute: this is probably not the most complicated thing scientists are supposed to deal with. Quite possibly, standardized formats might help, but the first significant step is really to make the data accessible.

For the epistemological issue, the problem is more profound. Scientists have to accept that producing data is not the only valuable thing in science, and that making sense of the data is also valuable. I will address this specific point in the next post of this short series.

Update (20.08.2014). I just realized Neuron actually makes it mandatory that some types of datasets are submitted along with the paper (but not all types).