In sensory systems, one of the hardest computational problems is the “invariance problem”: the same perceptual category can be associated with a large diversity of sensory signals. A classical example is the problem of a recognizing a face: the same face can appear with different orientations relative to the observer, and under different lighting conditions, and it is a challenge to design a recognition system that is invariant to these sources of variation.
In computational neuroscience, the problem is usually framed within the paradigm of statistical learning theory as follows. Perceptual categories belong to some set Y (the set of faces). Sensory signals belong to some high-dimensional sensory space X (e.g. pixels). Each particular category (a particular face) corresponds to a specific set of signals in X (different views of the face) or to a distribution on X. The goal is to find the correct mapping from X to Y from particular labeled examples (a particular view x of a face, the name y corresponding to that face). This is also the view that underlies the “neural coding” paradigm, where there is a communication channel between Y and X, and X contains “information” about Y.
Framed in this way, this is a really difficult problem in general, and it requires many examples to form categories. However, there is a different way of approaching the problem, which follows from the concept of “invariant structure” developed by James Gibson. It starts with the observation that a sensory system does not receive a static input (an image) but rather a sensory flow. This is obvious in hearing (sounds are carried by acoustic waves, which vary in time), but it is also true of vision: the eyes are constantly moving even when fixating an object (e.g. high-frequency tremors). A perceptual system is looking for things that do not vary within this sensory flow, the “invariant structure”, because this is what defines the essence of the world.
I will develop the example of sound localization. When a source produces a sound, there are time-varying acoustical waves propagating in the air, and possibly reaching the ears of a listener. The input to the auditory system is two time-varying signals. Through the sensory flow, the identity and spatial location of the source are unchanged. Therefore, any piece of information about these two things must be found in properties of the auditory signals that are invariant through the sensory flow. For example, if we neglect sound diffraction, the fact that one signal is a delayed copy of the other, with a particular delay, is true as long as the sound exists. An invariant property of the acoustic signals is not necessary about the location of the sound source. It could be about the identity of the source for the example (the speaker). However, if that property is no longer invariant when movements are produced by the organism, then that property cannot be an intrinsic property of the source, but rather something about the relative location of the sound source.
In this framework, the computational problem of sound localization is in two stages: 1) for any single example, pick-up an acoustical invariant that is affected by head movements, 2) associate these acoustical invariants with sound location (either externally labeled, or defined with head movements). The second stage is essentially the computational problem defined in the neural coding/statistical learning framework. But the first stage is entirely different. It is about finding an invariant property within a single example, and this only makes sense if there is a sensory flow, i.e., if time is involved within a single example and not just across examples.
There is a great benefit in this approach, which is to solve part of the invariance problem from the beginning, before any category is assigned to an example. For example, a property about the binaural structure produced by a broadband sound source at a given position will also be true for another sound source at the same position. In this case, the invariance problem has disappeared entirely.
Within this new paradigm, the learning problem is now: given a set X of time-varying sensory signals produced by sound sources, how to find a mapping from X to some other space Y such that the images of sensory signals through this mapping do not vary over time, but vary across sources? Phrased in this way, this is essentially the goal of slow feature analysis. However the slow feature algorithm is a machine learning technique, whose biological instantiation is not straightforward.
There have been similar ideas in the field. In a highly cited paper, Peter Földiak proposed a very simple unsupervised Hebbian rule based on related considerations (Földiak, Neural Comp 1991). The study focused on the development of complex cells in the visual system, which respond to edges independently of their location. The complex cell combines inputs from simple cells, which respond to specific edges, and the neuron must learn the right combination. The invariance is learned by presenting moving edges, that is, it is looked for within the sensory flow and not across independent examples. The rule is very simple: it is a Hebbian rule in a rate-based model, where the instantaneous postsynaptic activity is replaced by a moving average. The idea is simply that, if the output must be temporally stable, then the presynaptic activity should be paired with the output at any time. Another paper by Schraudolph and Sejnowski (NIPS 1992) is actually about finding the “invariant structure” (with no mention of Gibson) using an anti-Hebbian rule, but this means that neurons signal the invariant structure by not firing, which is not what neurons in the MSO seems to be doing (although perhaps the idea might be applicable to the LSO).
There is a more recent paper in which slow feature analysis is formally related to Hebbian rules and to STDP (Sprekeler et al., PLoS CB 2007). Essentially, the argument is that minimizing the temporal variation of the output is equivalent to maximizing the variance of the low-pass filtered output. In other words, they provide a link between slow feature analysis and Földiak’s simple algorithm. There are also constraints, in particular the synaptic weights must be normalized. Intuitively this is obvious: to aim for a slowly varying input is the same thing as to aim for increasing the low frequency power of the signal. The angle in the paper is rather on rate models but it gives a simple rationale for designing learning rules that promote slowness. In fact, it appears that the selection of slow features follows from the combination of three homeostatic principles: maintaining a target mean potential and a target variance, and minimizing the temporal variation of the potential (through maximizing the variance of the low-pass filtered signal). The potential may be replaced by the calcium trace of spike trains, for example.
It is relatively straightforward to see how this might be applied for learning to decode the activity of ITD-sensitive neurons in the MSO into the location of the sound source. For example, a target neuron combines inputs from the MSO into the membrane potential, and the slowness principle is applied to either the membrane potential or the output spike train. As a result, we expect the membrane potential and the firing rate of this neuron to depend only on sound location. These neurons could be in the inferior colliculus, for example.
But can this principle be also applied to the MSO? In fact, the output of a single neuron in the MSO does not depend only on sound location, even for those neurons with a frequency-dependent best delay. Their output also depends on sound frequency, for example. But is it possible that their output is as slow as possible, given the constraints? It might be so, but another possibility is that only some property of the entire population is slow, and not the activity of individual neurons. For example, in the Jeffress model, only the identity of the maximally active neuron is invariant. But then we face a difficult question: what learning criterion should be applied at the level of an individual neuron so that there is a slow combination of the activities of all neurons?
I can imagine two principles. One is a backpropagation principle: the value of the criterion in the target neurons, i.e., slowness, is backpropagated to the MSO neurons and acts as a reward. The second is that the slowness criterion is applied at the cellular level not to the output of the cell, but to a signal representing a combined activity of neurons in the MSO, for example the activity of neighboring neurons.