Computational neuroscience is the science of how the brain “computes”: how it recognizes faces or identifies words in speech. In computational neuroscience, standard approaches to perception are representational: they describe how neural networks represent in their firing some aspect of the external world. This means that a particular pattern of activity is associated to a particular face. But who makes this association? In the representational approach, it is the external observer. The approach only describes a mapping between patterns of pixels (say) and patterns of neural activity. The key step, of relating the pattern of neural activity to a particular face (which is in the world, not in the brain), is done by the external observer. How then is this about perception?
This is an intrinsic weakness of the concept of a “representation”: a representation is something (a painting, etc) that has a meaning for some observer, it is not about how this meaning is formed. Ultimately, it does not say much about perception, because it simply replaces the problem of how patterns of photoreceptor activity lead to perception by the problem of how patterns of neural activity lead to perception.
A simple example is the neural representation of auditory space. There are neurons in the auditory brainstem whose firing is sensitive to the direction of a sound source. One theory proposes that the sound's direction is signaled by the identity of the most active neuron (the one that is “tuned” to that direction). Another one proposes that it is the total firing rate of the population, which covaries with direction, that indicates sound direction. Some other theory considers that sound direction is computed as a “population vector”: each neuron codes for direction, and is associated a vector oriented in that direction, with a magnitude equal to its firing rate; the population vector is sum of all vectors.
Implicit in these representational theories is the idea that some other part of the brain “decodes” the neural representation into sound's direction, which ultimately leads to perception and behavior. However, this part is left unspecified in the model: neural models stop at the representational level, and the decoding is done by the external observer (using some formula). But the postulate of a subsequent neural decoder is problematic. Let us assume there is one. It takes the “neural representation” and transforms it into the target quantity, which is sound direction. But the output of a neuron is not a direction, it is a firing pattern or rate that can perhaps be interpreted as a direction. So how is sound direction represented in the output of the neural decoder? It appears that the decoder faces the same conceptual problem, which is that the relationship between output neural activity and the actual quantity in the world (sound direction) has to be interpreted by the external observer. In other words, the output is still a representation. The representational approach leads to an infinite regress.
Since neurons are in the brain and things (sound sources) are in the world, the only way to avoid an external “decoding” stage that relates the two is to include both the world and the brain in the perceptual model. In the example above, this means that, to understand how neurons estimate the direction of a sound source, one would not look for the “neural representation” of sound sources but for neural mechanisms that, embedded in an environment, lead to some appropriate orienting behavior. In other words, neural models of perception are not complete without an interaction with the world (i.e., without action). In this new framework, “neural representations” become a minor issue, one for the external observer looking at neurons.