The machine learning analogy of perception

To cast the problem of neural computation in sensory systems, one often refers to the standard framework in machine learning. A typical example is as follows: there is a dataset, which could be for example a set of images, and the goal is to learn a mapping between these images and categories, for example faces or cars. In the learning phase, labels are externally given to these images, and the machine learning algorithm builds a mapping between images and labels. As an analogy of what sensory systems do, the question is then: how do neurons learn this mapping, e.g. to fire when they are presented with an image of a given category? This question is the starting point of many theories in computational neuroscience. It is essentially an inference problem: to each category corresponds a distribution of images, and so what sensory systems must do is learn this distribution and compute what the most likely category is for a given presented image. This is why Bayesian approaches are appealing from this point of view, because an efficient sensory system should then be an ideal Bayesian observer. It just follows from the way the problem of perception is cast.

But is this actually a good analogy? In fact, it differs from the problems sensory systems actually face in at least three important ways:

1) elements of the data set are considered independent;

2) these elements are externally given;

3) the labels are externally defined.

First of all, elements of the data set are never independent in a real perceptual system. On the contrary, there is a continuous flow of sensory input. Vision is not a slideshow. The visual field changes through time in a continuous way, and more importantly the changes are lawful because objects are embedded in the physical world. We can perceive these laws, for example the rigidity of movements, and this is something that cannot be found in the “slideshow” view of vision that is implied by the machine learning analogy. I believe this is the main message of James Gibson. Moreover, there are lawful relationships in the sensory inputs, but there are also sensorimotor relationships. This is information that can be picked up from the sensory or sensorimotor flow, not by inference from the distribution of slides in the slideshow. This means that perception is not (or not only) inferential but relational: sensory inputs are analyzed in reference to themselves (their internal structure), and not (only) to memory.

A second point is that in the machine learning analogy, elements of the dataset are considered given, and the algorithm reacts to it. In psychology, this view corresponds to behaviorism, in which the organism is only considered from a stimulus-reaction point of view. But in fact a more ecologically accurate view is that data are in general produced by the actions of the organism, rather than passively received. Gibson criticized the information processing viewpoint for this reason, because the world does not produce messages to be decoded by a receiver, on the contrary a perceptual system samples its environment. It is really the opposite view: the organism does not react to a stimulus, but rather the environment reacts to the actions of the organism, and it is this reaction that is analyzed by the organism. In the machine learning field, there are new frameworks that try to address this aspect, named “active learning”: the algorithm chooses a data element and asks for its label, for example to maximize the information that can be gained.

Finally, in the machine learning analogy, the label is externally defined. But in a closed system, this is not possible. The organism must define the relevant categories by itself. But how can these categories be a priori defined? Often, this problem is discarded by what I would call “evolutionary magic”: these categories are provided by “evolution” because they are important to the survival and reproduction of the animal. I call it “magic” because the teleological argument does not provide any explanation at all: it is about as metaphysical as if “evolution” were replaced by “God”, in the sense that it has the same explanatory power. Bringing intergenerational changes of the organism does not solve the problem: whatever mechanism is involved, pressure for change still has to come from the environment and the way the organism can interact with it, not from an external source.

In fact, this problem was addressed by the development of phenomenology in philosophy, introduced by Husserl about a century ago. Followers of the phenomenological approach include Merleau-Ponty and Sartre. The idea is the following. What “really” exists in the world is a metaphysical question: it actually does not matter for the organism if it makes no difference to its experience. For example, is there such a thing as “absolute space”, the existence of an absolute location of things? The question is metaphysical because only relative changes in space can be experienced (the relative location of things) – this point was noted by Henri Poincaré. In the phenomenological approach, “essence” is what remains invariant under changes of perspective. I believe this is related to a central point in Gibson’s theory: information is given in the “structural invariants” present in the sensory inputs. These invariants do not need an external reference to be noticed.

For example, consider a sound source that produces two acoustical waves at the two ears. Neglecting sound diffraction, these acoustical waves are identical apart from a propagation delay (the interaural time difference or ITD). When a sound is produced by the source, this property is invariant through time – it is a law that is always satisfied. But what makes it a spatial property? It is spatial because the property is broken when movements are produced by the organism (e.g. head movements). In addition, there is a higher-order property, which is the relationship between the interaural delay and the movements of the head, which is always true, as long as the source does not move. This structural invariant is then information about the location of the sound source, in fact the relationship can be mapped to the physical location of the source. But the “label” here is intrinsically defined: it is precisely the relationship between head position and ITD. Thus labels can be intrinsically defined, as the sensory and sensorimotor structure. This is the postulate of the sensorimotor account of perception, according to which perception is precisely the anticipated effect of the organism’s action on the sensory inputs.

The fact that these labels can be intrinsically defined is, I believe, what James Gibson means when he states that information is “picked-up” and that perception is “direct”. But I would like to go further: there is no doubt that there can be inference in perception, and so in that sense perception cannot be entirely direct. For example, one can visually recognize an object that is partially occluded, and imagine the rest of the object (“amodal perception”). But the point is that what is inferred, i.e., the “label” in the machine learning terminology, is not an externally given category, but the sensory or sensorimotor structure, part of which is hidden. The main difference is that there is no need for an external reference. For example, in the sound localization example, a brief sound may be presented at a given direction. Then the sensorimotor structure that defines source direction for the organism is hidden, since there is no sound when the organism can turn its head. So this structure is inferred from the ITD. In other words, what is inferred is not an angle, which would make no sense for an animal that has no measurement tool, it is the effect of its own movements on the perceived ITD. So there is inference, but inference is not the basis of perception. It cannot be, for how would you know what should be inferred? For this reason, Gibson rejects inference by the argument that it would yield to infinite regress. As I have tried to explain, it is not inference per se that is problematic, but the idea that it might be the basis of perception.

This is quite important for our view of neural computation: this means that Bayesian inference is not so central anymore in the function of sensory systems. Certainly, inference is useful and perhaps necessary in many cases. But perhaps more important is the discovery of sensory and sensorimotor structure, that is, the elaboration of what is to be inferred. This requires the development of a theory of neural computation that is primarily relational rather than inferential.

In summary, labels can be intrinsically defined by the invariant structure of sensory and sensorimotor signals. I would like to end this post with another important Gibsonian notion: “affordances”. Gibson thought that we perceive “affordances”, which are what the objects of perception allow in terms of interaction. For example, a door affords opening, a wall affords blocking, etc. This is an important notion, because it defines meaning in terms of things that make sense to the organism, rather than in externally defined terms.

To conclude, a theory of neural computation that takes into account these points should differ from standard theories in the following way: it should be

1) relational (discovering internal structure) rather than inferential (comparing with memory),

2) active (inputs are not questions but answers) rather than passive (inputs are questions, actions are answers), and

3) subjective (meaning is defined by the interaction with the environment) rather than objective (objects are externally defined).

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *