Sensory modalities and the sense of temperature

Perception is traditionally categorized into five senses: hearing, vision, touch, taste and olfaction. These categories seem to reflect the organs of sense, rather than the sensory modalities themselves. For example, the sense of taste is generally (in the neuroscience literature) associated with the taste receptors in the tongue (sweet, salty etc). But what we refer to as taste in daily experience actually involves the tongue, including “taste” receptors (sweet, salty) but also “tactile” receptors (the texture of food), the nose (“olfactory” receptors), and in fact probably also the eyes (color) and the ears (chewing sounds). All these are involved in a unitary experience that seems to be perceptually localized in the mouth, or on the tongue – despite the fact the most informative stimuli, which are chemical, are actually captured in the nose. One may consider that taste is then a “multimodal” experience, but this is not a very good description. If you eat a crisp, you experience the taste of a crisp. But if you isolate any of the components that make this unitary experience, you will not experience taste. For example, imagine a crisp without any chemically active component and no salt: you experience touch with your tongue, and the crisp has “no taste”. If you only experience the smell, then you have an experience of smell, not of taste. This is another sensory modality, despite the fact that the same chemical elements are involved. If only the “taste” receptors on your tongue were stimulated, you would have an experience of “salty”, not of a crisp. So the modality of taste involves a variety of receptors, but that does not make it more multimodal than vision is multimodal because it involves many photoreceptors.

“Touch” is also very complex. There is touch as in touching something: you make contact with objects and you feel their texture or shape. There is also being touched. There is also the feeling of weight, which involves gravity, and also movement. There is the feeling of pain, which is related to touch, but not classically included in the 5 senses. Finally there is the feeling of temperature, which I will discuss now from an ecological point of view (in the way of Gibson).

The sense of temperature is not usually listed in the 5 senses. It is often associated with touch, because by touch you can feel that an object is hot or cold. But you can also feel that “it” (=the weather) is cold, in a way that is not well localized. Physically, it is a quantity that is not mechanical, and in this sense it is completely different from touch. But like touch, it is a proximal sense that involves the interface between the body and either the medium (air or water) or substances (object surfaces). The sense of temperature is much more interesting that it initially seems. First, there is of course “how hot it is”, the temperature of the medium. The image that comes to mind is that of the thermometer. But temperature can be experienced all over the body. So spatial gradients of temperature can be sensed. When touching an object, parts of the object can be more or less hot. So spatial gradients of temperature can potentially be sensed through an object, in the same way as the mechanical texture can be sensed. Are there temperature textures?

The most interesting and, as far as I know, underappreciated aspect of the temperature sense is its sensorimotor structure. The body produces heat. Objects react to heat by warming up. Some materials, like metal, conduct temperature well, others, like wood, don’t. So both the temporal changes in temperature when an object is touched, and the spatial gradient of temperature that develops, depends on the material and possibly specifies it. So it seems that the sense of temperature is rich enough to qualify as a modality in the same way as touch.


What is time and how is it perceived? This is of course a vast philosophical question, which I will only scratch.

1) Time, space and existence

It is customary to describe time as “the fourth dimension”. This point of view comes from the equations of mechanics and is highly misleading, because it seems to imply that time is of the same kind as space. A century ago, Henri Poincaré noted that our concept of space, both perceptually and scientifically, derives from our physical interactions with the world. That is to say, knowing where something is is knowing how to get there. Space is defined by the laws that govern movements in the physical world and the structure of these laws (Euclidean geometry). A law, some property that does not change, can only be defined with respect to something that changes. Therefore, time, defined as the source of change in the world, is a prerequisite to space. Space exists only by its persistence through the passing of time.

2) Time and change

In fact, nothing exists without the passing of time, because the essence is precisely what does not change through the flow of time. If we see someone throwing a ball, that ball is moving. Our visual sensations change, but we see a ball in movement: this is to say that there is something in the visual signals that does not change, which characterizes the ball as such. We do not see an object in the flickering white noise of a TV set.

In the TV series Bewitched, Samantha the housewife twitches her nose and everyone freezes except her. Then she twitches her nose and everyone unfreezes, without noticing that anything happened. For them, time has effectively stopped. This is to say that time is not perceived as such, but only through the changes it causes in our body. It is these changes that are perceived, not time per se (i.e., not time as in the variable in the equations of mechanics).

3) Irreversibility of time

From the fact that time is the perceived cause of changes, it follows that time has a direction, because physical processes are generally irreversible. This is also related to the theorem in information theory that states that information can only be lost, and never gained, when a process is applied to a variable. The current state of a physical system results from previous processes only, which constitutes “the past”.

A physical system in which events occur (our body) can be seen as a dynamical system, or series of processes that make the state of the system evolve. From one state s, the system changes subsequently to state s’. There is a direction to this change: s -> s’. This is the action of time on the system, and it is directed (the “arrow of time”). If the system where isolated, then time would be arbitrary. One could consider any dimension that is isomorphic to time and preserves directionality, and call it “time”, without changing the organization of changes within the system. It would make no difference for the system.

4) The unity of time

This raises the question of the perceptual unity of time: if time is perceived through changes in our body, then why do we feel that time is a single thing, when lots of different things change in our body? How is it that an auditory event and a visual event can appear to occur “at the same time”, given that they impact different receptors? Why isn’t there a different time for each process in our body? What does it mean that an event occurs “before” another one?

Imagine two independent processes that are spatially separated. From the perspective of these processes, it would make no difference if time passed at a different pace. The unity of time must come from an interaction between processes. The interaction between different processes defines a common flow of time.

Going further would probably require a discussion of consciousness and working memory, so I will leave these questions mostly unanswered for now.

5) The grain of time

How fine is our perception of time? When one listens to an auditory click played through headphones with 500 µs delay between the two ears, we do not hear two clicks. We hear a single click, lateralized towards one side. If we repeatedly play clicks at 50 Hz (every 20 ms), we do not hear a series of clicks. We hear a single continuous sound. When we listen to a pure tone at 50 Hz, the amplitude of the tone varies all the time but we do not hear this variation of amplitude. On the contrary, it feels like the tone has constant loudness.

These remarks suggest that our perception of time has a “grain” of a few tens of ms. That is, processes occurring within a few tens of ms are perceived as being caused by the same event, and the temporal occurrence of events within that time window is not perceived as time. Why?

To see how tricky this is, consider again the first example, when we listen to two clicks delayed by 500 µs between the two ears. The temporal order of the clicks can be clearly distinguished: if the click is first played in the left earphone, the sound is perceived as coming from the left, and conversely if the click is first played in the right earphone. In addition, if the delay between the two clicks is changed, then the sound is perceived as coming from a different direction (usually somewhere between the two ears), in a way that is reproducible. Such changes are perceived when the delay is changed by about 20 µs.

So from a computational point of view, time is processed with a grain of 20 µs. But phenomenologically, time appears to have a grain about a thousand times larger. Why such a difference? The perceptual grain of time does not appear to reflect the precision of neural processing, or in other words, the timescale at which states of the brain seem constant.

6) Duration

This post probably raised more questions than I could answer. I will end it with a discussion of the concept of duration. Spinoza described it as follows: “Duration is an attribute under which we conceive the existence of created things insofar as they persevere in their actuality”. This is essentially the point I have developed at the beginning of this post. In contrast with, say, color and pitch, duration is not a quality of things. Duration is about existence (the fact that a thing exists), while color or pitch is about essence (what this thing is). Properties of objects are defined by their persistence through time, but duration does not persist through time. Rather, duration quantifies how much time some properties exist. For example, it can be said that a musical note has a timbre (the instrument), a pitch and a duration. These are not three independent qualities: duration is about the pitch and timbre (for how much time they can be said to exist), but timbre is not about duration.

In summary: time is about existence, space is about essence.

Is perception about inference?

One philosophical theory about perception claims that perceiving is inferring the external world from the sensory signals. The argumentation goes as follows. Consider the retina: there is a spatially inhomogeneous set of photoreceptors; the image projected onto the retina is inverted, but you don’t see the world upside down; there are blood vessels that you normally don’t see; there is a blind spot where the optic nerve starts that you normally don’t notice; color properties of photoreceptors and their spatial sampling is inhomogeneous and yet color doesn’t change when you move the eyes. Perceptually, the visual field seems homogeneous and independent of the position of the eyes, apart from a change of perspective. So certainly what you perceive is not the raw sensations coming from your photoreceptors. These raw sensations are indirectly produced by things in the world, which have some constancy (compared to eye movements, for example). The visual signals in your retina are not constant, but somehow your perception is constant. Therefore, so the argument goes, your mind must be reconstructing the external world from the sensory signals, and what you perceive is this reconstruction.

Secondly, visual signals are ambiguous. A classical example is the Necker cube: a wire frame cube drawn in isometric perspective on a piece of paper, which can be perceived in two different ways. More generally, the three-dimensional world is projected on your retina as a two-dimensional image, and yet we see in three dimensions: the full 3D shape of objects must then be inferred. Another example is that in the dark, visual signals are noisy and yet you can see the world, although less clearly, and you don’t see noise.

I would then like to consider the following question: why, when I am looking at an apple, do I not see the back of the apple?

The answer is so obvious that the question sounds silly. Obviously, there is no light going through the object to our eyes, so how come could we see anything behind it? Well precisely, the inference view claims that we perceive things that are not present in the sensory signals but inferred from them. In the case of the Necker cube, there is nothing in the image itself that informs us of the true three-dimensional shape of the cube; there are just two consistent possibilities. But in the same way, when we see an apple, there are a number of plausible possibilities about how the back of the apple should be, and yet we only see the front of the apple. Certainly we see an apple, and we can guess how the back of the apple looks like, but we do not perceive it. A counter-argument would be that inference about the world is partial: of course we cannot infer what is visually occluded by an object. But this is circularly reasoning: perception is the result of inference, but we only infer what can be perceived.

One line of criticism of criticism of the objectivist/inferential view starts from Kant’s remark that anything we can ever experience comes from our senses, and therefore one cannot experience the objective world as such, even through inference, since we have never had access to the things to be inferred. This leads to James Gibson’s ecological theory of perception, who considered that the (phenomenal) world is directly perceived as the invariant structure in the sensory signals (the laws that the signals follow, potentially including self-generated movements). This view is appealing in many respects because it solves the problem raised by Kant (who concluded that there must be an innate notion of space). But it does not account for the examples that motivate the inferential view, such as the Necker cube (or in fact the perception of drawings in general). A related view, O’Regan’s sensorimotor theory of perception, also considers that objects of perception must be defined in terms of relationships between signals (including motor signals) but does not reject the possibility of inference. Simply, what is to be inferred is not an external description of the world but the effect of actions of sensory signals.

So some of the problems of the objectivist inferential view can be solved by redefining what is to be inferred. However, it still remains that in an inferential process, the result of inference is in a sense always greater than its premises: there is more than is directly implied by the current sensory signals. For example, if I infer that there is an apple, I can have some expectations about how the apple should look like if I turn it, and I may be wrong. But this part where I may be wrong, the predictions that I haven’t checked, I actually don’t see it – I can imagine it, perhaps.

Therefore, perception cannot be the result of inference. I suggest that perception involves two processes: 1) an inferential process, which consists in making a hypothesis about sensory signals and their relationship with action; 2) a testing process, in which the hypothesis is tested against sensory signals, possibly involving an action (e.g. an eye movement). These two processes can be seen as coupled, since new sensory signals are produced by the second process. I suggest that it is the second process (which is conditioned by the first one) that gives rise to conscious perception. In other words, to perceive is to check a hypothesis about the senses (possibly involving action). According to this proposition, subliminal perception is possible. That is, a hypothesis may be formed with insufficient time to test it. In this case, the stimulus is not perceived. But it may still influence the way subsequent stimuli are perceived, by influencing future hypotheses or tests.

Update. In The world as an outside memory, Kevin O'Regan expressed a similar view: "It is the act of looking that makes things visible".

The villainous monster recursion

In O’Regan’s paper about the sensorimotor theory of perception (O’Regan and Noë, BBS 2001), he uses the analogy of the “villainous monster”. I quote it in full:

“Imagine a team of engineers operating a remote-controlled underwater vessel exploring the remains of the Titanic, and imagine a villainous aquatic monster that has interfered with the control cable by mixing up the connections to and from the underwater cameras, sonar equipment, robot arms, actuators, and sensors. What appears on the many screens, lights, and dials, no longer makes any sense, and the actuators no longer have their usual functions. What can the engineers do to save the situation? By observing the structure of the changes on the control panel that occur when they press various buttons and levers, the engineers should be able to deduce which buttons control which kind of motion of the vehicle, and which lights correspond to information deriving from the sensors mounted outside the vessel, which indicators correspond to sensors on the vessel’s tentacles, and so on.”

It is meant here that all knowledge must come from the sensors and the effect of actions on them, because there is just no other source of knowledge. This point of view changes the computational problem of perception from inferring objective things about the physical world from the senses to finding relations between actions and sensor data.

This remark is not specific to the brain. It would apply whether the perceptual system is made of neurons or not – for example it could be an engineered piece of software for a robot. So what in fact is specific about the brain? The question is perhaps too broad, but I can at least name one specificity. The brain is made of neurons, and each neuron is a separate entity (with a membrane) that interacts with other neurons, which are relatively elementary (compared to the entire organism) and essentially identical (in the great lines). Each entity has sensors (dendrites) and can act by sending spikes through their axons (and also in other ways, but on a slower timescale). So in fact we could think of the villainous monster concept at different levels. The higher level is the organism, with sensors (photoreceptors) and actuators (muscle contraction). At a lower level, we could consider a brain structure, for example the hippocampus, and see it as a system with sensors (spiking inputs to the hippocampus) and actuators (spiking outputs). What can be said about the relationship between actions and sensor inputs? In fact, we could arbitrarily define a system by doing at graph cut in the connectivity graph of the brain. At the final level of analysis, we might analyze the neuron as a perceptual system, with a set of sensors (dendrites) and one possible action (to produce a spike). At this level, it may also be possible to define the same neuron as a different perceptual system by redefining the set of sensors and actions. For example, sensors could be a number of state variables, such as membrane potential at different points along the dendritic tree, calcium concentration, etc; actions could be changes in channel densities, in synaptic weights, etc. This is not completely crazy because in a way, these sensed properties and the effect of cellular actions are all that the cell can ever know about the “outside world”.

One might call this conceptual framework the “villainous monster recursion”. I am not sure where it could lead, but it seems intriguing enough to think about it!

On imitation

How is it possible to learn by imitation? For example, consider a child learning to speak. She reproduces a word produced by an adult, for example “Mom”. How is this possible? At first sight, it seems like there is an obvious answer: the child tries to activate her muscles so that the sound produced is similar. But that’s the thing: the sound is not similar at all. A child is much smaller than an adult, which implies that: 1) the pitch is higher, 2) the whole spectrum of the sound is shifted towards higher frequencies (the “acoustic scale” is smaller). So if one were to compare the two acoustic waves, she would find little similarity (both in the time domain and in the spectral domain). Therefore, learning by imitation must be based on a notion of similarity that resides at a rather conceptual level – not at all the direct comparison of sensory signals. Note that the sensorimotor account of perception (in this case the motor theory of speech) does not really help here, because it still requires explaining why the two vastly different acoustic waves should relate to similar motor programs. To be more precise: the two acoustic waves actually do relate to similar motor programs, but the adult’s motor program cannot be observed by the child: the child has to relate the acoustic result of the adult’s motor program with her own motor program, when the latter does not produce the same acoustic result. Could there be something in the acoustic wave that directly suggests the motor program?

This was the easy problem of imitation. But here’s a harder one: how can you imitate a smile? In this case, you can only see the smile you want to imitate on the teacher’s face, but you cannot see your own smile. In addition, it seems unlikely that the ability is based on prior practicing in front of a mirror. Thus, somehow, there is something in the visual signals that suggests the motor program. These are two completely different physical signals, therefore the resemblance must lie somewhere in the higher-order structure of the signals. This means that the perceptual system is able to extract an amodal notion of structure, and compare two structures independently of their sensory origin.

Memory as an inside world

A number of thinkers oppose the notion of pictorial representations, or even of any sort of representation, in the brain. In robotics, Rodney Brooks is often quoted for this famous statement: “the world is its own best model”. In a previous post, I commented on the fact that slime molds can solve complex spatial navigation problems without an internal representation of space – in fact, without a brain! It relies on using the world as a sort of outside memory: the slime mold leaves some extracellular trace on the floor, where it has previously been, so that it avoids being stuck in any one place.

This idea is also central in the sensorimotor theory of perception, and in fact Kevin O’Regan argued about “the world as an outside memory” in an early paper. This is related to a number of psychological findings about change blindness, but I will rephrase the argument from a more computational perspective. Imagine you are making a robot with a moveable eye that has a fovea. At any given moment, you only have a limited view of the world. You could obtain a detailed representation of the visual scene by scanning the scene with your eye and storing the images in memory. This memory would then be a highly detailed pictorial representation of the world. When you want to know some information about an object in any part of the visual scene, you can then look at the right place in the memory. But then why look at the memory if you can directly look at the scene? If moving the eye is very fast, which is the case for humans, then from an operational point of view, there is no difference between the two. It is then simply useless and inefficient to store the information in memory if the information is immediately available in the world. What might need to be stored, however, is some information about how to find the relevant information (what eye movements to produce), but this is not a pictorial representation of the visual scene.

Despite what the title of this post might suggest, I am not going to contradict this view. But we also know that visual memory exists: for example, we can remember a face, or we can remember what is behind us if we have seen it before (although it is not highly detailed). Now I am throwing an idea here, or perhaps an analogy, which might initially sound a bit crazy: how about if memory were like an inside world? In other words, how about interpreting the metaphor “looking at something in memory” in a literal way?

The idea of the world as an external memory implicitly relies on a boundary between mind and world that is put at the interface of our sensors (say, the retina). Certainly this is a conceptual boundary. Our brain interacts with the environment through an interface (sensors/muscles), but we could equally say that any part of the brain interacts with its environment, made of everything outside it, including other parts of the brain. So let us imagine for a while that we put the mind-world boundary in such a way that the hippocampus, which is involved in memory (working memory and spatial memory), is outside it. Then the mind can request information about the world from the sensory system (moving the eyes, observing the visual inputs), or in the same way from the hippocampus (making some form of action on the hippocampus, observing the hippocampal inputs).

Perhaps this might seem somehow like a homunculus thinking exercise, but I think there is something interesting in this perspective. In particular, it puts memory and perception at the same level of description, in terms of sensorimotor interaction. This is interesting because from a phenomenological point of view, there is a similarity between memory and perception: the memory of an image feels (a bit) like an image, or one can say that she “hears a melody in her head”. At the same time, memory has distinct phenomenal properties, for example one cannot interact with memory in the same way as with the physical world, it is also less detailed, and finally there are no “events” in memory (something unpredictable happening).

In other words, this view may suggest a sensorimotor account of memory (where “sensorimotor” must be understood in a very broad sense).

Robots and jobs

Are robots going to free us from the slavery of work, or are they going to steal people’s jobs?

As a computational neuroscientist, this is a question I sometimes think about. For a long time, I have followed a self-reassuring reasoning, which seems to make sense from a logical point of view, that having robots do the work for us means that either we get more products for the same amount of work or each person works less for the same quantity of products. So it has to be a good thing: ideally, robots would do the work we don’t want to do, and we would just do what we are interested in doing – maybe travel, write books, see our friends or play music.

This is a fine theoretical argument, but unfortunately it is also one that ignores the economy we live in. Maybe we could (or should) think of an economy that would make this work, but how about our current capitalist economy? Very concretely, if robots arrive on the market that are able to do the job that people previously did for cheaper, then these people would simply lose their job. If work can be outsourced to poorer countries, then in the same way it can also be outsourced to robots.

One counter-argument, of course, is that in a free market economy, people would temporarily lose their job but then they would be reassigned to other jobs and the whole economy would be more productive. This is a classical free-market fundamentalist argument. But there are at least two problems with this argument. The first is that it commits the mistake of thinking the economy as a quasi-static system: it changes, but it is always in equilibrium. It is implicitly assumed that it is easy to change job, that it has a negligible cost, that large scale changes in labor market has no significant impact on the rest of the economy (think for example of the effect on the financial system of thousands of sacked people being unable to pay their mortgage). Now if we think of a continuous progress, in which innovations regularly arrive and continuously change the structure of the labor market, then it is clear that the economy can never be in the ideal equilibrium state in which jobs are perfectly allocated. At any given time there would be a large fraction of the population that would be unemployed. In addition, anyone would then face a high risk of going through such a crisis in the course of their work life. This would then have major consequences for the financial system, as it would make loans and insurances riskier, and therefore more expensive. These additional costs to society (cost of unemployment and reconversion, financial risk, etc) are what economists call “externalities”: these are costs that have to be paid by society, but they are not supported by the ones that take the decisions that are responsible for these costs. For the company that replaces a human by a robot, the decision is based on the salary of the human vs. the cost of the robot, but it does not include the cost of the negative externalities. For this reason, it is possible that companies take decisions that seem beneficial for each one of them, and yet that have a negative impact on the global economy (not even considering the human factor).

A second problem is that the argument neglects a critical aspect of capitalist systems, which is the division between capital and work. When a human is replaced by a robot, what was previously the product of work is now the product of capital (investment in buying the robot) – see this blog post by Paul Krugman. Very concretely, this means that a larger part of the wealth goes to the owners rather than to the workers. As a thought experiment, we could imagine that the workforce is completely replaced by robots, and that the owner would only buy the robots and then get the money from customers without doing anything. Wealth would then be distributed according to how many robots one owns. This might seem far-fetched, but if you think about it, this is pretty much how real estate works.

So concretely, introducing robots in a capitalist economy means increasing productivity, but it also means that owners get an increasingly bigger part of the pie. In such an economy, the ideal robotic world is a dystopia in which wealth is distributed exclusively in proportion of what people own.

This thought is very bothering for scientists like me, who are more or less trying to make this ideal robotic world happen, with the utopia of the no-forced-work society in mind. How could one avoid the dystopian nightmare? I do not think that it is possible to just stop working on robots. I could personally decide not to work on robots, and maybe I would feel morally right and good about myself, having no responsibility in what happens next, but that would just be burying my head in the sand. The only way it will not happen is if all scientists in the world, in all countries, would stop working on robots or any sort of automation that would increase productivity (internet?). We don’t even seem to be able to stabilize our production of carbon dioxide even when we agree on the consequences, so I don’t think this is very realistic.

So if we can’t stop the scientific progress from happening, then the only other way is to adapt our economy to it. Imagine a society with robots doing all the work, entirely. Since there is no work at all in such a society, then in an unregulated free market economy wealth can only be distributed according to the amount of capital people have. There is simply no other way it could be distributed. Such an economy is bound to lead to the robotic nightmare.

Therefore, society has to take global measures to regulate the economy, and make the distribution of wealth fairer. I don’t have any magical answer, but we could throw a few ideas. For example, one could get rid of inheritance (certainly not easy in practice), and transmit capital from the deceased to the newborn in equal proportion. Certainly some people would get richer than others by the end of their lives, but it would be limited. As a transition policy, one could allow the replacement of people by robots, but the fired worker would own part of the robot. Alternatively, robots could only be owned by people and not by companies. A robot could then replace a worker only when a worker buys the robot and rents it to the company. Another alternative is that robot-making companies belong to the State and can only rent the robots to companies. The wealth would then be shared among citizens.

Certainly all these ideas come with difficulties, none of them is ideal, but one has to keep in mind that not implementing any regulation of this type can only lead to the robotic dystopia.

The machine learning analogy of perception

To cast the problem of neural computation in sensory systems, one often refers to the standard framework in machine learning. A typical example is as follows: there is a dataset, which could be for example a set of images, and the goal is to learn a mapping between these images and categories, for example faces or cars. In the learning phase, labels are externally given to these images, and the machine learning algorithm builds a mapping between images and labels. As an analogy of what sensory systems do, the question is then: how do neurons learn this mapping, e.g. to fire when they are presented with an image of a given category? This question is the starting point of many theories in computational neuroscience. It is essentially an inference problem: to each category corresponds a distribution of images, and so what sensory systems must do is learn this distribution and compute what the most likely category is for a given presented image. This is why Bayesian approaches are appealing from this point of view, because an efficient sensory system should then be an ideal Bayesian observer. It just follows from the way the problem of perception is cast.

But is this actually a good analogy? In fact, it differs from the problems sensory systems actually face in at least three important ways:

1) elements of the data set are considered independent;

2) these elements are externally given;

3) the labels are externally defined.

First of all, elements of the data set are never independent in a real perceptual system. On the contrary, there is a continuous flow of sensory input. Vision is not a slideshow. The visual field changes through time in a continuous way, and more importantly the changes are lawful because objects are embedded in the physical world. We can perceive these laws, for example the rigidity of movements, and this is something that cannot be found in the “slideshow” view of vision that is implied by the machine learning analogy. I believe this is the main message of James Gibson. Moreover, there are lawful relationships in the sensory inputs, but there are also sensorimotor relationships. This is information that can be picked up from the sensory or sensorimotor flow, not by inference from the distribution of slides in the slideshow. This means that perception is not (or not only) inferential but relational: sensory inputs are analyzed in reference to themselves (their internal structure), and not (only) to memory.

A second point is that in the machine learning analogy, elements of the dataset are considered given, and the algorithm reacts to it. In psychology, this view corresponds to behaviorism, in which the organism is only considered from a stimulus-reaction point of view. But in fact a more ecologically accurate view is that data are in general produced by the actions of the organism, rather than passively received. Gibson criticized the information processing viewpoint for this reason, because the world does not produce messages to be decoded by a receiver, on the contrary a perceptual system samples its environment. It is really the opposite view: the organism does not react to a stimulus, but rather the environment reacts to the actions of the organism, and it is this reaction that is analyzed by the organism. In the machine learning field, there are new frameworks that try to address this aspect, named “active learning”: the algorithm chooses a data element and asks for its label, for example to maximize the information that can be gained.

Finally, in the machine learning analogy, the label is externally defined. But in a closed system, this is not possible. The organism must define the relevant categories by itself. But how can these categories be a priori defined? Often, this problem is discarded by what I would call “evolutionary magic”: these categories are provided by “evolution” because they are important to the survival and reproduction of the animal. I call it “magic” because the teleological argument does not provide any explanation at all: it is about as metaphysical as if “evolution” were replaced by “God”, in the sense that it has the same explanatory power. Bringing intergenerational changes of the organism does not solve the problem: whatever mechanism is involved, pressure for change still has to come from the environment and the way the organism can interact with it, not from an external source.

In fact, this problem was addressed by the development of phenomenology in philosophy, introduced by Husserl about a century ago. Followers of the phenomenological approach include Merleau-Ponty and Sartre. The idea is the following. What “really” exists in the world is a metaphysical question: it actually does not matter for the organism if it makes no difference to its experience. For example, is there such a thing as “absolute space”, the existence of an absolute location of things? The question is metaphysical because only relative changes in space can be experienced (the relative location of things) – this point was noted by Henri Poincaré. In the phenomenological approach, “essence” is what remains invariant under changes of perspective. I believe this is related to a central point in Gibson’s theory: information is given in the “structural invariants” present in the sensory inputs. These invariants do not need an external reference to be noticed.

For example, consider a sound source that produces two acoustical waves at the two ears. Neglecting sound diffraction, these acoustical waves are identical apart from a propagation delay (the interaural time difference or ITD). When a sound is produced by the source, this property is invariant through time – it is a law that is always satisfied. But what makes it a spatial property? It is spatial because the property is broken when movements are produced by the organism (e.g. head movements). In addition, there is a higher-order property, which is the relationship between the interaural delay and the movements of the head, which is always true, as long as the source does not move. This structural invariant is then information about the location of the sound source, in fact the relationship can be mapped to the physical location of the source. But the “label” here is intrinsically defined: it is precisely the relationship between head position and ITD. Thus labels can be intrinsically defined, as the sensory and sensorimotor structure. This is the postulate of the sensorimotor account of perception, according to which perception is precisely the anticipated effect of the organism’s action on the sensory inputs.

The fact that these labels can be intrinsically defined is, I believe, what James Gibson means when he states that information is “picked-up” and that perception is “direct”. But I would like to go further: there is no doubt that there can be inference in perception, and so in that sense perception cannot be entirely direct. For example, one can visually recognize an object that is partially occluded, and imagine the rest of the object (“amodal perception”). But the point is that what is inferred, i.e., the “label” in the machine learning terminology, is not an externally given category, but the sensory or sensorimotor structure, part of which is hidden. The main difference is that there is no need for an external reference. For example, in the sound localization example, a brief sound may be presented at a given direction. Then the sensorimotor structure that defines source direction for the organism is hidden, since there is no sound when the organism can turn its head. So this structure is inferred from the ITD. In other words, what is inferred is not an angle, which would make no sense for an animal that has no measurement tool, it is the effect of its own movements on the perceived ITD. So there is inference, but inference is not the basis of perception. It cannot be, for how would you know what should be inferred? For this reason, Gibson rejects inference by the argument that it would yield to infinite regress. As I have tried to explain, it is not inference per se that is problematic, but the idea that it might be the basis of perception.

This is quite important for our view of neural computation: this means that Bayesian inference is not so central anymore in the function of sensory systems. Certainly, inference is useful and perhaps necessary in many cases. But perhaps more important is the discovery of sensory and sensorimotor structure, that is, the elaboration of what is to be inferred. This requires the development of a theory of neural computation that is primarily relational rather than inferential.

In summary, labels can be intrinsically defined by the invariant structure of sensory and sensorimotor signals. I would like to end this post with another important Gibsonian notion: “affordances”. Gibson thought that we perceive “affordances”, which are what the objects of perception allow in terms of interaction. For example, a door affords opening, a wall affords blocking, etc. This is an important notion, because it defines meaning in terms of things that make sense to the organism, rather than in externally defined terms.

To conclude, a theory of neural computation that takes into account these points should differ from standard theories in the following way: it should be

1) relational (discovering internal structure) rather than inferential (comparing with memory),

2) active (inputs are not questions but answers) rather than passive (inputs are questions, actions are answers), and

3) subjective (meaning is defined by the interaction with the environment) rather than objective (objects are externally defined).

The intelligence of slime molds

Slime molds are fascinating: these are unicellular organisms that can display complex behaviors such as finding the shortest path in a maze and developing an efficient transportation network. Actually each of these two findings generated a high-impact publication (Science and Nature) and an Ignobel prize. In the latter study, the authors grew a slime mold on a map of Japan, with food on the biggest cities, and demonstrated that it developed a transportation network that looked very much like the railway network of Japan (check out the video!).

More recently, there was a recent PNAS paper in which the authors showed that a slime mold can solve the “U-shaped trap problem”. This is a classic spatial navigation problem in robotics: the organism is behind a U-shaped barrier and there is food behind it. It cannot navigate to the food using local rules (e.g. following a path along which the distance to the food continuously decreases), and therefore it requires some form of spatial memory. This is not a trivial task for robots, but the slime mold can do it (check out the video).

What I find particularly interesting is that the slime mold has no brain (it is a single cell!), and yet it displays behavior that requires some form of spatial memory. The way it manages to do the task is that it leaves extracellular slime behind it and uses it to mark the locations it has already visited. It can then explore its environment by avoiding extracellular slime, and it can go around the U-shaped barrier. Thus it uses an externalized memory. This is a concrete example that shows that (neural) representation is not always necessary for complex cognition. It nicely illustrates Rodney Brook’s famous quote: “The world is its own best model”. That is, why develop a complex map of the external world when you can directly interact with it?

Of course, we humans don’t usually leave slime on the floor to help us navigate. But this example should make us think about the nature of spatial memory. We tend to think of spatial memory in terms of maps, in analogy with actual maps that we can draw on a paper. However, it is now possible to imagine other ways in which a spatial memory could work, in analogy with the slime mold. For example, one might imagine a memory system that leaves “virtual slime” in places that have been already explored, that is, that associates environmental cues about location with a “slime signal”. This would confer the same navigational abilities as those of slime molds, without a map-like representation of the world. For the organism, having markers in the hippocampus (the brain area involved in spatial memory) or outside the skull might not make a big difference (does the mind stop at the boundary of the skull?).

It is known that in mammals, there are cells in the hippocampus that fire at a specific (preferred) location. These are called “place cells”. How about if the meaning of spikes fired by these place cells were that there is “slime” in their favorite place? Of course I realize that this is a provocative question, which might not go so well with other known facts about the hippocampus, such as grid cells (cells that fire when the animal is at nodes of a regular spatial grid). But it makes the point that maps, in the usual sense, may not be the only way in which these experimental observations can be interpreted. That is, the neural basis of spatial memory could be thought of as operational (neurons fire to trigger some behavior) rather than representational (the world is reconstructed from spike trains).

On the role of voluntary action in perception

The sensorimotor theory of perception considers that to perceive is to understand the effect of active movements on sensory signals. Gibson’s ecological theory also places an emphasis on movements: information about the visual world is obtained by producing movements and registering how the visual field changes in lawful ways. Poincaré also defined the notion of space in terms of the movements required to reach an object or compensate for movements of an object.

Information about the world is contained in the sensorimotor “contingencies” or “invariants”, but why should it be important that actions are voluntary? Indeed, one could see movements as another kind of sensory information (e.g. proprioceptive information, or “efferent copy”), and a sensorimotor law is then just a law defined on the entire set of accessible signals. I will propose two answers below. I only address the computational problem (why it is useful), not the problem of consciousness.

Why would it make a difference that action is voluntary? The first answer I will give comes from ideas discussed in robotics and machine learning, and known as active learning, curiosity or optimal experiment design. Gibson makes this remark that the term “information” is misleading when talking about the sensory inputs. Senses cannot be seen as a communication channel, because the world does not send messages to be decoded by the organism. In fact rather the opposite is true: the organism actively seeks information about the world by making specific actions that improve its knowledge. A good analogy is the game “20 questions”. One participant thinks of an object or person. The other tries to discover it by asking questions that can only be answered by yes or no. She wins if she can guess the object with fewer than 20 questions. Clearly it is very difficult to guess using the answer to a random question. But by asking smart questions, one can quickly narrow the search to the right object. In fact with 20 questions, one can discover up to 220 = a million objects. Thus voluntary action is useful for efficiently exploring the world. Here by “voluntary” it is simply meant that action is a decision based on previous knowledge, which is intended to maximally increase future knowledge.

I can see another way in which voluntary action is useful, by drawing an analogy with philosophy of science. If perception is about inferring sensory or sensorimotor laws, then it raises an issue common to the development of science, which is how to infer universal laws from a finite set of observations. Indeed there are an infinite number of universal laws that are consistent with any finite set of observations – this is the problem of inductivism. Karl Popper argued that science progresses not by inferring laws, but by postulating falsifiable theories and testing them with critical experiments. Thus action can be seen as the test of a perceptual hypothesis. Perception without action is like science based on inductivism. Action can decide between several consistent hypotheses, and the fact that it is voluntary is what makes it possible to distinguish between causality and correlation (a fundamental problem raised by Hume). Here “voluntary” means that the action could have been different.

In summary, voluntary action can be understood as the test of a perceptual hypothesis, and it is useful both in establishing causal relationships and in efficiently exploring relevant hypotheses.