On the existential risks of artificial intelligence

The impressive progresses in machine learning have revived the fear that humans might eventually be wiped out or enslaved by artificial superintelligences. This is hardly a new fear. For example, this fear is the basis of most of Isaac Asimov’s books, who imagined that robots are built with three laws to protect humans.

My point here is not to demonstrate that such events are impossible. On the contrary, my point is that autonomous human-made entities already exist, and cause the exact same risks that AI alarmists are talking about, except they are real. In this context, evil AI fantasies are an anthropomorphic distraction.

Let me quickly dismiss some misconceptions. Does ChatGPT understand language? Of course not. Large language models are (essentially) algorithms tuned to predict the next words. But here we don’t mean “word” in the human sense. In the human sense, a word is a symbol that means something. In the computer sense, a word is a symbol, to which we humans attribute meaning. When ChatGPT talks about bananas, it has no idea what a banana tastes like (well, it has no idea). It has never seen a banana or tasted a banana (well, it has never seen or tasted). “Banana” is just a node in a big graph of other nodes, totally disconnected from the outside world, and in particular from what “banana” might actually refer to. This is known in cognitive science as the “symbol grounding problem”, and it is a difficult problem that LLMs do not solve. So, maybe LLMs “understand” language, but only if you are willing to define “understand” in such a way that it is not required to know what words mean.

Machine learning algorithms are not biological organisms, they do not perceive, they are not conscious, they do not have intentions in the human sense. But it doesn’t matter. The broader worry about AI is simply that these algorithms are generally designed so as to optimize some predefined criterion (e.g., prediction error), and if we give them very powerful means to do so, in particular means that involve real actions in the world, then who knows whether using those means might not be harmful to us? At some point, without necessarily postulating any kind of evil mind, we humans might become means in the achievement of some optimization criterion. We built some technical goals into the machine, but it is very difficult to ensure that those are aligned with human values. This is the so-called “alignment” problem.

Why not. We are clearly not there, but maybe, in a hypothetical future, or at least as a thought experiment. But what strikes me with the misalignment narrative is that this scenario is not at all hypothetical if you are willing to look beyond anthropomorphic evil robots. Have you really never heard of any human-made entities with their own goals, which might be misaligned with human values? Entities that are powerful and hard to control by humans?

There is an obvious answer if you look at the social rather than technological domain: it is the modern financialized multinational corporation. The modern corporation is a human-made organization that is designed in such a way as to maximize profit. It does not have intentions or goals in a human sense, but exactly like in the AI alignment narrative, it is simply designed in such a way that it will use all means available in order to maximize a predefined criterion, which may or may not be perfectly aligned with human values. Let’s call these companies “profit robots”.

To what extent are profit robots autonomous from humans? Today’s modern large corporations are owned not by people but in majority by institutional stakeholders, such as mutual funds, i.e., other organizations with the same goals. As is well known, their multinational nature makes them largely immune to the legislation of states (hence the issues of fiscal optimization, social dumping, etc). As is also well known, a large part of the resources of a profit robot is devoted to marketing and advertisement, that is, in manipulating humans into buying their products.

Profit robots also engage in intense lobbying to bend human laws in their favor. But more to the point, the very notion of law is not the same for a profit robot as for humans. For humans, a law is something that sets boundaries on what could be done or should not be done, morally. But a profit robot is not a person. It has no moral principles. So, law is just one particular constraint, in fact a financial cost or risk – a company does not go to prison. A striking example of this is the “Dieselgate”: Volkswagen (also not owned by humans) intentionally programmed their engines so that their car emissions remained hidden during the pollution tests required to authorize their cars on the US market. As far as I know, shareholders were not informed, and neither were consumers. The company autonomously decided to break the law for profit. Again, the company is not evil: it is not a person. It behaves in this non-human way because it is a robot, exactly like in the AI misalignment narrative.

We often hear that ultimately, it is the consumers who have power, by deciding what to buy. This is simply false. Consumers did not know that Volkswagen cheated on pollution tests. Consumers rarely know in what exact conditions the products are made, or even to what corporation the products belong. This type of crucial information is deliberately hidden. Profit robots, on the other hand, actively manipulate consumers into buying their products. What to think of planned obsolescence? Nobody wants products that are deliberately designed to break down prematurely, yet that is what a profit robot makes. So yes, profit robots are largely autonomous from the human community.

Are profit robots an existential risk for humans? That might be a bit dramatic, but they certainly do cause very significant risks. A particular distressing fact illustrates this. As the Arctic ice melts because of global warming, oil companies get ready to drill the newly available resources. Clearly this is not in the interest of humans, but this is what a company like Shell, who is only directly owned by humans in the proportion of 6%, needs to do to pursue its goals, which as any other profit robot, is to generate profit by whatever means.

So yes, there is a risk that powerful human-made entities get out of control and that their goals are misaligned with human values. This worry is reasonable because it is already realized, except not in the technological domain. It is ironic (but not so surprising) that billionaires buy into the AI misalignment narrative but fail to see that the same narrative fully applies to the companies that their wealth depends on, except it is realized.

The reasonable worry about AI is not that AI takes control of the world: the worry is that AI provides even more powerful means for the misaligned robots that are already out of control now. In this context, evil AI fantasies are an anthropomorphic distraction from the actual problems we have already created.

Sensory modalities and the sense of temperature

Perception is traditionally categorized into five senses: hearing, vision, touch, taste and olfaction. These categories seem to reflect the organs of sense, rather than the sensory modalities themselves. For example, the sense of taste is generally (in the neuroscience literature) associated with the taste receptors in the tongue (sweet, salty etc). But what we refer to as taste in daily experience actually involves the tongue, including “taste” receptors (sweet, salty) but also “tactile” receptors (the texture of food), the nose (“olfactory” receptors), and in fact probably also the eyes (color) and the ears (chewing sounds). All these are involved in a unitary experience that seems to be perceptually localized in the mouth, or on the tongue – despite the fact the most informative stimuli, which are chemical, are actually captured in the nose. One may consider that taste is then a “multimodal” experience, but this is not a very good description. If you eat a crisp, you experience the taste of a crisp. But if you isolate any of the components that make this unitary experience, you will not experience taste. For example, imagine a crisp without any chemically active component and no salt: you experience touch with your tongue, and the crisp has “no taste”. If you only experience the smell, then you have an experience of smell, not of taste. This is another sensory modality, despite the fact that the same chemical elements are involved. If only the “taste” receptors on your tongue were stimulated, you would have an experience of “salty”, not of a crisp. So the modality of taste involves a variety of receptors, but that does not make it more multimodal than vision is multimodal because it involves many photoreceptors.

“Touch” is also very complex. There is touch as in touching something: you make contact with objects and you feel their texture or shape. There is also being touched. There is also the feeling of weight, which involves gravity, and also movement. There is the feeling of pain, which is related to touch, but not classically included in the 5 senses. Finally there is the feeling of temperature, which I will discuss now from an ecological point of view (in the way of Gibson).

The sense of temperature is not usually listed in the 5 senses. It is often associated with touch, because by touch you can feel that an object is hot or cold. But you can also feel that “it” (=the weather) is cold, in a way that is not well localized. Physically, it is a quantity that is not mechanical, and in this sense it is completely different from touch. But like touch, it is a proximal sense that involves the interface between the body and either the medium (air or water) or substances (object surfaces). The sense of temperature is much more interesting that it initially seems. First, there is of course “how hot it is”, the temperature of the medium. The image that comes to mind is that of the thermometer. But temperature can be experienced all over the body. So spatial gradients of temperature can be sensed. When touching an object, parts of the object can be more or less hot. So spatial gradients of temperature can potentially be sensed through an object, in the same way as the mechanical texture can be sensed. Are there temperature textures?

The most interesting and, as far as I know, underappreciated aspect of the temperature sense is its sensorimotor structure. The body produces heat. Objects react to heat by warming up. Some materials, like metal, conduct temperature well, others, like wood, don’t. So both the temporal changes in temperature when an object is touched, and the spatial gradient of temperature that develops, depends on the material and possibly specifies it. So it seems that the sense of temperature is rich enough to qualify as a modality in the same way as touch.

Is perception about inference?

One philosophical theory about perception claims that perceiving is inferring the external world from the sensory signals. The argumentation goes as follows. Consider the retina: there is a spatially inhomogeneous set of photoreceptors; the image projected onto the retina is inverted, but you don’t see the world upside down; there are blood vessels that you normally don’t see; there is a blind spot where the optic nerve starts that you normally don’t notice; color properties of photoreceptors and their spatial sampling is inhomogeneous and yet color doesn’t change when you move the eyes. Perceptually, the visual field seems homogeneous and independent of the position of the eyes, apart from a change of perspective. So certainly what you perceive is not the raw sensations coming from your photoreceptors. These raw sensations are indirectly produced by things in the world, which have some constancy (compared to eye movements, for example). The visual signals in your retina are not constant, but somehow your perception is constant. Therefore, so the argument goes, your mind must be reconstructing the external world from the sensory signals, and what you perceive is this reconstruction.

Secondly, visual signals are ambiguous. A classical example is the Necker cube: a wire frame cube drawn in isometric perspective on a piece of paper, which can be perceived in two different ways. More generally, the three-dimensional world is projected on your retina as a two-dimensional image, and yet we see in three dimensions: the full 3D shape of objects must then be inferred. Another example is that in the dark, visual signals are noisy and yet you can see the world, although less clearly, and you don’t see noise.

I would then like to consider the following question: why, when I am looking at an apple, do I not see the back of the apple?

The answer is so obvious that the question sounds silly. Obviously, there is no light going through the object to our eyes, so how come could we see anything behind it? Well precisely, the inference view claims that we perceive things that are not present in the sensory signals but inferred from them. In the case of the Necker cube, there is nothing in the image itself that informs us of the true three-dimensional shape of the cube; there are just two consistent possibilities. But in the same way, when we see an apple, there are a number of plausible possibilities about how the back of the apple should be, and yet we only see the front of the apple. Certainly we see an apple, and we can guess how the back of the apple looks like, but we do not perceive it. A counter-argument would be that inference about the world is partial: of course we cannot infer what is visually occluded by an object. But this is circularly reasoning: perception is the result of inference, but we only infer what can be perceived.

One line of criticism of criticism of the objectivist/inferential view starts from Kant’s remark that anything we can ever experience comes from our senses, and therefore one cannot experience the objective world as such, even through inference, since we have never had access to the things to be inferred. This leads to James Gibson’s ecological theory of perception, who considered that the (phenomenal) world is directly perceived as the invariant structure in the sensory signals (the laws that the signals follow, potentially including self-generated movements). This view is appealing in many respects because it solves the problem raised by Kant (who concluded that there must be an innate notion of space). But it does not account for the examples that motivate the inferential view, such as the Necker cube (or in fact the perception of drawings in general). A related view, O’Regan’s sensorimotor theory of perception, also considers that objects of perception must be defined in terms of relationships between signals (including motor signals) but does not reject the possibility of inference. Simply, what is to be inferred is not an external description of the world but the effect of actions of sensory signals.

So some of the problems of the objectivist inferential view can be solved by redefining what is to be inferred. However, it still remains that in an inferential process, the result of inference is in a sense always greater than its premises: there is more than is directly implied by the current sensory signals. For example, if I infer that there is an apple, I can have some expectations about how the apple should look like if I turn it, and I may be wrong. But this part where I may be wrong, the predictions that I haven’t checked, I actually don’t see it – I can imagine it, perhaps.

Therefore, perception cannot be the result of inference. I suggest that perception involves two processes: 1) an inferential process, which consists in making a hypothesis about sensory signals and their relationship with action; 2) a testing process, in which the hypothesis is tested against sensory signals, possibly involving an action (e.g. an eye movement). These two processes can be seen as coupled, since new sensory signals are produced by the second process. I suggest that it is the second process (which is conditioned by the first one) that gives rise to conscious perception. In other words, to perceive is to check a hypothesis about the senses (possibly involving action). According to this proposition, subliminal perception is possible. That is, a hypothesis may be formed with insufficient time to test it. In this case, the stimulus is not perceived. But it may still influence the way subsequent stimuli are perceived, by influencing future hypotheses or tests.

Update. In The world as an outside memory, Kevin O'Regan expressed a similar view: "It is the act of looking that makes things visible".

The villainous monster recursion

In O’Regan’s paper about the sensorimotor theory of perception (O’Regan and Noë, BBS 2001), he uses the analogy of the “villainous monster”. I quote it in full:

“Imagine a team of engineers operating a remote-controlled underwater vessel exploring the remains of the Titanic, and imagine a villainous aquatic monster that has interfered with the control cable by mixing up the connections to and from the underwater cameras, sonar equipment, robot arms, actuators, and sensors. What appears on the many screens, lights, and dials, no longer makes any sense, and the actuators no longer have their usual functions. What can the engineers do to save the situation? By observing the structure of the changes on the control panel that occur when they press various buttons and levers, the engineers should be able to deduce which buttons control which kind of motion of the vehicle, and which lights correspond to information deriving from the sensors mounted outside the vessel, which indicators correspond to sensors on the vessel’s tentacles, and so on.”

It is meant here that all knowledge must come from the sensors and the effect of actions on them, because there is just no other source of knowledge. This point of view changes the computational problem of perception from inferring objective things about the physical world from the senses to finding relations between actions and sensor data.

This remark is not specific to the brain. It would apply whether the perceptual system is made of neurons or not – for example it could be an engineered piece of software for a robot. So what in fact is specific about the brain? The question is perhaps too broad, but I can at least name one specificity. The brain is made of neurons, and each neuron is a separate entity (with a membrane) that interacts with other neurons, which are relatively elementary (compared to the entire organism) and essentially identical (in the great lines). Each entity has sensors (dendrites) and can act by sending spikes through their axons (and also in other ways, but on a slower timescale). So in fact we could think of the villainous monster concept at different levels. The higher level is the organism, with sensors (photoreceptors) and actuators (muscle contraction). At a lower level, we could consider a brain structure, for example the hippocampus, and see it as a system with sensors (spiking inputs to the hippocampus) and actuators (spiking outputs). What can be said about the relationship between actions and sensor inputs? In fact, we could arbitrarily define a system by doing at graph cut in the connectivity graph of the brain. At the final level of analysis, we might analyze the neuron as a perceptual system, with a set of sensors (dendrites) and one possible action (to produce a spike). At this level, it may also be possible to define the same neuron as a different perceptual system by redefining the set of sensors and actions. For example, sensors could be a number of state variables, such as membrane potential at different points along the dendritic tree, calcium concentration, etc; actions could be changes in channel densities, in synaptic weights, etc. This is not completely crazy because in a way, these sensed properties and the effect of cellular actions are all that the cell can ever know about the “outside world”.

One might call this conceptual framework the “villainous monster recursion”. I am not sure where it could lead, but it seems intriguing enough to think about it!

On imitation

How is it possible to learn by imitation? For example, consider a child learning to speak. She reproduces a word produced by an adult, for example “Mom”. How is this possible? At first sight, it seems like there is an obvious answer: the child tries to activate her muscles so that the sound produced is similar. But that’s the thing: the sound is not similar at all. A child is much smaller than an adult, which implies that: 1) the pitch is higher, 2) the whole spectrum of the sound is shifted towards higher frequencies (the “acoustic scale” is smaller). So if one were to compare the two acoustic waves, she would find little similarity (both in the time domain and in the spectral domain). Therefore, learning by imitation must be based on a notion of similarity that resides at a rather conceptual level – not at all the direct comparison of sensory signals. Note that the sensorimotor account of perception (in this case the motor theory of speech) does not really help here, because it still requires explaining why the two vastly different acoustic waves should relate to similar motor programs. To be more precise: the two acoustic waves actually do relate to similar motor programs, but the adult’s motor program cannot be observed by the child: the child has to relate the acoustic result of the adult’s motor program with her own motor program, when the latter does not produce the same acoustic result. Could there be something in the acoustic wave that directly suggests the motor program?

This was the easy problem of imitation. But here’s a harder one: how can you imitate a smile? In this case, you can only see the smile you want to imitate on the teacher’s face, but you cannot see your own smile. In addition, it seems unlikely that the ability is based on prior practicing in front of a mirror. Thus, somehow, there is something in the visual signals that suggests the motor program. These are two completely different physical signals, therefore the resemblance must lie somewhere in the higher-order structure of the signals. This means that the perceptual system is able to extract an amodal notion of structure, and compare two structures independently of their sensory origin.

Memory as an inside world

A number of thinkers oppose the notion of pictorial representations, or even of any sort of representation, in the brain. In robotics, Rodney Brooks is often quoted for this famous statement: “the world is its own best model”. In a previous post, I commented on the fact that slime molds can solve complex spatial navigation problems without an internal representation of space – in fact, without a brain! It relies on using the world as a sort of outside memory: the slime mold leaves some extracellular trace on the floor, where it has previously been, so that it avoids being stuck in any one place.

This idea is also central in the sensorimotor theory of perception, and in fact Kevin O’Regan argued about “the world as an outside memory” in an early paper. This is related to a number of psychological findings about change blindness, but I will rephrase the argument from a more computational perspective. Imagine you are making a robot with a moveable eye that has a fovea. At any given moment, you only have a limited view of the world. You could obtain a detailed representation of the visual scene by scanning the scene with your eye and storing the images in memory. This memory would then be a highly detailed pictorial representation of the world. When you want to know some information about an object in any part of the visual scene, you can then look at the right place in the memory. But then why look at the memory if you can directly look at the scene? If moving the eye is very fast, which is the case for humans, then from an operational point of view, there is no difference between the two. It is then simply useless and inefficient to store the information in memory if the information is immediately available in the world. What might need to be stored, however, is some information about how to find the relevant information (what eye movements to produce), but this is not a pictorial representation of the visual scene.

Despite what the title of this post might suggest, I am not going to contradict this view. But we also know that visual memory exists: for example, we can remember a face, or we can remember what is behind us if we have seen it before (although it is not highly detailed). Now I am throwing an idea here, or perhaps an analogy, which might initially sound a bit crazy: how about if memory were like an inside world? In other words, how about interpreting the metaphor “looking at something in memory” in a literal way?

The idea of the world as an external memory implicitly relies on a boundary between mind and world that is put at the interface of our sensors (say, the retina). Certainly this is a conceptual boundary. Our brain interacts with the environment through an interface (sensors/muscles), but we could equally say that any part of the brain interacts with its environment, made of everything outside it, including other parts of the brain. So let us imagine for a while that we put the mind-world boundary in such a way that the hippocampus, which is involved in memory (working memory and spatial memory), is outside it. Then the mind can request information about the world from the sensory system (moving the eyes, observing the visual inputs), or in the same way from the hippocampus (making some form of action on the hippocampus, observing the hippocampal inputs).

Perhaps this might seem somehow like a homunculus thinking exercise, but I think there is something interesting in this perspective. In particular, it puts memory and perception at the same level of description, in terms of sensorimotor interaction. This is interesting because from a phenomenological point of view, there is a similarity between memory and perception: the memory of an image feels (a bit) like an image, or one can say that she “hears a melody in her head”. At the same time, memory has distinct phenomenal properties, for example one cannot interact with memory in the same way as with the physical world, it is also less detailed, and finally there are no “events” in memory (something unpredictable happening).

In other words, this view may suggest a sensorimotor account of memory (where “sensorimotor” must be understood in a very broad sense).

Robots and jobs

Are robots going to free us from the slavery of work, or are they going to steal people’s jobs?

As a computational neuroscientist, this is a question I sometimes think about. For a long time, I have followed a self-reassuring reasoning, which seems to make sense from a logical point of view, that having robots do the work for us means that either we get more products for the same amount of work or each person works less for the same quantity of products. So it has to be a good thing: ideally, robots would do the work we don’t want to do, and we would just do what we are interested in doing – maybe travel, write books, see our friends or play music.

This is a fine theoretical argument, but unfortunately it is also one that ignores the economy we live in. Maybe we could (or should) think of an economy that would make this work, but how about our current capitalist economy? Very concretely, if robots arrive on the market that are able to do the job that people previously did for cheaper, then these people would simply lose their job. If work can be outsourced to poorer countries, then in the same way it can also be outsourced to robots.

One counter-argument, of course, is that in a free market economy, people would temporarily lose their job but then they would be reassigned to other jobs and the whole economy would be more productive. This is a classical free-market fundamentalist argument. But there are at least two problems with this argument. The first is that it commits the mistake of thinking the economy as a quasi-static system: it changes, but it is always in equilibrium. It is implicitly assumed that it is easy to change job, that it has a negligible cost, that large scale changes in labor market has no significant impact on the rest of the economy (think for example of the effect on the financial system of thousands of sacked people being unable to pay their mortgage). Now if we think of a continuous progress, in which innovations regularly arrive and continuously change the structure of the labor market, then it is clear that the economy can never be in the ideal equilibrium state in which jobs are perfectly allocated. At any given time there would be a large fraction of the population that would be unemployed. In addition, anyone would then face a high risk of going through such a crisis in the course of their work life. This would then have major consequences for the financial system, as it would make loans and insurances riskier, and therefore more expensive. These additional costs to society (cost of unemployment and reconversion, financial risk, etc) are what economists call “externalities”: these are costs that have to be paid by society, but they are not supported by the ones that take the decisions that are responsible for these costs. For the company that replaces a human by a robot, the decision is based on the salary of the human vs. the cost of the robot, but it does not include the cost of the negative externalities. For this reason, it is possible that companies take decisions that seem beneficial for each one of them, and yet that have a negative impact on the global economy (not even considering the human factor).

A second problem is that the argument neglects a critical aspect of capitalist systems, which is the division between capital and work. When a human is replaced by a robot, what was previously the product of work is now the product of capital (investment in buying the robot) – see this blog post by Paul Krugman. Very concretely, this means that a larger part of the wealth goes to the owners rather than to the workers. As a thought experiment, we could imagine that the workforce is completely replaced by robots, and that the owner would only buy the robots and then get the money from customers without doing anything. Wealth would then be distributed according to how many robots one owns. This might seem far-fetched, but if you think about it, this is pretty much how real estate works.

So concretely, introducing robots in a capitalist economy means increasing productivity, but it also means that owners get an increasingly bigger part of the pie. In such an economy, the ideal robotic world is a dystopia in which wealth is distributed exclusively in proportion of what people own.

This thought is very bothering for scientists like me, who are more or less trying to make this ideal robotic world happen, with the utopia of the no-forced-work society in mind. How could one avoid the dystopian nightmare? I do not think that it is possible to just stop working on robots. I could personally decide not to work on robots, and maybe I would feel morally right and good about myself, having no responsibility in what happens next, but that would just be burying my head in the sand. The only way it will not happen is if all scientists in the world, in all countries, would stop working on robots or any sort of automation that would increase productivity (internet?). We don’t even seem to be able to stabilize our production of carbon dioxide even when we agree on the consequences, so I don’t think this is very realistic.

So if we can’t stop the scientific progress from happening, then the only other way is to adapt our economy to it. Imagine a society with robots doing all the work, entirely. Since there is no work at all in such a society, then in an unregulated free market economy wealth can only be distributed according to the amount of capital people have. There is simply no other way it could be distributed. Such an economy is bound to lead to the robotic nightmare.

Therefore, society has to take global measures to regulate the economy, and make the distribution of wealth fairer. I don’t have any magical answer, but we could throw a few ideas. For example, one could get rid of inheritance (certainly not easy in practice), and transmit capital from the deceased to the newborn in equal proportion. Certainly some people would get richer than others by the end of their lives, but it would be limited. As a transition policy, one could allow the replacement of people by robots, but the fired worker would own part of the robot. Alternatively, robots could only be owned by people and not by companies. A robot could then replace a worker only when a worker buys the robot and rents it to the company. Another alternative is that robot-making companies belong to the State and can only rent the robots to companies. The wealth would then be shared among citizens.

Certainly all these ideas come with difficulties, none of them is ideal, but one has to keep in mind that not implementing any regulation of this type can only lead to the robotic dystopia.

The machine learning analogy of perception

To cast the problem of neural computation in sensory systems, one often refers to the standard framework in machine learning. A typical example is as follows: there is a dataset, which could be for example a set of images, and the goal is to learn a mapping between these images and categories, for example faces or cars. In the learning phase, labels are externally given to these images, and the machine learning algorithm builds a mapping between images and labels. As an analogy of what sensory systems do, the question is then: how do neurons learn this mapping, e.g. to fire when they are presented with an image of a given category? This question is the starting point of many theories in computational neuroscience. It is essentially an inference problem: to each category corresponds a distribution of images, and so what sensory systems must do is learn this distribution and compute what the most likely category is for a given presented image. This is why Bayesian approaches are appealing from this point of view, because an efficient sensory system should then be an ideal Bayesian observer. It just follows from the way the problem of perception is cast.

But is this actually a good analogy? In fact, it differs from the problems sensory systems actually face in at least three important ways:

1) elements of the data set are considered independent;

2) these elements are externally given;

3) the labels are externally defined.

First of all, elements of the data set are never independent in a real perceptual system. On the contrary, there is a continuous flow of sensory input. Vision is not a slideshow. The visual field changes through time in a continuous way, and more importantly the changes are lawful because objects are embedded in the physical world. We can perceive these laws, for example the rigidity of movements, and this is something that cannot be found in the “slideshow” view of vision that is implied by the machine learning analogy. I believe this is the main message of James Gibson. Moreover, there are lawful relationships in the sensory inputs, but there are also sensorimotor relationships. This is information that can be picked up from the sensory or sensorimotor flow, not by inference from the distribution of slides in the slideshow. This means that perception is not (or not only) inferential but relational: sensory inputs are analyzed in reference to themselves (their internal structure), and not (only) to memory.

A second point is that in the machine learning analogy, elements of the dataset are considered given, and the algorithm reacts to it. In psychology, this view corresponds to behaviorism, in which the organism is only considered from a stimulus-reaction point of view. But in fact a more ecologically accurate view is that data are in general produced by the actions of the organism, rather than passively received. Gibson criticized the information processing viewpoint for this reason, because the world does not produce messages to be decoded by a receiver, on the contrary a perceptual system samples its environment. It is really the opposite view: the organism does not react to a stimulus, but rather the environment reacts to the actions of the organism, and it is this reaction that is analyzed by the organism. In the machine learning field, there are new frameworks that try to address this aspect, named “active learning”: the algorithm chooses a data element and asks for its label, for example to maximize the information that can be gained.

Finally, in the machine learning analogy, the label is externally defined. But in a closed system, this is not possible. The organism must define the relevant categories by itself. But how can these categories be a priori defined? Often, this problem is discarded by what I would call “evolutionary magic”: these categories are provided by “evolution” because they are important to the survival and reproduction of the animal. I call it “magic” because the teleological argument does not provide any explanation at all: it is about as metaphysical as if “evolution” were replaced by “God”, in the sense that it has the same explanatory power. Bringing intergenerational changes of the organism does not solve the problem: whatever mechanism is involved, pressure for change still has to come from the environment and the way the organism can interact with it, not from an external source.

In fact, this problem was addressed by the development of phenomenology in philosophy, introduced by Husserl about a century ago. Followers of the phenomenological approach include Merleau-Ponty and Sartre. The idea is the following. What “really” exists in the world is a metaphysical question: it actually does not matter for the organism if it makes no difference to its experience. For example, is there such a thing as “absolute space”, the existence of an absolute location of things? The question is metaphysical because only relative changes in space can be experienced (the relative location of things) – this point was noted by Henri Poincaré. In the phenomenological approach, “essence” is what remains invariant under changes of perspective. I believe this is related to a central point in Gibson’s theory: information is given in the “structural invariants” present in the sensory inputs. These invariants do not need an external reference to be noticed.

For example, consider a sound source that produces two acoustical waves at the two ears. Neglecting sound diffraction, these acoustical waves are identical apart from a propagation delay (the interaural time difference or ITD). When a sound is produced by the source, this property is invariant through time – it is a law that is always satisfied. But what makes it a spatial property? It is spatial because the property is broken when movements are produced by the organism (e.g. head movements). In addition, there is a higher-order property, which is the relationship between the interaural delay and the movements of the head, which is always true, as long as the source does not move. This structural invariant is then information about the location of the sound source, in fact the relationship can be mapped to the physical location of the source. But the “label” here is intrinsically defined: it is precisely the relationship between head position and ITD. Thus labels can be intrinsically defined, as the sensory and sensorimotor structure. This is the postulate of the sensorimotor account of perception, according to which perception is precisely the anticipated effect of the organism’s action on the sensory inputs.

The fact that these labels can be intrinsically defined is, I believe, what James Gibson means when he states that information is “picked-up” and that perception is “direct”. But I would like to go further: there is no doubt that there can be inference in perception, and so in that sense perception cannot be entirely direct. For example, one can visually recognize an object that is partially occluded, and imagine the rest of the object (“amodal perception”). But the point is that what is inferred, i.e., the “label” in the machine learning terminology, is not an externally given category, but the sensory or sensorimotor structure, part of which is hidden. The main difference is that there is no need for an external reference. For example, in the sound localization example, a brief sound may be presented at a given direction. Then the sensorimotor structure that defines source direction for the organism is hidden, since there is no sound when the organism can turn its head. So this structure is inferred from the ITD. In other words, what is inferred is not an angle, which would make no sense for an animal that has no measurement tool, it is the effect of its own movements on the perceived ITD. So there is inference, but inference is not the basis of perception. It cannot be, for how would you know what should be inferred? For this reason, Gibson rejects inference by the argument that it would yield to infinite regress. As I have tried to explain, it is not inference per se that is problematic, but the idea that it might be the basis of perception.

This is quite important for our view of neural computation: this means that Bayesian inference is not so central anymore in the function of sensory systems. Certainly, inference is useful and perhaps necessary in many cases. But perhaps more important is the discovery of sensory and sensorimotor structure, that is, the elaboration of what is to be inferred. This requires the development of a theory of neural computation that is primarily relational rather than inferential.

In summary, labels can be intrinsically defined by the invariant structure of sensory and sensorimotor signals. I would like to end this post with another important Gibsonian notion: “affordances”. Gibson thought that we perceive “affordances”, which are what the objects of perception allow in terms of interaction. For example, a door affords opening, a wall affords blocking, etc. This is an important notion, because it defines meaning in terms of things that make sense to the organism, rather than in externally defined terms.

To conclude, a theory of neural computation that takes into account these points should differ from standard theories in the following way: it should be

1) relational (discovering internal structure) rather than inferential (comparing with memory),

2) active (inputs are not questions but answers) rather than passive (inputs are questions, actions are answers), and

3) subjective (meaning is defined by the interaction with the environment) rather than objective (objects are externally defined).

The intelligence of slime molds

Slime molds are fascinating: these are unicellular organisms that can display complex behaviors such as finding the shortest path in a maze and developing an efficient transportation network. Actually each of these two findings generated a high-impact publication (Science and Nature) and an Ignobel prize. In the latter study, the authors grew a slime mold on a map of Japan, with food on the biggest cities, and demonstrated that it developed a transportation network that looked very much like the railway network of Japan (check out the video!).

More recently, there was a recent PNAS paper in which the authors showed that a slime mold can solve the “U-shaped trap problem”. This is a classic spatial navigation problem in robotics: the organism is behind a U-shaped barrier and there is food behind it. It cannot navigate to the food using local rules (e.g. following a path along which the distance to the food continuously decreases), and therefore it requires some form of spatial memory. This is not a trivial task for robots, but the slime mold can do it (check out the video).

What I find particularly interesting is that the slime mold has no brain (it is a single cell!), and yet it displays behavior that requires some form of spatial memory. The way it manages to do the task is that it leaves extracellular slime behind it and uses it to mark the locations it has already visited. It can then explore its environment by avoiding extracellular slime, and it can go around the U-shaped barrier. Thus it uses an externalized memory. This is a concrete example that shows that (neural) representation is not always necessary for complex cognition. It nicely illustrates Rodney Brook’s famous quote: “The world is its own best model”. That is, why develop a complex map of the external world when you can directly interact with it?

Of course, we humans don’t usually leave slime on the floor to help us navigate. But this example should make us think about the nature of spatial memory. We tend to think of spatial memory in terms of maps, in analogy with actual maps that we can draw on a paper. However, it is now possible to imagine other ways in which a spatial memory could work, in analogy with the slime mold. For example, one might imagine a memory system that leaves “virtual slime” in places that have been already explored, that is, that associates environmental cues about location with a “slime signal”. This would confer the same navigational abilities as those of slime molds, without a map-like representation of the world. For the organism, having markers in the hippocampus (the brain area involved in spatial memory) or outside the skull might not make a big difference (does the mind stop at the boundary of the skull?).

It is known that in mammals, there are cells in the hippocampus that fire at a specific (preferred) location. These are called “place cells”. How about if the meaning of spikes fired by these place cells were that there is “slime” in their favorite place? Of course I realize that this is a provocative question, which might not go so well with other known facts about the hippocampus, such as grid cells (cells that fire when the animal is at nodes of a regular spatial grid). But it makes the point that maps, in the usual sense, may not be the only way in which these experimental observations can be interpreted. That is, the neural basis of spatial memory could be thought of as operational (neurons fire to trigger some behavior) rather than representational (the world is reconstructed from spike trains).

On the role of voluntary action in perception

The sensorimotor theory of perception considers that to perceive is to understand the effect of active movements on sensory signals. Gibson’s ecological theory also places an emphasis on movements: information about the visual world is obtained by producing movements and registering how the visual field changes in lawful ways. Poincaré also defined the notion of space in terms of the movements required to reach an object or compensate for movements of an object.

Information about the world is contained in the sensorimotor “contingencies” or “invariants”, but why should it be important that actions are voluntary? Indeed, one could see movements as another kind of sensory information (e.g. proprioceptive information, or “efferent copy”), and a sensorimotor law is then just a law defined on the entire set of accessible signals. I will propose two answers below. I only address the computational problem (why it is useful), not the problem of consciousness.

Why would it make a difference that action is voluntary? The first answer I will give comes from ideas discussed in robotics and machine learning, and known as active learning, curiosity or optimal experiment design. Gibson makes this remark that the term “information” is misleading when talking about the sensory inputs. Senses cannot be seen as a communication channel, because the world does not send messages to be decoded by the organism. In fact rather the opposite is true: the organism actively seeks information about the world by making specific actions that improve its knowledge. A good analogy is the game “20 questions”. One participant thinks of an object or person. The other tries to discover it by asking questions that can only be answered by yes or no. She wins if she can guess the object with fewer than 20 questions. Clearly it is very difficult to guess using the answer to a random question. But by asking smart questions, one can quickly narrow the search to the right object. In fact with 20 questions, one can discover up to 220 = a million objects. Thus voluntary action is useful for efficiently exploring the world. Here by “voluntary” it is simply meant that action is a decision based on previous knowledge, which is intended to maximally increase future knowledge.

I can see another way in which voluntary action is useful, by drawing an analogy with philosophy of science. If perception is about inferring sensory or sensorimotor laws, then it raises an issue common to the development of science, which is how to infer universal laws from a finite set of observations. Indeed there are an infinite number of universal laws that are consistent with any finite set of observations – this is the problem of inductivism. Karl Popper argued that science progresses not by inferring laws, but by postulating falsifiable theories and testing them with critical experiments. Thus action can be seen as the test of a perceptual hypothesis. Perception without action is like science based on inductivism. Action can decide between several consistent hypotheses, and the fact that it is voluntary is what makes it possible to distinguish between causality and correlation (a fundamental problem raised by Hume). Here “voluntary” means that the action could have been different.

In summary, voluntary action can be understood as the test of a perceptual hypothesis, and it is useful both in establishing causal relationships and in efficiently exploring relevant hypotheses.