Marr’s levels of analysis and embodied approaches

Marr described the brain as an information-processing system, and argued it had to be understood at three distinct conceptual levels:

1) The computational level: what does the system do? (for example: estimating the location of a sound source)

2) The algorithmic/representational level: how does it do it? (for example: by calculating the maximum of cross-correlation between the two monaural signals)

3) The physical level: how is it physically realized? (for example: with axonal delay lines and coincidence detectors)

This is what Francisco Varela describes as “computational objectivism”. That is, the purpose of the computation is to extract information about the world, in an externally defined representation. For example, to extract the interaural time difference between the two monaural sounds. Varela describes the opposite view as “neurophysiological subjectivism”, according to which perception is a result of neural network dynamics. Neurophysiological subjectivism is problematic because it fails to fully recognize the defining property of living beings, which is teleonomy. Jacques Monod (who got the Nobel prize for his work in molecular biology) articulated this idea by explaining that living beings, by the mechanics of evolution, differ from non-living things (say, a mountain) by the fact that they have a fundamental teleonomic project, which is “invariant reproduction” (in Hasard et Nécessité). The achievement of this project relies on specific molecular mechanisms, but it would be a mistake to think that the achievement of the project is the consequence of these mechanisms. Rather, the existence of mechanisms consistent with the project is a consequence of evolutionary pressure selecting these mechanisms: the project defines the mechanisms rather than the other way round. This is a fundamental aspect of life that is downplayed in neurophysiological subjectivism.

Thus computational objectivism improves on neurophysiological subjectivism by acknowledging the teleonomic nature of living beings. However, a critical problem is that the goal (first level) is defined in terms that are external to the organism. In other words, a critical issue is whether the three levels are independent. For example, in sound localization, a typical engineering approach is to calculate the interaural time differences as a function of sound direction, then calculate these differences by cross-correlation and invert the mapping. This approach fails in practice because in fact, these binaural cues depend on the shape of the head (and other aspects), which varies across individuals. One would then have to specify a mapping that is specific to each individual, and it is not reasonable to think that this might be hard-coded in the brain. This simply means that the algorithmic level (#2) must in fact be defined in relationship with the embodiment, which is part of level #3. This is in line with Gibson’s ecological approach, in which information about the world is obtained by detecting sensory invariants, a notion that depends on the embodiment. Essentially, this is the idea of the “synchrony receptive field” that I developed in a recent general paper (Brette, PLoS CB 2012), and before that in the context of sound localization (Goodman and Brette, PLoS CB 2010).

However, this still leaves the computational level (#1) defined in external terms, although the algorithmic level (#2) is defined in more ecological terms (sound location rather than ITD). The sensorimotor approach (and related approaches) closes the loop by proposing that the computational goal is to predict the effect of movements on sensory inputs. This implies the development of an internal representation of space, but space is a consequence of this goal, rather than an ad hoc assumption about the external world.

Thus I propose a redefinition of the three levels of analysis of a perceptual system that is more in line with embodied approaches:

1) Computational level: to predict the sensory consequences of actions (sensorimotor approach) or to identify the laws that govern sensory and sensorimotor signals (ecological approach). Embodiment (previously in level 3) is taken into account in this definition.

2) Algorithmic/representational level: how to identify these laws or predict future sensory inputs? (the kernel in the kernel-envelope theory in robotics)

3) Neurophysiological level (previously physical level): how are these principles implemented by neurons?

Here I am also postulating that these three levels are largely independent, but the computational level is now defined in relationship with the embodiment. Note: I am not postulating independence as a hypothesis about perception, but rather as a methodological choice.

Update. In a later post about rate vs. timing, I refine this idea by noting that, in a spike-based theory, levels 2 and 3 are in fact not independent, since algorithms are defined at the spike level.

 


"The brain uses all available information"

In discussions of « neural coding » issues, I have often heard the idea that “the brain uses all available information”. This idea generally pops up in response to the observation that neural responses are complex and vary with stimuli in ways that are difficult to comprehend. In this variability there is information about stimuli, and as complex as the mapping from stimuli to neural responses may be, the brain might well be able to invert this mapping. I sympathize with the notion that neural heterogeneity is information rather than noise, but I believe that, phrased in this way, this idea reveals two important misconceptions.

First of all, there is often a confusion between sensitivity (responses vary along several stimulus dimensions) and information (you can recover these dimensions from the responses). I made this point in a specific paper two years ago (pdf). Neural responses are observed for a specific experimental protocol, which is always constrained in a limited set of stimuli. One can often recover stimulus dimensions from the responses within this set, but it is a mistake to conclude that the brain can do it, because this inverse mapping depends on the particular experimental set of stimuli. In other words, the mapping is in fact from the observed neural responses and the knowledge of the experimental protocol to the stimulus. The brain does not have access to such an external knowledge. Therefore, information is always highly overestimated in this type of analysis. This is in fact a classical problem in machine learning, related to the issues of training vs. test error, generalization and overfitting. The key concept is robustness: the hypothesized inverse mapping should be robust to large changes in the set of stimuli.

The second misconception is more philosophical, and has to do with the general investigation of “neural codes”. What is a code? It is a way of representing information. But sensory information is already present at the level of sensory inputs, and it is a theorem that information can only decrease along a processing chain. So if we say that the goal of a code is only to represent the maximum amount of information about stimuli, then what is gained by having a second (central) code, which can only be a degraded version of the initial sensory inputs? Thinking in this way is in fact committing the homunculus fallacy: looking at the neural responses as a projection of sensory inputs, which “the brain” observes. This projection achieves nothing, for it still leaves unexplained how the brain makes sense of sensory inputs – nothing has been gained in terms of what these inputs mean. At some point there needs to be something else than just representing sensory inputs in a high-dimensional space.

The answer, of course, is that the goal of a “neural code” is not just to represent information, but to do it in such a way that makes it easier to process relevant information. This is the answer provided by representational theories (e.g. David Marr). Then you might also argue that the very notion of a neural code is misleading because the role of a perceptual system is not to encode sensory inputs but to guide behavior, and therefore it is more appropriate to speak of computation rather than code. In either view, the relevant question when interpreting neural responses is not how the rest of brain can make use of it, but rather how they participate in solving the perceptual problem. I believe one key aspect is behavioral invariance, for example the fact that you can localize a sound source independently of its level (within a certain range). Another key aspect is that the “code” is in some way easier to decode for “neural observers” (not just any observer).

A note on computing with neural synchrony

In a recent paper, I explained how to compute with neural synchrony, by relating synchrony with the Gibsonian notion of sensory invariants. Here I will briefly recapitulate the arguments and try to explain what can and cannot be done with this approach.

First of all, neural synchrony, as any other concept of neural code, should be defined from the observer point of view, that is, from the postsynaptic point of view. Detecting synchrony is detecting coincidences. That is, a neural observer of neural synchrony is a coincidence detector. Now coincidences are observed when they occur in the postsynaptic neuron, not when the spikes are produced by the presynaptic neurons. Spikes travel along axons and therefore generally arrive after some delay, which we may consider fixed. This means that in fact, coincidence detectors do not detect synchrony but rather specific time differences between spike trains.

I will call these spike trains Ti(t), where i is the index of the presynaptic neuron. Detecting coincidences means detecting relationships Ti(t)=Tj(t-d), where d is a delay (for all t). Of course we may interpret this relationship in a probabilistic (approximate) way. Now if one assumes that the neuron is a somewhat deterministic device that transforms a time-varying signal S(t) into a spike train T(t), then detecting coincidences is about detecting relationships Si(t)=Sj(t-d) between analog input signals.

To make the connection with perception, I then assume that the input signals are determined by the sensory input X(t) (which could be a vector of inputs), so that Si(t)=Fi(X)(t). So computing with neural synchrony means detecting relationships Fi(X)(t)=Fj(X)(t-d), that is, specific properties of the stimulus X (Fi is a linear or nonlinear filter). You could see this as a sensory law that the stimulus X(t) follows, or with Gibson’s terminology, a sensory invariant (some property of the sensory inputs that does not change with time).

So this theory describes computing with synchrony as the extraction sensory invariants. The first question is, can we extract all sensory invariants in this way? The answer is no, only those relationships that can be written as Fi(X)(t) = Fj(X)(t-d) can be detected. But then isn’t the computation already done by the primary neurons themselves, through the filters Fi? This would imply that synchrony does not achieve anything, computationally speaking. But this is not true. The set of relationships between signals Fi(X)(t) is not the same thing as the set of signals themselves. For once, there are more relationships than signals: if there are N encoding neurons, then there are N2 relationships, times the number of allowed delays. But more importantly, a relationship between signals does not have the same nature as a signal. To see this, consider just two auditory neurons, one that responds to sounds from the left ear only, and one that responds to sounds from the right ear (and neglect sound diffraction by the head to simplify things). None of these neurons is sensitive at all to the location of the sound source. But the relationships between the input signals to these two neurons are informative of sound location. Relationships and signals are two different things: a signal is a stream of numbers, while a relationship is a universal statement on these numbers (aka “invariant”). So to summarize: synchrony represents sensory invariants, which are not represented in the individual neurons, but only a limited number of sensory invariants. For example, if the filters Fi are linear, then only linear properties of the sensory input can be detected. Thus, sensory laws are not produced but rather detected, among a set of possible laws.

Now the second question: is computing with synchrony only about extracting sensory invariants? The answer is also no, because the theory is based on the assumption that the input signals to the neurons and their synchrony are mostly determined by the sensory inputs. But they could also depend on “top-down” signals. Synchrony could be generated by recurrent connections, that is, synchrony could be the result of a computation rather than (or in addition to) the basis of computation. Thus, to be more precise, this theory describes what can be computed with stimulus-induced synchrony. In Gibson’s terminology, this would correspond to the “pick-up” of information, i.e., the information is present in the primary input, preexisting in the form of the relationships between transformed sensory signals (Fi(X)), and one just needs to observe these relationships.

But there is an entire part of the field that is concerned with the computational role of neural oscillations, for example. If oscillations are spatially homogeneous, then it does not affect the theory – it may in fact be simply a way to transform similarity of slowly varying signals into synchrony (this mechanism is the basis of Hopfield and Brody’s olfactory model). If they are not, in particular if they result from interactions between neurons, then this is a different thing.

What is sound? (VI) Sounds inside the head

When one hears music or speech through earphones, it usually feels like the sound comes from “inside the head”. Yet, one also feels that the sound may come from the left or from the right, and even from the front or back when using head-related transfer functions or binaural recordings. This is why, when subjects report the left-right quality of sounds with artificially introduced interaural level or time differences, one speaks of lateralization rather than localization.

But why is it so? The first answer is: sounds heard through earphones generally don’t reproduce the spatial features of sounds heard in a natural environment. For example, in musical recordings, sources are lateralized using only interaural level differences but not time differences. They also don’t reproduce the diffraction by the head, which one can reproduce using individually measured head-related transfer functions (HRTFs). However, even with individual HRTFs, sounds usually don’t feel as “external” as in the real world. How can it be so, if the sound waves arriving at the eardrums are exactly the same as in real life? Well, maybe they are not: maybe reproducing reverberation is important, or maybe some features of the reproduced waves are very sensitive to the precise placement of the earphones.

It could be the reason, but even if it’s true, it still leaves an open question: why would sounds feel “inside the head” when the spatial cues are not natural? One may argue that, if a sound is judged as not coming from a known external direction, then “by default” it has to come from inside. But we continuously experience new acoustical environments, which modify the spatial cues, and I don’t think we experience sounds as coming from inside our head at first. We might also imagine other “default places” where there are usually no sound sources, for example other places inside the body, but we feel sounds inside the head, not just inside the body. And finally, is it actually true that there are no sounds coming from inside the head? In fact, not quite: think about chewing, for example – although arguably, these sounds come from the inner surface of the mouth.

The “default place” idea also doesn’t explain why such sounds should feel like they have a spatial location rather than no location at all. An alternative strategy is the sensorimotor approach, according to which the distinct quality of sounds that feel inside the head has to do with the relationship between one’s movements and the sensory signals. Indeed, with earphones, the sound waves are unaffected by head movements. This is characteristic of sound sources that are rigidly attached to the ears. This is the head, from the top of the neck, excluding the jaws. This is an appealing explanation, but it doesn’t come without difficulties. First, even though it may explain why we have a specific spatial feel for sounds heard through earphones, it is not obvious how we should experience this feel as sounds being produced inside the head. Perhaps this difficulty can be resolved by considering that one can produce sounds with such a feel by e.g. touching one’s head or chewing. But these are sound sources localized on the surface of the head, or the inner surface of the mouth, not exactly inside the head. Another way of producing sounds with the same quality is to speak, but it comes with the same difficulty.

I will come back to speech later, but I will finish with a few more remarks about the sensorimotor approach. It seems that experiencing the feel of sounds produced inside the head requires turning one’s head. So one would expect that if sound is realistically rendered through earphones with individual HRTFs and the subject’s head is held fixed, it should sound externalized; or natural sounds should feel inside the head until one turns her head. But maybe this is a naive understanding of the sensorimotor approach: the feel is associated to the expectation of a particular sensorimotor relationship, and this expectation can be based on inference rather than on a direct test. That is, sounds heard through earphones, with their particular features (e.g. no interaural time differences, constant interaural intensity differences), produce a feel of coming from inside the head because whenever one has tried to test this perceptual hypothesis by moving her head, this hypothesis has been confirmed (i.e., ITDs and IIDs have remained unchanged). So when presenting sounds with such features, it is inferred that ITDs and IIDs should be unaffected by movements, which is to say that sounds come from inside the head. One objection, perhaps, is that sounds lateralized using only ITDs and not IIDs also immediately feel inside the head, even though they do not correspond at all to the kind of binaural sounds usually rendered through earphones (in musically recordings).

The remarks above would imply the following facts:

  • When sounds are rendered through earphones with only IIDs, they initially feel inside the head.
  • When sounds are realistically rendered through earphones with individual HRTFs (assuming we can actually reproduce the true sound waves very accurately, maybe using the transaural technique), perhaps using natural reverberation, they initially feel outside the head.
  • When the subject is allowed to move, sounds should feel (perhaps after a while) inside the head.
  • When the subject is allowed to move and the spatial rendering follows these movements (using a head tracker), the sounds should feel outside the head. Critically, this should also be true when sounds are not realistically rendered, as long as the sensorimotor relationship is accurate enough.

To end this post, I will come back to the example of speech. Why do we feel that speech comes from our mouth, or perhaps nose or throat? We cannot resolve the location of speech with touch. However, we can change the sound of speech by moving well-localized parts of our body: the jaws, the lips, the tongue, etc. This could be one explanation. But another possibility, which I find interesting, is that speech also produces tactile vibrations, in particular on the throat but also on the nose. These parts of the body have tactile sensors that can also be activated by touch. So speech should actually produce well-localized vibratory sensations at the places where we feel speech is coming from.

What I find intriguing in this remark is that it raises the possibility that the localization of sound might also involve tactile signals. So the question is: what are the tactile signals produced by natural sounds? And what are the tactile signals produced by earphones, do they stimulate tactile receptors on the outer ears, for example? This idea might be less crazy than it sounds. Decades ago, von Békésy used the human skin to test our sensitivity to vibrations and he showed that we can actually feel the ITD of binaural sounds acting on the skin of the two arms rather than on the two eardrums. The question, of course, is whether natural sounds produce such distinguishable mechanical vibrations on the skin. Perhaps studies on profoundly deaf subjects could provide an answer. I should also note that, given the properties of the skin and tactile receptors, I believe these tactile signals should be limited to low frequencies (say, below 300 Hz).

I now summarize this post by listing a number of questions I have raised:

  • What are the spatial auditory cues of natural sounds produced inside the head? (chewing, touching one’s head, speaking)
  • Is it possible to externalize sounds without tracking head movements? (e.g. with the transaural technique)
  • Is it possible to externalize sounds by tracking head movements, but without reproducing realistic natural spatial cues (HRTFs)?
  • What is the tactile experience of sound, and are there tactile cues for sound location? Can profoundly deaf people localize sound sources?

Update. Following a discussion with Kevin O’Regan, I realize I must qualify one of my statements. I wrote that sound waves are unaffected by head movements when the source is rigidly attached to the head. This is in fact only true in an anechoic environment. But as soon as there is a reflecting surface, which does not move with the head, moving the head has an effect on sound waves (specifically, on echoes). In other words, the fact that echoic cues are affected (in a lawful way) by movements is characteristic of sounds outside the head, whether they are rigidly attached to the head or not. To be more precise, monaural echoic cues change with head movements for an external source attached to the head, while binaural echoic cues do for an external source free from the head.

Natural sensory signals

I am writing this post from the Sensory Coding and Natural Environment conference in Vienna. It’s a very interesting conference about a topic that I like very much, but it strikes me that many approaches I have seen seem to miss the point of what is natural about natural sensory signals.

So what is natural about natural sensory signals? It seems that a large part of the field, from I have heard, answers that these are signals that have natural statistics. For example, they have particular second and higher order statistics, both spatially and temporally. While this is certainly true to some extent, I don’t find it a very satisfying answer.

Suppose I throw a rock in the air, and I can see its movement until it reaches the ground. The visual signals that I capture can be considered “natural”. What is natural about the motion of the rock, is it that the visual signals have particular statistics? Probably they do, but to me a more satisfying answer is that it follows the law of gravitation. Efficient coding approaches often tend to focus on statistics, because “the world is noisy” (or, “the brain is noisy”). However, even though there are turbulences in the air, describing the motion of the rock as obeying to the law of gravitation (possibly with some noise) is still more satisfying than describing its higher order statistics – and possibly more helpful for an animal too.

In other words, I propose that what is natural about sensory signals is that they follow the laws of nature.

By the way, this view is completely in agreement with Barlow’s efficient coding principle, which postulates that neurons encode sensory information in an efficient way, i.e., they convey a maximum amount of information with a minimum number of spikes. Indeed representing the laws that govern sensory signals leads to a parsimonious description of these signals.

What is sound? (V) The structure of pitch

Musical notes have a particular perceptual quality called “pitch”. Pitch is the percept corresponding to how low or high a musical note is. Vowels also have a pitch. To a large extent, the pitch of a periodic sound corresponds to its repetition rate. The important point is that what matters in pitch is more the periodicity than the frequency content. For example, a periodic sound with repetition rate f0 has frequency components at multiples of f0 (n.f0), which are called harmonics. A pure tone of frequency f0 and a complex tone with all harmonics except the first one, i.e., which does not contain the frequency component f0, will evoke the same pitch. It is in fact a little more complex than that, there are many subtleties, but I will not enter into these details in this post. Here I simply want to describe the kind of sensory or sensorimotor structure there is in pitch. It turns out that pitch has a surprisingly rich structure.

The most obvious type of structure is periodicity. Pitch-evoking sounds have this very specific property that the acoustical wave is unchanged when temporally shifted by some delay. This delay is characteristic of the sound’s pitch (i.e., same period means same pitch). This is the type of structure that is emphasized in temporal theories of pitch. This is what I call the “similarity structure” of the acoustical signal, and this notion can in fact be extended and accounts for a number of interesting phenomena related to pitch. But this is work in progress, so I will discuss it further at a later time.

Another way to see periodic sounds is to realize that a periodic sound is predictable. That is, after a couple periods, one can predict the future acoustical wave. Compared to most other sounds, periodic sounds have a very high degree of predictability. Perhaps the perceptual strength of pitch (which depends on a number of factors) is related to the degree of predictability of the sound.

There is another type of structure that is in some sense orthogonal to the similarity structure I just described, which one might call the “dissimilarity structure”. Natural sounds (apart from vocalizations) tend to have a smooth spectrum. Periodic sounds, on the other hand, have a discrete spectrum. Thus, in some sense, periodic sounds have a “surprisingly discontinuous” spectrum. Suppose for example that two auditory receptors respond to different but overlapping parts of the spectrum (e.g., two nearby points on the basilar membrane). Then one can usually predict the sensory input to the second receptor given the sensory input to the first receptor, because natural sounds tend to have a continuous spectrum. But this prediction would fail with a periodic sound. Periodic sounds are maximally surprising in this sense. The interesting thing about the dissimilarity structure of pitch is that it accounts for binaural pitch phenomena such as Huggins’ pitch: noise with flat spectrum is presented on both ears, and the interaural phase difference changes abruptly at a given frequency; a tone is perceived, with the pitch corresponding to that frequency.

Thus, pitch-evoking sounds simultaneously have two types of structure that distinguish them from other types of sounds: the similarity structure, which consists of different views of the acoustical signal that are unusually similar, and the dissimilarity structure, which consists of different views of the acoustical signal that are unusually dissimilar. This first type of structure corresponds to what I examined in my paper on computing with neural synchrony. It is important to notice that these two types of structure have a different nature. The similarity structure corresponds to a law that the sensory signals follow. Here the percept is associated to the specific law that these signals follow. The dissimilarity structure corresponds to the breaking of a law that sensory signals usually follow. Here the percept is associated to a law that is specific not of the presented sensory signals, but of the usual sensory signals. Thus we might relate the similarity structure to the notion of discovery, and the dissimilarity structure to the notion of surprise (and perhaps the term “structure” is not appropriate for the latter).

So far, I have only considered the structure of the acoustical signal, but one may also consider the sensorimotor structure of pitch. As I mentioned in another post, periodic sounds are generally produced by living beings, so it makes sense to examine these sounds from the viewpoint of their production. When one produces a pitch-evoking sound (for example a vowel, or when one sings), there is a very rich structure that goes beyond the acoustical structure. First, there is proprioceptive information about vocal muscles and tactile information about the vibrations of the larynx, and both are directly related to the period of sounds. There is also the efferent copy, i.e., the motor commands issued to make the vocal folds vibrate in the desired way. For a person who can produce sounds, pitch is then associated to a rich and meaningful sensorimotor structure. In fact, the sensorimotor theory of pitch perception would be that to perceive the pitch of a sound is, perhaps, to perceive the movements that would be required to produce such acoustical structure. An interesting aspect of this view is that it provides some meaning to the notion of how low or high a pitch-evoking sound is, by associating it with the state of the different elements involved in sound production. For example, to produce a high sound requires to increase the tension of the vocal cords, and to move the larynx up (higher!). One question then is whether congenitally mute people have a different perception of pitch.

Observe that, as for binaural hearing , the sensorimotor structure of pitch should not be understood as the relationship between motor commands and auditory signals, but rather as the relationship between motor commands and the structure of auditory signals (e.g. the periodicity). In this sense, it is higher-order structure.

What is sound? (IV) Ecological ontology of sounds

What kinds of sounds are there in the world? This is essentially the question William Gaver addresses in a very interesting paper (Gaver, 1993), in which he describes an ontology of sounds, categorized by the type of interaction. There are three categories: sounds made by solids, liquids and gases. An example of a sound made by liquid is dripping. There are also hybrid sounds, such as the rain falling on a solid surface. It makes sense to categorize sounds based on the nature of the objects because the mechanical events are physically very different. For example, in sounds involving solids (e.g. a footstep), energy is transmitted at the interface between two solids, which is a surface, and the volumes are put in motion (i.e., they are deformed). This is completely different for sounds involving gases, e.g. wind. In mechanical events involving solids, the shape is essentially unchanged (only transiently deformed). This is a sort of structural invariance that ought to leave a specific signature on the sounds (more on this in another post). Sounds made by gases, on the other hand, correspond to irreversible changes.

These three categories correspond to the physical nature of the sound producing substances. There are subcategories that correspond to the nature of the mechanical interaction. For example, a solid object can be hit or it can be scraped. The same object vibrates but there is a difference in the way it is made to vibrate. This also ought to produce some common structure in the auditory signals, as is explained in Gaver's companion article. For example, a vibrating solid object has modes of vibration that are determined by its shape (more on this in another post). These modes do not depend on the type of interaction with the object.

Interactions that are localized in time are impact sounds, while continuous interactions produce auditory textures. These are two very distinct types of sounds. Both have a structure, but auditory textures, it seems, only have a structure in a statistical sense (see McDermott & Simoncelli, 2011). Another kind of auditory texture is the type of sounds produced by a river, for example. These sounds also have a structure in a statistical sense. An interesting aspect, in this case, is that these sounds are not spatially localized: they do have an auditory size (see my post on spatial hearing ).

The examples I have described correspond to what Gaver calls "basic level events", elementary sounds produced by a single mechanical interaction. There are also complex events, which are composed of simple events. For example, a breaking sound is composed of a series of impact sounds. A bouncing sound is also composed of a series of impact sounds, but the temporal patterning is different, because it is lawful (predictable) in the case of a bouncing sound. Walking is yet another example of a series of impact sounds, which is also lawful, but it differs in the temporal patterning: it is approximately periodic.

Gaver only describes sounds made by non-living elements of the environment (except perhaps for walking). But there are also sounds produced by animals. I will describe them now. First, some animals can produce vocalizations. In Gaver's terminology, vocalizations are a sort of hybrid gas-solid mechanical event: periodic pulses of air make the vocal folds vibrate. The sound then resonates in the vocal tract, which shapes the spectrum of the sound (in a similar way as the shape of an object determines the resonating modes of impact sounds). One special type of structure in these sounds is the periodicity of the sound wave. The fact that a sound is periodic is highly meaningful, because it means that energy is continuously provided, and therefore that a living being is most likely producing it. There are also many other interesting aspects that I will describe in a later post.

Animals also produce sounds by interacting with the environment. These are the same kinds of sounds as described by Gaver, but I believe there is a distinction. How can you tell that a sound has been produced by a living being? Apart from identifying specific sounds, I have two possible answers to provide. First, in natural non-living sounds, energy typically decays. This distinguishes walking sounds from bouncing sounds, for example. In a bouncing sound, the energy decreases at each impact. This means both the intensity of the sound and the interval between successive impacts decay. This is simply because a bouncing ball starts its movement with a potential energy, that can only decay. In a walking sound, roughly the same energy is brought at each impact, so it cannot be produced by the collision of two solids. Therefore, sounds contain a signature of whether it is produced by continuous source of energy. But a river is also a continuous source of energy (and the same would apply to all auditory textures). Another specificity is that sounds produced by the non-living environment are governed by the laws of physics, and therefore they are lawful in a sense, i.e., they are predictable. A composed sound with a non-predictable pattern (even in a statistical sense) is most likely produced by a living being. In a sense, non-predictability is a signature of decision making. This remark is not specific to hearing.

These are specificities of sounds produced by living beings, as heard by another observer. But one can also hear self-produced sounds. There are two new specificities about these types of sounds. First, they also make the body vibrate, for example, a foot hits the ground. This produces sound waves with a specific structure. But more importantly, self-produced sounds have a sensorimotor structure. Scraping corresponds to a particular way in which one interacts with an object. The time of impact corresponds to the onset of the sound. The intensity of the sound is directly related to the energy with which an object is hit. Finally, the periodicity of vocalizations (i.e., the pitch), corresponds to the periodicity of self-generated air pulses through the vocal folds, and the formant frequencies correspond to the shape of the vocal tract. Self-generated sounds also have a multimodal structure. For example, they produce vibrations in the body than can be perceived by tactile receptors. In the next post, I will look at the structure of pitch.

Perceptual invariants: representational vs. structural theories

In his book on vision, David Marr acknowledges the fact that a major computational issue for sensory systems is to extract relevant information in a way that is invariant to a number of changes in the world. For example, to recognize a face independently of its orientation and distance. Here we hit a major difference between representational theories and what I shall call structural theories, such as Gibson’s ecological theory (see my post on the difference between these two theories). In a representational theory, invariant processing is obtained by building a representation that is itself invariant to a number of transformations (e.g. translations, rotations). How can this representation be built? There are two ways: either it is wired (innate) or it is acquired, learned by associating many transformed instances of the same object with the same “percept”. So in a representational theory, dealing with invariance is a tedious learning process requiring supervision. In a structural theory, the problem actually does not exist, because the basis of perception is precisely invariants.

I will give an example in hearing. There are two theories of pitch perception. Pitch is the percept associated to how low or high a musical note is. It mostly corresponds to the periodicity of the sound wave. Two periodic sounds with the same repetition rate will generally have the same pitch. But they may have different timbres, i.e., different spectral contents. In the spectral or template theory, there is an initial representation of sounds consisting as a spectral pattern. It is then compared with the spectral patterns of reference periodic sounds with various pitches, the templates. These templates need to be learned, and the task is not entirely trivial because periodic sounds with the same pitch can have non-overlapping spectra (for example a pure tone, and a complex tone without the first harmonic). The spectral theory of pitch is a representational theory of pitch. In this account, there is nothing special about pitch, it is just a category of sound spectra.

The temporal theory of pitch, on the other hand, postulates that the period of a sound is detected. I call it a structural theory because pitch corresponds to a structural property of sounds, their periodicity. One can observe that the same pattern in the sound wave is repeated, at a particular rate, and this observation does not require learning. Now this means that if two sounds with the same period are presented, I can immediately recognize that they share the same structural property, i.e., they have the same pitch. Learning, in a structural theory, only means associating a particular structure with a label (say, the name of a musical note). The invariance problem disappears in a structural theory, because the basis of the percept is an invariant: the periodicity does not depend on the sound’s spectrum. This also means that sounds that elicit a pitch percept are special because they have a particular structure. In particular, periodic sounds are predictable. White noise, on the other hand, has no structure and does not elicit a pitch percept.

David Marr vs. James Gibson

In his book “Vision”, David Marr briefly comments on James Gibson’s ecological approach, and rejects it. He makes a couple of criticisms that I think are fair, for example the fact that Gibson seemed to believe that extracting meaningful invariants from sensory signals is somehow trivial, while it is a difficult computational problem. But David Marr seems to have missed the important philosophical points in James Gibson’s work. These points have also been made by others, for example Kevin O’Regan, Alva Noë, but also Merleau-Ponty and many others. I will try to summarize a few of these points here.

I quote from David Marr: “Vision is a process that produces from images of the external world a description that is useful to the viewer and not cluttered with irrelevant information”. There are two philosophical errors in this sentence. First, that perception is the production of a representation. This is a classical philosophical mistake, the homunculus fallacy. Who then sees this representation? Marr even explicitly mentions a “viewer” of this representation. One would have to explain the perception of this viewer, and this reasoning leads to an infinite regress.

The second philosophical mistake is more subtle. It is to postulate that there is an external source of information, the images in the retina, that the sensory system interprets. This is made explicit later in the book: “(...) the initial representation is in no doubt – it consists of arrays of image intensity values as detected by the photoreceptors in the retina”. This fact is precisely what Gibson doubts at the very beginning of his book, The Ecological Approach to Visual Perception. Although it is convenient to speak of information in sensory signals, it can be misleading. It makes a parallel with Shannon’s theory of communication, but the environment does not communicate with the observer. Surfaces reflect light waves in all directions. There is no message in these waves. So the analogy between a sensory system and a communication channel is misleading. The fallacy of this view is fully revealed when one considers the voluntary movements of the observer. The observer can decide to move and capture different sensory signals. In Gibson’s terminology, the observer samples the ambient optic array. So what is primary is not the image, it is the environment. Gibson insists that a sensory system cannot be reduced to the sensory organ (say, the eyes and the visual cortex). It must include active movements, embedded in the environment. This is related to the embodiment theory.

We tend to feel that what we see is like the image of a high-resolution camera. This is a mistake due to the immediate availability of visual information (by eye movements). In reality, a very small part of the visual field has high resolution, and a large part of the retina has no photoreceptors (the blind spot). We do not feel this because when we need the information, we can immediately direct our eyes towards the relevant target in the visual field. There is no need to postulate that there is an internal high-resolution representation in which we can move our “inner eye”. Rodney Brooks, a successful researcher in artificial intelligence and robotics, once stated “the world is its own best model”. The fact that we actually do not have a high-resolution mental representation of the visual world (an image in the mind) has been demonstrated spectacularly through the phenomena of change blindness and inattentional blindness, in which a major change in an image or movie goes unnoticed (see for example this movie).