What is sound? (III) Spatial hearing

In my previous post, I argued that the spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. This happens even though sounds contain highly structured information about shape, because this structure does not relate to self-generated movements. One may then wonder whether the notion of auditory space in general, for example the spatial location of sound source, is also secondary. One may postulate that the spatial content of auditory spatial cues is only acquired by their contingency with visual spatial cues. In fact, this idea is supported by an intriguing study showing a very strong correlation across species between visual acuity in the fovea and auditory spatial acuity (Heffner & Heffner, 1992 , Fig. 6). More precisely, the authors show that sound localization acuity is better predicted by visual acuity than by acoustical factors (essentially, interaural distance). In our interpretation, animals have poor sound localization acuity not so much because they lack the physiological mechanisms to correctly analyze spatial information, but because in the absence of precise vision, auditory spatial cues cannot acquire precise spatial content. This does not imply that the auditory system of these animals cannot decode these spatial cues, but only they cannot make sense of them. [Update: the results in Heffner & Heffner are in fact more subtle, see a more recent post]

This being said, there is in fact some intrinsic spatial content in sounds, which I will describe now. When a sound is produced, it arrives first at the ear closer to the source, then at the other ear. The intensity will also be higher at the first ear. This is the binaural structure of sounds produced by a single source, and captured by two ears that are spatially separated. This is similar to stereoscopic vision. But observe one difference: in vision, as Gibson noted, having two eyes is essentially the same thing as having one eye, combined with lateral head movements; in hearing, this is not the same because of the non-persistent nature of sounds. If one turns the head to sample another part of the “acoustic array” (in analogy with Gibson’s optic array), the sound field will have changed already (and possibly faded out), so the spatial structure will not be directly captured in the same way. Thus, to capture spatial structure in sound, it is crucial that acoustic signals are simultaneously captured at different locations.

This binaural structure in sounds is often described as “spatial cues” (binaural cues). Quantitatively, there is a relationship between the spatial location of the source and binaural structure, e.g. the interaural time difference (ITD). However, these “cues” are not intrinsically spatial in the sense that they are not defined in relationship to self-generated movements. For example, what is the spatial meaning of an ITD of 100 µs? Intrinsically, there is none. As discussed above, one way for spatial cues to acquire spatial content is by association, i.e., with the spatial content of another modality (vision). But now I will consider the effects of self-generated movements, that is, what is intrinsically spatial in sounds.

When the head turns, the binaural structure changes in specific ways. That is, there is a sensorimotor structure that gives spatial content to binaural structure. More precisely, two different binaural structures can be related to each other by a specific movement. But an important distinction with vision must be made. Because of the non-persistent nature of sounds, the relationship is not between movements and sensory signals, it is between movements and the structure of sensory signals. It is not possible to predict the auditory signals from auditory signals captured before a specific movement. For one thing, there might be no sound produced after the movement. What is predictable is the binaural structure of the sound, if indeed a sound is produced by a source that has a persistent location. If the location of the source is persistent, then the binaural structure is persistent, but not the auditory signals themselves.

Another point we notice is that this provides only a relative sense of space. That is, one can that say whether a sound source is 20° left of another sound source, but it does not produce an absolute egocentric notion of space. What is lacking is a reference point. I will propose two ways to solve this problem.

What is special, for example, about a source that is in front of the observer? Gibson noted that, in vision, the direction in which the optic flow is constant indicates the direction of movement. Similarly, when one moves in the direction of a sound source, the direction of that sound source is unchanged, and therefore the binaural structure of sound is unchanged. In other words, the direction of a sound source in front is the direction of a self-generated movement that would leave the binaural structure unchanged (we could also extend this definition to the monaural spectral information). In fact the binaural structure can depend on distance, when the source is near, but this is a minor point because we can simply state that we are considering the direction that makes binaural structure minimally changed (see also the second way below). One problem with this, however, is that moving to and moving away from a source both satisfy this definition. Although these two cases can be distinguished by head movements, this definition does not make a distinction between what is moving closer and what is moving further away from the source. One obvious remark is that moving to a source increases the intensity of the sound. The notion of intensity here should be understood as a change in information content. In the same way as in vision where moving to an object increases the level of visual detail, moving to a sound source increases the signal-to-noise ratio, and therefore the level of auditory detail available. This makes sense independently of the perceptual notion of loudness – in fact it is rather related to the notion of intelligibility (a side note: this is consistent with the fact that an auditory cue to distance is the ratio between direct sound energy and reverberated energy). Of course again, because sounds are not persistent, the notion of change in level is weak. One needs to assume that the intensity of the sound persists. However, I do not think this is a critical problem, for even if intensity is variable, what is needed is only to observe how intensity at the ear correlates with self-generated movements. This is possible because self-generated movements are (or at least can be) independent of the intensity variations of the sound.

This indeed seems to provide some intrinsic spatial content to sounds. But we note that it is quite indirect (compared to vision), and made more evident by the fact that sounds are not persistent. There is another, more direct, way in which sounds can acquire spatial content: by the active production of sounds. For example, one can produce sounds by hitting objects. This provides a direct link between the spatial location of the object, relative to the body, and the auditory structure of the sound. Even though sounds are not persistent, they can be repeated. But we note that this can only apply to objects that are within reach.

This discussion shows that while there is no intrinsic spatial content about shape in sounds, there is intrinsic spatial content about source location. This seems to stand in contradiction with the discussion at the beginning of this post, in which I pointed out that spatial auditory acuity seems to be well predicted across species by visual acuity, suggesting that spatial content is acquired. Here is a possible way to reconcile these two viewpoints. In vision, an object at a specific direction relative to the observer will project light rays in that direction to the retina, which will be captured by specific photoreceptors. Therefore, there is little ambiguity in vision about spatial location. However, in hearing, this is completely different. Sounds coming from a particular direction are not captured by a specific receptor. Information about direction is in the structure of the signals captured at the two ears. The difficulty is that this structure depends on the direction of the sound source but also on other uncontrolled factors. For example, reflections, in particular early reflections, modify the binaural cues (Gourévitch & Brette 2012). These effects are deterministic but situation-dependent. This implies that there is no fixed mapping from binaural structure to spatial location. This makes the auditory spatial content weaker, even though auditory spatial structure is rich. Because visual location is more invariant, it is perhaps not surprising that it dominates hearing in localization tasks.

What is sound? (II) Sensorimotor contingencies

In a previous post, I emphasized the differences between vision and hearing, from an ecological point of view. Here I want to comment on the sensorimotor theory of perception (O’Regan & Noë 2001) or the enactive approach, applied to sounds. According to this theory, perception is the implicit knowledge of the effects of self-generated movements on sensory signals. Henri Poincaré made this point a long time ago: "To localize an object simply means to represent to oneself the movements that would be necessary to reach it". For example, perceiving the spatial location of an object is knowing the movements that one should do to move to that object, or to grasp it, or to direct its fovea to it.

There are two implicit assumptions here: 1) that there is some persistence in the sensory signals, 2) that the relevant information is spatial in nature. I will start with the issue of persistence. As I previously argued, a defining characteristic of sounds is that they are not persistent, they happen. For example, the sound of someone else hitting an object is transient. One cannot interact with it. So there cannot be any sensorimotor contingency in this experience. It could be argued that one relies on the memory of previous sensorimotor contingencies, that is, the memory of one producing an impact sound. This is a fair remark, I think, but it overestimates the amount of information there is in this contingency. When an impact sound is produced, the only relationships between motor commands and the acoustic signals are the impact timing and the sound level (related to the strength of the impact). But there is much more information in the acoustic signal of an impact sound, because the structure of this signal is related to properties of the sounding object, in particular material and shape (Gaver, 1993). For example, the resonant modes are informative of the shape and the decay rate of these modes indicates the nature of the material (wood, metal, etc), properties that we can very easily identify. So there is informative sensory structure independent of sensorimotor contingencies.

Now I think we are hitting an interesting point. Even though the resonant modes are informative of the shape (the size, essentially) of an object, they cannot provide any perceptual spatial content by themselves. That is, the frequency of a resonant mode is just a number, and a number has no meaning without context. Compare with the notion of object size for the tactile system: the size of a (small) object is the extent to which one must stretch the hand to grasp it. There is no such thing in hearing. There is nothing intrinsically spatial in auditory size, it seems. If one moves and the sound is repeated, the same resonant modes will be excited. Therefore, it seems that auditory shape can only be a derived property. That is, the specific sensory structure of sounds that corresponds to shape acquires perceptual content by association with another sense that has intrinsic spatial content, i.e., visual or tactile. Now we get to Gibson’s notion of invariant structure: auditory size is the structure in the auditory signals that remains the same when other aspects than size change (where the notion of size is not auditory). Here I am imagining that one hears sounds produced by various sources for which the size is known, and one can identify that some auditory structure is the same for all sources that have the same size. Note the important point here: what persists here is not the sensory signals, it is not the relationship between movements and sensory signals, it is not even the relationship between size and sensory signals, it is the relationship between size and the structure of auditory signals, which is a kind of relationship. That is, one cannot predict the auditory signals from the size: one can predict some aspect of the structure of these signals from the size.

Here I have highlighted the fact that the auditory shape of an object is a structure of auditory signals, not a kind of sensorimotor structure. The spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. But there can also be intrinsic spatial content in sounds, and in my next post, I will discuss spatial hearing.

What is sound? (I) Hearing vs. seeing

What is sound? Physically, sounds are mediated by acoustical waves. But vision is mediated by light waves and yet hearing does not feel like vision. Why is that?

There are two wrong answers to this question. The first one is that the neural structures are different. Sounds are processed in the cochlea and in the auditory cortex, images by the retina and visual cortex. But then why doesn’t a sound evoke some sort of image, like a second visual system? This point of view does not explain much about perception, only about what brain areas “light up” when a specific type of stimulus is presented. The second one is that the physical substrate is different: light waves vs. acoustic waves. This is also a weak answer, for what is fundamentally different between light and acoustic waves that would make them “feel” different?

I believe the ecological approach provides a more satisfying answer. By this, I am referring to the ecological theory of visual perception developed by James Gibson. It emphasizes the structure of sensory signals collected by an observer in an ecological environment. It is also related the sensorimotor account of perception (O’Regan & Noë 2001), which puts the emphasis on the relationship between movements and sensory signals, but I will show below that this emphasis is less relevant in hearing (except in spatial hearing).

I will quickly summarize what is vision in Gibson’s ecological view. Illumination sources (the sun) produce light rays that are reflected by objects. More precisely, light is reflected by the surface of objects with the medium (air, or possibly water). What is available for visual perception are surfaces and their properties (color, texture, shape...). Both the illumination sources and the surfaces in the environment are generally persistent. The observer can move, and this changes the light rays received by the retina. But these changes are highly structured because the surfaces persist, and this structure is informative of the surfaces in the environment. Thus what the visual system perceives is the arrangement and properties of persistent surfaces. Persistence is crucial here, because it allows the observer to use its own movements to learn about the world – in the sensorimotor account of perception, perception is precisely the implicit knowledge of the effect of one’s actions on sensory signals.

On the other hand, sounds are produced by the mechanical vibration of objects. This means that sounds convey information about volumes rather than surfaces. They depend on the shape but also on the material and internal structure of objects. It also means that what is perceived in sounds is the source of the waves rather than their interaction with the environment. Crucially, contrary to vision, the observer cannot directly interact with sound waves, because a sound happens, it is not persistent. An observer can produce a sound wave, for example by hitting an object, but once the sound is produced there is no possible further interaction with it. The observer cannot move to analyze the structure of acoustic signals. The only available information is in the sound signal itself. In this sense, sounds are events.

These ecological observations highlight major differences between vision and hearing, which go beyond the physical basis of these two senses (light waves and acoustic waves). Vision is the perception of persistent surfaces. Hearing is essentially the perception of mechanical events on volumes. These remarks are independent from the fact that vision is mediated by a retina and hearing by a cochlea.

The impact of early reflections on binaural cues

Boris Gourévitch and I have just published a paper in ecological acoustics:

Gourévitch B and Brette R (2012). The impact of early reflections on binaural cues. JASA 132(1):9-27.

This is a rather technical paper in which we investigate how binaural cues (ITDs, ILDs) are modified in an ecological environment in which there are reflections. Indeed most sound localization studies use HRTFs recorded in anechoic conditions, but apart perhaps from flying animals, anechoic conditions are highly unecological. That is, even in free field, there is always at least a ground on which sounds waves reflect. In this paper, we focus on early reflections. In the introduction, we motivate this choice by the fact that the precedence effect (perceptual suppression of echoes) only acts when echoes arrive after a few ms, and therefore early reflections should not be suppressed. Another, perhaps simpler, argument is that in a narrow frequency band, a sound will always interfere with its echo when the echo arrives less than a couple of periods after the direct sound. Therefore, early reflections produce interferences, seen in the binaural cues. An important point is that these are deterministic effects, not variability. In the paper, we analyze these effects quantitatively with models (rigid spheres and sophisticated models of sound absorption by the ground). One implication is that in ecological environments and even with a single sound source and in the absence of noise, there may be very large interaural time differences, which carry spatial information.