In my previous post, I argued that the spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. This happens even though sounds contain highly structured information about shape, because this structure does not relate to self-generated movements. One may then wonder whether the notion of auditory space in general, for example the spatial location of sound source, is also secondary. One may postulate that the spatial content of auditory spatial cues is only acquired by their contingency with visual spatial cues. In fact, this idea is supported by an intriguing study showing a very strong correlation across species between visual acuity in the fovea and auditory spatial acuity (Heffner & Heffner, 1992 , Fig. 6). More precisely, the authors show that sound localization acuity is better predicted by visual acuity than by acoustical factors (essentially, interaural distance). In our interpretation, animals have poor sound localization acuity not so much because they lack the physiological mechanisms to correctly analyze spatial information, but because in the absence of precise vision, auditory spatial cues cannot acquire precise spatial content. This does not imply that the auditory system of these animals cannot decode these spatial cues, but only they cannot make sense of them. [Update: the results in Heffner & Heffner are in fact more subtle, see a more recent post]
This being said, there is in fact some intrinsic spatial content in sounds, which I will describe now. When a sound is produced, it arrives first at the ear closer to the source, then at the other ear. The intensity will also be higher at the first ear. This is the binaural structure of sounds produced by a single source, and captured by two ears that are spatially separated. This is similar to stereoscopic vision. But observe one difference: in vision, as Gibson noted, having two eyes is essentially the same thing as having one eye, combined with lateral head movements; in hearing, this is not the same because of the non-persistent nature of sounds. If one turns the head to sample another part of the “acoustic array” (in analogy with Gibson’s optic array), the sound field will have changed already (and possibly faded out), so the spatial structure will not be directly captured in the same way. Thus, to capture spatial structure in sound, it is crucial that acoustic signals are simultaneously captured at different locations.
This binaural structure in sounds is often described as “spatial cues” (binaural cues). Quantitatively, there is a relationship between the spatial location of the source and binaural structure, e.g. the interaural time difference (ITD). However, these “cues” are not intrinsically spatial in the sense that they are not defined in relationship to self-generated movements. For example, what is the spatial meaning of an ITD of 100 µs? Intrinsically, there is none. As discussed above, one way for spatial cues to acquire spatial content is by association, i.e., with the spatial content of another modality (vision). But now I will consider the effects of self-generated movements, that is, what is intrinsically spatial in sounds.
When the head turns, the binaural structure changes in specific ways. That is, there is a sensorimotor structure that gives spatial content to binaural structure. More precisely, two different binaural structures can be related to each other by a specific movement. But an important distinction with vision must be made. Because of the non-persistent nature of sounds, the relationship is not between movements and sensory signals, it is between movements and the structure of sensory signals. It is not possible to predict the auditory signals from auditory signals captured before a specific movement. For one thing, there might be no sound produced after the movement. What is predictable is the binaural structure of the sound, if indeed a sound is produced by a source that has a persistent location. If the location of the source is persistent, then the binaural structure is persistent, but not the auditory signals themselves.
Another point we notice is that this provides only a relative sense of space. That is, one can that say whether a sound source is 20° left of another sound source, but it does not produce an absolute egocentric notion of space. What is lacking is a reference point. I will propose two ways to solve this problem.
What is special, for example, about a source that is in front of the observer? Gibson noted that, in vision, the direction in which the optic flow is constant indicates the direction of movement. Similarly, when one moves in the direction of a sound source, the direction of that sound source is unchanged, and therefore the binaural structure of sound is unchanged. In other words, the direction of a sound source in front is the direction of a self-generated movement that would leave the binaural structure unchanged (we could also extend this definition to the monaural spectral information). In fact the binaural structure can depend on distance, when the source is near, but this is a minor point because we can simply state that we are considering the direction that makes binaural structure minimally changed (see also the second way below). One problem with this, however, is that moving to and moving away from a source both satisfy this definition. Although these two cases can be distinguished by head movements, this definition does not make a distinction between what is moving closer and what is moving further away from the source. One obvious remark is that moving to a source increases the intensity of the sound. The notion of intensity here should be understood as a change in information content. In the same way as in vision where moving to an object increases the level of visual detail, moving to a sound source increases the signal-to-noise ratio, and therefore the level of auditory detail available. This makes sense independently of the perceptual notion of loudness – in fact it is rather related to the notion of intelligibility (a side note: this is consistent with the fact that an auditory cue to distance is the ratio between direct sound energy and reverberated energy). Of course again, because sounds are not persistent, the notion of change in level is weak. One needs to assume that the intensity of the sound persists. However, I do not think this is a critical problem, for even if intensity is variable, what is needed is only to observe how intensity at the ear correlates with self-generated movements. This is possible because self-generated movements are (or at least can be) independent of the intensity variations of the sound.
This indeed seems to provide some intrinsic spatial content to sounds. But we note that it is quite indirect (compared to vision), and made more evident by the fact that sounds are not persistent. There is another, more direct, way in which sounds can acquire spatial content: by the active production of sounds. For example, one can produce sounds by hitting objects. This provides a direct link between the spatial location of the object, relative to the body, and the auditory structure of the sound. Even though sounds are not persistent, they can be repeated. But we note that this can only apply to objects that are within reach.
This discussion shows that while there is no intrinsic spatial content about shape in sounds, there is intrinsic spatial content about source location. This seems to stand in contradiction with the discussion at the beginning of this post, in which I pointed out that spatial auditory acuity seems to be well predicted across species by visual acuity, suggesting that spatial content is acquired. Here is a possible way to reconcile these two viewpoints. In vision, an object at a specific direction relative to the observer will project light rays in that direction to the retina, which will be captured by specific photoreceptors. Therefore, there is little ambiguity in vision about spatial location. However, in hearing, this is completely different. Sounds coming from a particular direction are not captured by a specific receptor. Information about direction is in the structure of the signals captured at the two ears. The difficulty is that this structure depends on the direction of the sound source but also on other uncontrolled factors. For example, reflections, in particular early reflections, modify the binaural cues (Gourévitch & Brette 2012). These effects are deterministic but situation-dependent. This implies that there is no fixed mapping from binaural structure to spatial location. This makes the auditory spatial content weaker, even though auditory spatial structure is rich. Because visual location is more invariant, it is perhaps not surprising that it dominates hearing in localization tasks.