In my previous post, I argued that the spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. This happens even though sounds contain highly structured information about shape, because this structure does not relate to self-generated movements. One may then wonder whether the notion of auditory space in general, for example the spatial location of sound source, is also secondary. One may postulate that the spatial content of auditory spatial cues is only acquired by their contingency with visual spatial cues. In fact, this idea is supported by an intriguing study showing a very strong correlation across species between visual acuity in the fovea and auditory spatial acuity (Heffner & Heffner, 1992 , Fig. 6). More precisely, the authors show that sound localization acuity is better predicted by visual acuity than by acoustical factors (essentially, interaural distance). In our interpretation, animals have poor sound localization acuity not so much because they lack the physiological mechanisms to correctly analyze spatial information, but because in the absence of precise vision, auditory spatial cues cannot acquire precise spatial content. This does not imply that the auditory system of these animals cannot decode these spatial cues, but only they cannot make sense of them. [Update: the results in Heffner & Heffner are in fact more subtle, see a more recent post]
This being said, there is in fact some intrinsic spatial content in sounds, which I will describe now. When a sound is produced, it arrives first at the ear closer to the source, then at the other ear. The intensity will also be higher at the first ear. This is the binaural structure of sounds produced by a single source, and captured by two ears that are spatially separated. This is similar to stereoscopic vision. But observe one difference: in vision, as Gibson noted, having two eyes is essentially the same thing as having one eye, combined with lateral head movements; in hearing, this is not the same because of the non-persistent nature of sounds. If one turns the head to sample another part of the “acoustic array” (in analogy with Gibson’s optic array), the sound field will have changed already (and possibly faded out), so the spatial structure will not be directly captured in the same way. Thus, to capture spatial structure in sound, it is crucial that acoustic signals are simultaneously captured at different locations.
This binaural structure in sounds is often described as “spatial cues” (binaural cues). Quantitatively, there is a relationship between the spatial location of the source and binaural structure, e.g. the interaural time difference (ITD). However, these “cues” are not intrinsically spatial in the sense that they are not defined in relationship to self-generated movements. For example, what is the spatial meaning of an ITD of 100 µs? Intrinsically, there is none. As discussed above, one way for spatial cues to acquire spatial content is by association, i.e., with the spatial content of another modality (vision). But now I will consider the effects of self-generated movements, that is, what is intrinsically spatial in sounds.
When the head turns, the binaural structure changes in specific ways. That is, there is a sensorimotor structure that gives spatial content to binaural structure. More precisely, two different binaural structures can be related to each other by a specific movement. But an important distinction with vision must be made. Because of the non-persistent nature of sounds, the relationship is not between movements and sensory signals, it is between movements and the structure of sensory signals. It is not possible to predict the auditory signals from auditory signals captured before a specific movement. For one thing, there might be no sound produced after the movement. What is predictable is the binaural structure of the sound, if indeed a sound is produced by a source that has a persistent location. If the location of the source is persistent, then the binaural structure is persistent, but not the auditory signals themselves.
Another point we notice is that this provides only a relative sense of space. That is, one can that say whether a sound source is 20° left of another sound source, but it does not produce an absolute egocentric notion of space. What is lacking is a reference point. I will propose two ways to solve this problem.
What is special, for example, about a source that is in front of the observer? Gibson noted that, in vision, the direction in which the optic flow is constant indicates the direction of movement. Similarly, when one moves in the direction of a sound source, the direction of that sound source is unchanged, and therefore the binaural structure of sound is unchanged. In other words, the direction of a sound source in front is the direction of a self-generated movement that would leave the binaural structure unchanged (we could also extend this definition to the monaural spectral information). In fact the binaural structure can depend on distance, when the source is near, but this is a minor point because we can simply state that we are considering the direction that makes binaural structure minimally changed (see also the second way below). One problem with this, however, is that moving to and moving away from a source both satisfy this definition. Although these two cases can be distinguished by head movements, this definition does not make a distinction between what is moving closer and what is moving further away from the source. One obvious remark is that moving to a source increases the intensity of the sound. The notion of intensity here should be understood as a change in information content. In the same way as in vision where moving to an object increases the level of visual detail, moving to a sound source increases the signal-to-noise ratio, and therefore the level of auditory detail available. This makes sense independently of the perceptual notion of loudness – in fact it is rather related to the notion of intelligibility (a side note: this is consistent with the fact that an auditory cue to distance is the ratio between direct sound energy and reverberated energy). Of course again, because sounds are not persistent, the notion of change in level is weak. One needs to assume that the intensity of the sound persists. However, I do not think this is a critical problem, for even if intensity is variable, what is needed is only to observe how intensity at the ear correlates with self-generated movements. This is possible because self-generated movements are (or at least can be) independent of the intensity variations of the sound.
This indeed seems to provide some intrinsic spatial content to sounds. But we note that it is quite indirect (compared to vision), and made more evident by the fact that sounds are not persistent. There is another, more direct, way in which sounds can acquire spatial content: by the active production of sounds. For example, one can produce sounds by hitting objects. This provides a direct link between the spatial location of the object, relative to the body, and the auditory structure of the sound. Even though sounds are not persistent, they can be repeated. But we note that this can only apply to objects that are within reach.
This discussion shows that while there is no intrinsic spatial content about shape in sounds, there is intrinsic spatial content about source location. This seems to stand in contradiction with the discussion at the beginning of this post, in which I pointed out that spatial auditory acuity seems to be well predicted across species by visual acuity, suggesting that spatial content is acquired. Here is a possible way to reconcile these two viewpoints. In vision, an object at a specific direction relative to the observer will project light rays in that direction to the retina, which will be captured by specific photoreceptors. Therefore, there is little ambiguity in vision about spatial location. However, in hearing, this is completely different. Sounds coming from a particular direction are not captured by a specific receptor. Information about direction is in the structure of the signals captured at the two ears. The difficulty is that this structure depends on the direction of the sound source but also on other uncontrolled factors. For example, reflections, in particular early reflections, modify the binaural cues (Gourévitch & Brette 2012). These effects are deterministic but situation-dependent. This implies that there is no fixed mapping from binaural structure to spatial location. This makes the auditory spatial content weaker, even though auditory spatial structure is rich. Because visual location is more invariant, it is perhaps not surprising that it dominates hearing in localization tasks.
I like the idea that spatial audition should be acquired in relation to visual spatial understanding. More generally, my own work suggests that our understanding of one sensory modality is always and necessarily determined by the other sensory modalities. I think what we mean by space is something which is determined across all the modalities we have access to.
Nevertheless I'm a bit skeptical about the Heffner result. In near space, even blind people can do spatial localization with their limbs, so this would allow auditory localisation to develop through the tactile sense. If what Romain says, referring to Heffner et al., is true, then you would expect blind people to have poor auditory acuity...?? Is this found?
I think an important issue is: what do you mean by space in the auditory domain? There are different aspects of space: there is shape, size, spatial acuity (between two auditory objects), localisation with respect to the body,... among others. I'm not sure which, or how, these different aspects of shape should be linked with visual acuity. Visual acuity is not strongly linked to localisation, it seems to me.
Concerning what you say is an essential difference with vision, namely the fact that sounds do not persist: I'm not sure. I would have thought that though a particular sound does not persist, sound emitted by the same animal, or the same source, persist in their localisation. So the spatial localisation of a sound is something that persists, despite changes in the sound itself. (Ah, reading further I see you do say something like this!)
Concerning the idea that vision is more easily linked with space because of the structure of our visual sensors: I'm not so sure. Distance is confounded with size on the retina. Position in space is confounded by eye and body movements. Shape is not so obviously extractable from the retinal image, as is well shown by the failure of visual object recognition algorithms: object shape is confounded with shadow and texture cues and is difficult to distinguish from the background information. What you say about the ambiguity of the auditory array caused by, for example, reflections, also applies to the visual array, where lighting and inter-reflextions affect lightness and color in an important way.
(N.B. in an article with Philipona , we attempted to show, taking the example of a "multimodal rat" with whiskers, eyes and ears, how spatial cues can be extracted independently of sensory modality: Philipona, D., O'Regan, J. K., Nadal, J.-P. & Coenen, O. J-M. D. (2004). Perception of the structure of the physical world using unknown multimodal sensors and effectors. Advances in Neural Information Processing Systems, 16, 945-952.)
Your remark about blind people is very interesting. I would say it is probably not true that blind people have a poorer spatial acuity, but I don't know for sure (and there is the notion of absolution localization vs. discrimination, which is quite a different problem). But this is actually not what Heffner is concluding. He offers a more evolutionary explanation, in which the acuity of spatial hearing matches visual acuity. In this case, you would not expect blind people to have poorer auditory spatial acuity. But in any case, I also think that there actually is intrinsic spatial content in sounds (as I argue in the rest of the post), so I am not strongly advocating that auditory space is only acquired through vision.
About the notion of auditory space: you are absolutely right that there are in fact different notions or tasks. I was only referring to spatial location (assuming a point source, i.e., with negligible size), mainly in the sense of Poincaré: the movements that one would need to do in order to reach the source. But for the cocktail party problem, this is not the relevant notion, because what we you want to do is to separe sound sources, not localize them.
In the comparison between vision and hearing, you are mentioning ambiguities that also arise in vision. Although this is true, part of these ambiguities disappear if you consider movements or combine different cues. You could say that position in space is counfounded by eye and body movements, but you have information about your own movements, so it is a source of complexity rather than of ambiguity. I am not saying that it is an easy task, but at least the information is potentially there.
In hearing on the other hand, the point I want to make is that the structure of auditory signals is also affected by factors that you cannot control or have information about (I am assuming that you do not use visual information). For example, in a room, you do not know the absorption properties of walls and of the ground, but these affect the ITD, for example (I address this in a recent paper: http://audition.ens.fr/brette/papers/GourevitchBrette2012.html). It is a source of ambiguity that cannot be raised. Lighting, on the other hand, has a great effect on retinal images but it still leaves a large part of the visual structure unchanged (e.g. the retinal location of edges). Finally, there is still the issue of non-persistence. It is very rare that a source in the environment produces a continuous stream of acoustic waves, so in this sense it is much easier to "experiment" with visual space than with auditory space.