An intriguing fact about the pitch of tones is that we tend to describe it using spatial characteristics such as “high” and “low”. In the same way, we speak of a rising intonation when the pitch increases. A sequence of notes with increasing frequency played on a piano scale is described as going “up” (even though it is going right on a piano, and going down on a guitar). Yet there is nothing intrinsically spatial in the frequency of a tone. Why do we use these spatial attributes? An obvious possibility is that it is purely cultural: “high” and “low” are just arbitrary words that we happen to use to describe these characteristics of sounds. However, the following observations should be made: - We use the terms low and high, which are also used for spatial height, and not specific words such as blurp and zboot. But we don’t use spatial words for colors and odors. Instead we use specific words (red, green) or sometimes words used for other senses (a hot color). Why use space and not something else? - All languages seem to use more or less the same type of words. - In an experiment done in 1930 by Caroll Pratt (“The spatial character of high and low tones”), subjects were asked to locate tones of various frequencies on a numbered scale running from the floor to the ceiling. The tones were presented through a speaker behind a screen, placed at random height. It turned out that the judgment of spatial height made by subjects was very consistent, but was entirely determined by tone frequency rather than actual source position. High frequency tones were placed near the ceiling, low frequency tones near the floor. The result was later confirmed in congenitally blind persons and in young children (Roffler & Butler, JASA 1968). Thus, there is some support for the hypothesis that tones are perceived to have a spatial character, which is reflected in language. But why? Here I will just speculate widely and make a list of possibilities. 1. Sensorimotor hypothesis related to vocal production: when one makes sounds (sings or speaks), sounds of high pitch are felt to be produced higher than low pitch sounds. This could be related to the spatial location of tactile vibrations on the skin depending on fundamental frequency or timbre. Professional singers indeed use spatial words to describe where the voice “comes from” (which has no physical basis as such). This could be tested by measuring skin vibrations. In addition, congenitally mute people would show different patterns of tone localization. 2. Natural statistics: high frequency sounds tend to come from sources that are physically higher than low frequency sounds. For example, steps on the ground tend to produce low frequency sounds. Testing this hypothesis would require an extensive collection of natural recordings tagged with their spatial position. But note that the opposite trend is true for sounds produced by humans: adults have a lower voice than children, which are lower in physical height. 3. Elevation-dependent spectral cues: to estimate the elevation of a sound source, we rely on spectral cues introduced by the pinnae. Indeed the circumvolutions of the pinnae introduce elevation-dependent notches in the spectrum. By association, the frequency of a tone would be associated with the spectral characteristics of a particular elevation. This could be tested by doing a tone localization experiment and comparing with individual head-related transfer functions.
When one thinks of sounds, the image that comes to mind is a speaker playing back a sound wave, which travels through air to the ears of the listener. But not all sounds are like that. I will give two examples: head scratching and footsteps.
When you scratch your head, a sound is produced that travels in the air to your ears. But there is another pathway: the sound is actually produced by the skull and the skin, and it propagates through the skull directly to the inner ear. This is called “bone conduction”. A lot of the early work on this subject was done by von Békésy (see e.g. Hood, JASA 1962). Normally, bone conduction represents a negligible part of sounds that we hear. When an acoustical wave reaches our head, the skull is put in vibration and can transmit sounds directly to the inner ear by bone conduction. But because of the difference in acoustical impedance between air and skin, the wave is very strongly attenuated, on the order of 60 dB according to these early works. It is actually the function of the middle ear to match these two impedances.
But in the case of head scratching, the sound is actually already produced on the skull, so it is likely that a large proportion of the sound is transmitted by bone conduction, if not most of it. This implies that sound localization cues (in particular binaural cues) are completely different from airborne sounds. For example, sound propagates faster (as in water) and there are resonances. Cues might also depend on the position of the jaw. There is a complete set of binaural cues that are specific of the location of scratching on the skull, which are directly associated with tactile cues. To my knowledge, this has not been measured. This also applies to chewing sounds, and also to the sound of one’s own voice. In fact, it is thought that the reason why one’s own voice sounds higher when it is played back is because our perception of our own voice relies on bone conduction, which transmits lower frequencies better than higher frequencies.
Let us now turn to footsteps. A footstep is a very interesting sound – not even mentioning the multisensory information in a footstep. When the ground is impacted, an airborne sound is produced, coming from the location of the impact. However, the ground is not a point source. Therefore when it vibrates, the sound comes from a larger piece of material than just the location of the impact. This produces binaural cues that are unlike those of sounds produced by a speaker. In particular, the interaural correlation is lower for larger sources, and you would expect that the frequency-dependence of this correlation depends on the size of the source (the angular width, from the perspective of the listener).
When you walk in a noisy street, you may notice that you can hear your own footsteps but not those of other people walking next to you, even though the distance of the feet might be similar. Why is that? In addition to the airborne sound, your entire skeleton vibrates. This implies that there should be a large component of the sound that you hear that is in fact coming from bone conduction through your body. Again these sounds should have quite peculiar binaural cues, in addition to having stronger low frequencies. In particular, there should be different set of cues for the left foot and for the right foot.
You might also hear someone else’s footstep. In this case there is of course the airborne sound, but there is also another pathway, through the ground. Through this other pathway, the sound reaches your feet before the airborne sound, because sound propagates much faster in a solid substance than in air. Depending on the texture of the ground, higher frequencies would also be more attenuated. In principle, this vibration in your feet (perhaps if you are bare feet) will then propagate through your body to your inner ear. But it is not so clear how strong this bone conducted sound might be. Clearly it should be much softer than for your own footstep, since in that case there is an impact on your skeleton. But perhaps it is still significant. In this case, there are again different binaural cues, which should depend on the nature of the ground (since this affects the speed of propagation).
In the same way, sounds made by touching or hitting an object might also include a bone conducted component. It will be quite challenging to measure these effects, since ideally one should measure the vibration of the basilar membrane. Indirect methods might include: measurements on the skull (to have an idea of the magnitude), psychoacoustic methods using masking sounds, measuring otoacoustic emissions, electrophysiological methods (cochlear microphonics, ABR).
In a previous post, I argued that some artificial sounds might be wrongly presented as if they were not natural, because ecological environments are complex and so natural sounds are diverse. But what if they were actually not natural? Perhaps these particular sounds can be encountered in a natural environment, but there might be other sounds that can be synthesized and heard but that are never encountered in nature.
Why exactly do we care about this question? If we are interested in knowing whether these sounds exist in nature, it is because we hypothesize that they acquire a particular meaning that is related to the context in which they appear (e.g. a binaural sound with a large ITD is produced by a source located on the ipsilateral leading side). This is a form of objectivism: it is argued that if we subjectively lateralize a binaural sound with a 10 ms ITD to the right, it is because in nature, such a sound would actually be produced by a source located on the right. So in fact, what we are interested in is not only whether these sounds exist in nature, but also additionally whether we have encountered them in a meaningful situation.
So have we previously encountered all the sounds that we subjectively localize? Certainly this cannot be literally true, for a new sound (e.g. in a new acoustical environment) could then never be localized. Therefore there must be some level of extrapolation in our perception. It cannot be that what we perceive is a direct reflection of the world. In fact, there is a relationship between this point and the question of inductivism in philosophy of science. Inductivism is the position that a scientific theory can be deduced from the facts. But this cannot be true, for a scientific theory is a universal statement about the world, and no finite set of observations can imply a universal statement. No scientific theory is ever “true”: rather, it agrees with a large body of data collected so far, and it is understood that any theory is bound to be amended or changed for a new theory at some point. The same can be said about perception, for example sound localization: given a number of past observations, a perceptual theory can be formed that relates some acoustical properties and the spatial location of the source. This implies that there should be sounds that have never been heard but that can still be associated with a specific source location.
Now we reach an interesting point, because it means that there may be a relationship between phenomenology and biology. When sounds are presented that deviate from the set of natural sounds, their perceived quality says something about the perceptual theory that the individual has developed. This provides some credit to the idea that the fact we lateralize binaural sounds with large ITDs might say something about the way the auditory system processes binaural sounds – but of course this is probably not the best example since it may well be in agreement with an objectivist viewpoint.
Some types of artificial sounds presented through headphones are sometimes described as not natural, in the sense that they have binaural relationships that sounds in a natural environment do not have. In general, this qualification refers to the qualities of point sources in an anechoic environment, but real environments reflect sounds and there are also more complex sound sources. I will discuss two types of “unnatural” binaural sounds.
1) Binaural noise with long interaural delays (ITD). In an anechoic environment, the ITD of a sound can reach 600-700 µs in high frequency for humans, and perhaps up to 800-900 µs in low frequency. Yet if we listen to binaurally delayed noise through headphones with an ITD of about 10 ms, we hear a single source, lateralized to one side. When the ITD is increased, starting from 0 µs, perceived lateralization progressively increases up to about 1 ms or a bit less, then reaches a plateau. We hear two separate noises only when the ITD is larger than about 10 ms. This is surprising because 10 ms is much larger than the maximal ITD that can be produced by a single sound source in an anechoic environment. However, let us consider a situation where there is an acoustically reflecting surface, a vertical wall, which is on our left side, about two meters away. A sound source is far on the opposite side. In this case, the right ear receives the direct wave from the source and the left ear receives the reflected wave. It follows that the ITD is about 10 ms. In addition, the direction of the sound source is consistent with the headphone experiments. Therefore, large ITDs may not be unlikely in natural environments, even with a simple point sound source.
2) Uncorrelated binaural noise. If acoustical noise is made of the addition of independent many sound sources, one would expect that the signals at the two ears are correlated in low frequency, that is, when the period is large compared to the maximum ITD of the sound sources. Are there sounds in an ecological environment that are binaurally uncorrelated? I would suggest the following situation. You are riding a bicycle, and you feel the wind in your face and in your ears. The sound of the wind in the ears does not feel localized anywhere else than at the ears. There is also little reason to believe that the pressure of the air is highly correlated at the two ears, except perhaps at very low frequency - although this should be measured. In fact, this acoustical situation correlates with mechanical pressure on the ears, which can be captured by tactile receptors. I would suggest that this type of sound is perceptually localized at the ears because of this ecological association.
In summary, acoustical situations are considerably diverse in ecological environments, and therefore there might be fewer “unnatural” sounds than often assumed.
In this post, I want to come back on a remark I made in a previous post, on the relationship between vision and spatial hearing. It appears that my account of the comparative study of Heffner and Heffner (Heffner & Heffner, 1992) was not accurate. Their findings are in fact even more interesting than I thought. They find that sound localization acuity across mammalian species is best predicted not by visual acuity, but by the width of the field of best vision.
Before I comment on this result, I need to explain a few details. Sound localization acuity was measured behaviorally in a left/right discrimination task near the midline, with broadband sounds. The authors report this discrimination threshold for 23 mammalian species, from gerbils to elephants. They then try to relate this value to various other quantities: the largest interaural time difference (ITD), which is directly related to head size, visual acuity (highest angular density of retinal cells), whether the animals are predatory or preys, and field of best vision. The latter quantity is defined as the angular width of the retina in which angular cell density is at least 75% of the highest density. So this quantity is directly related to the inhomogeneity of cell density in the retina.
The results of the comparative study are not straightforward (I find). Let us consider a few hypotheses. One hypothesis goes as follows. Sound localization acuity is directly related to the temporal precision of firing of auditory nerve fibers. If this precision is similar for all mammals, then this should correspond to a constant ITD threshold. In terms of angular threshold, sound localization acuity should then be inversely proportional to the largest ITD, and to head size. The same reasoning would go for intensity differences. Philosophically speaking, this corresponds to the classical information-processing view of perception: there is information about sound direction in the ITD, as reflected in the relative timing of spikes, and so sound direction can be estimated with a precision that is directly related to the temporal precision of neural firing. As I have argued many times in this blog, the flaw in the information-processing view is that information is defined with respect to an external reference (sound direction), which is accessible for an external observer. Nothing in the spikes themselves is about space: why would a difference in timing between two specific neurons produce a percept of space? It turns out that, of all the quantities the authors looked at, largest ITD is actually the worst predictor of sound localization acuity. Once the effect of best field of vision is removed, it is essentially uncorrelated (Fig. 8).
A second hypothesis goes as follows. The auditory system can estimate the ITD of sounds, but to interpret this ITD as the angle of the sound source requires calibration (learning), and this calibration requires vision. Therefore, sound localization acuity is directly determined by visual acuity. At first sight, this could be compatible with the information processing view of perception. However, the sound localization threshold is determined in a left/right localization task near the midline, and in fact this task does not require calibration. Indeed, one only needs to know which of the two ITDs is larger. Therefore, in the information-processing view, sound localization acuity should still be related to the temporal precision of neural “coding”. To make this hypothesis compatible with the information-processing view requires an additional evolutionary argument, which goes as follows. The sound localization system is optimized for a different task, absolute (not relative) localization, which requires calibration with vision. Therefore the temporal precision of neural firing, or of the binaural system, should match the required precision for that task. The authors find again that, once the effect of best field of vision is removed, visual acuity is essentially uncorrelated with sound localization acuity (Fig. 8).
Another evolutionary hypothesis could be that sound localization acuity is tuned for the particular needs of the animal. So a predator, like a cat, would need a very accurate sound localization system to be able to find a prey that is hiding. A prey would probably not require such high accuracy to be able to escape from a predator. An animal that is neither a prey nor a predator, like an elephant, would also not need high accuracy. It turns out that the elephant has one of the lowest localization thresholds in all mammals. Again there is no significant correlation once the best field of vision is factored out.
In this study, it appears rather clearly that the single quantity that best predicts sound localization acuity is the width of the best field of vision. First of all, this goes against the common view of the interaction between vision and hearing. According to this view, the visual system localizes the sound source, and this estimation is used to calibrate the sound localization system. If this were right, we would rather expect that localization acuity corresponds to visual acuity.
In terms of function, the results suggest that sound localization is used by animals to move their eyes so that the source is in the field of best vision. There are different ways to interpret this. The authors seem to follow the information-processing view, with the evolutionary twist: sound localization acuity reflects the precision of the auditory system, but that precision is adapted for the function of sound localization. One difficulty with this interpretation is that the auditory system is also involved in many other tasks that are unrelated to sound localization, such as sound identification. Therefore, only the precision of the sound localization system should be tuned to the difficulty of the task, for example the size of the medial superior olive, which is involved in the processing of ITDs. However, when thinking of intensity rather than timing differences, this view seems to imply that the precision of encoding of monaural intensities should be tuned to the difficulty of the binaural task.
Another difficulty comes from studies of vision-deprived or blind animals. There are a few of them, which tend to show that sound localization acuity actually tends to get better. This could not occur if sound localization acuity reflected genetic limitations. The interpretation can be saved by replacing evolution by development. That is, the sound localization system is tuned during development to reach a precision appropriate for the needs of the animal. For a sighted animal, these needs would be moving the eyes to the source, but for a blind animal it could be different.
An alternative interpretation that rejects the information-processing view is to consider that the meaning of binaural cues (ITD, ILD) can only come from what they imply for the animal, independently of the “encoding” precision. For a sighted animal, observing a given ITD would imply that moving the eyes or the head by a specific angle would put a moving object in the best field of view. If perceiving direction is perceiving the movement that must be performed to put the source in the best field of view, then sound localization acuity should correspond to the width of that field. For a blind animal, the connection with vision disappears, and so binaural cues must acquire a different meaning. This could be, for example, the movements required to reach the source. In this case, sound localization acuity could well be better than for a sighted animal.
In more operational terms, learning the association between binaural cues and movements (of the eyes or head) requires a feedback signal. In the calibration view, this feedback is the error between the predicted retinal location of the sound source and the actual location, given by the visual system. Here the feedback signal would rather be something like the total amount of motion in the visual field, or its correlation with sound, a quantity that would maximized when the source is in the best field of vision. This feedback is more like a reward than a teacher signal.
Finally, I suggest a simple experiment to test this hypothesis. Gerbils have a rather homogeneous retina, with a best field of vision of 200°. Accordingly, sound localization threshold is large, about 27°. The hypothesis would predict that, if gerbils were raised with an optical system (glasses) that creates an artificial fovea (enlarge a central part of the visual field), then their sound localization acuity should improve. Conversely, for an animal with a small field of best vision (cats), using an optical system that magnifies the visual field should degrade sound localization acuity. Finally, in humans with corrected vision, there should be a correlation between the type of correction and sound localization acuity.
This discussion also raises two points I will try to address later:
- If sound localization acuity reflects visual factors, then it should not depend on properties of the sound, as long as there are no constraints in the acoustics themselves (e.g. a pure tone may provide ambiguous cues).
- If sound localization is about moving the eyes or the head, then how about the feeling of distance, and other aspects of spatial hearing?
When one hears music or speech through earphones, it usually feels like the sound comes from “inside the head”. Yet, one also feels that the sound may come from the left or from the right, and even from the front or back when using head-related transfer functions or binaural recordings. This is why, when subjects report the left-right quality of sounds with artificially introduced interaural level or time differences, one speaks of lateralization rather than localization.
But why is it so? The first answer is: sounds heard through earphones generally don’t reproduce the spatial features of sounds heard in a natural environment. For example, in musical recordings, sources are lateralized using only interaural level differences but not time differences. They also don’t reproduce the diffraction by the head, which one can reproduce using individually measured head-related transfer functions (HRTFs). However, even with individual HRTFs, sounds usually don’t feel as “external” as in the real world. How can it be so, if the sound waves arriving at the eardrums are exactly the same as in real life? Well, maybe they are not: maybe reproducing reverberation is important, or maybe some features of the reproduced waves are very sensitive to the precise placement of the earphones.
It could be the reason, but even if it’s true, it still leaves an open question: why would sounds feel “inside the head” when the spatial cues are not natural? One may argue that, if a sound is judged as not coming from a known external direction, then “by default” it has to come from inside. But we continuously experience new acoustical environments, which modify the spatial cues, and I don’t think we experience sounds as coming from inside our head at first. We might also imagine other “default places” where there are usually no sound sources, for example other places inside the body, but we feel sounds inside the head, not just inside the body. And finally, is it actually true that there are no sounds coming from inside the head? In fact, not quite: think about chewing, for example – although arguably, these sounds come from the inner surface of the mouth.
The “default place” idea also doesn’t explain why such sounds should feel like they have a spatial location rather than no location at all. An alternative strategy is the sensorimotor approach, according to which the distinct quality of sounds that feel inside the head has to do with the relationship between one’s movements and the sensory signals. Indeed, with earphones, the sound waves are unaffected by head movements. This is characteristic of sound sources that are rigidly attached to the ears. This is the head, from the top of the neck, excluding the jaws. This is an appealing explanation, but it doesn’t come without difficulties. First, even though it may explain why we have a specific spatial feel for sounds heard through earphones, it is not obvious how we should experience this feel as sounds being produced inside the head. Perhaps this difficulty can be resolved by considering that one can produce sounds with such a feel by e.g. touching one’s head or chewing. But these are sound sources localized on the surface of the head, or the inner surface of the mouth, not exactly inside the head. Another way of producing sounds with the same quality is to speak, but it comes with the same difficulty.
I will come back to speech later, but I will finish with a few more remarks about the sensorimotor approach. It seems that experiencing the feel of sounds produced inside the head requires turning one’s head. So one would expect that if sound is realistically rendered through earphones with individual HRTFs and the subject’s head is held fixed, it should sound externalized; or natural sounds should feel inside the head until one turns her head. But maybe this is a naive understanding of the sensorimotor approach: the feel is associated to the expectation of a particular sensorimotor relationship, and this expectation can be based on inference rather than on a direct test. That is, sounds heard through earphones, with their particular features (e.g. no interaural time differences, constant interaural intensity differences), produce a feel of coming from inside the head because whenever one has tried to test this perceptual hypothesis by moving her head, this hypothesis has been confirmed (i.e., ITDs and IIDs have remained unchanged). So when presenting sounds with such features, it is inferred that ITDs and IIDs should be unaffected by movements, which is to say that sounds come from inside the head. One objection, perhaps, is that sounds lateralized using only ITDs and not IIDs also immediately feel inside the head, even though they do not correspond at all to the kind of binaural sounds usually rendered through earphones (in musically recordings).
The remarks above would imply the following facts:
- When sounds are rendered through earphones with only IIDs, they initially feel inside the head.
- When sounds are realistically rendered through earphones with individual HRTFs (assuming we can actually reproduce the true sound waves very accurately, maybe using the transaural technique), perhaps using natural reverberation, they initially feel outside the head.
- When the subject is allowed to move, sounds should feel (perhaps after a while) inside the head.
- When the subject is allowed to move and the spatial rendering follows these movements (using a head tracker), the sounds should feel outside the head. Critically, this should also be true when sounds are not realistically rendered, as long as the sensorimotor relationship is accurate enough.
To end this post, I will come back to the example of speech. Why do we feel that speech comes from our mouth, or perhaps nose or throat? We cannot resolve the location of speech with touch. However, we can change the sound of speech by moving well-localized parts of our body: the jaws, the lips, the tongue, etc. This could be one explanation. But another possibility, which I find interesting, is that speech also produces tactile vibrations, in particular on the throat but also on the nose. These parts of the body have tactile sensors that can also be activated by touch. So speech should actually produce well-localized vibratory sensations at the places where we feel speech is coming from.
What I find intriguing in this remark is that it raises the possibility that the localization of sound might also involve tactile signals. So the question is: what are the tactile signals produced by natural sounds? And what are the tactile signals produced by earphones, do they stimulate tactile receptors on the outer ears, for example? This idea might be less crazy than it sounds. Decades ago, von Békésy used the human skin to test our sensitivity to vibrations and he showed that we can actually feel the ITD of binaural sounds acting on the skin of the two arms rather than on the two eardrums. The question, of course, is whether natural sounds produce such distinguishable mechanical vibrations on the skin. Perhaps studies on profoundly deaf subjects could provide an answer. I should also note that, given the properties of the skin and tactile receptors, I believe these tactile signals should be limited to low frequencies (say, below 300 Hz).
I now summarize this post by listing a number of questions I have raised:
- What are the spatial auditory cues of natural sounds produced inside the head? (chewing, touching one’s head, speaking)
- Is it possible to externalize sounds without tracking head movements? (e.g. with the transaural technique)
- Is it possible to externalize sounds by tracking head movements, but without reproducing realistic natural spatial cues (HRTFs)?
- What is the tactile experience of sound, and are there tactile cues for sound location? Can profoundly deaf people localize sound sources?
Update. Following a discussion with Kevin O’Regan, I realize I must qualify one of my statements. I wrote that sound waves are unaffected by head movements when the source is rigidly attached to the head. This is in fact only true in an anechoic environment. But as soon as there is a reflecting surface, which does not move with the head, moving the head has an effect on sound waves (specifically, on echoes). In other words, the fact that echoic cues are affected (in a lawful way) by movements is characteristic of sounds outside the head, whether they are rigidly attached to the head or not. To be more precise, monaural echoic cues change with head movements for an external source attached to the head, while binaural echoic cues do for an external source free from the head.
In my previous post, I argued that the spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. This happens even though sounds contain highly structured information about shape, because this structure does not relate to self-generated movements. One may then wonder whether the notion of auditory space in general, for example the spatial location of sound source, is also secondary. One may postulate that the spatial content of auditory spatial cues is only acquired by their contingency with visual spatial cues. In fact, this idea is supported by an intriguing study showing a very strong correlation across species between visual acuity in the fovea and auditory spatial acuity (Heffner & Heffner, 1992 , Fig. 6). More precisely, the authors show that sound localization acuity is better predicted by visual acuity than by acoustical factors (essentially, interaural distance). In our interpretation, animals have poor sound localization acuity not so much because they lack the physiological mechanisms to correctly analyze spatial information, but because in the absence of precise vision, auditory spatial cues cannot acquire precise spatial content. This does not imply that the auditory system of these animals cannot decode these spatial cues, but only they cannot make sense of them. [Update: the results in Heffner & Heffner are in fact more subtle, see a more recent post]
This being said, there is in fact some intrinsic spatial content in sounds, which I will describe now. When a sound is produced, it arrives first at the ear closer to the source, then at the other ear. The intensity will also be higher at the first ear. This is the binaural structure of sounds produced by a single source, and captured by two ears that are spatially separated. This is similar to stereoscopic vision. But observe one difference: in vision, as Gibson noted, having two eyes is essentially the same thing as having one eye, combined with lateral head movements; in hearing, this is not the same because of the non-persistent nature of sounds. If one turns the head to sample another part of the “acoustic array” (in analogy with Gibson’s optic array), the sound field will have changed already (and possibly faded out), so the spatial structure will not be directly captured in the same way. Thus, to capture spatial structure in sound, it is crucial that acoustic signals are simultaneously captured at different locations.
This binaural structure in sounds is often described as “spatial cues” (binaural cues). Quantitatively, there is a relationship between the spatial location of the source and binaural structure, e.g. the interaural time difference (ITD). However, these “cues” are not intrinsically spatial in the sense that they are not defined in relationship to self-generated movements. For example, what is the spatial meaning of an ITD of 100 µs? Intrinsically, there is none. As discussed above, one way for spatial cues to acquire spatial content is by association, i.e., with the spatial content of another modality (vision). But now I will consider the effects of self-generated movements, that is, what is intrinsically spatial in sounds.
When the head turns, the binaural structure changes in specific ways. That is, there is a sensorimotor structure that gives spatial content to binaural structure. More precisely, two different binaural structures can be related to each other by a specific movement. But an important distinction with vision must be made. Because of the non-persistent nature of sounds, the relationship is not between movements and sensory signals, it is between movements and the structure of sensory signals. It is not possible to predict the auditory signals from auditory signals captured before a specific movement. For one thing, there might be no sound produced after the movement. What is predictable is the binaural structure of the sound, if indeed a sound is produced by a source that has a persistent location. If the location of the source is persistent, then the binaural structure is persistent, but not the auditory signals themselves.
Another point we notice is that this provides only a relative sense of space. That is, one can that say whether a sound source is 20° left of another sound source, but it does not produce an absolute egocentric notion of space. What is lacking is a reference point. I will propose two ways to solve this problem.
What is special, for example, about a source that is in front of the observer? Gibson noted that, in vision, the direction in which the optic flow is constant indicates the direction of movement. Similarly, when one moves in the direction of a sound source, the direction of that sound source is unchanged, and therefore the binaural structure of sound is unchanged. In other words, the direction of a sound source in front is the direction of a self-generated movement that would leave the binaural structure unchanged (we could also extend this definition to the monaural spectral information). In fact the binaural structure can depend on distance, when the source is near, but this is a minor point because we can simply state that we are considering the direction that makes binaural structure minimally changed (see also the second way below). One problem with this, however, is that moving to and moving away from a source both satisfy this definition. Although these two cases can be distinguished by head movements, this definition does not make a distinction between what is moving closer and what is moving further away from the source. One obvious remark is that moving to a source increases the intensity of the sound. The notion of intensity here should be understood as a change in information content. In the same way as in vision where moving to an object increases the level of visual detail, moving to a sound source increases the signal-to-noise ratio, and therefore the level of auditory detail available. This makes sense independently of the perceptual notion of loudness – in fact it is rather related to the notion of intelligibility (a side note: this is consistent with the fact that an auditory cue to distance is the ratio between direct sound energy and reverberated energy). Of course again, because sounds are not persistent, the notion of change in level is weak. One needs to assume that the intensity of the sound persists. However, I do not think this is a critical problem, for even if intensity is variable, what is needed is only to observe how intensity at the ear correlates with self-generated movements. This is possible because self-generated movements are (or at least can be) independent of the intensity variations of the sound.
This indeed seems to provide some intrinsic spatial content to sounds. But we note that it is quite indirect (compared to vision), and made more evident by the fact that sounds are not persistent. There is another, more direct, way in which sounds can acquire spatial content: by the active production of sounds. For example, one can produce sounds by hitting objects. This provides a direct link between the spatial location of the object, relative to the body, and the auditory structure of the sound. Even though sounds are not persistent, they can be repeated. But we note that this can only apply to objects that are within reach.
This discussion shows that while there is no intrinsic spatial content about shape in sounds, there is intrinsic spatial content about source location. This seems to stand in contradiction with the discussion at the beginning of this post, in which I pointed out that spatial auditory acuity seems to be well predicted across species by visual acuity, suggesting that spatial content is acquired. Here is a possible way to reconcile these two viewpoints. In vision, an object at a specific direction relative to the observer will project light rays in that direction to the retina, which will be captured by specific photoreceptors. Therefore, there is little ambiguity in vision about spatial location. However, in hearing, this is completely different. Sounds coming from a particular direction are not captured by a specific receptor. Information about direction is in the structure of the signals captured at the two ears. The difficulty is that this structure depends on the direction of the sound source but also on other uncontrolled factors. For example, reflections, in particular early reflections, modify the binaural cues (Gourévitch & Brette 2012). These effects are deterministic but situation-dependent. This implies that there is no fixed mapping from binaural structure to spatial location. This makes the auditory spatial content weaker, even though auditory spatial structure is rich. Because visual location is more invariant, it is perhaps not surprising that it dominates hearing in localization tasks.
In a previous post, I emphasized the differences between vision and hearing, from an ecological point of view. Here I want to comment on the sensorimotor theory of perception (O’Regan & Noë 2001) or the enactive approach, applied to sounds. According to this theory, perception is the implicit knowledge of the effects of self-generated movements on sensory signals. Henri Poincaré made this point a long time ago: "To localize an object simply means to represent to oneself the movements that would be necessary to reach it". For example, perceiving the spatial location of an object is knowing the movements that one should do to move to that object, or to grasp it, or to direct its fovea to it.
There are two implicit assumptions here: 1) that there is some persistence in the sensory signals, 2) that the relevant information is spatial in nature. I will start with the issue of persistence. As I previously argued, a defining characteristic of sounds is that they are not persistent, they happen. For example, the sound of someone else hitting an object is transient. One cannot interact with it. So there cannot be any sensorimotor contingency in this experience. It could be argued that one relies on the memory of previous sensorimotor contingencies, that is, the memory of one producing an impact sound. This is a fair remark, I think, but it overestimates the amount of information there is in this contingency. When an impact sound is produced, the only relationships between motor commands and the acoustic signals are the impact timing and the sound level (related to the strength of the impact). But there is much more information in the acoustic signal of an impact sound, because the structure of this signal is related to properties of the sounding object, in particular material and shape (Gaver, 1993). For example, the resonant modes are informative of the shape and the decay rate of these modes indicates the nature of the material (wood, metal, etc), properties that we can very easily identify. So there is informative sensory structure independent of sensorimotor contingencies.
Now I think we are hitting an interesting point. Even though the resonant modes are informative of the shape (the size, essentially) of an object, they cannot provide any perceptual spatial content by themselves. That is, the frequency of a resonant mode is just a number, and a number has no meaning without context. Compare with the notion of object size for the tactile system: the size of a (small) object is the extent to which one must stretch the hand to grasp it. There is no such thing in hearing. There is nothing intrinsically spatial in auditory size, it seems. If one moves and the sound is repeated, the same resonant modes will be excited. Therefore, it seems that auditory shape can only be a derived property. That is, the specific sensory structure of sounds that corresponds to shape acquires perceptual content by association with another sense that has intrinsic spatial content, i.e., visual or tactile. Now we get to Gibson’s notion of invariant structure: auditory size is the structure in the auditory signals that remains the same when other aspects than size change (where the notion of size is not auditory). Here I am imagining that one hears sounds produced by various sources for which the size is known, and one can identify that some auditory structure is the same for all sources that have the same size. Note the important point here: what persists here is not the sensory signals, it is not the relationship between movements and sensory signals, it is not even the relationship between size and sensory signals, it is the relationship between size and the structure of auditory signals, which is a kind of relationship. That is, one cannot predict the auditory signals from the size: one can predict some aspect of the structure of these signals from the size.
Here I have highlighted the fact that the auditory shape of an object is a structure of auditory signals, not a kind of sensorimotor structure. The spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. But there can also be intrinsic spatial content in sounds, and in my next post, I will discuss spatial hearing.
What is sound? Physically, sounds are mediated by acoustical waves. But vision is mediated by light waves and yet hearing does not feel like vision. Why is that?
There are two wrong answers to this question. The first one is that the neural structures are different. Sounds are processed in the cochlea and in the auditory cortex, images by the retina and visual cortex. But then why doesn’t a sound evoke some sort of image, like a second visual system? This point of view does not explain much about perception, only about what brain areas “light up” when a specific type of stimulus is presented. The second one is that the physical substrate is different: light waves vs. acoustic waves. This is also a weak answer, for what is fundamentally different between light and acoustic waves that would make them “feel” different?
I believe the ecological approach provides a more satisfying answer. By this, I am referring to the ecological theory of visual perception developed by James Gibson. It emphasizes the structure of sensory signals collected by an observer in an ecological environment. It is also related the sensorimotor account of perception (O’Regan & Noë 2001), which puts the emphasis on the relationship between movements and sensory signals, but I will show below that this emphasis is less relevant in hearing (except in spatial hearing).
I will quickly summarize what is vision in Gibson’s ecological view. Illumination sources (the sun) produce light rays that are reflected by objects. More precisely, light is reflected by the surface of objects with the medium (air, or possibly water). What is available for visual perception are surfaces and their properties (color, texture, shape...). Both the illumination sources and the surfaces in the environment are generally persistent. The observer can move, and this changes the light rays received by the retina. But these changes are highly structured because the surfaces persist, and this structure is informative of the surfaces in the environment. Thus what the visual system perceives is the arrangement and properties of persistent surfaces. Persistence is crucial here, because it allows the observer to use its own movements to learn about the world – in the sensorimotor account of perception, perception is precisely the implicit knowledge of the effect of one’s actions on sensory signals.
On the other hand, sounds are produced by the mechanical vibration of objects. This means that sounds convey information about volumes rather than surfaces. They depend on the shape but also on the material and internal structure of objects. It also means that what is perceived in sounds is the source of the waves rather than their interaction with the environment. Crucially, contrary to vision, the observer cannot directly interact with sound waves, because a sound happens, it is not persistent. An observer can produce a sound wave, for example by hitting an object, but once the sound is produced there is no possible further interaction with it. The observer cannot move to analyze the structure of acoustic signals. The only available information is in the sound signal itself. In this sense, sounds are events.
These ecological observations highlight major differences between vision and hearing, which go beyond the physical basis of these two senses (light waves and acoustic waves). Vision is the perception of persistent surfaces. Hearing is essentially the perception of mechanical events on volumes. These remarks are independent from the fact that vision is mediated by a retina and hearing by a cochlea.