When one hears music or speech through earphones, it usually feels like the sound comes from “inside the head”. Yet, one also feels that the sound may come from the left or from the right, and even from the front or back when using head-related transfer functions or binaural recordings. This is why, when subjects report the left-right quality of sounds with artificially introduced interaural level or time differences, one speaks of lateralization rather than localization.
But why is it so? The first answer is: sounds heard through earphones generally don’t reproduce the spatial features of sounds heard in a natural environment. For example, in musical recordings, sources are lateralized using only interaural level differences but not time differences. They also don’t reproduce the diffraction by the head, which one can reproduce using individually measured head-related transfer functions (HRTFs). However, even with individual HRTFs, sounds usually don’t feel as “external” as in the real world. How can it be so, if the sound waves arriving at the eardrums are exactly the same as in real life? Well, maybe they are not: maybe reproducing reverberation is important, or maybe some features of the reproduced waves are very sensitive to the precise placement of the earphones.
It could be the reason, but even if it’s true, it still leaves an open question: why would sounds feel “inside the head” when the spatial cues are not natural? One may argue that, if a sound is judged as not coming from a known external direction, then “by default” it has to come from inside. But we continuously experience new acoustical environments, which modify the spatial cues, and I don’t think we experience sounds as coming from inside our head at first. We might also imagine other “default places” where there are usually no sound sources, for example other places inside the body, but we feel sounds inside the head, not just inside the body. And finally, is it actually true that there are no sounds coming from inside the head? In fact, not quite: think about chewing, for example – although arguably, these sounds come from the inner surface of the mouth.
The “default place” idea also doesn’t explain why such sounds should feel like they have a spatial location rather than no location at all. An alternative strategy is the sensorimotor approach, according to which the distinct quality of sounds that feel inside the head has to do with the relationship between one’s movements and the sensory signals. Indeed, with earphones, the sound waves are unaffected by head movements. This is characteristic of sound sources that are rigidly attached to the ears. This is the head, from the top of the neck, excluding the jaws. This is an appealing explanation, but it doesn’t come without difficulties. First, even though it may explain why we have a specific spatial feel for sounds heard through earphones, it is not obvious how we should experience this feel as sounds being produced inside the head. Perhaps this difficulty can be resolved by considering that one can produce sounds with such a feel by e.g. touching one’s head or chewing. But these are sound sources localized on the surface of the head, or the inner surface of the mouth, not exactly inside the head. Another way of producing sounds with the same quality is to speak, but it comes with the same difficulty.
I will come back to speech later, but I will finish with a few more remarks about the sensorimotor approach. It seems that experiencing the feel of sounds produced inside the head requires turning one’s head. So one would expect that if sound is realistically rendered through earphones with individual HRTFs and the subject’s head is held fixed, it should sound externalized; or natural sounds should feel inside the head until one turns her head. But maybe this is a naive understanding of the sensorimotor approach: the feel is associated to the expectation of a particular sensorimotor relationship, and this expectation can be based on inference rather than on a direct test. That is, sounds heard through earphones, with their particular features (e.g. no interaural time differences, constant interaural intensity differences), produce a feel of coming from inside the head because whenever one has tried to test this perceptual hypothesis by moving her head, this hypothesis has been confirmed (i.e., ITDs and IIDs have remained unchanged). So when presenting sounds with such features, it is inferred that ITDs and IIDs should be unaffected by movements, which is to say that sounds come from inside the head. One objection, perhaps, is that sounds lateralized using only ITDs and not IIDs also immediately feel inside the head, even though they do not correspond at all to the kind of binaural sounds usually rendered through earphones (in musically recordings).
The remarks above would imply the following facts:
- When sounds are rendered through earphones with only IIDs, they initially feel inside the head.
- When sounds are realistically rendered through earphones with individual HRTFs (assuming we can actually reproduce the true sound waves very accurately, maybe using the transaural technique), perhaps using natural reverberation, they initially feel outside the head.
- When the subject is allowed to move, sounds should feel (perhaps after a while) inside the head.
- When the subject is allowed to move and the spatial rendering follows these movements (using a head tracker), the sounds should feel outside the head. Critically, this should also be true when sounds are not realistically rendered, as long as the sensorimotor relationship is accurate enough.
To end this post, I will come back to the example of speech. Why do we feel that speech comes from our mouth, or perhaps nose or throat? We cannot resolve the location of speech with touch. However, we can change the sound of speech by moving well-localized parts of our body: the jaws, the lips, the tongue, etc. This could be one explanation. But another possibility, which I find interesting, is that speech also produces tactile vibrations, in particular on the throat but also on the nose. These parts of the body have tactile sensors that can also be activated by touch. So speech should actually produce well-localized vibratory sensations at the places where we feel speech is coming from.
What I find intriguing in this remark is that it raises the possibility that the localization of sound might also involve tactile signals. So the question is: what are the tactile signals produced by natural sounds? And what are the tactile signals produced by earphones, do they stimulate tactile receptors on the outer ears, for example? This idea might be less crazy than it sounds. Decades ago, von Békésy used the human skin to test our sensitivity to vibrations and he showed that we can actually feel the ITD of binaural sounds acting on the skin of the two arms rather than on the two eardrums. The question, of course, is whether natural sounds produce such distinguishable mechanical vibrations on the skin. Perhaps studies on profoundly deaf subjects could provide an answer. I should also note that, given the properties of the skin and tactile receptors, I believe these tactile signals should be limited to low frequencies (say, below 300 Hz).
I now summarize this post by listing a number of questions I have raised:
- What are the spatial auditory cues of natural sounds produced inside the head? (chewing, touching one’s head, speaking)
- Is it possible to externalize sounds without tracking head movements? (e.g. with the transaural technique)
- Is it possible to externalize sounds by tracking head movements, but without reproducing realistic natural spatial cues (HRTFs)?
- What is the tactile experience of sound, and are there tactile cues for sound location? Can profoundly deaf people localize sound sources?
Update. Following a discussion with Kevin O’Regan, I realize I must qualify one of my statements. I wrote that sound waves are unaffected by head movements when the source is rigidly attached to the head. This is in fact only true in an anechoic environment. But as soon as there is a reflecting surface, which does not move with the head, moving the head has an effect on sound waves (specifically, on echoes). In other words, the fact that echoic cues are affected (in a lawful way) by movements is characteristic of sounds outside the head, whether they are rigidly attached to the head or not. To be more precise, monaural echoic cues change with head movements for an external source attached to the head, while binaural echoic cues do for an external source free from the head.