What is sound? (IX) Sound localization and vision

In this post, I want to come back on a remark I made in a previous post, on the relationship between vision and spatial hearing. It appears that my account of the comparative study of Heffner and Heffner (Heffner & Heffner, 1992) was not accurate. Their findings are in fact even more interesting than I thought. They find that sound localization acuity across mammalian species is best predicted not by visual acuity, but by the width of the field of best vision.

Before I comment on this result, I need to explain a few details. Sound localization acuity was measured behaviorally in a left/right discrimination task near the midline, with broadband sounds. The authors report this discrimination threshold for 23 mammalian species, from gerbils to elephants. They then try to relate this value to various other quantities: the largest interaural time difference (ITD), which is directly related to head size, visual acuity (highest angular density of retinal cells), whether the animals are predatory or preys, and field of best vision. The latter quantity is defined as the angular width of the retina in which angular cell density is at least 75% of the highest density. So this quantity is directly related to the inhomogeneity of cell density in the retina.

The results of the comparative study are not straightforward (I find). Let us consider a few hypotheses. One hypothesis goes as follows. Sound localization acuity is directly related to the temporal precision of firing of auditory nerve fibers. If this precision is similar for all mammals, then this should correspond to a constant ITD threshold. In terms of angular threshold, sound localization acuity should then be inversely proportional to the largest ITD, and to head size. The same reasoning would go for intensity differences. Philosophically speaking, this corresponds to the classical information-processing view of perception: there is information about sound direction in the ITD, as reflected in the relative timing of spikes, and so sound direction can be estimated with a precision that is directly related to the temporal precision of neural firing. As I have argued many times in this blog, the flaw in the information-processing view is that information is defined with respect to an external reference (sound direction), which is accessible for an external observer. Nothing in the spikes themselves is about space: why would a difference in timing between two specific neurons produce a percept of space? It turns out that, of all the quantities the authors looked at, largest ITD is actually the worst predictor of sound localization acuity. Once the effect of best field of vision is removed, it is essentially uncorrelated (Fig. 8).

A second hypothesis goes as follows. The auditory system can estimate the ITD of sounds, but to interpret this ITD as the angle of the sound source requires calibration (learning), and this calibration requires vision. Therefore, sound localization acuity is directly determined by visual acuity. At first sight, this could be compatible with the information processing view of perception. However, the sound localization threshold is determined in a left/right localization task near the midline, and in fact this task does not require calibration. Indeed, one only needs to know which of the two ITDs is larger. Therefore, in the information-processing view, sound localization acuity should still be related to the temporal precision of neural “coding”. To make this hypothesis compatible with the information-processing view requires an additional evolutionary argument, which goes as follows. The sound localization system is optimized for a different task, absolute (not relative) localization, which requires calibration with vision. Therefore the temporal precision of neural firing, or of the binaural system, should match the required precision for that task. The authors find again that, once the effect of best field of vision is removed, visual acuity is essentially uncorrelated with sound localization acuity (Fig. 8).

Another evolutionary hypothesis could be that sound localization acuity is tuned for the particular needs of the animal. So a predator, like a cat, would need a very accurate sound localization system to be able to find a prey that is hiding. A prey would probably not require such high accuracy to be able to escape from a predator. An animal that is neither a prey nor a predator, like an elephant, would also not need high accuracy. It turns out that the elephant has one of the lowest localization thresholds in all mammals. Again there is no significant correlation once the best field of vision is factored out.

In this study, it appears rather clearly that the single quantity that best predicts sound localization acuity is the width of the best field of vision. First of all, this goes against the common view of the interaction between vision and hearing. According to this view, the visual system localizes the sound source, and this estimation is used to calibrate the sound localization system. If this were right, we would rather expect that localization acuity corresponds to visual acuity.

In terms of function, the results suggest that sound localization is used by animals to move their eyes so that the source is in the field of best vision. There are different ways to interpret this. The authors seem to follow the information-processing view, with the evolutionary twist: sound localization acuity reflects the precision of the auditory system, but that precision is adapted for the function of sound localization. One difficulty with this interpretation is that the auditory system is also involved in many other tasks that are unrelated to sound localization, such as sound identification. Therefore, only the precision of the sound localization system should be tuned to the difficulty of the task, for example the size of the medial superior olive, which is involved in the processing of ITDs. However, when thinking of intensity rather than timing differences, this view seems to imply that the precision of encoding of monaural intensities should be tuned to the difficulty of the binaural task.

Another difficulty comes from studies of vision-deprived or blind animals. There are a few of them, which tend to show that sound localization acuity actually tends to get better. This could not occur if sound localization acuity reflected genetic limitations. The interpretation can be saved by replacing evolution by development. That is, the sound localization system is tuned during development to reach a precision appropriate for the needs of the animal. For a sighted animal, these needs would be moving the eyes to the source, but for a blind animal it could be different.

An alternative interpretation that rejects the information-processing view is to consider that the meaning of binaural cues (ITD, ILD) can only come from what they imply for the animal, independently of the “encoding” precision. For a sighted animal, observing a given ITD would imply that moving the eyes or the head by a specific angle would put a moving object in the best field of view. If perceiving direction is perceiving the movement that must be performed to put the source in the best field of view, then sound localization acuity should correspond to the width of that field. For a blind animal, the connection with vision disappears, and so binaural cues must acquire a different meaning. This could be, for example, the movements required to reach the source. In this case, sound localization acuity could well be better than for a sighted animal.

In more operational terms, learning the association between binaural cues and movements (of the eyes or head) requires a feedback signal. In the calibration view, this feedback is the error between the predicted retinal location of the sound source and the actual location, given by the visual system. Here the feedback signal would rather be something like the total amount of motion in the visual field, or its correlation with sound, a quantity that would maximized when the source is in the best field of vision. This feedback is more like a reward than a teacher signal.

Finally, I suggest a simple experiment to test this hypothesis. Gerbils have a rather homogeneous retina, with a best field of vision of 200°. Accordingly, sound localization threshold is large, about 27°. The hypothesis would predict that, if gerbils were raised with an optical system (glasses) that creates an artificial fovea (enlarge a central part of the visual field), then their sound localization acuity should improve. Conversely, for an animal with a small field of best vision (cats), using an optical system that magnifies the visual field should degrade sound localization acuity. Finally, in humans with corrected vision, there should be a correlation between the type of correction and sound localization acuity.

This discussion also raises two points I will try to address later:

- If sound localization acuity reflects visual factors, then it should not depend on properties of the sound, as long as there are no constraints in the acoustics themselves (e.g. a pure tone may provide ambiguous cues).

- If sound localization is about moving the eyes or the head, then how about the feeling of distance, and other aspects of spatial hearing?

 

What is sound? (VIII) Sounds: objects or events?

In my first post in this series, I described the differences between seeing and hearing. I noted that what characterizes sounds is that they are not persistent. One may say that sounds are “events”, as opposed to “objects”. I avoided this term because it is implied that an event has a definite start and end. Although this sometimes true (for example speech), many sounds actually do not have a definite end. For example, the sound produced when striking an object has a definite start (the impact) but not a definite end (energy decays exponentially). This is not to say that we hear these sounds as lasting forever, but simply that it is somewhat arbitrary to define a clear ending time. Worse, a number of sounds also have no start and no end. For example, the sound made by a river, or by wind. So what characterizes sounds is not exactly that they have a clear start and end, but rather that they are not persistent, they change through time. So, it could be said that sounds are events, but in the sense that they “happen”. When the sound is heard, the acoustical wave responsible for it is actually not here anymore (this is related to Husserl’s phenomenological description of time).

Now it could be argued that, if one could repeat the sound (with a recording for example, or less accurately by physically producing the sound several times), then perhaps it could qualify as an object. The notion of “repeatable object” is discussed by Jérôme Dokic (“Two ontologies of sound”), where there is an interesting remark about the notion of depiction. When seeing a painting, one sees both the content in the painting and the painting itself. But at first sight, it seems that sounds are not like this: the reproduction of a sound is like the original sound – possibly altered, but not a representation of the sound. But in fact there is an interesting auditory example: when a loud voice is heard through a phone and the volume is low, you actually hear a loud sound (the voice) inside a soft sound (the sound coming out of the phone).

Nevertheless, I think even in this case, describing the sound as a sort of “object” is misleading. An object is something that can be manipulated. For example, if you are looking at a box on the table, you can change your perspective on it, turn around it, see a face disappear behind an edge, etc. You can do this exploration because the object is persistent. In the same way, you could touch it, hold it, turn it, etc. So it makes sense to say that visual or tactile experience is about objects. But the same does not hold for sounds because they are transient, you cannot explore them. If you read my post on spatial hearing, you could oppose that you actually can: some of the properties of sound change when you move around the source. It is true, but precisely you do not hear these changes as changes in the sound, but in the localization of the sound. You feel the same sound, coming from some other direction. How about being able to repeat the sound with a recording? The point is that repeating is not manipulating. To manipulate, you need to change the perspective on the object, and this change of perspective tells you something about the object that you could not know before the manipulation (for example looking behind) – to be more precise, it can be said that visual shape is isomorphic to the relationship between viewing angle and visual field. If you repeat a recording exactly as in the original production, there is no manipulation at all. If you repeat it but, say, filter it in some way, you change it but it does not reveal anything about the sound, so it is not a change in perspective. You just produce a different sound, or possibly a depiction of the sound as in the phone example. The right visual analogy would be to insert a colored filter in front of your eyes, and this does not reveal anything about visual shape. Finally, it could be opposed that a sound could be repeatedly produced, for example by hitting the box several times, and the sound could be manipulated by hitting it with different strengths. But this is in fact not accurate: when the box is hit with a different strength, a different sound is produced, not a new perspective on the same sound. Here the object, what is persistent and can be manipulated, is not the sound: it is the material that produces the sound.

In fact, there is a well-known example in which the environment is probed using acoustical waves: the ultrasound hearing of bats. Bats produce loud ultrasound clicks or chirps and use the echoes to navigate in caves or localize small insects. In this case, acoustical waves are used to construct some sort of objects (the detailed shape of the cave), but I think this is really not what we usually mean by “hearing”, it seems rather closer to what we mean by seeing. I can of course only speculate about the phenomenological experience of bats, but I would guess that their experience is that of seeing, not of hearing.

To summarize: sounds are not like objects, which you can physically manipulate, i.e., have some control over the sensory inputs, in a way that is specific of the object. One possibility, perhaps, is to consider sounds as mental objects: things that you can manipulate in your mind, using your memory – but this is quite different from the notion of visual or tactile object.

What is sound? (VII) The phenomenology of pitch

So far, I have focused on an ecological description of sound, that is, how sounds appear from the perspective of an organism in its environment: the structure of sound waves captured by the ears in relationship with the sound-producing object, and the structure of interaction with sounds. There is nothing psychological per se in this description. It only specifies what is available to our perception, in a way that does not presuppose knowledge about the world. I now want to describe subjective experience of sounds in the same way, without preconceptions about what it might be. Such a preconception could be, for example, to say: pitch is the perceptual correlate of the periodicity of a sound wave. I am not saying that this is wrong, but I want to describe the experience of pitch as it appears subjectively to us, independently of what we may think it relates to.

This is in fact the approach of phenomenology. Phenomenology is a branch of philosophy that describes how things are given to consciousness, our subjective experience. It was introduced by Edmund Husserl and developed by a number of philosophers, including Merleau-Ponty and Sartre. The method of “phenomenological reduction” consists in suspending all beliefs we may have on the nature of things, to describe only how they appear to consciousness.

Here I will briefly discuss the phenomenology of pitch, which is the percept associated to how high or low a musical note is. A vowel also produces a similar experience. First of all, a pure tone feels like a constant sound, unlike a tone modulated at low frequency (say, a few Hz). This simple remark is already quite surprising. A pure tone is not a constant acoustical wave at all, it oscillates at a fast rate. Yet we feel it as a constant sound, as if nothing were changing at all in the sound. At the same time, we are not insensitive to this rate of change of the acoustical wave: if we vary the frequency of the pure tone, it feels very differently. This feeling is what is commonly associated to pitch: when the frequency is increased, the tone feels “higher”, when it is decreased, it feels “lower”. Interestingly, the language we use to describe pitch is that of space. I am too influenced by my own language and my musical background to tell whether we actually feel high sounds as being physically high, but it is an interesting observation. But for sure, low pitch sounds tend to feel larger than high pitch sounds, again a spatial dimension.

A very distinct property of pitch is that changing the frequency of the tone, i.e., the temporal structure of the sound wave, does not produce a perceptual change along a temporal dimension. Pitch is not temporal in the sense of: there is one thing, and then there is another thing. With a pure tone, there always seems to be a single thing, not a succession of things. In contrast, with an amplitude-modulated tone, one can feel that the sound is sequentially (but continuously) louder and weaker. In the same way, if one hits a piano key, the loudness of the sound decreases. In both cases there is a distinct feel of time associated to the change in amplitude of the sound wave. And this feel does not exist with the fast amplitude change of a tone. This simple observation demonstrates that phenomenological time is distinct from physical time.

Another very salient point is that when the loudness of the sound of the piano key decreases, the pitch does not seem to change. Somehow the pitch seems to be invariant to this change. I would qualify this statement, however, because this might not be true at low levels.

When a tone is accelerated, the sound seems to go higher, as when one asks a question. When it is decelerated, it seems to go lower, as when one ends a sentence. Here there is a feeling of time (first it is lower, then it is higher), corresponding to the temporal structure of the frequency change at a fast timescale.

Now when one compares two different sounds from the same instrument in sequence, there is usually a distinct feeling of one sound being higher than the other one. However, when the two sounds are very close in pitch, for example when one tunes a guitar, it can be difficult to tell which one is higher, even though it may be clearer that they have distinct pitches. When one plays two notes of different instruments, it is generally easy to tell whether it is the same note, but not always which one is higher. In fact the confusion is related to the octave similarity: if two notes are played on a piano, differing by an octave (which corresponds to doubling the frequency), they sound very similar. If they are played together instead of sequentially, they seem to fuse, almost as a single note. It follows that pitch seems to have a somewhat circular or helicoidal topology: there is an ordering from low to high, but at the same time pitches of notes differing by an octave feel very similar.

If one plays a melody on one instrument and then the same melody on another instrument, they feel like the same melody, even the though the acoustic waves are very different, and certainly they sound different. If one plays a piano key, then it is generally easy to immediately sing the same note. Of course when we say “the same note”, it is actually a very different acoustical wave that is produced by our voice, but yet it feels like it is the same level of “highness”. These observations certainly support the theory that pitch is the perceptual correlate of the periodicity of the sound wave, with the qualification that low repetition rates (e.g. 1 Hz) actually produce a feel of temporal structure (change in loudness or repeated sounds, depending on what is repeated in the acoustical wave) rather than a lower pitch.

The last observation is intriguing. We can repeat the pitch of a piano key with our voice, and yet most of us do not possess absolute pitch, the ability to name the piano key, even with musical training. It is intriguing because the muscular commands to the vocal system required to produce a given note are absolute, in the sense that they do not depend on musical context. This means, for most of us who do not possess absolute pitch, that these commands are not available to our consciousness as such. We can sing a note that we just heard, but we cannot sing a C. This suggests that we actually possess absolute pitch at a subconscious level.

I will come back to this point. Before, we need to discuss relative pitch. What is meant by “relative pitch”? Essentially, it is the observation that two melodies played in different keys sound the same. This is not a trivial fact at all. Playing a melody in a different key means scaling the frequency of all notes by the same factor, or equivalently, playing the fine structure of the melody at a different rate. The resulting sound wave is not at all like the original sound wave, either in the temporal domain (at any given time the acoustical pressures are completely different) or in the frequency domain (spectra could be non-overlapping). The melody sounds the same when fundamental frequencies are multiplied by the same factor, not when they are shifted by the same quantity. Note also that the melody is still recognizable when the duration of notes or gaps is changed, when the tempo is different, when expressivity is changed (e.g. loudness of notes) or when the melody is played staccato. This fact questions neurophysiological explanations based on adaptation.

Thus, it seems that, at a conscious level, what is perceived is primarily musical intervals. But even this description is probably not entirely right. It suggests that the pitch of a note is compared to the previous one to make sense. But if one hears the national hymn with a note removed, it will not feel like a different melody, but like the same melody with an ellipse. It is thus more accurate to say that a note makes sense within a harmonic context, rather than with respect to the previous note.

This point is in fact familiar to musicians. If a song is played and then one asks to sing another song, then the singer will tend to start the melody in the same key as the previous song. The two songs are unrelated, so thinking in terms of intervals does not make sense. But somehow there seems to be a harmonic context in which notes are interpreted.

Now the fact that there is such an effect of the previous song means that the harmonic context is maintained in working memory. It does not seem to require any conscious effort or attention, as when one tries to remember a phone number. Somehow it stays there, unconsciously, and determines the way in which future sounds are experienced. It does not even appear clearly whether there is a harmonic context in memory or if it has been “forgotten”.

Melodies can also be remembered for a long time. A striking observation is that it is impossible for most people to recall a known melody in the right key, the key in which it was originally played, and it is also impossible to tell whether the melody, played by someone else, is played in the right key. Somehow the original key is not memorized. Thus it seems that it is not the fundamental frequency of notes that is memorized. One could imagine that intervals are memorized rather than notes, but as I noted earlier, this is probably not right either. More plausible is the notion that it is the pitch of notes relative to the harmonic structure that is stored (i.e., pitch is relative to the key, not to the previous note).

We arrive at the notion that both the perception and the memory of pitch is relative, and it seems to be relative in a harmonic sense, i.e., relative to the key and not in the sense of intervals of successive notes. Now what I find very puzzling is that the fact that we can even sing means that, at a subconscious level but not at a conscious level, we must have a notion of absolute pitch.

Another intriguing point is that we can imagine a note, play it in our head, and then try to play it on a piano: it may sound like the note we played, or it may sound too high or too low. We are thus able to make a comparison between a note that is physically played and a note that we consciously imagine. But we are apparently not conscious of pitch in an absolute sense, in a way that relates directly to properties of physical sounds. The only way I can see to resolve this apparent contradiction is to say that we imagine notes as degrees in a harmonic context (or musical scale), i.e., “tonic” for the C note in a C key, “dominant” for the G note in a C key, etc, and in the same way we perceive notes as degrees. The absolute pitch, independent of the musical key, is also present but at a subconscious level.

I have only addressed a small portion of the phenomenology of pitch, since I have barely discussed harmony. But clearly, it appears that the phenomenology of pitch is very rich, and also not tied to the physics of sound in a straightforward way. It is deeply connected with the concepts of memory and time.

In light of these observations, it appears that current theories of pitch address very little of the phenomenology of pitch. In fact, all of them (both temporal and spectral theories) address the question of absolute pitch, something that most of us actually do not have conscious access to. It is even more limited than that: current models of pitch are meant to explain how the fundamental frequency of a sound can be estimated by the nervous system. Thus, they start from the physicalist postulate that pitch is the perceptual correlate of sound periodicity, which, as we have seen, is not unreasonable but remains a very superficial aspect of the phenomenology of pitch. They also focus on the problem of inference (how to estimate pitch) and not on the deeper problem of definition (what is pitch, why do some sounds produce pitch and not others, etc.).

What is sound? (VI) Sounds inside the head

When one hears music or speech through earphones, it usually feels like the sound comes from “inside the head”. Yet, one also feels that the sound may come from the left or from the right, and even from the front or back when using head-related transfer functions or binaural recordings. This is why, when subjects report the left-right quality of sounds with artificially introduced interaural level or time differences, one speaks of lateralization rather than localization.

But why is it so? The first answer is: sounds heard through earphones generally don’t reproduce the spatial features of sounds heard in a natural environment. For example, in musical recordings, sources are lateralized using only interaural level differences but not time differences. They also don’t reproduce the diffraction by the head, which one can reproduce using individually measured head-related transfer functions (HRTFs). However, even with individual HRTFs, sounds usually don’t feel as “external” as in the real world. How can it be so, if the sound waves arriving at the eardrums are exactly the same as in real life? Well, maybe they are not: maybe reproducing reverberation is important, or maybe some features of the reproduced waves are very sensitive to the precise placement of the earphones.

It could be the reason, but even if it’s true, it still leaves an open question: why would sounds feel “inside the head” when the spatial cues are not natural? One may argue that, if a sound is judged as not coming from a known external direction, then “by default” it has to come from inside. But we continuously experience new acoustical environments, which modify the spatial cues, and I don’t think we experience sounds as coming from inside our head at first. We might also imagine other “default places” where there are usually no sound sources, for example other places inside the body, but we feel sounds inside the head, not just inside the body. And finally, is it actually true that there are no sounds coming from inside the head? In fact, not quite: think about chewing, for example – although arguably, these sounds come from the inner surface of the mouth.

The “default place” idea also doesn’t explain why such sounds should feel like they have a spatial location rather than no location at all. An alternative strategy is the sensorimotor approach, according to which the distinct quality of sounds that feel inside the head has to do with the relationship between one’s movements and the sensory signals. Indeed, with earphones, the sound waves are unaffected by head movements. This is characteristic of sound sources that are rigidly attached to the ears. This is the head, from the top of the neck, excluding the jaws. This is an appealing explanation, but it doesn’t come without difficulties. First, even though it may explain why we have a specific spatial feel for sounds heard through earphones, it is not obvious how we should experience this feel as sounds being produced inside the head. Perhaps this difficulty can be resolved by considering that one can produce sounds with such a feel by e.g. touching one’s head or chewing. But these are sound sources localized on the surface of the head, or the inner surface of the mouth, not exactly inside the head. Another way of producing sounds with the same quality is to speak, but it comes with the same difficulty.

I will come back to speech later, but I will finish with a few more remarks about the sensorimotor approach. It seems that experiencing the feel of sounds produced inside the head requires turning one’s head. So one would expect that if sound is realistically rendered through earphones with individual HRTFs and the subject’s head is held fixed, it should sound externalized; or natural sounds should feel inside the head until one turns her head. But maybe this is a naive understanding of the sensorimotor approach: the feel is associated to the expectation of a particular sensorimotor relationship, and this expectation can be based on inference rather than on a direct test. That is, sounds heard through earphones, with their particular features (e.g. no interaural time differences, constant interaural intensity differences), produce a feel of coming from inside the head because whenever one has tried to test this perceptual hypothesis by moving her head, this hypothesis has been confirmed (i.e., ITDs and IIDs have remained unchanged). So when presenting sounds with such features, it is inferred that ITDs and IIDs should be unaffected by movements, which is to say that sounds come from inside the head. One objection, perhaps, is that sounds lateralized using only ITDs and not IIDs also immediately feel inside the head, even though they do not correspond at all to the kind of binaural sounds usually rendered through earphones (in musically recordings).

The remarks above would imply the following facts:

  • When sounds are rendered through earphones with only IIDs, they initially feel inside the head.
  • When sounds are realistically rendered through earphones with individual HRTFs (assuming we can actually reproduce the true sound waves very accurately, maybe using the transaural technique), perhaps using natural reverberation, they initially feel outside the head.
  • When the subject is allowed to move, sounds should feel (perhaps after a while) inside the head.
  • When the subject is allowed to move and the spatial rendering follows these movements (using a head tracker), the sounds should feel outside the head. Critically, this should also be true when sounds are not realistically rendered, as long as the sensorimotor relationship is accurate enough.

To end this post, I will come back to the example of speech. Why do we feel that speech comes from our mouth, or perhaps nose or throat? We cannot resolve the location of speech with touch. However, we can change the sound of speech by moving well-localized parts of our body: the jaws, the lips, the tongue, etc. This could be one explanation. But another possibility, which I find interesting, is that speech also produces tactile vibrations, in particular on the throat but also on the nose. These parts of the body have tactile sensors that can also be activated by touch. So speech should actually produce well-localized vibratory sensations at the places where we feel speech is coming from.

What I find intriguing in this remark is that it raises the possibility that the localization of sound might also involve tactile signals. So the question is: what are the tactile signals produced by natural sounds? And what are the tactile signals produced by earphones, do they stimulate tactile receptors on the outer ears, for example? This idea might be less crazy than it sounds. Decades ago, von Békésy used the human skin to test our sensitivity to vibrations and he showed that we can actually feel the ITD of binaural sounds acting on the skin of the two arms rather than on the two eardrums. The question, of course, is whether natural sounds produce such distinguishable mechanical vibrations on the skin. Perhaps studies on profoundly deaf subjects could provide an answer. I should also note that, given the properties of the skin and tactile receptors, I believe these tactile signals should be limited to low frequencies (say, below 300 Hz).

I now summarize this post by listing a number of questions I have raised:

  • What are the spatial auditory cues of natural sounds produced inside the head? (chewing, touching one’s head, speaking)
  • Is it possible to externalize sounds without tracking head movements? (e.g. with the transaural technique)
  • Is it possible to externalize sounds by tracking head movements, but without reproducing realistic natural spatial cues (HRTFs)?
  • What is the tactile experience of sound, and are there tactile cues for sound location? Can profoundly deaf people localize sound sources?

Update. Following a discussion with Kevin O’Regan, I realize I must qualify one of my statements. I wrote that sound waves are unaffected by head movements when the source is rigidly attached to the head. This is in fact only true in an anechoic environment. But as soon as there is a reflecting surface, which does not move with the head, moving the head has an effect on sound waves (specifically, on echoes). In other words, the fact that echoic cues are affected (in a lawful way) by movements is characteristic of sounds outside the head, whether they are rigidly attached to the head or not. To be more precise, monaural echoic cues change with head movements for an external source attached to the head, while binaural echoic cues do for an external source free from the head.

What is sound? (V) The structure of pitch

Musical notes have a particular perceptual quality called “pitch”. Pitch is the percept corresponding to how low or high a musical note is. Vowels also have a pitch. To a large extent, the pitch of a periodic sound corresponds to its repetition rate. The important point is that what matters in pitch is more the periodicity than the frequency content. For example, a periodic sound with repetition rate f0 has frequency components at multiples of f0 (n.f0), which are called harmonics. A pure tone of frequency f0 and a complex tone with all harmonics except the first one, i.e., which does not contain the frequency component f0, will evoke the same pitch. It is in fact a little more complex than that, there are many subtleties, but I will not enter into these details in this post. Here I simply want to describe the kind of sensory or sensorimotor structure there is in pitch. It turns out that pitch has a surprisingly rich structure.

The most obvious type of structure is periodicity. Pitch-evoking sounds have this very specific property that the acoustical wave is unchanged when temporally shifted by some delay. This delay is characteristic of the sound’s pitch (i.e., same period means same pitch). This is the type of structure that is emphasized in temporal theories of pitch. This is what I call the “similarity structure” of the acoustical signal, and this notion can in fact be extended and accounts for a number of interesting phenomena related to pitch. But this is work in progress, so I will discuss it further at a later time.

Another way to see periodic sounds is to realize that a periodic sound is predictable. That is, after a couple periods, one can predict the future acoustical wave. Compared to most other sounds, periodic sounds have a very high degree of predictability. Perhaps the perceptual strength of pitch (which depends on a number of factors) is related to the degree of predictability of the sound.

There is another type of structure that is in some sense orthogonal to the similarity structure I just described, which one might call the “dissimilarity structure”. Natural sounds (apart from vocalizations) tend to have a smooth spectrum. Periodic sounds, on the other hand, have a discrete spectrum. Thus, in some sense, periodic sounds have a “surprisingly discontinuous” spectrum. Suppose for example that two auditory receptors respond to different but overlapping parts of the spectrum (e.g., two nearby points on the basilar membrane). Then one can usually predict the sensory input to the second receptor given the sensory input to the first receptor, because natural sounds tend to have a continuous spectrum. But this prediction would fail with a periodic sound. Periodic sounds are maximally surprising in this sense. The interesting thing about the dissimilarity structure of pitch is that it accounts for binaural pitch phenomena such as Huggins’ pitch: noise with flat spectrum is presented on both ears, and the interaural phase difference changes abruptly at a given frequency; a tone is perceived, with the pitch corresponding to that frequency.

Thus, pitch-evoking sounds simultaneously have two types of structure that distinguish them from other types of sounds: the similarity structure, which consists of different views of the acoustical signal that are unusually similar, and the dissimilarity structure, which consists of different views of the acoustical signal that are unusually dissimilar. This first type of structure corresponds to what I examined in my paper on computing with neural synchrony. It is important to notice that these two types of structure have a different nature. The similarity structure corresponds to a law that the sensory signals follow. Here the percept is associated to the specific law that these signals follow. The dissimilarity structure corresponds to the breaking of a law that sensory signals usually follow. Here the percept is associated to a law that is specific not of the presented sensory signals, but of the usual sensory signals. Thus we might relate the similarity structure to the notion of discovery, and the dissimilarity structure to the notion of surprise (and perhaps the term “structure” is not appropriate for the latter).

So far, I have only considered the structure of the acoustical signal, but one may also consider the sensorimotor structure of pitch. As I mentioned in another post, periodic sounds are generally produced by living beings, so it makes sense to examine these sounds from the viewpoint of their production. When one produces a pitch-evoking sound (for example a vowel, or when one sings), there is a very rich structure that goes beyond the acoustical structure. First, there is proprioceptive information about vocal muscles and tactile information about the vibrations of the larynx, and both are directly related to the period of sounds. There is also the efferent copy, i.e., the motor commands issued to make the vocal folds vibrate in the desired way. For a person who can produce sounds, pitch is then associated to a rich and meaningful sensorimotor structure. In fact, the sensorimotor theory of pitch perception would be that to perceive the pitch of a sound is, perhaps, to perceive the movements that would be required to produce such acoustical structure. An interesting aspect of this view is that it provides some meaning to the notion of how low or high a pitch-evoking sound is, by associating it with the state of the different elements involved in sound production. For example, to produce a high sound requires to increase the tension of the vocal cords, and to move the larynx up (higher!). One question then is whether congenitally mute people have a different perception of pitch.

Observe that, as for binaural hearing , the sensorimotor structure of pitch should not be understood as the relationship between motor commands and auditory signals, but rather as the relationship between motor commands and the structure of auditory signals (e.g. the periodicity). In this sense, it is higher-order structure.

What is sound? (IV) Ecological ontology of sounds

What kinds of sounds are there in the world? This is essentially the question William Gaver addresses in a very interesting paper (Gaver, 1993), in which he describes an ontology of sounds, categorized by the type of interaction. There are three categories: sounds made by solids, liquids and gases. An example of a sound made by liquid is dripping. There are also hybrid sounds, such as the rain falling on a solid surface. It makes sense to categorize sounds based on the nature of the objects because the mechanical events are physically very different. For example, in sounds involving solids (e.g. a footstep), energy is transmitted at the interface between two solids, which is a surface, and the volumes are put in motion (i.e., they are deformed). This is completely different for sounds involving gases, e.g. wind. In mechanical events involving solids, the shape is essentially unchanged (only transiently deformed). This is a sort of structural invariance that ought to leave a specific signature on the sounds (more on this in another post). Sounds made by gases, on the other hand, correspond to irreversible changes.

These three categories correspond to the physical nature of the sound producing substances. There are subcategories that correspond to the nature of the mechanical interaction. For example, a solid object can be hit or it can be scraped. The same object vibrates but there is a difference in the way it is made to vibrate. This also ought to produce some common structure in the auditory signals, as is explained in Gaver's companion article. For example, a vibrating solid object has modes of vibration that are determined by its shape (more on this in another post). These modes do not depend on the type of interaction with the object.

Interactions that are localized in time are impact sounds, while continuous interactions produce auditory textures. These are two very distinct types of sounds. Both have a structure, but auditory textures, it seems, only have a structure in a statistical sense (see McDermott & Simoncelli, 2011). Another kind of auditory texture is the type of sounds produced by a river, for example. These sounds also have a structure in a statistical sense. An interesting aspect, in this case, is that these sounds are not spatially localized: they do have an auditory size (see my post on spatial hearing ).

The examples I have described correspond to what Gaver calls "basic level events", elementary sounds produced by a single mechanical interaction. There are also complex events, which are composed of simple events. For example, a breaking sound is composed of a series of impact sounds. A bouncing sound is also composed of a series of impact sounds, but the temporal patterning is different, because it is lawful (predictable) in the case of a bouncing sound. Walking is yet another example of a series of impact sounds, which is also lawful, but it differs in the temporal patterning: it is approximately periodic.

Gaver only describes sounds made by non-living elements of the environment (except perhaps for walking). But there are also sounds produced by animals. I will describe them now. First, some animals can produce vocalizations. In Gaver's terminology, vocalizations are a sort of hybrid gas-solid mechanical event: periodic pulses of air make the vocal folds vibrate. The sound then resonates in the vocal tract, which shapes the spectrum of the sound (in a similar way as the shape of an object determines the resonating modes of impact sounds). One special type of structure in these sounds is the periodicity of the sound wave. The fact that a sound is periodic is highly meaningful, because it means that energy is continuously provided, and therefore that a living being is most likely producing it. There are also many other interesting aspects that I will describe in a later post.

Animals also produce sounds by interacting with the environment. These are the same kinds of sounds as described by Gaver, but I believe there is a distinction. How can you tell that a sound has been produced by a living being? Apart from identifying specific sounds, I have two possible answers to provide. First, in natural non-living sounds, energy typically decays. This distinguishes walking sounds from bouncing sounds, for example. In a bouncing sound, the energy decreases at each impact. This means both the intensity of the sound and the interval between successive impacts decay. This is simply because a bouncing ball starts its movement with a potential energy, that can only decay. In a walking sound, roughly the same energy is brought at each impact, so it cannot be produced by the collision of two solids. Therefore, sounds contain a signature of whether it is produced by continuous source of energy. But a river is also a continuous source of energy (and the same would apply to all auditory textures). Another specificity is that sounds produced by the non-living environment are governed by the laws of physics, and therefore they are lawful in a sense, i.e., they are predictable. A composed sound with a non-predictable pattern (even in a statistical sense) is most likely produced by a living being. In a sense, non-predictability is a signature of decision making. This remark is not specific to hearing.

These are specificities of sounds produced by living beings, as heard by another observer. But one can also hear self-produced sounds. There are two new specificities about these types of sounds. First, they also make the body vibrate, for example, a foot hits the ground. This produces sound waves with a specific structure. But more importantly, self-produced sounds have a sensorimotor structure. Scraping corresponds to a particular way in which one interacts with an object. The time of impact corresponds to the onset of the sound. The intensity of the sound is directly related to the energy with which an object is hit. Finally, the periodicity of vocalizations (i.e., the pitch), corresponds to the periodicity of self-generated air pulses through the vocal folds, and the formant frequencies correspond to the shape of the vocal tract. Self-generated sounds also have a multimodal structure. For example, they produce vibrations in the body than can be perceived by tactile receptors. In the next post, I will look at the structure of pitch.

What is sound? (III) Spatial hearing

In my previous post, I argued that the spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. This happens even though sounds contain highly structured information about shape, because this structure does not relate to self-generated movements. One may then wonder whether the notion of auditory space in general, for example the spatial location of sound source, is also secondary. One may postulate that the spatial content of auditory spatial cues is only acquired by their contingency with visual spatial cues. In fact, this idea is supported by an intriguing study showing a very strong correlation across species between visual acuity in the fovea and auditory spatial acuity (Heffner & Heffner, 1992 , Fig. 6). More precisely, the authors show that sound localization acuity is better predicted by visual acuity than by acoustical factors (essentially, interaural distance). In our interpretation, animals have poor sound localization acuity not so much because they lack the physiological mechanisms to correctly analyze spatial information, but because in the absence of precise vision, auditory spatial cues cannot acquire precise spatial content. This does not imply that the auditory system of these animals cannot decode these spatial cues, but only they cannot make sense of them. [Update: the results in Heffner & Heffner are in fact more subtle, see a more recent post]

This being said, there is in fact some intrinsic spatial content in sounds, which I will describe now. When a sound is produced, it arrives first at the ear closer to the source, then at the other ear. The intensity will also be higher at the first ear. This is the binaural structure of sounds produced by a single source, and captured by two ears that are spatially separated. This is similar to stereoscopic vision. But observe one difference: in vision, as Gibson noted, having two eyes is essentially the same thing as having one eye, combined with lateral head movements; in hearing, this is not the same because of the non-persistent nature of sounds. If one turns the head to sample another part of the “acoustic array” (in analogy with Gibson’s optic array), the sound field will have changed already (and possibly faded out), so the spatial structure will not be directly captured in the same way. Thus, to capture spatial structure in sound, it is crucial that acoustic signals are simultaneously captured at different locations.

This binaural structure in sounds is often described as “spatial cues” (binaural cues). Quantitatively, there is a relationship between the spatial location of the source and binaural structure, e.g. the interaural time difference (ITD). However, these “cues” are not intrinsically spatial in the sense that they are not defined in relationship to self-generated movements. For example, what is the spatial meaning of an ITD of 100 µs? Intrinsically, there is none. As discussed above, one way for spatial cues to acquire spatial content is by association, i.e., with the spatial content of another modality (vision). But now I will consider the effects of self-generated movements, that is, what is intrinsically spatial in sounds.

When the head turns, the binaural structure changes in specific ways. That is, there is a sensorimotor structure that gives spatial content to binaural structure. More precisely, two different binaural structures can be related to each other by a specific movement. But an important distinction with vision must be made. Because of the non-persistent nature of sounds, the relationship is not between movements and sensory signals, it is between movements and the structure of sensory signals. It is not possible to predict the auditory signals from auditory signals captured before a specific movement. For one thing, there might be no sound produced after the movement. What is predictable is the binaural structure of the sound, if indeed a sound is produced by a source that has a persistent location. If the location of the source is persistent, then the binaural structure is persistent, but not the auditory signals themselves.

Another point we notice is that this provides only a relative sense of space. That is, one can that say whether a sound source is 20° left of another sound source, but it does not produce an absolute egocentric notion of space. What is lacking is a reference point. I will propose two ways to solve this problem.

What is special, for example, about a source that is in front of the observer? Gibson noted that, in vision, the direction in which the optic flow is constant indicates the direction of movement. Similarly, when one moves in the direction of a sound source, the direction of that sound source is unchanged, and therefore the binaural structure of sound is unchanged. In other words, the direction of a sound source in front is the direction of a self-generated movement that would leave the binaural structure unchanged (we could also extend this definition to the monaural spectral information). In fact the binaural structure can depend on distance, when the source is near, but this is a minor point because we can simply state that we are considering the direction that makes binaural structure minimally changed (see also the second way below). One problem with this, however, is that moving to and moving away from a source both satisfy this definition. Although these two cases can be distinguished by head movements, this definition does not make a distinction between what is moving closer and what is moving further away from the source. One obvious remark is that moving to a source increases the intensity of the sound. The notion of intensity here should be understood as a change in information content. In the same way as in vision where moving to an object increases the level of visual detail, moving to a sound source increases the signal-to-noise ratio, and therefore the level of auditory detail available. This makes sense independently of the perceptual notion of loudness – in fact it is rather related to the notion of intelligibility (a side note: this is consistent with the fact that an auditory cue to distance is the ratio between direct sound energy and reverberated energy). Of course again, because sounds are not persistent, the notion of change in level is weak. One needs to assume that the intensity of the sound persists. However, I do not think this is a critical problem, for even if intensity is variable, what is needed is only to observe how intensity at the ear correlates with self-generated movements. This is possible because self-generated movements are (or at least can be) independent of the intensity variations of the sound.

This indeed seems to provide some intrinsic spatial content to sounds. But we note that it is quite indirect (compared to vision), and made more evident by the fact that sounds are not persistent. There is another, more direct, way in which sounds can acquire spatial content: by the active production of sounds. For example, one can produce sounds by hitting objects. This provides a direct link between the spatial location of the object, relative to the body, and the auditory structure of the sound. Even though sounds are not persistent, they can be repeated. But we note that this can only apply to objects that are within reach.

This discussion shows that while there is no intrinsic spatial content about shape in sounds, there is intrinsic spatial content about source location. This seems to stand in contradiction with the discussion at the beginning of this post, in which I pointed out that spatial auditory acuity seems to be well predicted across species by visual acuity, suggesting that spatial content is acquired. Here is a possible way to reconcile these two viewpoints. In vision, an object at a specific direction relative to the observer will project light rays in that direction to the retina, which will be captured by specific photoreceptors. Therefore, there is little ambiguity in vision about spatial location. However, in hearing, this is completely different. Sounds coming from a particular direction are not captured by a specific receptor. Information about direction is in the structure of the signals captured at the two ears. The difficulty is that this structure depends on the direction of the sound source but also on other uncontrolled factors. For example, reflections, in particular early reflections, modify the binaural cues (Gourévitch & Brette 2012). These effects are deterministic but situation-dependent. This implies that there is no fixed mapping from binaural structure to spatial location. This makes the auditory spatial content weaker, even though auditory spatial structure is rich. Because visual location is more invariant, it is perhaps not surprising that it dominates hearing in localization tasks.

What is sound? (II) Sensorimotor contingencies

In a previous post, I emphasized the differences between vision and hearing, from an ecological point of view. Here I want to comment on the sensorimotor theory of perception (O’Regan & Noë 2001) or the enactive approach, applied to sounds. According to this theory, perception is the implicit knowledge of the effects of self-generated movements on sensory signals. Henri Poincaré made this point a long time ago: "To localize an object simply means to represent to oneself the movements that would be necessary to reach it". For example, perceiving the spatial location of an object is knowing the movements that one should do to move to that object, or to grasp it, or to direct its fovea to it.

There are two implicit assumptions here: 1) that there is some persistence in the sensory signals, 2) that the relevant information is spatial in nature. I will start with the issue of persistence. As I previously argued, a defining characteristic of sounds is that they are not persistent, they happen. For example, the sound of someone else hitting an object is transient. One cannot interact with it. So there cannot be any sensorimotor contingency in this experience. It could be argued that one relies on the memory of previous sensorimotor contingencies, that is, the memory of one producing an impact sound. This is a fair remark, I think, but it overestimates the amount of information there is in this contingency. When an impact sound is produced, the only relationships between motor commands and the acoustic signals are the impact timing and the sound level (related to the strength of the impact). But there is much more information in the acoustic signal of an impact sound, because the structure of this signal is related to properties of the sounding object, in particular material and shape (Gaver, 1993). For example, the resonant modes are informative of the shape and the decay rate of these modes indicates the nature of the material (wood, metal, etc), properties that we can very easily identify. So there is informative sensory structure independent of sensorimotor contingencies.

Now I think we are hitting an interesting point. Even though the resonant modes are informative of the shape (the size, essentially) of an object, they cannot provide any perceptual spatial content by themselves. That is, the frequency of a resonant mode is just a number, and a number has no meaning without context. Compare with the notion of object size for the tactile system: the size of a (small) object is the extent to which one must stretch the hand to grasp it. There is no such thing in hearing. There is nothing intrinsically spatial in auditory size, it seems. If one moves and the sound is repeated, the same resonant modes will be excited. Therefore, it seems that auditory shape can only be a derived property. That is, the specific sensory structure of sounds that corresponds to shape acquires perceptual content by association with another sense that has intrinsic spatial content, i.e., visual or tactile. Now we get to Gibson’s notion of invariant structure: auditory size is the structure in the auditory signals that remains the same when other aspects than size change (where the notion of size is not auditory). Here I am imagining that one hears sounds produced by various sources for which the size is known, and one can identify that some auditory structure is the same for all sources that have the same size. Note the important point here: what persists here is not the sensory signals, it is not the relationship between movements and sensory signals, it is not even the relationship between size and sensory signals, it is the relationship between size and the structure of auditory signals, which is a kind of relationship. That is, one cannot predict the auditory signals from the size: one can predict some aspect of the structure of these signals from the size.

Here I have highlighted the fact that the auditory shape of an object is a structure of auditory signals, not a kind of sensorimotor structure. The spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. But there can also be intrinsic spatial content in sounds, and in my next post, I will discuss spatial hearing.

What is sound? (I) Hearing vs. seeing

What is sound? Physically, sounds are mediated by acoustical waves. But vision is mediated by light waves and yet hearing does not feel like vision. Why is that?

There are two wrong answers to this question. The first one is that the neural structures are different. Sounds are processed in the cochlea and in the auditory cortex, images by the retina and visual cortex. But then why doesn’t a sound evoke some sort of image, like a second visual system? This point of view does not explain much about perception, only about what brain areas “light up” when a specific type of stimulus is presented. The second one is that the physical substrate is different: light waves vs. acoustic waves. This is also a weak answer, for what is fundamentally different between light and acoustic waves that would make them “feel” different?

I believe the ecological approach provides a more satisfying answer. By this, I am referring to the ecological theory of visual perception developed by James Gibson. It emphasizes the structure of sensory signals collected by an observer in an ecological environment. It is also related the sensorimotor account of perception (O’Regan & Noë 2001), which puts the emphasis on the relationship between movements and sensory signals, but I will show below that this emphasis is less relevant in hearing (except in spatial hearing).

I will quickly summarize what is vision in Gibson’s ecological view. Illumination sources (the sun) produce light rays that are reflected by objects. More precisely, light is reflected by the surface of objects with the medium (air, or possibly water). What is available for visual perception are surfaces and their properties (color, texture, shape...). Both the illumination sources and the surfaces in the environment are generally persistent. The observer can move, and this changes the light rays received by the retina. But these changes are highly structured because the surfaces persist, and this structure is informative of the surfaces in the environment. Thus what the visual system perceives is the arrangement and properties of persistent surfaces. Persistence is crucial here, because it allows the observer to use its own movements to learn about the world – in the sensorimotor account of perception, perception is precisely the implicit knowledge of the effect of one’s actions on sensory signals.

On the other hand, sounds are produced by the mechanical vibration of objects. This means that sounds convey information about volumes rather than surfaces. They depend on the shape but also on the material and internal structure of objects. It also means that what is perceived in sounds is the source of the waves rather than their interaction with the environment. Crucially, contrary to vision, the observer cannot directly interact with sound waves, because a sound happens, it is not persistent. An observer can produce a sound wave, for example by hitting an object, but once the sound is produced there is no possible further interaction with it. The observer cannot move to analyze the structure of acoustic signals. The only available information is in the sound signal itself. In this sense, sounds are events.

These ecological observations highlight major differences between vision and hearing, which go beyond the physical basis of these two senses (light waves and acoustic waves). Vision is the perception of persistent surfaces. Hearing is essentially the perception of mechanical events on volumes. These remarks are independent from the fact that vision is mediated by a retina and hearing by a cochlea.