Some propositions for future spatial hearing research (III) - The coding problem

In the previous posts, I have proposed that we should look at the ecological problem of sound localization, and that in terms of physiology we should go beyond tuning curves. However, if all of this is addressed, we still have a big problem. We are looking for “neural representations” or “codes”, but neural representations are observer-centric concepts that make little sense from the viewpoint of the organism, as I have discussed a few times before (for example there). Neural responses are not there to be read by some little homunculus, they are just neurons that are exciting other neurons, which you are not recording. Those other neurons are not “reading the code”, you are. Those neurons are just reacting instantly to the electrical stimulation of the neurons that constitute what we like to call “neural representation”.

Not everyone is receptive to the philosophical points, so let me just give one example. You could look at the reponses of lots of binaural neurons and realize they have lots of different tunings. So you could suggest: maybe sound location is represented by the most active neurons. But someone else realizes that the average response of all those neurons varies gradually with sound location, so maybe sound location is actually encoded in the average response? Wait a minute: why throw all this information away? maybe sound location is represented by the entire pattern of activity? The problem we are facing here is not that we don't know how to determine which one is true, but rather that all of these are true (the last one being trivially true). Yes, sound location is represented by the identity of most active neurons, the average response and the pattern of activity: there is a mapping between sound location and those different features. That is, you, the external observer, can look at those features and guess what the sound location was. What is this supposed to prove?

By focusing on neural representations, we are not looking at the right problem. What we want to know in the end is not so much how neural activity varies with various parameters of an experiment, but how neural activity constitutes the spatial percept, or perhaps more modestly, how it drives behavioral orientation responses. Now certainly looking at neural responses is a necessary first step, but we can't answer the interesting question if we stop there. So how can we answer the interesting question?

Well, I won't claim that I have a good answer because I think that's one of the major conceptual problems in systems neuroscience today. But one proposition that I think goes in the right direction is to do stimulations instead of, or in conjunction with recordings. Ideally, those stimulations should be such as to trigger behaviors. Is average activity the important feature? Stimulate neurons at different places and you should see the same orientation response. Is the identity of active neurons important? With the same experiment, you should see different responses, varying systematically with stimulated neurons.

It's possible: it has actually been done 45 years ago (Syka and Straschill, 1970). Electrical stimulation of the inferior colliculus with microelectrodes can trigger specific orienting responses. These days one could also probably do optogenetic stimulation. It's not going to be simple, but I think it's worth it.

Some propositions for future spatial hearing research (II) - Tuning curves

In the previous post, I proposed to look at the ecological problem of sound localization, rather than the artificial and computationally trivial problem that is generally addressed. As regards physiology, this means that a neural representation of sound location is a property of collective neural responses that is unchanged for the class of stimuli that produce the same spatial percept. This is not a property that you will find at a single neuron level. To give a sense of what kind of property I am talking about, consider the Jeffress model, a classic model of sound localization. It goes as follows: each neuron is tuned to a particular location, and there are a bunch of neurons with different tunings. When a sound is presented, you identify the most active neuron, and that tells you where the sound comes from. If it is the same neuron that is most active for different sounds coming from the same location, then you have the kind of representation I am talking about: maximally active neuron is a representation of (specifically) sound location.

The Jeffress model actually has this kind of nice property (unlike competitors), but only when you see it as a signal processing model (cross-correlation) applied to an idealized acoustical situation where you have no head (ie two mics with just air between them). What we pointed out in a recent paper in eLife is that it loses that property when you consider sound diffraction introduced by the head; quite intriguingly, it seems that binaural neurons actually compensate for that (ie their tunings are frequency-dependent in the same way as interaural time differences are frequency-dependent).

But I want to discuss a more fundamental point that has to do with tuning curves. By “tuning curve”, I am referring to a measurement of how the firing rate of a neuron varies when one stimulus dimension is varied. Suppose that indeed you do have neurons that are tuned to different sound locations. Then you present a stimulus (of the same kind) and you look for the maximally active neuron. The tuning of that neuron should match the location of the presented stimulus. Right? Well, actually no. At least not in principle. That would be true if all tuning curves had exactly the same shape and peak value and only differed by a translation, or at least if the shape and magnitude were not correlated with tuning. But otherwise it's just an incorrect inference. If you don't see what I mean look at this paper on auditory nerve responses. Usually one would show selectivity curves of auditory nerve fibers, ie firing rate vs. sound frequency for a bunch of fibers (note that auditory scientists also use “tuning curve” to mean something else, which is minimum sound level that elicits a response vs. frequency). Here the authors show the data differently on Fig. 1: responses of all fibers along the cochlea for a bunch of frequencies. I bet that it is not what you would expect from reading textbooks on hearing. Individually, fibers are tuned to frequency. Yet you can't really pick the most active fiber and tell what sound frequency was presented. Actually there are different frequencies at which the response peaks at the same place. It's basically a mess. But that is what the auditory gets when you present a sound: the response of the entire cochlea for one sound, not the response of one neuron to lots of different stimuli.

So, what about sound localization and binaural neurons, do we have this kind of problem or not? Well I don't know for sure because no one actually shows whether the shape of tuning curves vary systematically with tuning or not. Most of the time, one shows a few normalized responses and then extracts a couple of features of the tuning curves for each cell (ie the tuning in frequency and ITD) and shows some trends. The problem is we can't infer the population response from tunings unless we know quite precisely how the tuning curves depend on tuning. That is particularly problematic when tuning curves are broad, which is the case for the rodents used in many physiological studies.

I see two ways to solve this problem. One is to prove that there is no problem. You look at tuning curves, and you show that there is no correlation between tuning and any other characteristic of tuning curves (for examples, calculate average tuning curves with the same tuning, and compare across tunings). That would be quite reassuring. My intuition: that will work in high frequency, maybe, or in the barn owl perhaps (quite narrow curves), but not in low frequency, and not for most cells in rodents (guinea pigs and gerbils).

If it doesn't work and there are correlations, then the problem will get quite complicated. You could think of looking for a parametric representation of the responses. It's a possibility and one might make some progress this way, but it might become quite difficult to do when you add extra stimulus dimensions (level etc). There is also the issue of gathering data from several animals, which will introduce extra variability.

The only clean way I see of dealing with this problem is to actually record the entire population response (or a large part of the structure). It sounds very challenging, but large-scale recording techniques are really progressing quite fast these days. Very dense electrode arrays, various types of imaging techniques; it's difficult but probably possible at some point.

Some propositions for future spatial hearing research (I) – The ecological situation and the computational problem

In these few posts, I will be describing my personal view of the kind of developments I would like to see in spatial hearing research. You might wonder: if this is any good, then why would I put it on my blog rather than in a grant proposal? Well, I have hesitated for a while but there are only so many things you can do in your life, and in the end I would just be glad if someone would pick up some of these ideas and made some progress in an interesting direction. Some of them are pretty demanding both in terms of efforts and expertise, which is also a reason why I am not likely to pursue all of these myself. And finally I believe in open science, and it would be interesting to read some comments or have some discussions. All this being said, I am open to collaboration on these subjects if one is motivated enough.

The basic question is: how do we (or animals) localize sounds in space? (this does not cover all of spatial hearing)

My personal feeling is that the field has made some real progress on this question but has now exploited all there is to exploit in the current approaches. In a nutshell, those approaches are: consider a restricted set of lab stimuli, typically a set of sounds that are varied in one spatial dimension, and look at how physiological and behavioral responses change when you vary that spatial parameter (the “coding” approach).

Let us start with what I think is the most fundamental point: the stimuli. For practical reasons, scientists want to use nice clean reproducible sounds in their experiments, for example tones and bursts of white noise. There are very good reasons for that. One is that if you want to make your results reproducible by your peers, then it's simpler to write that you used a 70 dB pure tone of frequency 1000 Hz than the sound of a mouse scratching the ground, even though the latter is clearly a more ecologically relevant sound for a cat. Another reason is that you want a clean, non noisy signal both for reproducibility reasons and because you don't want to do lots of experiments. Finally, you typically vary just one stimulus parameter (e.g. azimuthal angle of the source) because that already makes a lot of experiments.

All of this is very sensible, but it means that in terms of the computational task of localizing a sound, we are actually looking at a really trivial task. Think about it as if you were to design a sound localization algorithm. Suppose all sounds are going to be picked up from a set of tones that vary along a spatial dimension, say azimuth. How would you do it? I will tell you how I would do it: measure the average intensity at the left ear, and use a table to map it to sound direction. Works perfectly. Obviously that's not what actual signal processing techniques do, and probably that's not what the auditory system does. Why not? Because in real life, you have confounding factors. With my algorithm, you would think loud sounds come from the left and soft sounds from the right. Not a good algorithm. The difficulty of the sound localization problem is precisely to locate sounds despite all the possible confounding factors, ie all the non-spatial properties of sounds. There are many of them: level, spectrum, envelope, duration, source size, source directivity, early reflections, reverberation, noise, etc. That's why it's actually hard and algorithms are not that good in ecological conditions. That is the ecological problem, but there is actually very little research on it (in biology). As I argued in two papers (one about the general problem and one applied to binaural neurons), the problem that is generally addressed is not the ecological problem of sound localization, but the problem of sensitivity to sound location, a much simpler problem.

This state of affairs is very problematic in my opinion when it comes to understanding “neural representations” of sound location, or more generally, how the auditory system deals with sound location. For example, many studies have looked at the information content of neural responses and connected it with behavioral measurements. There are claims such as: this neuron's firing contains as much information about sound location as the entire organism. Other studies have claimed to have identified optimal codes for sound location, all based on the non-ecological approach I have just described. Sorry to be blunt, but: this is nonsense. Such claims would have been meaningful if we actually lived in a world of entirely identical sounds coming from different directions. And so in that world my little algorithm based on left ear intensity would probably be optimal. But we don't live in that world, and I would still not use the left-ear algorithm even if I encountered one of those sounds. I would use the algorithm that works in general, and not care so much about algorithms that are optimal for imaginary worlds.

What do we mean when we say that “neurons encode sound location”? Certainly we can't mean that neurons responses are sensitive to location, ie they vary when you vary sound location, because that would be true of basically all neurons that respond to sounds. If this is what we mean, then we are just saying that a sizeable portion of the brain is sensitive to auditory stimuli. Not that interesting. I think we mean, or at least we should mean, that neurons encode sound location specifically, that is, there is something in the collective response of the neurons that varies with sound location and not with other things. This something is the “representation”, and its most basic property is that it does not change if the sound location percept does not change. Unfortunately that property cannot be assessed if all you ever vary in your stimulus is the spatial dimension, and so in a nutshell: current approaches based on restricted stimulus sets cannot, by construction, address the question of neural representations of sound location. They address the question of sensitivity – a prerequisite, but really quite far from the actual ecological problem.

So I think the first thing to do would be to start actually addressing the ecological problem. This means essentially inverting the current paradigm: instead of looking at how responses (physiological/behavioral) change when a spatial dimension is varied, look at how they change (or at what doesn't change) when non-spatial dimensions are varied. I would proceed in 3 steps:

1) Acoustics. First of all, what are the ecological signals? Perhaps surprisingly, no one has measured that systematically (as far as I know). That is, for an actual physical source at a given location, not in a lab (say in a quiet field, to simplify things), how do the binaural signals look like? What is the structure of noise? How do the signals vary over repetititions, or if you use a different source? One would need to do lots of recordings with different source sources and different acoustic configurations (we have started to do that a little bit in the lab). Then we would start to have a reasonable idea of what the sound localization problem really is.

2) Behavior. The ecological problem of sound localization is difficult, but are we actually good at it? So far, I have not seen this question addressed in the previous literature. Usually, there is a restricted set of sounds, with high signal-to-noise ratio, often noises or clicks. So actually, we don't know how good we (or animals) are at localizing sounds in ecological situations. Animal behavior experiments are difficult, but a lot could be done with humans. There is some psychophysical research that tends to show that humans are generally not too much affected by confounding factors (eg level); it's a good starting point.

3) Physiology. As mentioned above, the point is to identify what in neural responses is specifically about sound location (or more precisely, perceived sound location), as opposed to other things. That implies to vary not only the spatial dimension but also other dimensions. That's a problem because you need more experiments, but you could start with one non-spatial dimension that is particularly salient. There is another problem, which is that you are looking for stable properties of neuron responses, but it's unlikely that you find that in one or a few neurons. So probably, you would need to record from many neurons (next post), and this gets quite challenging.

Next post is a criticism of tuning curves; and I'll end on stimulating vs. recording.


Update (6 Jan 2021): I am sharing a grant proposal on this subject. I am unlikely to do it myself, so feel free to reuse the ideas. I am happy to help if useful.

What is sound? (XVI) On the spatial character of tones

An intriguing fact about the pitch of tones is that we tend to describe it using spatial characteristics such as “high” and “low”. In the same way, we speak of a rising intonation when the pitch increases. A sequence of notes with increasing frequency played on a piano scale is described as going “up” (even though it is going right on a piano, and going down on a guitar). Yet there is nothing intrinsically spatial in the frequency of a tone. Why do we use these spatial attributes? An obvious possibility is that it is purely cultural: “high” and “low” are just arbitrary words that we happen to use to describe these characteristics of sounds. However, the following observations should be made: - We use the terms low and high, which are also used for spatial height, and not specific words such as blurp and zboot. But we don’t use spatial words for colors and odors. Instead we use specific words (red, green) or sometimes words used for other senses (a hot color). Why use space and not something else? - All languages seem to use more or less the same type of words. - In an experiment done in 1930 by Caroll Pratt (“The spatial character of high and low tones”), subjects were asked to locate tones of various frequencies on a numbered scale running from the floor to the ceiling. The tones were presented through a speaker behind a screen, placed at random height. It turned out that the judgment of spatial height made by subjects was very consistent, but was entirely determined by tone frequency rather than actual source position. High frequency tones were placed near the ceiling, low frequency tones near the floor. The result was later confirmed in congenitally blind persons and in young children (Roffler & Butler, JASA 1968). Thus, there is some support for the hypothesis that tones are perceived to have a spatial character, which is reflected in language. But why? Here I will just speculate widely and make a list of possibilities. 1. Sensorimotor hypothesis related to vocal production: when one makes sounds (sings or speaks), sounds of high pitch are felt to be produced higher than low pitch sounds. This could be related to the spatial location of tactile vibrations on the skin depending on fundamental frequency or timbre. Professional singers indeed use spatial words to describe where the voice “comes from” (which has no physical basis as such). This could be tested by measuring skin vibrations. In addition, congenitally mute people would show different patterns of tone localization. 2. Natural statistics: high frequency sounds tend to come from sources that are physically higher than low frequency sounds. For example, steps on the ground tend to produce low frequency sounds. Testing this hypothesis would require an extensive collection of natural recordings tagged with their spatial position. But note that the opposite trend is true for sounds produced by humans: adults have a lower voice than children, which are lower in physical height. 3. Elevation-dependent spectral cues: to estimate the elevation of a sound source, we rely on spectral cues introduced by the pinnae. Indeed the circumvolutions of the pinnae introduce elevation-dependent notches in the spectrum. By association, the frequency of a tone would be associated with the spectral characteristics of a particular elevation. This could be tested by doing a tone localization experiment and comparing with individual head-related transfer functions.

What is sound? (XV) Footsteps and head scratching

When one thinks of sounds, the image that comes to mind is a speaker playing back a sound wave, which travels through air to the ears of the listener. But not all sounds are like that. I will give two examples: head scratching and footsteps.

When you scratch your head, a sound is produced that travels in the air to your ears. But there is another pathway: the sound is actually produced by the skull and the skin, and it propagates through the skull directly to the inner ear. This is called “bone conduction”. A lot of the early work on this subject was done by von Békésy (see e.g. Hood, JASA 1962). Normally, bone conduction represents a negligible part of sounds that we hear. When an acoustical wave reaches our head, the skull is put in vibration and can transmit sounds directly to the inner ear by bone conduction. But because of the difference in acoustical impedance between air and skin, the wave is very strongly attenuated, on the order of 60 dB according to these early works. It is actually the function of the middle ear to match these two impedances.

But in the case of head scratching, the sound is actually already produced on the skull, so it is likely that a large proportion of the sound is transmitted by bone conduction, if not most of it. This implies that sound localization cues (in particular binaural cues) are completely different from airborne sounds. For example, sound propagates faster (as in water) and there are resonances. Cues might also depend on the position of the jaw. There is a complete set of binaural cues that are specific of the location of scratching on the skull, which are directly associated with tactile cues. To my knowledge, this has not been measured. This also applies to chewing sounds, and also to the sound of one’s own voice. In fact, it is thought that the reason why one’s own voice sounds higher when it is played back is because our perception of our own voice relies on bone conduction, which transmits lower frequencies better than higher frequencies.

Let us now turn to footsteps. A footstep is a very interesting sound – not even mentioning the multisensory information in a footstep. When the ground is impacted, an airborne sound is produced, coming from the location of the impact. However, the ground is not a point source. Therefore when it vibrates, the sound comes from a larger piece of material than just the location of the impact. This produces binaural cues that are unlike those of sounds produced by a speaker. In particular, the interaural correlation is lower for larger sources, and you would expect that the frequency-dependence of this correlation depends on the size of the source (the angular width, from the perspective of the listener).

When you walk in a noisy street, you may notice that you can hear your own footsteps but not those of other people walking next to you, even though the distance of the feet might be similar. Why is that? In addition to the airborne sound, your entire skeleton vibrates. This implies that there should be a large component of the sound that you hear that is in fact coming from bone conduction through your body. Again these sounds should have quite peculiar binaural cues, in addition to having stronger low frequencies. In particular, there should be different set of cues for the left foot and for the right foot.

You might also hear someone else’s footstep. In this case there is of course the airborne sound, but there is also another pathway, through the ground. Through this other pathway, the sound reaches your feet before the airborne sound, because sound propagates much faster in a solid substance than in air. Depending on the texture of the ground, higher frequencies would also be more attenuated. In principle, this vibration in your feet (perhaps if you are bare feet) will then propagate through your body to your inner ear. But it is not so clear how strong this bone conducted sound might be. Clearly it should be much softer than for your own footstep, since in that case there is an impact on your skeleton. But perhaps it is still significant. In this case, there are again different binaural cues, which should depend on the nature of the ground (since this affects the speed of propagation).

In the same way, sounds made by touching or hitting an object might also include a bone conducted component. It will be quite challenging to measure these effects, since ideally one should measure the vibration of the basilar membrane. Indirect methods might include: measurements on the skull (to have an idea of the magnitude), psychoacoustic methods using masking sounds, measuring otoacoustic emissions, electrophysiological methods (cochlear microphonics, ABR).

What is sound? (XIV) Are there unnatural sounds?

In a previous post, I argued that some artificial sounds might be wrongly presented as if they were not natural, because ecological environments are complex and so natural sounds are diverse. But what if they were actually not natural? Perhaps these particular sounds can be encountered in a natural environment, but there might be other sounds that can be synthesized and heard but that are never encountered in nature.

Why exactly do we care about this question? If we are interested in knowing whether these sounds exist in nature, it is because we hypothesize that they acquire a particular meaning that is related to the context in which they appear (e.g. a binaural sound with a large ITD is produced by a source located on the ipsilateral leading side). This is a form of objectivism: it is argued that if we subjectively lateralize a binaural sound with a 10 ms ITD to the right, it is because in nature, such a sound would actually be produced by a source located on the right. So in fact, what we are interested in is not only whether these sounds exist in nature, but also additionally whether we have encountered them in a meaningful situation.

So have we previously encountered all the sounds that we subjectively localize? Certainly this cannot be literally true, for a new sound (e.g. in a new acoustical environment) could then never be localized. Therefore there must be some level of extrapolation in our perception. It cannot be that what we perceive is a direct reflection of the world. In fact, there is a relationship between this point and the question of inductivism in philosophy of science. Inductivism is the position that a scientific theory can be deduced from the facts. But this cannot be true, for a scientific theory is a universal statement about the world, and no finite set of observations can imply a universal statement. No scientific theory is ever “true”: rather, it agrees with a large body of data collected so far, and it is understood that any theory is bound to be amended or changed for a new theory at some point. The same can be said about perception, for example sound localization: given a number of past observations, a perceptual theory can be formed that relates some acoustical properties and the spatial location of the source. This implies that there should be sounds that have never been heard but that can still be associated with a specific source location.

Now we reach an interesting point, because it means that there may be a relationship between phenomenology and biology. When sounds are presented that deviate from the set of natural sounds, their perceived quality says something about the perceptual theory that the individual has developed. This provides some credit to the idea that the fact we lateralize binaural sounds with large ITDs might say something about the way the auditory system processes binaural sounds – but of course this is probably not the best example since it may well be in agreement with an objectivist viewpoint.

What is sound? (XIII) Loudness constancy

Perhaps the biggest puzzle in loudness perception is why a pure tone, or a stationary sound such as a noise burst, feels like it has constant loudness. Or more generally: why does a pure tone feel like it is a constant sound? (both in loudness and other qualities like pitch)

The question is not obvious because physically, the acoustical wave changes all the time. Even though we are sensitive to this change in the temporal fine structure of the wave, because for example it contributes to our perception of pitch, we do not hear it as a change: we do not hear the amplitude rising and falling. Only the envelope remains constant, and this is an abstract property of the acoustical wave. We could have chosen another property. For example, in models of the auditory periphery, it is customary to represent the envelope as a low-pass filtered version of the rectified signal. But this does not produce an exactly constant signal for pure tones.

Secondly, at the physiological level nothing is constant either for pure tones. The basilar membrane follows the temporal fine structure of the acoustical wave. The auditory nerve fibers fire at several hundred Hz. At low frequency they fire at specific phases of the tone. At higher frequency their firing seems more random. In both cases we hear a pure tone with a constant loudness. What is more, fibers adapt: they fire more at the onset of a tone, then their firing rate decreases with time. Yet we do not hear the loudness decreasing. On the other hand, when we strike a piano key, the level (envelope) of the acoustical wave decreases and we can hear this very distinctly. In both cases (pure tone and piano key) the firing rate of fibers decreases, but in one case we hear a constant loudness and in the other case a decreasing loudness.

Finally, it is not just that some high-level property of sound feels constant, but with a pure tone we are simply unable to hear any variation in the sound at all, whether in loudness or in any other quality.

This discussion raises the question: what does it mean that something changes perceptually? To (tentatively) answer this question, I will start with pitch constancy. A pure tone feels like it has a constant pitch. If its frequency is progressively increased, then we feel that the pitch increases. If the frequency remains constant, then the pure tone feels like a completely constant percept. We do not feel the acoustical pressure going up and down. Why? The pure tone has this characteristic property that from the observation of a few periods of the wave, it is possible to predict the entire future wave. Pitch is indeed associated with the periodicity of the sound wave. If the basis of what we perceive as pitch if this periodicity relationship, then as the acoustical wave unfolds, this relationship (or law) remains constantly valid and so the perceived pitch should remain constant. There is some variation in the acoustical pressure, but not in the law that the signal follows. So there is in fact some constancy, but at the level of the relationships or laws that the signal follows. I would propose that the pure tone feels constant because the signal never deviates from the perceptual expectation.

This hypothesis about perceptual constancy implies several non-trivial facts: 1) how sensory signals are presented to the system (in the form of spike trains or acoustical signals) is largely irrelevant, if these specific aspects of presentation (or “projection”) can be included in the expectation; 2) signal variations are not perceived as variations if they are expected; 3) signal variations are not perceived if there is no expectation. This last point deserves further explanation. To perceive a change, an expectation must be formed prior to this change, and then violated: the variation must be surprising, and surprise is defined by the relation between the expectation (which can be precise or broad) and the variation. So if there is no expectation (expectation is broad), then we cannot perceive variation.

From this hypothesis it follows that a completely predictable acoustical wave such as a pure tone should produce a constant percept. Let us come back to the initial problem, loudness constancy, and consider that the firing rate of auditory nerve fibers adapt. For a tone of constant intensity, the firing rate decays at some speed. For tones of increasing intensity, the firing rate might decay at slower speed, or even increase. For tones of decreasing intensity, the firing rate would decay faster. How is it that constant loudness corresponds to the specific speed of decay that is obtained for the tone of constant intensity, if the auditory system never has direct access to the acoustical signals?

Loudness constancy seems more difficult to explain than pitch constancy. I will start with the ecological viewpoint. In an ecological environment, many natural sounds are transient (e.g. impacts) and therefore do not have constant intensity. However, even though the intensity of an impact sound decays, its perceived loudness may not decay, i.e., it may be perceived as a single timed sound (e.g. a footstep). There are also natural sounds that are stationary and therefore have constant intensity, at least at a large enough timescale: a river, the wind. However, these sounds do not address the problem of neural adaptation, as adaptation only applies to sounds with a sharp onset. Finally, vocalizations have a sharp onset and slowly varying intensity (although this might be questionable). Thus, for a vocalization, the expected intensity profile is constant, and therefore it could be speculated that this explains the relationship between constant loudness and constant intensity, despite variations at the neurophysiological level.

A second line of explanation is related to the view of loudness as a perceptual correlate of intelligibility. A pure tone presented in a stationary background has constant intelligibility (or signal-to-noise ratio), and this fact is independent of any further (non-destructive) processing applied to the acoustical wave. Therefore, the fact that loudness is constant for a pure tone is consistent with the view that loudness primarily reflects the intelligibility of sounds.

What is sound? (XII) Unnatural binaural sounds

Some types of artificial sounds presented through headphones are sometimes described as not natural, in the sense that they have binaural relationships that sounds in a natural environment do not have. In general, this qualification refers to the qualities of point sources in an anechoic environment, but real environments reflect sounds and there are also more complex sound sources. I will discuss two types of “unnatural” binaural sounds.

1) Binaural noise with long interaural delays (ITD). In an anechoic environment, the ITD of a sound can reach 600-700 µs in high frequency for humans, and perhaps up to 800-900 µs in low frequency. Yet if we listen to binaurally delayed noise through headphones with an ITD of about 10 ms, we hear a single source, lateralized to one side. When the ITD is increased, starting from 0 µs, perceived lateralization progressively increases up to about 1 ms or a bit less, then reaches a plateau. We hear two separate noises only when the ITD is larger than about 10 ms. This is surprising because 10 ms is much larger than the maximal ITD that can be produced by a single sound source in an anechoic environment. However, let us consider a situation where there is an acoustically reflecting surface, a vertical wall, which is on our left side, about two meters away. A sound source is far on the opposite side. In this case, the right ear receives the direct wave from the source and the left ear receives the reflected wave. It follows that the ITD is about 10 ms. In addition, the direction of the sound source is consistent with the headphone experiments. Therefore, large ITDs may not be unlikely in natural environments, even with a simple point sound source.

2) Uncorrelated binaural noise. If acoustical noise is made of the addition of independent many sound sources, one would expect that the signals at the two ears are correlated in low frequency, that is, when the period is large compared to the maximum ITD of the sound sources. Are there sounds in an ecological environment that are binaurally uncorrelated? I would suggest the following situation. You are riding a bicycle, and you feel the wind in your face and in your ears. The sound of the wind in the ears does not feel localized anywhere else than at the ears. There is also little reason to believe that the pressure of the air is highly correlated at the two ears, except perhaps at very low frequency - although this should be measured. In fact, this acoustical situation correlates with mechanical pressure on the ears, which can be captured by tactile receptors. I would suggest that this type of sound is perceptually localized at the ears because of this ecological association.

In summary, acoustical situations are considerably diverse in ecological environments, and therefore there might be fewer “unnatural” sounds than often assumed.

What is sound? (XI) What is loudness?

In the previous post, I discussed proximal aspects of loudness, which depend on the acoustical wave at the ear. For example, when we say that a sound is too loud, we are referring to an unpleasant feeling related to the effect of acoustical waves sensed at the ear. That is, the same sound source would feel less loud if it were far from the ear.

But we can also perceive the “intrinsic” loudness of a sound source, that is, that aspect of loudness that is not affected by distance. This is a distal property, which I will call source loudness. The loudness of a sound can be defined as proximal or distal in the same way as a visual object has a size on the retina (proximal) and a physical size as an external object (distal).

First of all, what can possibly be meant by source loudness? We may consider that it is a perceptual correlate of an acoustical property of the sound source, for example the energy radiated by a sound source. The acoustical energy at a given point depends on the distance to the source through the inverse square law, but the total energy at a given distance (integrated on the whole surface of a sphere) is constant (neglecting reflections). However, we cannot sense this kind of invariant since it implies sampling the acoustical wave in the entire space (but see the last comments in this post).

An alternative is to consider that source loudness is that property of sound that does not vary with distance, and more generally with the acoustical environment. The problem with this definition is that it applies to all distal properties of sound (pitch, speaker identity, etc). The fact that we refer to source loudness using the word loudness suggests that there is relationship between proximal and distal loudness. Therefore, we may consider that source loudness is that property of sound that is univocally related to (expected) proximal loudness at a reference location (say, at arm distance). Defined in this way, source loudness indeed does not depend on distance. Indirectly, it is a property of the sound field radiated by the source, although it is defined in reference with the perceptual system (since proximal loudness is not an intrinsic property of acoustical waves, as I noted in the previous post).

Another way to define source loudness involves action. For example, we can produce sounds by hitting different objects. The loudness of the sound correlates with the energy we put in the action. So we could imagine that source loudness corresponds to the energy we estimate necessary to produce the sound. This also gives a definition that does not depend on source distance. However, hitting the ground produces a sound that feels louder when the ground is hard (concrete) than when it is soft (grass, snow). Some of the energy we put into the mechanical interaction is dissipated and some is radiated, and it seems that only the radiated energy contributes to perceived loudness. Therefore, this definition is not entirely satisfying. I would not entirely discard it, because as we have seen loudness is not a unitary percept. It might also be relevant for speech: when we say that someone screams or speaks softly, we are referring to the way the vocal chords are excited. Thus, this is a distal notion of loudness that is object-specific.

So we have two notions of source loudness, which are invariant with respect to distance. One is related to proximal loudness; the other one is related to the mechanical energy required to produce the sound and is source-specific (in the sense that the source must be known). The next question is: what exactly in the acoustical waves is invariant with respect to distance? In Gibsonian’s terms, what is the invariant structure related to source loudness?

Let us start with the second notion. What in the acoustical wave specifies the energy put into the mechanical interaction? If this property is invariant to distance, then it should be invariant with respect to scaling the acoustical wave. It follows that such information can only be captured if the relationship between interaction strength and the resulting wave is nonlinear. Source loudness in this sense is therefore a measure of nonlinearity that is source-specific.

The first notion defines source loudness in relationship to proximal loudness at a reference distance. What in the acoustical wave specifies the proximal loudness that would be perceived at a different distance? One possibility is that this involves a highly inferential process: source loudness is first perceived in a source-specific way (previous paragraph), and then associated with proximal loudness. Another inferential process would be: distance is estimated using various cues, and then proximal loudness at a reference distance is inferred from proximal loudness at the current location. One such cue is the spectrum: the air absorbs high frequencies more than low frequencies, and therefore distant sounds have less high frequency content. Of course this is an ambiguous cue since spectrum at the ear also depends on the spectrum at the source, so it is only a cue in a statistical sense (i.e., given the expected spectral shape of natural sounds).

There is another possibility that was tested with psychophysical experiments in a very interesting study (Zahorik & Wightman (2001)). The subjects listen to noise bursts played from various distances at various intensities, and are asked to evaluate source loudness. The results show that 1) the evaluation of loudness does not depend on distance, 2) the scale of loudness depends on source intensity in the same way as for proximal loudness (loudness at the ears). This may seem surprising, since the sounds have no structure, they do not have the typical spectrum of natural sounds (which tend to decay as 1/f) and there is no nonlinearity involved. The key is that the sounds were presented in a reverberating environment (room). The authors propose that loudness constancy is due to the properties of diffuse fields. In acoustics, a diffuse field has the property that it is identical at all spatial locations within the environment. This is never entirely true of natural environments, but some reverberant environments are close to it. This implies that the reverberant part of the signal depends linearly on the source signal but does not depend on distance. Therefore, the reverberant part is invariant with respect to source location and can provide the basis for the notion of source loudness that we are considering. Reverberation preserves the spectrum of the source signal, but not the temporal envelope (which is blurred). However, we note that since reverberation depends on the specific acoustical environment, it is in principle only informative about the relative loudness of different sources; but it is important to observe that it allows comparisons between different types of sources.

Alternatively, the ratio between direct and reverberant energy provides a way to estimate the distance of the source, from which source loudness can be deduced. But we note that estimating the distance is in fact not necessary to estimate source loudness. The study does not mention cues due to early reflections on the ground. Indeed a reflection on the ground interferes with the direct signal at a specific frequency that is inversely proportional with the delay between direct and reflected signals. This could be a monaural or binaural cue to distance (Gourévitch & Brette 2012).

To conclude this post, we have seen that loudness actually encompasses several distinct notions:

1) a proximal notion that is related to intelligibility (as in “not loud enough”), and therefore to the relationship between the signal of interest and the background, considered as a distracter;

2) a proximal notion that is related to biological responses to the acoustical signal (as in “too loud”), which may (speculatively) be numerous (energy consumption, risk of cochlear damage, startle reflex);

3) a distal notion that relates to the mechanical energy involved in producing the sound (a sensorimotor notion), which is source-specific;

4) a distal notion that relates to the sound field radiated from the source, independently of the distance, which may be defined as the expected proximal loudness at a reference distance.

What is sound? (X) What is loudness?

At first sight, it seems obvious what loudness is. A sound is loud when the acoustical wave carries a lot of energy. But if we think about it in details, we quickly encounter difficulties. One obvious thing is that if we play the same sound at different levels, then clearly the feeling of loudness directly correlates with the amplitude of the sound, and therefore with the energy of the sound. But how about if we play two completely different sounds? Which one is louder? Should we consider the total energy? Probably not, because this would introduce a confusion with duration (the longer sound has more energy). So perhaps the average energy? But then what is the average energy of an impact sound, and how does it compare with a tone? Also, how about low sounds and high sounds, is there the same relationship between energy and loudness for both sounds? And does a sound feel as loud in a quiet environment as in a noisy environment? Does it depend on what sounds were played before?

I could go on indefinitely, but I have made the point that loudness is a complex concept, and its relationship with the acoustic signal is not straightforward at all.

Let us see what can be said about loudness. First of all, we can say that a sound is louder than another sound, even if the two sounds are completely different. This may not be true of all pairs of sounds, but certainly I can consider that a low amplitude tone is weak compared to the sound made by a glass breaking on the floor. So certainly there seems to be an order relationship in loudness, although perhaps partial. Also, it is true that scaling the acoustical wave has the effect of monotonically changing the loudness of the sound. So there is definitely a relationship with the amplitude, but only in that scaling sense: it is not determined by simple physical quantities such as the peak pressure or the total energy.

Now it is interesting to think for a while about the notion of a sound being “not loud enough” and of a sound being “too loud”, because it appears that these two phrases do not refer to the same concept. We say that a sound is “not loud enough” when we find it hard to hear, when it is difficult to make sense of it. For example we ask someone to speak louder. Thus this notion of loudness corresponds to intelligibility, rather than acoustical energy. In particular, this is a relative notion, in the sense that intelligibility depends on the acoustical environment – background noise, other sources, reverberation, etc.

But saying that a sound is “too loud” refers to a completely different concept. It means that the sound produces an uncomfortable feeling because of its intensity. This is unrelated to intelligibility: someone screaming may produce a sound that is “too loud”, but two people screaming would also produce a sound that is “too loud”, even though intelligibility decreases. Therefore, there are at least two different notions regarding loudness: a relative notion related to intelligibility, and a more absolute one related to an unpleasant or even painful feeling. Note that it can also be said that a sound is too loud in the sense of intelligibility. For example, it can be said that the TV is too loud because it makes it hard to understand someone speaking to us. So the notion of loudness is multiform, and therefore cannot be mapped to a single scale.

Loudness as in “not loud enough” (intelligibility) is rather simple to understand. If the signal-to-noise ratio is too low, then it is more difficult to extract the relevant information from the signal, and this is what is meant by “not loud enough”. Of course there are subtleties and the relationship between the acoustical signals and intelligibility is complex, but at least it is relatively clear what it is about. In contrast, it is not so straightforward what “too loud” means. Why would a sound be unpleasant because the acoustical pressure is large?

First of all, what does it mean that something is unpleasant or painful? Something unpleasant is something that we want to avoid. But this is not a complete characterization: it is not only a piece of information that is taken into account in decision making; it has the character of an uncontrollable feeling, something that we cannot avoid being subjected to. In other words, it is an emotion. Being controlled by this emotion means acting so as to escape the unpleasant sound, for example, by putting one’s hands on the ears. Consciously trying not to act in such a way would be considered as “resisting” this emotion. This terminology implies that loudness (as in “too loud”) is an involuntary avoidance reaction of the organism to sounds, one that implies attenuating the sounds. Therefore, loudness is not only about objective properties of the external world, but also about our biological self, or more precisely about the effect of sounds on our organism.

Why would a loud sound trigger an avoidance reaction? We can speculate on different possibilities.

1) A loud sound may indicate a threat. There is indeed a known reflex called “startle reflex”, with a latency of around 10 ms (Yeomans and Frankland, Brain Research Reviews 1996). In response to sudden unexpected loud sounds, there is an involuntary contraction of muscles, which stiffens in particular the neck during a brief period of time. The reflex is found in all mammals and involves a short pathway in the brainstem. It is also affected by previous sounds and emotional state. However, this reflex only involves a small subset of sounds, which are sudden and normally very loud (over 80 dB).

2) A very loud sound can damage the cochlea (destroy hair cells). At very high levels, it can even be painful. Note that a moderately loud sound can also damage the cochlea if it lasts long. Thus, the feeling of loudness could be related to the emotional reaction aimed at avoiding damage to the cochlea. Note that while cochlear damage depends on duration, loudness does not. That is, a continuous pure tone seems just as loud at the beginning as 1 minute into it, and yet because damage depends on continuous exposition, an avoidance reaction should be more urgent in the latter case than in the former case. Even for very loud sounds, the feeling of loudness does not seem to increase with time: it may seem more and more urgent to avoid the sound, but it does not feel louder. We can draw two conclusions: 1) the feeling of loudness, or of a sound being too loud, cannot correspond to an accurate biological measurement of potential cochlear damage, as it seems to have a feeling of constancy when the sound is stationary; 2) the feeling of a sound being “too loud” probably doesn’t correspond to the urgency of avoiding that sound, since this urgency can increase (emotionally) without a corresponding increase in loudness. It could be that the emotional content (“too loud”) comes in addition to the perceptual content (a certain degree of loudness), and that only the latter is constant for a stationary sound.

3) Another possibility is that loudness correlates with the energy consumption of the auditory periphery (possibly of the auditory system in general). Indeed when the amplitude of an acoustical wave is increased, the auditory nerve fibers and most neurons in the auditory system fire more. Brain metabolism is tightly regulated, and so it is not at all absurd to postulate that there are mechanisms to sense the energy consumption due to a sound. However, this is not a very satisfying explanation of why a sound would feel “too loud”. Indeed why would the organism feel an urge to avoid a sound because it incurs a large energy consumption, when there could be mechanisms to reduce that consumption?

In this post, I have addressed two aspects of loudness: intelligibility (“not loud enough”) and emotional content (“too loud”). These two aspects are “proximal”, in the sense that they are determined not so much by the sound source as by the acoustical wave at the ear. In the next post, I will consider distal aspects of loudness, that is, those aspects of loudness that are determined by the sound source.