What is sound? (V) The structure of pitch

Musical notes have a particular perceptual quality called “pitch”. Pitch is the percept corresponding to how low or high a musical note is. Vowels also have a pitch. To a large extent, the pitch of a periodic sound corresponds to its repetition rate. The important point is that what matters in pitch is more the periodicity than the frequency content. For example, a periodic sound with repetition rate f0 has frequency components at multiples of f0 (n.f0), which are called harmonics. A pure tone of frequency f0 and a complex tone with all harmonics except the first one, i.e., which does not contain the frequency component f0, will evoke the same pitch. It is in fact a little more complex than that, there are many subtleties, but I will not enter into these details in this post. Here I simply want to describe the kind of sensory or sensorimotor structure there is in pitch. It turns out that pitch has a surprisingly rich structure.

The most obvious type of structure is periodicity. Pitch-evoking sounds have this very specific property that the acoustical wave is unchanged when temporally shifted by some delay. This delay is characteristic of the sound’s pitch (i.e., same period means same pitch). This is the type of structure that is emphasized in temporal theories of pitch. This is what I call the “similarity structure” of the acoustical signal, and this notion can in fact be extended and accounts for a number of interesting phenomena related to pitch. But this is work in progress, so I will discuss it further at a later time.

Another way to see periodic sounds is to realize that a periodic sound is predictable. That is, after a couple periods, one can predict the future acoustical wave. Compared to most other sounds, periodic sounds have a very high degree of predictability. Perhaps the perceptual strength of pitch (which depends on a number of factors) is related to the degree of predictability of the sound.

There is another type of structure that is in some sense orthogonal to the similarity structure I just described, which one might call the “dissimilarity structure”. Natural sounds (apart from vocalizations) tend to have a smooth spectrum. Periodic sounds, on the other hand, have a discrete spectrum. Thus, in some sense, periodic sounds have a “surprisingly discontinuous” spectrum. Suppose for example that two auditory receptors respond to different but overlapping parts of the spectrum (e.g., two nearby points on the basilar membrane). Then one can usually predict the sensory input to the second receptor given the sensory input to the first receptor, because natural sounds tend to have a continuous spectrum. But this prediction would fail with a periodic sound. Periodic sounds are maximally surprising in this sense. The interesting thing about the dissimilarity structure of pitch is that it accounts for binaural pitch phenomena such as Huggins’ pitch: noise with flat spectrum is presented on both ears, and the interaural phase difference changes abruptly at a given frequency; a tone is perceived, with the pitch corresponding to that frequency.

Thus, pitch-evoking sounds simultaneously have two types of structure that distinguish them from other types of sounds: the similarity structure, which consists of different views of the acoustical signal that are unusually similar, and the dissimilarity structure, which consists of different views of the acoustical signal that are unusually dissimilar. This first type of structure corresponds to what I examined in my paper on computing with neural synchrony. It is important to notice that these two types of structure have a different nature. The similarity structure corresponds to a law that the sensory signals follow. Here the percept is associated to the specific law that these signals follow. The dissimilarity structure corresponds to the breaking of a law that sensory signals usually follow. Here the percept is associated to a law that is specific not of the presented sensory signals, but of the usual sensory signals. Thus we might relate the similarity structure to the notion of discovery, and the dissimilarity structure to the notion of surprise (and perhaps the term “structure” is not appropriate for the latter).

So far, I have only considered the structure of the acoustical signal, but one may also consider the sensorimotor structure of pitch. As I mentioned in another post, periodic sounds are generally produced by living beings, so it makes sense to examine these sounds from the viewpoint of their production. When one produces a pitch-evoking sound (for example a vowel, or when one sings), there is a very rich structure that goes beyond the acoustical structure. First, there is proprioceptive information about vocal muscles and tactile information about the vibrations of the larynx, and both are directly related to the period of sounds. There is also the efferent copy, i.e., the motor commands issued to make the vocal folds vibrate in the desired way. For a person who can produce sounds, pitch is then associated to a rich and meaningful sensorimotor structure. In fact, the sensorimotor theory of pitch perception would be that to perceive the pitch of a sound is, perhaps, to perceive the movements that would be required to produce such acoustical structure. An interesting aspect of this view is that it provides some meaning to the notion of how low or high a pitch-evoking sound is, by associating it with the state of the different elements involved in sound production. For example, to produce a high sound requires to increase the tension of the vocal cords, and to move the larynx up (higher!). One question then is whether congenitally mute people have a different perception of pitch.

Observe that, as for binaural hearing , the sensorimotor structure of pitch should not be understood as the relationship between motor commands and auditory signals, but rather as the relationship between motor commands and the structure of auditory signals (e.g. the periodicity). In this sense, it is higher-order structure.

What is sound? (IV) Ecological ontology of sounds

What kinds of sounds are there in the world? This is essentially the question William Gaver addresses in a very interesting paper (Gaver, 1993), in which he describes an ontology of sounds, categorized by the type of interaction. There are three categories: sounds made by solids, liquids and gases. An example of a sound made by liquid is dripping. There are also hybrid sounds, such as the rain falling on a solid surface. It makes sense to categorize sounds based on the nature of the objects because the mechanical events are physically very different. For example, in sounds involving solids (e.g. a footstep), energy is transmitted at the interface between two solids, which is a surface, and the volumes are put in motion (i.e., they are deformed). This is completely different for sounds involving gases, e.g. wind. In mechanical events involving solids, the shape is essentially unchanged (only transiently deformed). This is a sort of structural invariance that ought to leave a specific signature on the sounds (more on this in another post). Sounds made by gases, on the other hand, correspond to irreversible changes.

These three categories correspond to the physical nature of the sound producing substances. There are subcategories that correspond to the nature of the mechanical interaction. For example, a solid object can be hit or it can be scraped. The same object vibrates but there is a difference in the way it is made to vibrate. This also ought to produce some common structure in the auditory signals, as is explained in Gaver's companion article. For example, a vibrating solid object has modes of vibration that are determined by its shape (more on this in another post). These modes do not depend on the type of interaction with the object.

Interactions that are localized in time are impact sounds, while continuous interactions produce auditory textures. These are two very distinct types of sounds. Both have a structure, but auditory textures, it seems, only have a structure in a statistical sense (see McDermott & Simoncelli, 2011). Another kind of auditory texture is the type of sounds produced by a river, for example. These sounds also have a structure in a statistical sense. An interesting aspect, in this case, is that these sounds are not spatially localized: they do have an auditory size (see my post on spatial hearing ).

The examples I have described correspond to what Gaver calls "basic level events", elementary sounds produced by a single mechanical interaction. There are also complex events, which are composed of simple events. For example, a breaking sound is composed of a series of impact sounds. A bouncing sound is also composed of a series of impact sounds, but the temporal patterning is different, because it is lawful (predictable) in the case of a bouncing sound. Walking is yet another example of a series of impact sounds, which is also lawful, but it differs in the temporal patterning: it is approximately periodic.

Gaver only describes sounds made by non-living elements of the environment (except perhaps for walking). But there are also sounds produced by animals. I will describe them now. First, some animals can produce vocalizations. In Gaver's terminology, vocalizations are a sort of hybrid gas-solid mechanical event: periodic pulses of air make the vocal folds vibrate. The sound then resonates in the vocal tract, which shapes the spectrum of the sound (in a similar way as the shape of an object determines the resonating modes of impact sounds). One special type of structure in these sounds is the periodicity of the sound wave. The fact that a sound is periodic is highly meaningful, because it means that energy is continuously provided, and therefore that a living being is most likely producing it. There are also many other interesting aspects that I will describe in a later post.

Animals also produce sounds by interacting with the environment. These are the same kinds of sounds as described by Gaver, but I believe there is a distinction. How can you tell that a sound has been produced by a living being? Apart from identifying specific sounds, I have two possible answers to provide. First, in natural non-living sounds, energy typically decays. This distinguishes walking sounds from bouncing sounds, for example. In a bouncing sound, the energy decreases at each impact. This means both the intensity of the sound and the interval between successive impacts decay. This is simply because a bouncing ball starts its movement with a potential energy, that can only decay. In a walking sound, roughly the same energy is brought at each impact, so it cannot be produced by the collision of two solids. Therefore, sounds contain a signature of whether it is produced by continuous source of energy. But a river is also a continuous source of energy (and the same would apply to all auditory textures). Another specificity is that sounds produced by the non-living environment are governed by the laws of physics, and therefore they are lawful in a sense, i.e., they are predictable. A composed sound with a non-predictable pattern (even in a statistical sense) is most likely produced by a living being. In a sense, non-predictability is a signature of decision making. This remark is not specific to hearing.

These are specificities of sounds produced by living beings, as heard by another observer. But one can also hear self-produced sounds. There are two new specificities about these types of sounds. First, they also make the body vibrate, for example, a foot hits the ground. This produces sound waves with a specific structure. But more importantly, self-produced sounds have a sensorimotor structure. Scraping corresponds to a particular way in which one interacts with an object. The time of impact corresponds to the onset of the sound. The intensity of the sound is directly related to the energy with which an object is hit. Finally, the periodicity of vocalizations (i.e., the pitch), corresponds to the periodicity of self-generated air pulses through the vocal folds, and the formant frequencies correspond to the shape of the vocal tract. Self-generated sounds also have a multimodal structure. For example, they produce vibrations in the body than can be perceived by tactile receptors. In the next post, I will look at the structure of pitch.

Perceptual invariants: representational vs. structural theories

In his book on vision, David Marr acknowledges the fact that a major computational issue for sensory systems is to extract relevant information in a way that is invariant to a number of changes in the world. For example, to recognize a face independently of its orientation and distance. Here we hit a major difference between representational theories and what I shall call structural theories, such as Gibson’s ecological theory (see my post on the difference between these two theories). In a representational theory, invariant processing is obtained by building a representation that is itself invariant to a number of transformations (e.g. translations, rotations). How can this representation be built? There are two ways: either it is wired (innate) or it is acquired, learned by associating many transformed instances of the same object with the same “percept”. So in a representational theory, dealing with invariance is a tedious learning process requiring supervision. In a structural theory, the problem actually does not exist, because the basis of perception is precisely invariants.

I will give an example in hearing. There are two theories of pitch perception. Pitch is the percept associated to how low or high a musical note is. It mostly corresponds to the periodicity of the sound wave. Two periodic sounds with the same repetition rate will generally have the same pitch. But they may have different timbres, i.e., different spectral contents. In the spectral or template theory, there is an initial representation of sounds consisting as a spectral pattern. It is then compared with the spectral patterns of reference periodic sounds with various pitches, the templates. These templates need to be learned, and the task is not entirely trivial because periodic sounds with the same pitch can have non-overlapping spectra (for example a pure tone, and a complex tone without the first harmonic). The spectral theory of pitch is a representational theory of pitch. In this account, there is nothing special about pitch, it is just a category of sound spectra.

The temporal theory of pitch, on the other hand, postulates that the period of a sound is detected. I call it a structural theory because pitch corresponds to a structural property of sounds, their periodicity. One can observe that the same pattern in the sound wave is repeated, at a particular rate, and this observation does not require learning. Now this means that if two sounds with the same period are presented, I can immediately recognize that they share the same structural property, i.e., they have the same pitch. Learning, in a structural theory, only means associating a particular structure with a label (say, the name of a musical note). The invariance problem disappears in a structural theory, because the basis of the percept is an invariant: the periodicity does not depend on the sound’s spectrum. This also means that sounds that elicit a pitch percept are special because they have a particular structure. In particular, periodic sounds are predictable. White noise, on the other hand, has no structure and does not elicit a pitch percept.

David Marr vs. James Gibson

In his book “Vision”, David Marr briefly comments on James Gibson’s ecological approach, and rejects it. He makes a couple of criticisms that I think are fair, for example the fact that Gibson seemed to believe that extracting meaningful invariants from sensory signals is somehow trivial, while it is a difficult computational problem. But David Marr seems to have missed the important philosophical points in James Gibson’s work. These points have also been made by others, for example Kevin O’Regan, Alva Noë, but also Merleau-Ponty and many others. I will try to summarize a few of these points here.

I quote from David Marr: “Vision is a process that produces from images of the external world a description that is useful to the viewer and not cluttered with irrelevant information”. There are two philosophical errors in this sentence. First, that perception is the production of a representation. This is a classical philosophical mistake, the homunculus fallacy. Who then sees this representation? Marr even explicitly mentions a “viewer” of this representation. One would have to explain the perception of this viewer, and this reasoning leads to an infinite regress.

The second philosophical mistake is more subtle. It is to postulate that there is an external source of information, the images in the retina, that the sensory system interprets. This is made explicit later in the book: “(...) the initial representation is in no doubt – it consists of arrays of image intensity values as detected by the photoreceptors in the retina”. This fact is precisely what Gibson doubts at the very beginning of his book, The Ecological Approach to Visual Perception. Although it is convenient to speak of information in sensory signals, it can be misleading. It makes a parallel with Shannon’s theory of communication, but the environment does not communicate with the observer. Surfaces reflect light waves in all directions. There is no message in these waves. So the analogy between a sensory system and a communication channel is misleading. The fallacy of this view is fully revealed when one considers the voluntary movements of the observer. The observer can decide to move and capture different sensory signals. In Gibson’s terminology, the observer samples the ambient optic array. So what is primary is not the image, it is the environment. Gibson insists that a sensory system cannot be reduced to the sensory organ (say, the eyes and the visual cortex). It must include active movements, embedded in the environment. This is related to the embodiment theory.

We tend to feel that what we see is like the image of a high-resolution camera. This is a mistake due to the immediate availability of visual information (by eye movements). In reality, a very small part of the visual field has high resolution, and a large part of the retina has no photoreceptors (the blind spot). We do not feel this because when we need the information, we can immediately direct our eyes towards the relevant target in the visual field. There is no need to postulate that there is an internal high-resolution representation in which we can move our “inner eye”. Rodney Brooks, a successful researcher in artificial intelligence and robotics, once stated “the world is its own best model”. The fact that we actually do not have a high-resolution mental representation of the visual world (an image in the mind) has been demonstrated spectacularly through the phenomena of change blindness and inattentional blindness, in which a major change in an image or movie goes unnoticed (see for example this movie).

Correlation vs. synchrony

What is the difference between neural correlation and neural synchrony? As I am interested in the role of synchrony in neural computation, I often hear the question. I will try to give a few answers here.

A simple answer is: it’s a question of timescale. That is, synchrony is correlation at a fine timescale, or more precisely, at a timescale shorter than the integration time constant of the neuron. In this sense, the term synchrony implicitly acknowledges that there is an observer of these correlations. This usage is consistent with the fact that neurons are very sensitive to the relative timing of their inputs within their integration time constant (see our recent paper in J Neurosci on the subject).

However, although I have been satisfied with this simple answer in the past, I now feel that it misses the point. I think the distinction rather has to do with the distinction between the two main theories of neural computation, rate-based theories vs. spike-based theories. The term “correlation” is often used in the context of rate-based theories, whereas the term “synchrony” is used in general in the context of spike-based theories (as in my recent paper on computing with neural synchrony). The difference is substantial, and it does not really have to do with the timescale. A correlation is an average, just as a firing rate is an average. Therefore, by using the term correlation, one implicitly assumes that the quantities of interest are averages. In this view, correlations are generally seen as modulating input-output properties of neurons, in a rate-based framework, rather than being the substance of computation. But when using the term synchrony, one does not necessarily refer to an average, simply to the fact that two spikes occur at a similar time. For example, in my recent paper on computing with neural synchrony, I view coincidence detection as the detection of a rare event, that is, a synchrony event that is unlikely to occur by chance. If one takes this view further, then meaningful synchrony is in fact transient, and therefore the concept cannot be well captured by an average, i.e., by correlation.

The distinction might not be entirely obvious, so I will give a simple example here. Consider two Poisson inputs with rate F. Consider one spike from neuron A. The probability that neuron B spikes within time T after this spike can be calculated (integral of an exponential distribution), and for small T it is essentially proportional to T (and to F squared). If T is very small and the two inputs are independent, this event will almost never happen. So if it does happen, even just once, then it is unexpected and therefore meaningful, since it means that the assumption of independence was probably wrong. In a way, a coincidence detector can be seen as a statistical test: it tests the coincidence of input spikes against the null hypothesis, which is that inputs are independent. A single synchrony event can make this test fail, and so the concept cannot be fully captured by correlation, which is an average.

To summarize, synchrony is not about determinism vs. stochasticity, it is not about correlation on a very fine timescale, or about very strong correlation, it is about relative timing in individual spiking events, and about how likely such an event is likely to occur by chance under an independence hypothesis.