Archives de l’auteur : romain

Rate vs. timing (II) Rate in spike-based theories

To complement the previous post, I will comment on what firing rate means in spike-based theories. First of all, rate is important in spike-based theories. The timing of a spike can only exist if there is a spike. Therefore, the firing rate determines the rate of information in spike-based theories, but it does not determine the content of information.

A related point is energy consumption. The energy consumption of a cell is essentially proportional to the number of spikes it produces (taking into account the cost of synaptic transmission to target neurons) (Attwell and Laughlin, 2001). It seems reasonable to think that the organism tries to avoid any waste of energy, therefore a cell that fires at high rate must be doing something important. In terms of information, it is likely that the amount of information transmitted by a neuron is roughly proportional, or at least correlates with its firing rate.

For these two observations, it follows that, in spike-based theories, firing rate is a necessary correlate of information processing in a neuron. This stands in contrast with rate-based theories, in which rate is the basis of information processing. But both types of theories predict that firing rates correlate with various aspects of stimuli – and therefore that there is information about stimuli in firing rates, at least for an external observer.

Rate vs. timing (I) A category error

Laisser un commentaire

This post starts a series on the debate between rate-based and spike-based theories of neural computation and coding. My primary goal is to clarify the concepts. I will start by addressing a few common misconceptions about the debate.

Misconception #1: “Both rate and spike timing are important for coding, so the truth is in between”. This statement, I will argue, is what philosophers would call a “category error”: it is not that only one of the alternatives can be right, it is just that the two alternatives belong to different categories.

Neurons mainly communicate with each other using trains of spikes – at least this is what the rate-timing debate is concerned about. A spike train can be completely characterized by the timing of its spikes. The firing rate, on the other hand, is an abstract definition, that is only valid in a limit, which involves an infinite number of spikes. For example, it can be defined for a single neuron as a temporal average: the inverse of the mean inter-spike interval. It appears that rate is defined from the timing of spikes. Thus these are two different concepts: spike timing is what defines spike trains, whereas rate is an abstract mathematical construction on spike trains. Therefore the rate vs. timing debate is not about which one is right, but about whether rate is a sufficiently good description of neural activity. Spike-based theories do not necessarily claim that rate does not matter, they refute the notion that rate is the essential quantity that matters.

There are different ways to define the firing rate: over time (number of spikes divided by the duration, in the limit of infinite duration), over neurons (average number of spikes in a population of neurons, in the limit of an infinite number of neurons) or over trials (average number of spikes over an infinite number of trials). In the third definition (which might be the prevailing view), the rate is seen as an intrinsic time-varying signal r(t) and spikes are seen as random events occurring at rate r(t). In all these definitions, rate is an abstract quantity defined on the spike trains. Therefore when stating that the neural “code” is based on rates rather than spike timing, what is meant is that the concept of rate captures most of the important details of neural activity and computation, while precise spike timing is essentially meaningless. On the other hand, when stating that spike timing matters, it is not meant that rate is meaningless; it simply means that precise timing information cannot be discarded. Thus, these are not two symmetrical views: the stronger assumptions are on the side of the rate-based view. Now of course each specific spike-based theory makes a number of possibly strong assumptions. But the general idea that the neural “code” is based on individual spikes and not just rates is not based on strong assumptions. The rate-based view is based on an approximation, which may be a good one or a bad one. This is the nature of the rate vs. timing debate.

On the role of voluntary action in perception

Laisser un commentaire

The sensorimotor theory of perception considers that to perceive is to understand the effect of active movements on sensory signals. Gibson’s ecological theory also places an emphasis on movements: information about the visual world is obtained by producing movements and registering how the visual field changes in lawful ways. Poincaré also defined the notion of space in terms of the movements required to reach an object or compensate for movements of an object.

Information about the world is contained in the sensorimotor “contingencies” or “invariants”, but why should it be important that actions are voluntary? Indeed, one could see movements as another kind of sensory information (e.g. proprioceptive information, or “efferent copy”), and a sensorimotor law is then just a law defined on the entire set of accessible signals. I will propose two answers below. I only address the computational problem (why it is useful), not the problem of consciousness.

Why would it make a difference that action is voluntary? The first answer I will give comes from ideas discussed in robotics and machine learning, and known as active learning, curiosity or optimal experiment design. Gibson makes this remark that the term “information” is misleading when talking about the sensory inputs. Senses cannot be seen as a communication channel, because the world does not send messages to be decoded by the organism. In fact rather the opposite is true: the organism actively seeks information about the world by making specific actions that improve its knowledge. A good analogy is the game “20 questions”. One participant thinks of an object or person. The other tries to discover it by asking questions that can only be answered by yes or no. She wins if she can guess the object with fewer than 20 questions. Clearly it is very difficult to guess using the answer to a random question. But by asking smart questions, one can quickly narrow the search to the right object. In fact with 20 questions, one can discover up to 2²⁰ = a million objects. Thus voluntary action is useful for efficiently exploring the world. Here by “voluntary” it is simply meant that action is a decision based on previous knowledge, which is intended to maximally increase future knowledge.

I can see another way in which voluntary action is useful, by drawing an analogy with philosophy of science. If perception is about inferring sensory or sensorimotor laws, then it raises an issue common to the development of science, which is how to infer universal laws from a finite set of observations. Indeed there are an infinite number of universal laws that are consistent with any finite set of observations – this is the problem of inductivism. Karl Popper argued that science progresses not by inferring laws, but by postulating falsifiable theories and testing them with critical experiments. Thus action can be seen as the test of a perceptual hypothesis. Perception without action is like science based on inductivism. Action can decide between several consistent hypotheses, and the fact that it is voluntary is what makes it possible to distinguish between causality and correlation (a fundamental problem raised by Hume). Here “voluntary” means that the action could have been different.

In summary, voluntary action can be understood as the test of a perceptual hypothesis, and it is useful both in establishing causal relationships and in efficiently exploring relevant hypotheses.

Marr’s levels of analysis and embodied approaches

Laisser un commentaire

Marr described the brain as an information-processing system, and argued it had to be understood at three distinct conceptual levels:

1) The computational level: what does the system do? (for example: estimating the location of a sound source)

2) The algorithmic/representational level: how does it do it? (for example: by calculating the maximum of cross-correlation between the two monaural signals)

3) The physical level: how is it physically realized? (for example: with axonal delay lines and coincidence detectors)

This is what Francisco Varela describes as “computational objectivism”. That is, the purpose of the computation is to extract information about the world, in an externally defined representation. For example, to extract the interaural time difference between the two monaural sounds. Varela describes the opposite view as “neurophysiological subjectivism”, according to which perception is a result of neural network dynamics. Neurophysiological subjectivism is problematic because it fails to fully recognize the defining property of living beings, which is teleonomy. Jacques Monod (who got the Nobel prize for his work in molecular biology) articulated this idea by explaining that living beings, by the mechanics of evolution, differ from non-living things (say, a mountain) by the fact that they have a fundamental teleonomic project, which is “invariant reproduction” (in Hasard et Nécessité). The achievement of this project relies on specific molecular mechanisms, but it would be a mistake to think that the achievement of the project is the consequence of these mechanisms. Rather, the existence of mechanisms consistent with the project is a consequence of evolutionary pressure selecting these mechanisms: the project defines the mechanisms rather than the other way round. This is a fundamental aspect of life that is downplayed in neurophysiological subjectivism.

Thus computational objectivism improves on neurophysiological subjectivism by acknowledging the teleonomic nature of living beings. However, a critical problem is that the goal (first level) is defined in terms that are external to the organism. In other words, a critical issue is whether the three levels are independent. For example, in sound localization, a typical engineering approach is to calculate the interaural time differences as a function of sound direction, then calculate these differences by cross-correlation and invert the mapping. This approach fails in practice because in fact, these binaural cues depend on the shape of the head (and other aspects), which varies across individuals. One would then have to specify a mapping that is specific to each individual, and it is not reasonable to think that this might be hard-coded in the brain. This simply means that the algorithmic level (#2) must in fact be defined in relationship with the embodiment, which is part of level #3. This is in line with Gibson’s ecological approach, in which information about the world is obtained by detecting sensory invariants, a notion that depends on the embodiment. Essentially, this is the idea of the “synchrony receptive field” that I developed in a recent general paper (Brette, PLoS CB 2012), and before that in the context of sound localization (Goodman and Brette, PLoS CB 2010).

However, this still leaves the computational level (#1) defined in external terms, although the algorithmic level (#2) is defined in more ecological terms (sound location rather than ITD). The sensorimotor approach (and related approaches) closes the loop by proposing that the computational goal is to predict the effect of movements on sensory inputs. This implies the development of an internal representation of space, but space is a consequence of this goal, rather than an ad hoc assumption about the external world.

Thus I propose a redefinition of the three levels of analysis of a perceptual system that is more in line with embodied approaches:

1) Computational level: to predict the sensory consequences of actions (sensorimotor approach) or to identify the laws that govern sensory and sensorimotor signals (ecological approach). Embodiment (previously in level 3) is taken into account in this definition.

2) Algorithmic/representational level: how to identify these laws or predict future sensory inputs? (the kernel in the kernel-envelope theory in robotics)

3) Neurophysiological level (previously physical level): how are these principles implemented by neurons?

Here I am also postulating that these three levels are largely independent, but the computational level is now defined in relationship with the embodiment. Note: I am not postulating independence as a hypothesis about perception, but rather as a methodological choice.

Update. In a later post about rate vs. timing, I refine this idea by noting that, in a spike-based theory, levels 2 and 3 are in fact not independent, since algorithms are defined at the spike level.

"The brain uses all available information"

Laisser un commentaire

In discussions of « neural coding » issues, I have often heard the idea that “the brain uses all available information”. This idea generally pops up in response to the observation that neural responses are complex and vary with stimuli in ways that are difficult to comprehend. In this variability there is information about stimuli, and as complex as the mapping from stimuli to neural responses may be, the brain might well be able to invert this mapping. I sympathize with the notion that neural heterogeneity is information rather than noise, but I believe that, phrased in this way, this idea reveals two important misconceptions.

First of all, there is often a confusion between sensitivity (responses vary along several stimulus dimensions) and information (you can recover these dimensions from the responses). I made this point in a specific paper two years ago (pdf). Neural responses are observed for a specific experimental protocol, which is always constrained in a limited set of stimuli. One can often recover stimulus dimensions from the responses within this set, but it is a mistake to conclude that the brain can do it, because this inverse mapping depends on the particular experimental set of stimuli. In other words, the mapping is in fact from the observed neural responses and the knowledge of the experimental protocol to the stimulus. The brain does not have access to such an external knowledge. Therefore, information is always highly overestimated in this type of analysis. This is in fact a classical problem in machine learning, related to the issues of training vs. test error, generalization and overfitting. The key concept is robustness: the hypothesized inverse mapping should be robust to large changes in the set of stimuli.

The second misconception is more philosophical, and has to do with the general investigation of “neural codes”. What is a code? It is a way of representing information. But sensory information is already present at the level of sensory inputs, and it is a theorem that information can only decrease along a processing chain. So if we say that the goal of a code is only to represent the maximum amount of information about stimuli, then what is gained by having a second (central) code, which can only be a degraded version of the initial sensory inputs? Thinking in this way is in fact committing the homunculus fallacy: looking at the neural responses as a projection of sensory inputs, which “the brain” observes. This projection achieves nothing, for it still leaves unexplained how the brain makes sense of sensory inputs – nothing has been gained in terms of what these inputs mean. At some point there needs to be something else than just representing sensory inputs in a high-dimensional space.

The answer, of course, is that the goal of a “neural code” is not just to represent information, but to do it in such a way that makes it easier to process relevant information. This is the answer provided by representational theories (e.g. David Marr). Then you might also argue that the very notion of a neural code is misleading because the role of a perceptual system is not to encode sensory inputs but to guide behavior, and therefore it is more appropriate to speak of computation rather than code. In either view, the relevant question when interpreting neural responses is not how the rest of brain can make use of it, but rather how they participate in solving the perceptual problem. I believe one key aspect is behavioral invariance, for example the fact that you can localize a sound source independently of its level (within a certain range). Another key aspect is that the “code” is in some way easier to decode for “neural observers” (not just any observer).

Assises de la recherche

Laisser un commentaire

J'ai soumis une contribution aux assises de la recherche intitulée: "Une analyse économique du système de recherche". Il s'agit de quelques remarques simples, mais d'une portée générale, sur l'organisation du système de recherche.

A note on computing with neural synchrony

Laisser un commentaire

In a recent paper, I explained how to compute with neural synchrony, by relating synchrony with the Gibsonian notion of sensory invariants. Here I will briefly recapitulate the arguments and try to explain what can and cannot be done with this approach.

First of all, neural synchrony, as any other concept of neural code, should be defined from the observer point of view, that is, from the postsynaptic point of view. Detecting synchrony is detecting coincidences. That is, a neural observer of neural synchrony is a coincidence detector. Now coincidences are observed when they occur in the postsynaptic neuron, not when the spikes are produced by the presynaptic neurons. Spikes travel along axons and therefore generally arrive after some delay, which we may consider fixed. This means that in fact, coincidence detectors do not detect synchrony but rather specific time differences between spike trains.

I will call these spike trains T_i(t), where i is the index of the presynaptic neuron. Detecting coincidences means detecting relationships T_i(t)=T_j(t-d), where d is a delay (for all t). Of course we may interpret this relationship in a probabilistic (approximate) way. Now if one assumes that the neuron is a somewhat deterministic device that transforms a time-varying signal S(t) into a spike train T(t), then detecting coincidences is about detecting relationships S_i(t)=S_j(t-d) between analog input signals.

To make the connection with perception, I then assume that the input signals are determined by the sensory input X(t) (which could be a vector of inputs), so that S_i(t)=F_i(X)(t). So computing with neural synchrony means detecting relationships F_i(X)(t)=F_j(X)(t-d), that is, specific properties of the stimulus X (F_i is a linear or nonlinear filter). You could see this as a sensory law that the stimulus X(t) follows, or with Gibson’s terminology, a sensory invariant (some property of the sensory inputs that does not change with time).

So this theory describes computing with synchrony as the extraction sensory invariants. The first question is, can we extract all sensory invariants in this way? The answer is no, only those relationships that can be written as F_i(X)(t) = F_j(X)(t-d) can be detected. But then isn’t the computation already done by the primary neurons themselves, through the filters F_i? This would imply that synchrony does not achieve anything, computationally speaking. But this is not true. The set of relationships between signals F_i(X)(t) is not the same thing as the set of signals themselves. For once, there are more relationships than signals: if there are N encoding neurons, then there are N² relationships, times the number of allowed delays. But more importantly, a relationship between signals does not have the same nature as a signal. To see this, consider just two auditory neurons, one that responds to sounds from the left ear only, and one that responds to sounds from the right ear (and neglect sound diffraction by the head to simplify things). None of these neurons is sensitive at all to the location of the sound source. But the relationships between the input signals to these two neurons are informative of sound location. Relationships and signals are two different things: a signal is a stream of numbers, while a relationship is a universal statement on these numbers (aka “invariant”). So to summarize: synchrony represents sensory invariants, which are not represented in the individual neurons, but only a limited number of sensory invariants. For example, if the filters F_i are linear, then only linear properties of the sensory input can be detected. Thus, sensory laws are not produced but rather detected, among a set of possible laws.

Now the second question: is computing with synchrony only about extracting sensory invariants? The answer is also no, because the theory is based on the assumption that the input signals to the neurons and their synchrony are mostly determined by the sensory inputs. But they could also depend on “top-down” signals. Synchrony could be generated by recurrent connections, that is, synchrony could be the result of a computation rather than (or in addition to) the basis of computation. Thus, to be more precise, this theory describes what can be computed with stimulus-induced synchrony. In Gibson’s terminology, this would correspond to the “pick-up” of information, i.e., the information is present in the primary input, preexisting in the form of the relationships between transformed sensory signals (F_i(X)), and one just needs to observe these relationships.

But there is an entire part of the field that is concerned with the computational role of neural oscillations, for example. If oscillations are spatially homogeneous, then it does not affect the theory – it may in fact be simply a way to transform similarity of slowly varying signals into synchrony (this mechanism is the basis of Hopfield and Brody’s olfactory model). If they are not, in particular if they result from interactions between neurons, then this is a different thing.

What is sound? (VI) Sounds inside the head

Laisser un commentaire

When one hears music or speech through earphones, it usually feels like the sound comes from “inside the head”. Yet, one also feels that the sound may come from the left or from the right, and even from the front or back when using head-related transfer functions or binaural recordings. This is why, when subjects report the left-right quality of sounds with artificially introduced interaural level or time differences, one speaks of lateralization rather than localization.

But why is it so? The first answer is: sounds heard through earphones generally don’t reproduce the spatial features of sounds heard in a natural environment. For example, in musical recordings, sources are lateralized using only interaural level differences but not time differences. They also don’t reproduce the diffraction by the head, which one can reproduce using individually measured head-related transfer functions (HRTFs). However, even with individual HRTFs, sounds usually don’t feel as “external” as in the real world. How can it be so, if the sound waves arriving at the eardrums are exactly the same as in real life? Well, maybe they are not: maybe reproducing reverberation is important, or maybe some features of the reproduced waves are very sensitive to the precise placement of the earphones.

It could be the reason, but even if it’s true, it still leaves an open question: why would sounds feel “inside the head” when the spatial cues are not natural? One may argue that, if a sound is judged as not coming from a known external direction, then “by default” it has to come from inside. But we continuously experience new acoustical environments, which modify the spatial cues, and I don’t think we experience sounds as coming from inside our head at first. We might also imagine other “default places” where there are usually no sound sources, for example other places inside the body, but we feel sounds inside the head, not just inside the body. And finally, is it actually true that there are no sounds coming from inside the head? In fact, not quite: think about chewing, for example – although arguably, these sounds come from the inner surface of the mouth.

The “default place” idea also doesn’t explain why such sounds should feel like they have a spatial location rather than no location at all. An alternative strategy is the sensorimotor approach, according to which the distinct quality of sounds that feel inside the head has to do with the relationship between one’s movements and the sensory signals. Indeed, with earphones, the sound waves are unaffected by head movements. This is characteristic of sound sources that are rigidly attached to the ears. This is the head, from the top of the neck, excluding the jaws. This is an appealing explanation, but it doesn’t come without difficulties. First, even though it may explain why we have a specific spatial feel for sounds heard through earphones, it is not obvious how we should experience this feel as sounds being produced inside the head. Perhaps this difficulty can be resolved by considering that one can produce sounds with such a feel by e.g. touching one’s head or chewing. But these are sound sources localized on the surface of the head, or the inner surface of the mouth, not exactly inside the head. Another way of producing sounds with the same quality is to speak, but it comes with the same difficulty.

I will come back to speech later, but I will finish with a few more remarks about the sensorimotor approach. It seems that experiencing the feel of sounds produced inside the head requires turning one’s head. So one would expect that if sound is realistically rendered through earphones with individual HRTFs and the subject’s head is held fixed, it should sound externalized; or natural sounds should feel inside the head until one turns her head. But maybe this is a naive understanding of the sensorimotor approach: the feel is associated to the expectation of a particular sensorimotor relationship, and this expectation can be based on inference rather than on a direct test. That is, sounds heard through earphones, with their particular features (e.g. no interaural time differences, constant interaural intensity differences), produce a feel of coming from inside the head because whenever one has tried to test this perceptual hypothesis by moving her head, this hypothesis has been confirmed (i.e., ITDs and IIDs have remained unchanged). So when presenting sounds with such features, it is inferred that ITDs and IIDs should be unaffected by movements, which is to say that sounds come from inside the head. One objection, perhaps, is that sounds lateralized using only ITDs and not IIDs also immediately feel inside the head, even though they do not correspond at all to the kind of binaural sounds usually rendered through earphones (in musically recordings).

The remarks above would imply the following facts:

When sounds are rendered through earphones with only IIDs, they initially feel inside the head.
When sounds are realistically rendered through earphones with individual HRTFs (assuming we can actually reproduce the true sound waves very accurately, maybe using the transaural technique), perhaps using natural reverberation, they initially feel outside the head.
When the subject is allowed to move, sounds should feel (perhaps after a while) inside the head.
When the subject is allowed to move and the spatial rendering follows these movements (using a head tracker), the sounds should feel outside the head. Critically, this should also be true when sounds are not realistically rendered, as long as the sensorimotor relationship is accurate enough.

To end this post, I will come back to the example of speech. Why do we feel that speech comes from our mouth, or perhaps nose or throat? We cannot resolve the location of speech with touch. However, we can change the sound of speech by moving well-localized parts of our body: the jaws, the lips, the tongue, etc. This could be one explanation. But another possibility, which I find interesting, is that speech also produces tactile vibrations, in particular on the throat but also on the nose. These parts of the body have tactile sensors that can also be activated by touch. So speech should actually produce well-localized vibratory sensations at the places where we feel speech is coming from.

What I find intriguing in this remark is that it raises the possibility that the localization of sound might also involve tactile signals. So the question is: what are the tactile signals produced by natural sounds? And what are the tactile signals produced by earphones, do they stimulate tactile receptors on the outer ears, for example? This idea might be less crazy than it sounds. Decades ago, von Békésy used the human skin to test our sensitivity to vibrations and he showed that we can actually feel the ITD of binaural sounds acting on the skin of the two arms rather than on the two eardrums. The question, of course, is whether natural sounds produce such distinguishable mechanical vibrations on the skin. Perhaps studies on profoundly deaf subjects could provide an answer. I should also note that, given the properties of the skin and tactile receptors, I believe these tactile signals should be limited to low frequencies (say, below 300 Hz).

I now summarize this post by listing a number of questions I have raised:

What are the spatial auditory cues of natural sounds produced inside the head? (chewing, touching one’s head, speaking)
Is it possible to externalize sounds without tracking head movements? (e.g. with the transaural technique)
Is it possible to externalize sounds by tracking head movements, but without reproducing realistic natural spatial cues (HRTFs)?
What is the tactile experience of sound, and are there tactile cues for sound location? Can profoundly deaf people localize sound sources?

Update. Following a discussion with Kevin O’Regan, I realize I must qualify one of my statements. I wrote that sound waves are unaffected by head movements when the source is rigidly attached to the head. This is in fact only true in an anechoic environment. But as soon as there is a reflecting surface, which does not move with the head, moving the head has an effect on sound waves (specifically, on echoes). In other words, the fact that echoic cues are affected (in a lawful way) by movements is characteristic of sounds outside the head, whether they are rigidly attached to the head or not. To be more precise, monaural echoic cues change with head movements for an external source attached to the head, while binaural echoic cues do for an external source free from the head.

Natural sensory signals

Laisser un commentaire

I am writing this post from the Sensory Coding and Natural Environment conference in Vienna. It’s a very interesting conference about a topic that I like very much, but it strikes me that many approaches I have seen seem to miss the point of what is natural about natural sensory signals.

So what is natural about natural sensory signals? It seems that a large part of the field, from I have heard, answers that these are signals that have natural statistics. For example, they have particular second and higher order statistics, both spatially and temporally. While this is certainly true to some extent, I don’t find it a very satisfying answer.

Suppose I throw a rock in the air, and I can see its movement until it reaches the ground. The visual signals that I capture can be considered “natural”. What is natural about the motion of the rock, is it that the visual signals have particular statistics? Probably they do, but to me a more satisfying answer is that it follows the law of gravitation. Efficient coding approaches often tend to focus on statistics, because “the world is noisy” (or, “the brain is noisy”). However, even though there are turbulences in the air, describing the motion of the rock as obeying to the law of gravitation (possibly with some noise) is still more satisfying than describing its higher order statistics – and possibly more helpful for an animal too.

In other words, I propose that what is natural about sensory signals is that they follow the laws of nature.

By the way, this view is completely in agreement with Barlow’s efficient coding principle, which postulates that neurons encode sensory information in an efficient way, i.e., they convey a maximum amount of information with a minimum number of spikes. Indeed representing the laws that govern sensory signals leads to a parsimonious description of these signals.

What is sound? (V) The structure of pitch

Laisser un commentaire

Musical notes have a particular perceptual quality called “pitch”. Pitch is the percept corresponding to how low or high a musical note is. Vowels also have a pitch. To a large extent, the pitch of a periodic sound corresponds to its repetition rate. The important point is that what matters in pitch is more the periodicity than the frequency content. For example, a periodic sound with repetition rate f0 has frequency components at multiples of f0 (n.f0), which are called harmonics. A pure tone of frequency f0 and a complex tone with all harmonics except the first one, i.e., which does not contain the frequency component f0, will evoke the same pitch. It is in fact a little more complex than that, there are many subtleties, but I will not enter into these details in this post. Here I simply want to describe the kind of sensory or sensorimotor structure there is in pitch. It turns out that pitch has a surprisingly rich structure.

The most obvious type of structure is periodicity. Pitch-evoking sounds have this very specific property that the acoustical wave is unchanged when temporally shifted by some delay. This delay is characteristic of the sound’s pitch (i.e., same period means same pitch). This is the type of structure that is emphasized in temporal theories of pitch. This is what I call the “similarity structure” of the acoustical signal, and this notion can in fact be extended and accounts for a number of interesting phenomena related to pitch. But this is work in progress, so I will discuss it further at a later time.

Another way to see periodic sounds is to realize that a periodic sound is predictable. That is, after a couple periods, one can predict the future acoustical wave. Compared to most other sounds, periodic sounds have a very high degree of predictability. Perhaps the perceptual strength of pitch (which depends on a number of factors) is related to the degree of predictability of the sound.

There is another type of structure that is in some sense orthogonal to the similarity structure I just described, which one might call the “dissimilarity structure”. Natural sounds (apart from vocalizations) tend to have a smooth spectrum. Periodic sounds, on the other hand, have a discrete spectrum. Thus, in some sense, periodic sounds have a “surprisingly discontinuous” spectrum. Suppose for example that two auditory receptors respond to different but overlapping parts of the spectrum (e.g., two nearby points on the basilar membrane). Then one can usually predict the sensory input to the second receptor given the sensory input to the first receptor, because natural sounds tend to have a continuous spectrum. But this prediction would fail with a periodic sound. Periodic sounds are maximally surprising in this sense. The interesting thing about the dissimilarity structure of pitch is that it accounts for binaural pitch phenomena such as Huggins’ pitch: noise with flat spectrum is presented on both ears, and the interaural phase difference changes abruptly at a given frequency; a tone is perceived, with the pitch corresponding to that frequency.

Thus, pitch-evoking sounds simultaneously have two types of structure that distinguish them from other types of sounds: the similarity structure, which consists of different views of the acoustical signal that are unusually similar, and the dissimilarity structure, which consists of different views of the acoustical signal that are unusually dissimilar. This first type of structure corresponds to what I examined in my paper on computing with neural synchrony. It is important to notice that these two types of structure have a different nature. The similarity structure corresponds to a law that the sensory signals follow. Here the percept is associated to the specific law that these signals follow. The dissimilarity structure corresponds to the breaking of a law that sensory signals usually follow. Here the percept is associated to a law that is specific not of the presented sensory signals, but of the usual sensory signals. Thus we might relate the similarity structure to the notion of discovery, and the dissimilarity structure to the notion of surprise (and perhaps the term “structure” is not appropriate for the latter).

So far, I have only considered the structure of the acoustical signal, but one may also consider the sensorimotor structure of pitch. As I mentioned in another post, periodic sounds are generally produced by living beings, so it makes sense to examine these sounds from the viewpoint of their production. When one produces a pitch-evoking sound (for example a vowel, or when one sings), there is a very rich structure that goes beyond the acoustical structure. First, there is proprioceptive information about vocal muscles and tactile information about the vibrations of the larynx, and both are directly related to the period of sounds. There is also the efferent copy, i.e., the motor commands issued to make the vocal folds vibrate in the desired way. For a person who can produce sounds, pitch is then associated to a rich and meaningful sensorimotor structure. In fact, the sensorimotor theory of pitch perception would be that to perceive the pitch of a sound is, perhaps, to perceive the movements that would be required to produce such acoustical structure. An interesting aspect of this view is that it provides some meaning to the notion of how low or high a pitch-evoking sound is, by associating it with the state of the different elements involved in sound production. For example, to produce a high sound requires to increase the tension of the vocal cords, and to move the larynx up (higher!). One question then is whether congenitally mute people have a different perception of pitch.

Observe that, as for binaural hearing , the sensorimotor structure of pitch should not be understood as the relationship between motor commands and auditory signals, but rather as the relationship between motor commands and the structure of auditory signals (e.g. the periodicity). In this sense, it is higher-order structure.

Romain Brette

Theoretical Neuroscience

Archives de l’auteur : romain

Rate vs. timing (II) Rate in spike-based theories

Rate vs. timing (I) A category error

On the role of voluntary action in perception

Marr’s levels of analysis and embodied approaches

"The brain uses all available information"

Assises de la recherche

A note on computing with neural synchrony

What is sound? (VI) Sounds inside the head

Natural sensory signals

What is sound? (V) The structure of pitch