octobre | 2012 | Romain Brette

In a recent paper, I explained how to compute with neural synchrony, by relating synchrony with the Gibsonian notion of sensory invariants. Here I will briefly recapitulate the arguments and try to explain what can and cannot be done with this approach.

First of all, neural synchrony, as any other concept of neural code, should be defined from the observer point of view, that is, from the postsynaptic point of view. Detecting synchrony is detecting coincidences. That is, a neural observer of neural synchrony is a coincidence detector. Now coincidences are observed when they occur in the postsynaptic neuron, not when the spikes are produced by the presynaptic neurons. Spikes travel along axons and therefore generally arrive after some delay, which we may consider fixed. This means that in fact, coincidence detectors do not detect synchrony but rather specific time differences between spike trains.

I will call these spike trains T_i(t), where i is the index of the presynaptic neuron. Detecting coincidences means detecting relationships T_i(t)=T_j(t-d), where d is a delay (for all t). Of course we may interpret this relationship in a probabilistic (approximate) way. Now if one assumes that the neuron is a somewhat deterministic device that transforms a time-varying signal S(t) into a spike train T(t), then detecting coincidences is about detecting relationships S_i(t)=S_j(t-d) between analog input signals.

To make the connection with perception, I then assume that the input signals are determined by the sensory input X(t) (which could be a vector of inputs), so that S_i(t)=F_i(X)(t). So computing with neural synchrony means detecting relationships F_i(X)(t)=F_j(X)(t-d), that is, specific properties of the stimulus X (F_i is a linear or nonlinear filter). You could see this as a sensory law that the stimulus X(t) follows, or with Gibson’s terminology, a sensory invariant (some property of the sensory inputs that does not change with time).

So this theory describes computing with synchrony as the extraction sensory invariants. The first question is, can we extract all sensory invariants in this way? The answer is no, only those relationships that can be written as F_i(X)(t) = F_j(X)(t-d) can be detected. But then isn’t the computation already done by the primary neurons themselves, through the filters F_i? This would imply that synchrony does not achieve anything, computationally speaking. But this is not true. The set of relationships between signals F_i(X)(t) is not the same thing as the set of signals themselves. For once, there are more relationships than signals: if there are N encoding neurons, then there are N² relationships, times the number of allowed delays. But more importantly, a relationship between signals does not have the same nature as a signal. To see this, consider just two auditory neurons, one that responds to sounds from the left ear only, and one that responds to sounds from the right ear (and neglect sound diffraction by the head to simplify things). None of these neurons is sensitive at all to the location of the sound source. But the relationships between the input signals to these two neurons are informative of sound location. Relationships and signals are two different things: a signal is a stream of numbers, while a relationship is a universal statement on these numbers (aka “invariant”). So to summarize: synchrony represents sensory invariants, which are not represented in the individual neurons, but only a limited number of sensory invariants. For example, if the filters F_i are linear, then only linear properties of the sensory input can be detected. Thus, sensory laws are not produced but rather detected, among a set of possible laws.

Now the second question: is computing with synchrony only about extracting sensory invariants? The answer is also no, because the theory is based on the assumption that the input signals to the neurons and their synchrony are mostly determined by the sensory inputs. But they could also depend on “top-down” signals. Synchrony could be generated by recurrent connections, that is, synchrony could be the result of a computation rather than (or in addition to) the basis of computation. Thus, to be more precise, this theory describes what can be computed with stimulus-induced synchrony. In Gibson’s terminology, this would correspond to the “pick-up” of information, i.e., the information is present in the primary input, preexisting in the form of the relationships between transformed sensory signals (F_i(X)), and one just needs to observe these relationships.

But there is an entire part of the field that is concerned with the computational role of neural oscillations, for example. If oscillations are spatially homogeneous, then it does not affect the theory – it may in fact be simply a way to transform similarity of slowly varying signals into synchrony (this mechanism is the basis of Hopfield and Brody’s olfactory model). If they are not, in particular if they result from interactions between neurons, then this is a different thing.

When one hears music or speech through earphones, it usually feels like the sound comes from “inside the head”. Yet, one also feels that the sound may come from the left or from the right, and even from the front or back when using head-related transfer functions or binaural recordings. This is why, when subjects report the left-right quality of sounds with artificially introduced interaural level or time differences, one speaks of lateralization rather than localization.

But why is it so? The first answer is: sounds heard through earphones generally don’t reproduce the spatial features of sounds heard in a natural environment. For example, in musical recordings, sources are lateralized using only interaural level differences but not time differences. They also don’t reproduce the diffraction by the head, which one can reproduce using individually measured head-related transfer functions (HRTFs). However, even with individual HRTFs, sounds usually don’t feel as “external” as in the real world. How can it be so, if the sound waves arriving at the eardrums are exactly the same as in real life? Well, maybe they are not: maybe reproducing reverberation is important, or maybe some features of the reproduced waves are very sensitive to the precise placement of the earphones.

It could be the reason, but even if it’s true, it still leaves an open question: why would sounds feel “inside the head” when the spatial cues are not natural? One may argue that, if a sound is judged as not coming from a known external direction, then “by default” it has to come from inside. But we continuously experience new acoustical environments, which modify the spatial cues, and I don’t think we experience sounds as coming from inside our head at first. We might also imagine other “default places” where there are usually no sound sources, for example other places inside the body, but we feel sounds inside the head, not just inside the body. And finally, is it actually true that there are no sounds coming from inside the head? In fact, not quite: think about chewing, for example – although arguably, these sounds come from the inner surface of the mouth.

The “default place” idea also doesn’t explain why such sounds should feel like they have a spatial location rather than no location at all. An alternative strategy is the sensorimotor approach, according to which the distinct quality of sounds that feel inside the head has to do with the relationship between one’s movements and the sensory signals. Indeed, with earphones, the sound waves are unaffected by head movements. This is characteristic of sound sources that are rigidly attached to the ears. This is the head, from the top of the neck, excluding the jaws. This is an appealing explanation, but it doesn’t come without difficulties. First, even though it may explain why we have a specific spatial feel for sounds heard through earphones, it is not obvious how we should experience this feel as sounds being produced inside the head. Perhaps this difficulty can be resolved by considering that one can produce sounds with such a feel by e.g. touching one’s head or chewing. But these are sound sources localized on the surface of the head, or the inner surface of the mouth, not exactly inside the head. Another way of producing sounds with the same quality is to speak, but it comes with the same difficulty.

I will come back to speech later, but I will finish with a few more remarks about the sensorimotor approach. It seems that experiencing the feel of sounds produced inside the head requires turning one’s head. So one would expect that if sound is realistically rendered through earphones with individual HRTFs and the subject’s head is held fixed, it should sound externalized; or natural sounds should feel inside the head until one turns her head. But maybe this is a naive understanding of the sensorimotor approach: the feel is associated to the expectation of a particular sensorimotor relationship, and this expectation can be based on inference rather than on a direct test. That is, sounds heard through earphones, with their particular features (e.g. no interaural time differences, constant interaural intensity differences), produce a feel of coming from inside the head because whenever one has tried to test this perceptual hypothesis by moving her head, this hypothesis has been confirmed (i.e., ITDs and IIDs have remained unchanged). So when presenting sounds with such features, it is inferred that ITDs and IIDs should be unaffected by movements, which is to say that sounds come from inside the head. One objection, perhaps, is that sounds lateralized using only ITDs and not IIDs also immediately feel inside the head, even though they do not correspond at all to the kind of binaural sounds usually rendered through earphones (in musically recordings).

The remarks above would imply the following facts:

When sounds are rendered through earphones with only IIDs, they initially feel inside the head.
When sounds are realistically rendered through earphones with individual HRTFs (assuming we can actually reproduce the true sound waves very accurately, maybe using the transaural technique), perhaps using natural reverberation, they initially feel outside the head.
When the subject is allowed to move, sounds should feel (perhaps after a while) inside the head.
When the subject is allowed to move and the spatial rendering follows these movements (using a head tracker), the sounds should feel outside the head. Critically, this should also be true when sounds are not realistically rendered, as long as the sensorimotor relationship is accurate enough.

To end this post, I will come back to the example of speech. Why do we feel that speech comes from our mouth, or perhaps nose or throat? We cannot resolve the location of speech with touch. However, we can change the sound of speech by moving well-localized parts of our body: the jaws, the lips, the tongue, etc. This could be one explanation. But another possibility, which I find interesting, is that speech also produces tactile vibrations, in particular on the throat but also on the nose. These parts of the body have tactile sensors that can also be activated by touch. So speech should actually produce well-localized vibratory sensations at the places where we feel speech is coming from.

What I find intriguing in this remark is that it raises the possibility that the localization of sound might also involve tactile signals. So the question is: what are the tactile signals produced by natural sounds? And what are the tactile signals produced by earphones, do they stimulate tactile receptors on the outer ears, for example? This idea might be less crazy than it sounds. Decades ago, von Békésy used the human skin to test our sensitivity to vibrations and he showed that we can actually feel the ITD of binaural sounds acting on the skin of the two arms rather than on the two eardrums. The question, of course, is whether natural sounds produce such distinguishable mechanical vibrations on the skin. Perhaps studies on profoundly deaf subjects could provide an answer. I should also note that, given the properties of the skin and tactile receptors, I believe these tactile signals should be limited to low frequencies (say, below 300 Hz).

I now summarize this post by listing a number of questions I have raised:

What are the spatial auditory cues of natural sounds produced inside the head? (chewing, touching one’s head, speaking)
Is it possible to externalize sounds without tracking head movements? (e.g. with the transaural technique)
Is it possible to externalize sounds by tracking head movements, but without reproducing realistic natural spatial cues (HRTFs)?
What is the tactile experience of sound, and are there tactile cues for sound location? Can profoundly deaf people localize sound sources?

Update. Following a discussion with Kevin O’Regan, I realize I must qualify one of my statements. I wrote that sound waves are unaffected by head movements when the source is rigidly attached to the head. This is in fact only true in an anechoic environment. But as soon as there is a reflecting surface, which does not move with the head, moving the head has an effect on sound waves (specifically, on echoes). In other words, the fact that echoic cues are affected (in a lawful way) by movements is characteristic of sounds outside the head, whether they are rigidly attached to the head or not. To be more precise, monaural echoic cues change with head movements for an external source attached to the head, while binaural echoic cues do for an external source free from the head.

Romain Brette

Theoretical Neuroscience

Archives mensuelles : octobre 2012

A note on computing with neural synchrony

What is sound? (VI) Sounds inside the head