In a previous post, I emphasized the differences between vision and hearing, from an ecological point of view. Here I want to comment on the sensorimotor theory of perception (O’Regan & Noë 2001) or the enactive approach, applied to sounds. According to this theory, perception is the implicit knowledge of the effects of self-generated movements on sensory signals. Henri Poincaré made this point a long time ago: "To localize an object simply means to represent to oneself the movements that would be necessary to reach it". For example, perceiving the spatial location of an object is knowing the movements that one should do to move to that object, or to grasp it, or to direct its fovea to it.
There are two implicit assumptions here: 1) that there is some persistence in the sensory signals, 2) that the relevant information is spatial in nature. I will start with the issue of persistence. As I previously argued, a defining characteristic of sounds is that they are not persistent, they happen. For example, the sound of someone else hitting an object is transient. One cannot interact with it. So there cannot be any sensorimotor contingency in this experience. It could be argued that one relies on the memory of previous sensorimotor contingencies, that is, the memory of one producing an impact sound. This is a fair remark, I think, but it overestimates the amount of information there is in this contingency. When an impact sound is produced, the only relationships between motor commands and the acoustic signals are the impact timing and the sound level (related to the strength of the impact). But there is much more information in the acoustic signal of an impact sound, because the structure of this signal is related to properties of the sounding object, in particular material and shape (Gaver, 1993). For example, the resonant modes are informative of the shape and the decay rate of these modes indicates the nature of the material (wood, metal, etc), properties that we can very easily identify. So there is informative sensory structure independent of sensorimotor contingencies.
Now I think we are hitting an interesting point. Even though the resonant modes are informative of the shape (the size, essentially) of an object, they cannot provide any perceptual spatial content by themselves. That is, the frequency of a resonant mode is just a number, and a number has no meaning without context. Compare with the notion of object size for the tactile system: the size of a (small) object is the extent to which one must stretch the hand to grasp it. There is no such thing in hearing. There is nothing intrinsically spatial in auditory size, it seems. If one moves and the sound is repeated, the same resonant modes will be excited. Therefore, it seems that auditory shape can only be a derived property. That is, the specific sensory structure of sounds that corresponds to shape acquires perceptual content by association with another sense that has intrinsic spatial content, i.e., visual or tactile. Now we get to Gibson’s notion of invariant structure: auditory size is the structure in the auditory signals that remains the same when other aspects than size change (where the notion of size is not auditory). Here I am imagining that one hears sounds produced by various sources for which the size is known, and one can identify that some auditory structure is the same for all sources that have the same size. Note the important point here: what persists here is not the sensory signals, it is not the relationship between movements and sensory signals, it is not even the relationship between size and sensory signals, it is the relationship between size and the structure of auditory signals, which is a kind of relationship. That is, one cannot predict the auditory signals from the size: one can predict some aspect of the structure of these signals from the size.
Here I have highlighted the fact that the auditory shape of an object is a structure of auditory signals, not a kind of sensorimotor structure. The spatial notion of shape is a secondary property of sounds that can only be acquired through other sensory modalities. But there can also be intrinsic spatial content in sounds, and in my next post, I will discuss spatial hearing.
Here you say that the shape and size of an object affects the sound it makes, but, in contrast to vision and touch, there is nothing "intrinsic" about the relation between size and the resulting sound. For vision, you say, there is such an intrinsic relation.... I wonder if this is really true... I suppose if you move your body around a visual object the movements you make are instrinsically related to the object's size... (???)
I dont think that vision so obviously easily provides intrinsic spatial information!
One thing I would say in the case of sound is the analogy between the sounds YOU make and the sounds other things make. If I make a "small" sound with my mouth or my body, then I can find a relation between that and other sounds I hear.
I suspect that this gives rise to natural correspondences between shape and sounds like the fact that vowel "i" will be associated with spikey things, whether open vowels like "o" will be associated with large round things. There is a lot of literature on such "natural" correspondences.
Certainly, I mention it at the end my 4th post of the series. For speech, I guess this is related to the motor theory of speech perception. A simple example is pitch, in which the glottal pulse rate corresponds to the repetition rate in the acoustical wave.
About vision, what I mean is that movements (including eye movements) have a direct impact on the visual field, in a way that is related to the shape of the object. I am not saying that it is obvious to obtain this information, but there is a sensorimotor structure that is highly constrained by the shape of the object. But in hearing, moving has no impact on the sound produced by a sound source, except its intensity (leaving localization information aside).