In his book on vision, David Marr acknowledges the fact that a major computational issue for sensory systems is to extract relevant information in a way that is invariant to a number of changes in the world. For example, to recognize a face independently of its orientation and distance. Here we hit a major difference between representational theories and what I shall call structural theories, such as Gibson’s ecological theory (see my post on the difference between these two theories). In a representational theory, invariant processing is obtained by building a representation that is itself invariant to a number of transformations (e.g. translations, rotations). How can this representation be built? There are two ways: either it is wired (innate) or it is acquired, learned by associating many transformed instances of the same object with the same “percept”. So in a representational theory, dealing with invariance is a tedious learning process requiring supervision. In a structural theory, the problem actually does not exist, because the basis of perception is precisely invariants.
I will give an example in hearing. There are two theories of pitch perception. Pitch is the percept associated to how low or high a musical note is. It mostly corresponds to the periodicity of the sound wave. Two periodic sounds with the same repetition rate will generally have the same pitch. But they may have different timbres, i.e., different spectral contents. In the spectral or template theory, there is an initial representation of sounds consisting as a spectral pattern. It is then compared with the spectral patterns of reference periodic sounds with various pitches, the templates. These templates need to be learned, and the task is not entirely trivial because periodic sounds with the same pitch can have non-overlapping spectra (for example a pure tone, and a complex tone without the first harmonic). The spectral theory of pitch is a representational theory of pitch. In this account, there is nothing special about pitch, it is just a category of sound spectra.
The temporal theory of pitch, on the other hand, postulates that the period of a sound is detected. I call it a structural theory because pitch corresponds to a structural property of sounds, their periodicity. One can observe that the same pattern in the sound wave is repeated, at a particular rate, and this observation does not require learning. Now this means that if two sounds with the same period are presented, I can immediately recognize that they share the same structural property, i.e., they have the same pitch. Learning, in a structural theory, only means associating a particular structure with a label (say, the name of a musical note). The invariance problem disappears in a structural theory, because the basis of the percept is an invariant: the periodicity does not depend on the sound’s spectrum. This also means that sounds that elicit a pitch percept are special because they have a particular structure. In particular, periodic sounds are predictable. White noise, on the other hand, has no structure and does not elicit a pitch percept.