Computational Auditory Scene Analysis ("CASA")

In human hearing, when two or more sounds occur at the same time, the various frequency components from each sound are received by the ear at basically the same time. This presents the problem of determining which components should be grouped together and identified as the same sound. To manage this, the human hearing system evaluates acoustic energy based on an array of characteristics or “cues” – such as pitch, frequency, intensity, onset, spatial location, and duration – which are coded in a rapid-fire series of electrical signals that the brain can recognize. The auditory areas of the brain can then sort and group the acoustic information to perceive a sound from one particular source or event. Through these cues, acoustic elements can also be grouped and linked together in time, producing an auditory stream, which can be interpreted as arising from the same sound-generating event, so we can perceive a dog barking, a person talking, or a piano playing.

The processes that allow the human auditory system to perceive and organize sound, are known as “Auditory Scene Analysis” or ASA, which is a term first coined by psychologist Albert Bregman to define the principles the human auditory system employs to organize acoustic inputs into perceptually meaningful elements. Through ASA, we can accurately group sounds – even when comprised of multiple frequencies, as in music, or when heard simultaneously – and avoid blending sources that should be perceived as coming from separate sound sources or events. As a result, ASA allows you to correctly distinguish and identify a sound of interest, like a voice, from other noise sources. For example, we can reject all the other voices, music and noise at a party, in order to selectively listen to just one conversation.

This is a well-known illustration of ASA, often referred to as the “cocktail party effect”. The “cocktail party effect” illustrates our ability to focus attention on one person talking, or a single noise source, and block out the surrounding noise. This same ability enables us to hear and converse even in a noisy place, such as a crowded café or a busy street.

Audience is a pioneer in developing commercial products based on the principles of ASA, employing the science of “Computational Auditory Scene Analysis” or CASA – the field of study that attempts to recreate sound source separation in the same manner as human hearing in machines. Using CASA, Audience’s earSmart Advanced Voice processor is able to mimic the processes of human hearing to accurately characterize and group complex mixtures of sound into “sources”, based on a diverse list of cues such as pitch, onset/offset time, spatial location and harmonicity. The processor evaluates these groupings to correctly identify and isolate the primary voice signal or conversation. Dozens of grouping principles underlie CASA. But these can be broadly categorized into those that operate across time, and those that operate across frequency. In addition, learned patterns play an important role. The job of CASA is to group incoming sensory information to form an accurate representation of the environmental sounds. Regardless of whether a sound source is steady and constant, or transitory and moving, a CASA based system handles each effectively. But before these processes can take place, the sound signals collected by the two microphones and arriving to the earSmart voice processor must be digitized and then transformed to the frequency domain. We accomplish this transformation through the Fast Cochlea Transform™.