Processing & Interpreting Sounds with CASA

In the characterization process, the “cues” or characteristics of sound components are computed – coding the acoustic information according to measures such as pitch, spatial location, onset time, and more, so these can be later used to inform grouping and interpretation.

One of the most powerful simultaneous grouping cues is “Pitch”. The harmonics generated from a pitched sound source form distinct frequency patterns. As a result, this is a helpful means to group and identify components as belonging to one sound versus another. For example, comparing a male to a female voice, it’s easy to separate these signals using Pitch.

Another important grouping cue is “Spatial Location”, which is enabled by using two microphones to function like ears and collect sounds surrounding your mobile phone. Spatial Location allows the voice processor to determine both the direction from which a sound is coming, as well as the distance of the sound from each of the microphones. This allows the components of a sound to be grouped based on displaced location relative to the two microphones. Characterization also relies on another powerful grouping cue, “Common Onset/Offset Time”. Frequency components from a single sound source often start and/or stop at the same time, which is described as onset/offset time. When a group of frequency components arrive at the ear simultaneously, it usually indicates that they have come from the same source. These cues are combined with the raw frequency data from the Fast Cochlea Transform™, to provide the tags that allow the acoustic information to be grouped, and this Grouping process is what allows sound components to be interpreted as a particular sound.

Dozens of grouping principles are involved for CASA to determine which sound or frequency components, should be grouped together and treated as one sound. The grouping process involves clustering sound components with common or similar attributes so these can be interpreted as a single sound source. Similar attributes are linked to form a single auditory stream, while sound components with sufficiently dissimilar attributes are associated with different auditory streams. These streams can then be tracked in time, to be associated with persistent or recurring sound sources in the auditory environment. The grouping process combines the raw frequency data from the Fast Cochlea Transform associated with each stream identified, and the corresponding acoustic tags.

Voice Isolation
After all the sound components are grouped, these can now be interpreted and identified as individual sound sources, which can be prioritized to select particular sounds. For your mobile phone, the voice processor can accurately identify and isolate your voice conversation, separating it from all the other auditory streams, so these can be suppressed or filtered out.

Inverse Fast Cochlea Transform
This last process is the inverse of the Fast Cochlea Transform, and is responsible for converting the Fast Cochlea Transform data back into reconstructed, clear, high-quality digitized audio and made available for transmission over the wireless network.