Audience

The Audience Voice Processor is the first integrated circuit that is modeled after the most efficient and accurate auditory system, the human hearing system. By thoroughly understanding the entire auditory pathway - from the cochlea to the brainstem to the thalamus and cortex - Audience is the first company to deliver a commercial product based on the science of Auditory Scene Analysis (ASA), or the grouping and processing of complex mixtures of sound. Because the Audience Voice Processor handles signals the way people actually perceive specific sounds, Audience is able to identify and suppress noise sources in an extremely efficient and accurate manner.

The Audience Voice Processor receives a complex mixture of sounds that often overlap at any given frequency, and organizes it into individual sources, in the same way people actually hear sounds. Regardless of whether the noise is local to the caller, or remote over the mobile network, the Audience Voice Processor uses several grouping cues to group the mixture of sound by source instantaneously, suppressing the noise and delivering the voice of interest clearly.

Cues and the Cocktail Party Effect

A well-known illustration of ASA is the so-called cocktail party effect; at a busy party, one is able to follow a conversation even though other voices and background music are present. Almost all sounds, such as the human voice, musical instruments, or cars passing in the street, are made up of many frequencies, which contribute to the perceived quality (or timbre) of the sounds. When two or more natural sounds occur at once, all the components of the simultaneously active sounds are received at the same time by the ears of listeners. This presents their auditory systems with a problem: Which frequency components should be grouped together and treated as parts of the same sound? Grouping them incorrectly can cause the listener to hear non-existent sounds built from the wrong combinations of the original components.

Dozens of grouping principles underlie ASA. These can be broadly categorized into sequential grouping cues (those that operate across time) and simultaneous grouping cues (those that operate across frequency). In addition, schemas (learned patterns) play an important role. The job of ASA is to group incoming sensory information to form an accurate mental representation of the environmental sounds. Regardless of whether a sound source is steady and constant, or transitory and moving, ASA handles each effectively.

Back to top

Reverse Engineering the Auditory Pathway


Figure 1: Audience Voice Processor

Fast Cochlea Transform™

Just as the cochlea is central to the human auditory system, the Fast Cochlea Transform (FCT) is the heart of the Audience Voice Processor. Sound enters the Voice Processor through microphones, is digitized, and enters the Fast Cochlea Transform. The FCT transforms the digital audio stream into a three dimensional, high-quality spectral representation of the sound mixture, as shown in Figure 2. Time is on the x axis, frequency on the y axis, and loudness in decibels (dB) on the z axis, that is represented in color, with red being the loudest, to blue being the softest. The transformation provides optimum time-frequency resolution on a logarithmic frequency axis, without introducing frame artifacts, to allow the various components of the multiple sound sources to be characterized and separated from each other.

The FCT's transformation into the spectral domain is essential for Audience's high-performance noise suppression because it permits regions of the frequency spectrum to be separately identified with different sound sources, even when they are present simultaneously. A simple time-domain solution, operating directly on the sound waveform, would not be able to identify and separate the large number of simultaneous sources present in the example of Figure 2. Between 4 and
8 kHz, there is a high pitched sound that appears as four parallel lines from 3 - 4 seconds on the x axis. Without the frequency transformation, this noise would be inseparable from the voice signal that appears simultaneously represented as wavy lines and occurs at the same time.


Figure 2: Spectral Representation using the Fast Cochlea Transform™.

The Fast Cochlea Transform is similar to the Fast Fourier Transform (FFT) that is commonly used in Digital Signal Processing. Both transform the signal into the frequency domain for audio processing, but the FCT is much better suited to mapping audio signals in several critical ways.

  • Log-Frequency Scale: The FFT transforms the audio signal into the frequency domain on a linear scale, while the FCT performs its transformation on a log-frequency scale. A log-frequency scale improves the efficiency of the transformation by putting the resolution and computational resources where the listener can hear it.
  • Direct Computation of Critical Bandwidths: The uniform bandwidth of the FFT contrasts with the well-known Critical Bandwidths of the human ear. The FCT computes its spectral transformation with the Critical Bandwidths built directly into the computation.
  • Optimal Time-Frequency Tradeoff: The FCT provides greater accuracy in representing the audio signal at both low and high frequencies. At low frequencies, the FCT provides greater spectral resolution, which allows the detection of harmonics and recreation of sound more accurately. At high frequencies, the FCT provides a faster response rate, which captures dynamic changes more accurately.
  • Continuous Signal Processing: The FFT transforms the audio signal by reading blocks or frames of data that are taken at a particular frame rate. When the audio signal spans these data frames, spurious artifacts can be introduced and wreak havoc with the audio stream by introducing extra "noise" into the signal. This "noise" can take the form of additional audio signals at additional frequencies, as well as inaccurate frequency representation of the real signal. The FCT takes a completely different approach. Instead of operating on blocks of data, the FCT continuously streams the incoming signal into the transformation. The result is that it can transform the audio signal into its appropriate representation without introducing frame artifacts, or additional "noise" that would have to be removed from the system.

Back to top

Characterization

In the Characterization process, the cues of sound components are computed. These cues are used by human beings for grouping and stream separation, and include Pitch, Spatial Location, and Onset Time, among others.

One of the most powerful simultaneous grouping cues is Pitch. The harmonics generated from a pitched sound source form distinct frequency patterns, and as a result are a useful method used to group one sound from another. For example, a male voice and a female voice can be easily separated using Pitch. In Figure 2, the harmonics of a single voice are readily visible as the parallel wavy lines at multiple frequencies.

When two microphones are available in the system, one of the most powerful grouping cues is Spatial Location. Spatial Location enables the Voice Processor to locate the direction from which a sound is coming and its distance to each of the microphones. Once a sound source is determined, it can be identified as a noise source given its displaced location in relation to the two microphones. The use of spatial positioning can be understood by snapping one's fingers off to the side of one's head. The listener can instantly identify where the sound is coming from as well as approximately how far away it is.

Another powerful grouping cue is Common Onset/Offset Time. Frequency components from a single sound source often start and/or stop at the same time. When a group of frequency components arrive at the ear at the same time, it is usually an indication that they have come from the same source.

These cues are then associated with the raw Fast Cochlea Transform frequency data as acoustic tags which are used in the subsequent Grouping process.

Back to top

Grouping

The Grouping process performs a type of clustering operation such that sound components with common or similar attributes may be mutually associated into a single auditory stream, and sound components with sufficiently dissimilar attributes are associated with different auditory streams. Ultimately, the streams are tracked through time and associated with persistent or recurring sound sources in the auditory environment. The output of the Grouping block is the raw Fast Cochlea Transform frequency data associated with each stream, and the corresponding acoustic tags.

Selector

The Selector process allows the separated auditory sound sources to be prioritized and selected as appropriate for the given application. In telephony applications, the primary voice of interest is selected, and the other auditory sources are eliminated or suppressed.

Inverse Fast Cochlea Transform

The Inverse Fast Cochlea Transform process converts the Fast Cochlea Transform data back into reconstructed, cleaned-up, high-quality digital audio which is then converted back to an analog signal, and made available for transmission.

Back to top