AI
Bixby, Samsung’s voice assistant, is constantly becoming better and better! We’re excited to announce a set of new sound enhancement technologies developed at the Samsung AI Centre Cambridge, together with the Language and Voice Team at Samsung Research, which are helping on-device Bixby to hear human voices more clearly than before in noisy environments. We developed a personalized sound enhancement (PSE) solution that runs on-device in real-time and together with ASR can produce high quality transcripts. Our personalized approach allows removing background noise like traffic and dog-barking, it recognizes speech from the owner of the device and can remove interfering human voices commonly picked up by smartphone microphones in crowded streets or busy indoor spaces. Our solution is part of the Galaxy S22-23, Fold3-4 and Flip3-4 that activate the on-device ASR and register their voice, and is now available on a larger number of Samsung Galaxy devices including the Fold5 and Flip5, thus boosting transcription quality as illustrated in Figure 1 for millions of Galaxy users.
Figure 1. On-device real time ASR systems excel at producing high quality transcripts in clean environments without noise or in noisy ambient environments (e.g., background traffic or fan noise), but often fall short in noisy babble environments (such as those with interfering speech). Our PSE solution improves Korean ASR transcript quality with up to 80% lower WER relative improvements.
Performance of industry-scale ASR has improved considerably over the past years and a majority of modern Samsung mobile and IoT devices increasingly support a form of voice interface, e.g., using either a cloud or an on-device real time ASR. Despite the progress in ASR technologies, real-world performance of on-device real time ASR remains very challenging due to a number of factors including the presence of background noise like traffic noise, voices from interfering speakers, variations in room structure, and increased distance of the users from the microphone. Modern industry-scale ASR systems predominantly use deep neural networks as the backbone of the underlying recognition system and employ various data augmentation schemes during training to make the ASR robust to various noise types. For reduced latency and computation, rather than operating directly on the audio signal, ASR systems often operate on the norm of the Fourier transform instead as depicted in Figure 2. However, the performance of ASR systems often suffer when the microphone recording captures two or more human voices simultaneously. For illustration, we consider performance improvement of an on-device real time ASR [1] under three different conditions. Under no noise, i.e., clean conditions, ASR systems usually perform quite well. But, when ambient noise is present the WER increases. However, ASR performances worsen significantly in the presence of babble or interfering voices. As a matter of fact, the WER of the system often becomes too high to be usable. Thus sound enhancement technologies have become popular for making ASR performances robust, especially in noisy environments.
Figure 2. On-device real time ASR often operates on the norm of the Fourier transform rather than on the audio signal for reduced latency and computation, yielding high quality transcripts in clean and ambient scenarios.
To improve Bixby on-device real time ASR robustness in complex real-world situations, we developed a personalized as well as light-weight, cheap and fast audio cleaning technology that addresses interfering speaker situations well [2]. Our solution runs real time entirely on Samsung devices and does not require any cloud connectivity, thereby guaranteeing every Galaxy user’s privacy. We called this solution PSE, which uses “Hi, Bixby” enrolment utterances from the owner of the device for effective ASR personalization. Figure 1 depicts the ASR robustness gains enabled via PSE. Specifically, when PSE is used as a pre-processing module to the ASR system, cleaned features from PSE are passed to the ASR engine, which can then generate high quality transcripts even in noisy environments as shown in Figure 3. The ASR transcripts quality improvements allowed by PSE pre-processing are impressive with up to 80% lower WER relative improvements in the targeted babble noise interference case. Equally impressive, PSE improves ASR even in traditional already optimized deployment scenarios without noise, i.e., clean or with ambient noise interference cases.
Figure 3. On-device real time ASR with PSE pre-processing yields robust transcription systems, with significant improvement in babble noise scenarios as well as clean and ambient noise scenarios, as demonstrated in Figure 1.
With the aforementioned high-level problem and solution overview, we can now focus on more technical discussion on the use cases and requirements for on-device real time PSE and ASR. For better usability, a new technology needs to satisfy a large number of requirements before it is ready to be shipped with a product. For example, in our smartphones commercialization of PSE cleaning technology for downstream ASR key performance requirements included: (i) the increased quality after the enhancement should be very high, (ii) the enhancement model should be very small (no more than 10% of the ASR size) in terms of overall parameters and its inference time or latency should be very limited when deployed on a mobile CPU (no impact on ASR emission latency), (iii) the solution should be ASR and language agnostic, (iv) no degradation whatsoever of transcription quality when clean audio is presented to the enhancement model, and of course (v) must support streaming and causal inference.
➔Enhancement quality
Primarily we wanted to improve the robustness of Bixby on-device real time ASR engine in noisy environments. Thus as the main performance indicator we used reduction in WER of the ASR system, i.e., the reduction in ASR’s WER, when measured with and without PSE pre-processing.
➔Light-weight and low-latency
As the PSE model will be running on the device continuously when the ASR is active, the overall computation and memory footprint of the PSE model needs to be small.
➔ASR and language agnostic
Ideally, an enhancement solution should be ASR and language agnostic to allow deployment across a large number of systems and languages.
➔Clean audio performance
While the PSE model helps to improve the quality of the transcript in noisy conditions, it should not have any adverse effect, in terms of ASR’s WER, when a clean signal is recorded by the smartphone microphone.
➔Causal inference, limited look-back, zero look-ahead
The PSE model should be able to perform inference without requiring any future measurements and with limited history. This requirement is essential for streaming based inferencing as many downstream applications, in our case ASR, need to start generating output within a limited amount of time.
The key technological challenge is thus building an ASR and language agnostic efficient PSE model, with minimal impact on the size, compute and latency of the overall system, that can be effectively personalized from a limited amount of “Hi Bixby” enrolment audio to yield high quality ASR transcripts under clean, ambient and babble noise scenarios.
➔Spectrogram masking
As shown in Figure 2, ASR systems typically consume spectrograms by first applying a short time Fourier transform feature extractor to the audio signal keeping the norm and discarding phase information. This provides an opportunity to design PSE as a spectrogram masking technique which is both efficient and effective with minimal pre and post processing. The underlying modeling assumption is that different voices or noise occupy different frequency elements in the spectrogram representation. This means interfering speech can be removed by multiplying spectrograms by learned masks taking values between zero and one. The PSE system thus takes the role of predicting masks from noisy spectrograms as depicted in Figure 3. To do so, it must leverage not only the noisy spectrograms but also a pre-computed voice profile obtained from the “Hi Bixby” enrolment audio via a voice profile model also often called speaker embedding model.
➔Norm and phase modeling
While developing a PSE architecture and personalization mechanism suitable for integration with the ASR system at minimal cost, we carved a small but key model size unused budget which could potentially allow for improved ASR. We chose to spend this budget by keeping the PSE output size the same but increasing the PSE input size by three fold with additional input information for improved output quality. As mentioned earlier, traditionally in spectrogram masking approaches, only the norm of the frequency domain is enhanced, which effectively discards the phase per frequency information. We have chosen to include the phase information as well, in a specific format based on the magnitude, the cosine of the phase and the sine of the phase, with significant improvements in ASR quality.
➔Fully convolutional model architecture
Within causal architectures with limited look-back and zero look-ahead, we found time depth separable dilated causal convolutions particularly effective within our size, compute and latency budget with excellent hardware performance.
➔Personalization through activation learning
To conclude our solution overview, we describe how we personalized the PSE model, so that only the voice of the owner of the device is kept, which enables ASR to always work in crowded streets or busy indoor spaces no matter the noise type. In large high-latency non-causal PSE models, personalization can be easily achieved by concatenating or modulating the voice profile with latent vectors of the PSE model. Unfortunately, in our on-device setting characterized by extremely small model size, latency, CPU and battery requirements, such techniques yield low-quality ASR WER improvements. To solve this issue, we developed a new personalization technique based on learned activations [2] which allowed significant ASR transcription quality improvements with only minimal additional costs. The key idea was to first convert the voice profile into a personalized probability vector which multiplies each of a set of basic activations to create per user and per layer unique activations as depicted in Figure 4. At a high level, learned activations condition the non-linearities of the model whereas previous techniques condition the linear parts of the model. Given deep neural networks have a large number of activations, the compounded effect of this technique is very powerful especially in low-resource deployment settings.
Figure 4. Personalization via learned activations for PSE, for voice profile vectors of randomly selected users. The plots highlight that learned activations exhibit different profiles across different users and layers. In this example there are eleven basic activations, each multiplied with a probability vector derived from each user’s voice profile vector.
Tightly integrated, these modeling strategies allow on-device real time Bixby ASR and PSE to improve the user’s transcription experience significantly at the same latency, CPU and battery requirements in a range of previously supported clean and ambient as well as new babble noise scenarios.
• Chanwoo Kim and Chang Woo Han - Samsung Research
• Malcolm Chadwick - Samsung AI Center - Cambridge
[1] "Conformer-Based on-Device Streaming Speech Recognition with KD Compression and Two-Pass Architecture" Park, Jin, Park, Kim, Sandhyana, Lee, Han, Lee, Jung, Han and Kim, IEEE Spoken Language Technology Workshop 2022, https://ieeexplore.ieee.org/abstract/document/10023291
[2] "Conditioning Sequence-to-sequence Networks with Learned Activations" Ramos, Mehrotra, Lane and Bhattacharya, ICLR 2022, https://openreview.net/forum?id=t5s-hd1bqLk