AI

Globally Normalizing the Transducer for Streaming Speech Recognition

By Rogier van Dalen AI Center - Cambridge

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society.

And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals.

In this blog series, we are introducing our research papers at the ICASSP 2025 and here is a list of them.

#1. Evaluation of Wearable Head BCG for PTT Measurement in Blood Pressure Intervention (Samsung Reseach America)

#2. Better Exploiting Spatial Separability in Multichannel Speech Enhancement with an Align-and-Filter Network (AI Center - Mountain View)

#3. Vision-Language Model Guided Semi-supervised Learning for No-Reference Video Quality Assessment (Samsung R&D Institute India-Bangalore)

#4. Text-aware adapter for few-shot keyword spotting (AI Center - Seoul)

#5. Single-Channel Distance-Based Source Separation for Mobile GPU in Outdoor and Indoor Environments (AI Center - Seoul)

#6. Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning (Samsung R&D Institute United Kingdom)

#7. Find Details in Long Videos: Tower-of-Thoughts and Self-Retrieval Augmented Generation for Video Understanding (Samsung R&D Institute China-Beijing)

#8. Globally Normalizing the Transducer for Streaming Speech Recognition (AI Center - Cambridge)

Introduction

At the AI Center in Cambridge, UK, we do research in the fundamentals of technology that we use every day. This work is an example of this. Many people talk to their devices, and speech recognition often takes place before the sentence has finished. This is called “streaming” speech recognition, and it is subtly different from “off-line” speech recognition which is often used in academic research.

The type of model that is normally used to adapt to streaming is the Transducer. Even when using it in an off-line setting, it explicitly steps through time while emitting symbols. This makes it obvious how to adapt it to streaming: feed the Transducer the audio as it steps through time. However, it turns out that in streaming mode the Transducer has a mathematical flaw which, simply put, restricts the model’s ability to change its mind.

Who cares, you might ask, if it works? However, at the AI Center in Cambridge, we like doing things properly. And it turns out that the mathematical flaw (called “label bias” for historical reasons) increases the error rate of the speech recogniser. This work therefore proposes a fix, which improves the error rate.

The Transducer in streaming mode

Figure 1. The Transducer in streaming mode at t=3 and u=2. In the standard Transducer, the output f is a probability distribution P, i.e. its output adds up to 1. f predicts the next label given the current token history z1:2 and, in streaming mode, only the first part x1:3 of the input.

Figure 1 shows its general architecture of the Transducer. The “prediction network” receives the current token history z. The “transcription network” receives the audio x. When used offline, x is the complete audio, but in the figure, it is used in streaming mode: it can see only the first 2 frames. The “joiner” takes the outputs of the two networks and outputs a vector of weights for the choices for the next label y. Conceptually, at each point the model chooses either to emit a blank to go to the next time step, or to emit a non-blank token z. The standard Transducer outputs a normalized vector f, i.e. its entries add to 1.

Figure 2. The Transducer stepping through time (horizontal) and through label space (vertical) for a token sequence “a d d”.

The Transducer explicitly moves to the next state, by emitting not only tokens but also special “blank” labels _. The output labels sequence is thus a mixture of tokens z and blanks _. Figure 2 shows how this corresponds to a path in a state diagram. A label sequence y that corresponds to the word “add” is

y = _ _ a d _ d _, z = a d d

Since the Transducer explicitly steps through time, it seems obvious how to use it for streaming. However, there is a problem with the mathematics. Denote the time step being considered when generating label v, i.e. the number of blanks in the label sequence up to that point, with t(v). The probability of the label sequence y given audio x factorizes as in this equation, which is then adapted to streaming:

The only thing that changes for streaming recognition, is that the dependency on x is replaced by one on x1:t(v). But the value x1:t(v) changes depending on the loop variable v, so the right-hand side is not a correct factorization of P(y |x). This form of model could be seen as an approximation, but it is probably instead a mistake.

The mathematical problem is known as the “label bias”, though in this work there is no bias to any particular label. But consider a different simple example, with a recognizer with only two output sentences: “mail order” and “nail polish”. When streaming, the hope is that after observing the first word, the system will output either “mail” or “nail”. These are not acoustically very distinguishable. However, the probabilities for the complete sequences are fixed after emitting the first word, since P(z2 | z1, x) = 1. In reality there will be more than one possible next token, but the total probability is still normalized to 1, and the probability of previous hypotheses cannot be changed. Therefore, it is local normalization of emission probabilities that prevents the model from changing its mind.

Training a globally normalised Transducer

The well-known method for dealing with label bias is to remove local normalisation. In the normal Transducer, at each step the output is explicitly normalised, usually with a softmax, so that in is a proper distribution. Instead, the distribution over the complete sequence must be normalised, which requires a global normalisation constant. However, computing the global normalisation constant requires summing over all possible word sequences. That is an infinite number of sequences, and with this type of model no tricks are possible.

Instead, and the paper gives more detail about the method, this work considers only a subset of the word sequences. To find a good subset, a “search algorithm” is used on the speech recogniser for each utterance, and it comes up with its best 10 hypotheses. These 10 hypotheses need to stand in for the infinite other hypotheses.

In initial experiments run at Samsung’s AI Center in Cambridge, this went terribly wrong: the recogniser learned to trick the search algorithm into accepting hypotheses which later on turned out to be bad ones. To keep the recogniser in line, therefore, this work proposes two approaches, to be used in parallel.

    
The first method is to start training from a locally normalised model that is partially trained, and then slowly interpolate with a fully globally normalised model. This work proposes a log-linear interpolation. The paper gives more mathematical and practical details, but because some terms cancel out, this is surprisingly feasible.
    
The second method to give the search algorithm a better sense of which hypotheses are viable, is to make the weights roughly normalised. This is done by adding a regularisation term: the log-sum of the unnormalised outputs should be roughly 0, so it is squared and added to the loss function.


Figure 3. Left: regularizing the square of the log-sums of output vectors encourages the model to become roughly normalized. The graph starts after 10 epochs of training a locally normalized model. Right: Interpolating between a locally normalized model and a globally normalized model. This happens over a longer time span: the shaded area matches the one in the left panel.

First, regularizing the log-sum of output vectors. The left panel of Figure 3 shows how this works in practice. The moment the regularization weight is introduced, the squared normalization drops. The loss is not adversely affected. No hyperparameter optimization was performed: since the model loss was around 0.05, the hypothesis was that 0.01 would be a reasonable weight that would not overpower the model loss.

Second, the interpolation weight between the locally normalized and the globally normalized model. The model is initialized from a locally normalized model, and then the interpolation weight α is slowly changed from 1 to 0.3. From initial experiments, it is important to reduce the interpolation weight slowly, here by about 0.25 per epoch. The process is illustrated in the right panel of Figure 3. As the interpolation weight is reduced, over the course of a few epochs, the loss goes down, though it is not strictly possible to compare the loss between different settings of α.

Results

Experiments are run on LibriSpeech, a well-known freely available dataset of 1000 hours of speech from audiobooks. None of the results use an external language model. Results are reported on the four test sets from LibriSpeech: “dev-clean”, “dev-other”, “test-clean”, and “test-other”. For the systems in this work, the tables report the results for each system at the epoch (out of 40) with the best word error rate on “dev-clean”.

Table 1. Word error rates and latencies for different systems. This work's globally normalized system with α=0.3 outperforms the locally normalized baseline by 9-11% relative.

Figure 4. Word error rates on dev-clean for training with different numbers of competitor sequences at α=0.3.

Table 3 shows the word error rates from introducing global normalization in this paper. As the mathematics suggest, the accuracy of streaming systems improves by going from local to global normalization. The globally normalized system at α=0.3 performs best, for “test-clean” at 3.16% compared to 3.55% for the baseline. This is a 11% relative improvement. For the other sets, the globally normalized model gives a 9-11% relative improvement. This closes almost half of the gap to the non-streaming system, which is at 2.67%. Figure 4 shows the effect of training with a smaller N-best list.

Then, let us have a look at latencies. This paper has argued that the problem with using a locally normalized Transducer for streaming is that after emitting a hypothesized label, the system cannot change its mind. This should cause delayed label emissions, which have been observed in practice in previous papers.

The last column of Table 3 contains latencies. Since there are no human-produced alignments for LibriSpeech, only relative measurements can be made. For independence from recognition errors, the posterior over reference label sequences is obtained and the average emission time of tokens calculated. The numbers show that most globally normalized models have lower average latencies on test-clean. The latency improvements are entirely unforced, demonstrating the lessening of pressure to delay results until enough future context has been seen.

Link to the paper

https://ieeexplore.ieee.org/abstract/document/10890301
And a more extensive version is at https://arxiv.org/abs/2307.10975

References

[1] A. Graves, “Sequence transduction with recurrent neural networks,” in International Conference on Machine Learning, Representation Learning Workshop, 2012.

[2] E. Variani, K. Wu, M. D. Riley, D. Rybach, M. Shannon, and C. Allauzen, “Global normalization for streaming speech recognition in a modular framework,” in Advances in Neural Information Processing Systems, 2022

[3] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of International Conference on Machine Learning, 2001.

[4] . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2015.