| Download | - View final version: Exploring cross-utterance speech contexts for conformer-transducer speech recognition systems (PDF, 4.9 MiB)
|
|---|
| DOI | Resolve DOI: https://doi.org/10.1109/TASLPRO.2025.3606235 |
|---|
| Author | Search for: Cui, MingyuORCID identifier: https://orcid.org/0009-0000-9906-946X; Search for: Geng, Mengzhe1ORCID identifier: https://orcid.org/0000-0002-7886-439X; Search for: Deng, JiajunORCID identifier: https://orcid.org/0000-0001-8874-4167; Search for: Deng, Chengxi; Search for: Kang, Jiawen; Search for: Hu, ShujieORCID identifier: https://orcid.org/0000-0002-8475-4912; Search for: Li, GuinanORCID identifier: https://orcid.org/0000-0002-2206-0237; Search for: Wang, TianziORCID identifier: https://orcid.org/0009-0005-5823-3039; Search for: Li, ZhaoqingORCID identifier: https://orcid.org/0000-0001-8649-4934; Search for: Chen, XieORCID identifier: https://orcid.org/0000-0001-7423-617X; Search for: Liu, XunyingORCID identifier: https://orcid.org/0000-0001-6725-1160 |
|---|
| Affiliation | - National Research Council of Canada. Digital Technologies
|
|---|
| Funder | Search for: Hong Kong RGC GRF; Search for: Innovation Technology Fund; Search for: National Natural Science Foundation of China |
|---|
| Format | Text, Article |
|---|
| Subject | speech recognition; conformer-transducer; cross-utterance speech contexts; elderly speech; context modeling; training; speech processing; older adults; data models; transformers; foundation models; transducers; error analysis |
|---|
| Abstract | This paper investigates four types of cross-utterance speech contexts modeling approaches for streaming and non-streaming Conformer-Transformer (C-T) ASR systems: i) input audio feature concatenation; ii) cross-utterance Encoder embedding concatenation; iii) cross-utterance Encoder embedding pooling projection; or iv) a novel chunk-based approach applied to C-T models for the first time. An efficient batch-training scheme is proposed for contextual C-Ts that uses spliced speech utterances within each minibatch to minimize the synchronization overhead while preserving the sequential order of cross-utterance speech contexts. Experiments are conducted on four benchmark speech datasets across three languages: the English GigaSpeech and Mandarin Wenetspeech corpora used in contextual C-T models pre-training; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets used in domain fine-tuning. The best performing contextual C-T systems consistently outperform their respective baselines using no cross-utterance speech contexts in pre-training and fine-tuning stages with statistically significant average word error rate (WER) or character error rate (CER) reductions up to 0.9%, 1.1%, 0.51%, and 0.98% absolute (6.0%, 5.4%, 2.0%, and 3.4% relative) on the four tasks respectively. Their performance competitiveness against Wav2vec2.0-Conformer, XLSR-128, and Whisper models highlights the potential benefit of incorporating cross-utterance speech contexts into current speech foundation models. |
|---|
| Publication date | 2025-09-04 |
|---|
| Publisher | Institute of Electrical and Electronics Engineers |
|---|
| Licence | |
|---|
| In | |
|---|
| Language | English |
|---|
| Peer reviewed | Yes |
|---|
| Export citation | Export as RIS |
|---|
| Report a correction | Report a correction (opens in a new tab) |
|---|
| Record identifier | 136ad8da-3106-4795-bdf0-eebf08796b5f |
|---|
| Record created | 2025-10-29 |
|---|
| Record modified | 2025-10-29 |
|---|