本文將通過(guò)六篇論文,從建模方法、響應(yīng)時(shí)間優(yōu)化、數(shù)據(jù)增強(qiáng)等不同方面講解端到端語(yǔ)音模型的發(fā)展,并探討不同端到端語(yǔ)音識(shí)別模型的優(yōu)缺點(diǎn)。
Seq2Seq
參考論文:Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. ICASSP 2016(William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals)
CTC
參考論文:Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006(AlexGraves, SantiagoFernández,FaustinoGomez)
這里 A 是一條合法的 x 和 y 的對(duì)應(yīng)路徑,a_t 代表 t 時(shí)刻 X 所對(duì)應(yīng)的輸出。
了解更多的推導(dǎo)細(xì)節(jié): https://distill.pub/2017/ctc/
Transducer
參考論文:Sequence Transduction with Recurrent Neural Networks. arXiv 2012(Alex Graves)
數(shù)據(jù)增強(qiáng)
參考論文:SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. INTERSPEECH 2019(Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le)
延遲優(yōu)化
參考論文:Towards Fast and Accurate Streaming End-to-End ASR. ICCASP 2019(Bo Li, Shuo-yiin Chang, Tara N. Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, Yonghui Wu)
參考論文:On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition. InterSpeech 2020(Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu)
聯(lián)系客服