Speech Synthesis from Articulatory Movements Recorded by Real-time MRI

Yuto Otani, Shun Sawada, Hidefumi Ohmura, Kouichi Katsurada

研究成果: Conference article査読

5 被引用数 (Scopus)

抄録

Previous speech synthesis models from articulatory movements recorded using real-time MRI (rtMRI) only predicted vocal tract shape parameters and required additional pitch information to generate a speech waveform. This study proposes a two-stage deep learning model composed of CNN-BiLSTM that predicts a mel-spectrogram from a rtMRI video and a HiFi-GAN vocoder that synthesizes a speech waveform. We evaluated our model on two databases: the ATR 503 sentences rtMRI database and the USC-TIMIT database. The experimental results on the ATR 503 sentences rtMRI database show that the PESQ score and the RMSE of F0 are 1.64 and 26.7 Hz. This demonstrates that all acoustic parameters, including fundamental frequency, can be estimated from the rtMRI videos. In the experiment on the USC-TIMIT database, we obtained a good PESQ score and RMSE for F0. However, the synthesized speech is unclear, indicating that the quality of the datasets affects the intelligibility of the synthesized speech.

本文言語English
ページ(範囲)127-131
ページ数5
ジャーナルProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
2023-August
DOI
出版ステータスPublished - 2023
イベント24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
継続期間: 20 8月 202324 8月 2023

フィンガープリント

「Speech Synthesis from Articulatory Movements Recorded by Real-time MRI」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル