TY - JOUR
T1 - Speech Synthesis from Articulatory Movements Recorded by Real-time MRI
AU - Otani, Yuto
AU - Sawada, Shun
AU - Ohmura, Hidefumi
AU - Katsurada, Kouichi
N1 - Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Previous speech synthesis models from articulatory movements recorded using real-time MRI (rtMRI) only predicted vocal tract shape parameters and required additional pitch information to generate a speech waveform. This study proposes a two-stage deep learning model composed of CNN-BiLSTM that predicts a mel-spectrogram from a rtMRI video and a HiFi-GAN vocoder that synthesizes a speech waveform. We evaluated our model on two databases: the ATR 503 sentences rtMRI database and the USC-TIMIT database. The experimental results on the ATR 503 sentences rtMRI database show that the PESQ score and the RMSE of F0 are 1.64 and 26.7 Hz. This demonstrates that all acoustic parameters, including fundamental frequency, can be estimated from the rtMRI videos. In the experiment on the USC-TIMIT database, we obtained a good PESQ score and RMSE for F0. However, the synthesized speech is unclear, indicating that the quality of the datasets affects the intelligibility of the synthesized speech.
AB - Previous speech synthesis models from articulatory movements recorded using real-time MRI (rtMRI) only predicted vocal tract shape parameters and required additional pitch information to generate a speech waveform. This study proposes a two-stage deep learning model composed of CNN-BiLSTM that predicts a mel-spectrogram from a rtMRI video and a HiFi-GAN vocoder that synthesizes a speech waveform. We evaluated our model on two databases: the ATR 503 sentences rtMRI database and the USC-TIMIT database. The experimental results on the ATR 503 sentences rtMRI database show that the PESQ score and the RMSE of F0 are 1.64 and 26.7 Hz. This demonstrates that all acoustic parameters, including fundamental frequency, can be estimated from the rtMRI videos. In the experiment on the USC-TIMIT database, we obtained a good PESQ score and RMSE for F0. However, the synthesized speech is unclear, indicating that the quality of the datasets affects the intelligibility of the synthesized speech.
KW - articulatory movement
KW - real-time MRI
KW - speech synthesis
KW - speech waveform generation
UR - http://www.scopus.com/inward/record.url?scp=85171571022&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2023-286
DO - 10.21437/Interspeech.2023-286
M3 - Conference article
AN - SCOPUS:85171571022
SN - 2308-457X
VL - 2023-August
SP - 127
EP - 131
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 24th International Speech Communication Association, Interspeech 2023
Y2 - 20 August 2023 through 24 August 2023
ER -