Paper Information | SBC BOOKS

Fill in your paper information

Language (*)

Title (*)

Keywords (*)

Abstract (*)

The development of Automatic Speech Recognition (ASR) technologies for healthcare applications is hindered by the limited availability of publicly accessible speech corpora that reflect both natural medical dialogues and the acoustic conditions typically found in clinical environments. In this study, we present the creation and characterization of MedDialogue-Audio, a new synthetic English-language corpus designed to address this gap. The dataset was derived from the MedDialog-EN transcription set and enriched through a multi-stage processing pipeline that involved text normalization with a large language model, speech synthesis, and the controlled addition of both white noise and hospital ambient sounds. We provide descriptive statistics for the corpus, which comprises more than 10,000 dialogues, as well as benchmarking results from leading ASR models. The experiments assess transcription performance across varying signal-to-noise ratios and establish baseline metrics to support future research in this field.

Pages (*)

File Link

English Information

Title

Keywords

Abstract

(*) To change the order drag the item to the new position.

Authors

#	Name
1	Aline Gassenn(aline.gassenn@usp.br)
2	Luís de Andrade(gustavo.modelli@unesp.br)
3	Douglas Teodoro(douglas.teodoro@unige.ch)
4	José F. Rodrigues-Jr(junio@icmc.usp.br)

(*) To change the order drag the item to the new position.

Reference

#	Reference
1	Ali, S. N. and Shuvo, S. B. (2021). Hospital ambient noise dataset.
2	Ali, S. N., Shuvo, S. B., Al-Manzo, M. I. S., Hasan, A., and Hasan, T. (2023). An end-to-end deep learning framework for real-time denoising of heart sounds for cardiac disease detection in unseen noise. IEEE Access, 11:87887–87901.
3	Arora, R. K., Wei, J., Hicks, R. S., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., and Singhal, K. (2025). Healthbench: Evaluating large language models towards improved human health.
4	Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pages 12449–12460. Curran Associates, Inc.
5	Banerjee, S., Agarwal, A., and Ghosh, P. (2024). High-precision medical speech recognition through synthetic data and semantic correction: United-medasr. arXiv preprint arXiv:2412.00055.
6	Canopyai (2025). Canopyai/orpheus-tts: Towards human-sounding speech.
7	Devatine, N. and Abraham, L. (2024). Assessing human editing effort on llm-generated texts via compression-based edit distance. arXiv preprint arXiv:2412.17321.
8	Gonçalves, Y. T., Alves, J. V. B., Sá, B. A. D., da Silva, L. N., de Macedo, J. A. F., and da Silva, T. L. C. (2024). Speech recognition models in assisting medical history. In Proceedings of the 39th Brazilian Symposium on Databases (SBBD), pages 485–497, Florianópolis, SC, Brazil.
9	Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
10	Le-Duc, K. (2024). VietMed: A dataset and benchmark for automatic speech recognition of Vietnamese in the medical domain. In Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N., editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17365–17370, Torino, Italia. ELRA and ICCL.
11	Le-Duc, K., Phan, P., Pham, T.-H., Tat, B. P., Ngo, M.-H., Nguyen-Tang, T., and Hy, T.-S. (2025). MultiMed: Multilingual medical speech recognition via attention encoder decoder. In Rehm, G. and Li, Y., editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 1113–1150, Vienna, Austria. Association for Computational Linguistics.
12	Lee, S.-H., Park, J., Yang, K., Min, J., and Choi, J. (2022). Accuracy of cloud-based speech recognition open application programming interface for medical terms of korean. Journal of Korean Medical Science, 37(18).
13	Norvig, P. (2025). Pyspellchecker: Pure python spell checking library.
14	Nurfadhilah, E., Jarin, A., Ruslana Aini, L., Pebiana, S., Santosa, A., Teduh Uliniansyah, M., Butarbutar, E., Desiani, and Gunarso (2021). Evaluating the bppt medical speech corpus for an asr medical record transcription system. In 2021 9th International Conference on Information and Communication Technology (ICoICT), pages 657–661.
15	Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 28492–28518. PMLR.
16	Tang, C., Zhang, H., Loakman, T., Lin, C., and Guerin, F. (2023). Terminology-aware medical dialogue generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
17	Zeng, G., Yang, W., Ju, Z., Yang, Y., Wang, S., Zhang, R., Zhou, M., Zeng, J., Dong, X., Zhang, R., Fang, H., Zhu, P., Chen, S., and Xie, P. (2020). MedDialog: Large-scale medical dialogue datasets. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9241–9250, Online. Association for Computational Linguistics.

Paper Registration

Fill in your paper information

English Information

Authors

Reference