Paper Information | SBC BOOKS

Fill in your paper information

Language (*)

Title (*)

Keywords (*)

Abstract (*)

O desbalanceamento de classes é um desafio relevante na classificação automática de textos, especialmente em contextos de  dados anotados escassos e línguas Study of the Impact of Synthetic Data and Paraphrases on Mitigating Imbalance in Low-Sample Portuguese Text Classification Tasksainda sub-representadas, como é caso do português. Este estudo investiga a classificação de um conjunto de dados escassos desbalanceado constituído por solicitações de suporte técnico registradas por profissionais de saúde relativas aos sistemas do e-SUS APS. Foram avaliadas seis estratégias de reamostragem — duas de subamostragem e quatro de sobreamostragem, incluindo geração de paráfrases com Large Language Models. A combinação de sobreamostragem via paráfrases com rotulação seletiva elevou a Macro-F1 do BERTimbau em 18%, alcançando desempenho estatisticamente equivalente ao da Regressão Logística (RL) aplicada à junção de dados originais, random oversampling e rotulação seletiva, que atingiu 70% de melhorias em relação ao método original.  A RL é contudo cerca de 3690x mais eficiente que o BERTimbau considerando a versão mais efetiva de ambos os métodos.

Pages (*)

File Link

English Information

Title

Keywords

Abstract

(*) To change the order drag the item to the new position.

Authors

#	Name
1	Claudio Moisés Andrade(claudiovaliense@gmail.com)
2	Gestefane Rabbi(gestefane@dcc.ufmg.br)
3	Raiane Asevedo(raiasevedo@ufmg.br)
4	Julia Paes(juliapaes@dcc.ufmg.br)
5	Isaias Oliveira(isaias@medicina.ufmg.br)
6	Adriana Pagano(apagano@ufmg.br)
7	Zilma Reis(zilma@ufmg.br)
8	Marcos André Gonçalves(mgoncalv@dcc.ufmg.br)

(*) To change the order drag the item to the new position.

Reference

#	Reference
1	Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl., 6(1):20–29.
2	Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority oversampling technique. J. Artif. Int. Res., 16(1):321–357.
3	Cunha, W., França, C., Fonseca, G., Rocha, L., & Gonçalves, M. A. (2023). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 665–674, New York, NY, USA. Association for Computing Machinery.
4	Cunha, W., Moreo Fernández, A., Esuli, A., Sebastiani, F., Rocha, L., & Gonçalves, M. A. (2025). A noise-oriented and redundancy-aware instance selection framework. ACM Trans. Inf. Syst., 43(2)
5	Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In Huang, D.-S., Zhang, X.-P., & Huang, G.-B., editors, Advances in Intelligent Computing, pages 878–887, Berlin, Heidelberg. Springer Berlin Heidelberg.
6	He, H. & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284.
7	Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proc. of Machine Learning Research, pages 4411–4421. PMLR
8	Last, F., Douzas, G., & Bação, F. (2017). Oversampling for imbalanced learning based on k-means and SMOTE. CoRR, abs/1711.00837.
9	McClure, J., Shimmei, M., Matsuda, N., & Jiang, S. (2024). Leveraging prompts in llms to overcome imbalances in complex educational text data.
10	Nguyen, H. M., Cooper, E. W., & Kamei, K. (2011). Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigm., 3(1):4–21
11	Souza, F., Nogueira, R., & Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portu- guese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).
12	Tabar, V. R., Eskandari, F., Salimi, S., & Zareifard, H. (2018). Finding a set of candidate parents using dependency criterion for the k2 algorithm. Pattern Recognition Letters, 111:23–29.
13	Taskiran, S. F., Turkoglu, B., Kaya, E., & Asuroglu, T. (2025). A comprehensive evaluation of oversampling techniques for enhancing text classification performance. Scientific Reports, 15:21631.
14	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, volume 30.
15	Yadav, V., Tang, Z., & Srinivasan, V. (2024). Pag-llm: Paraphrase and aggregate with large language models for minimizing intent classification errors. In Proc. of the International ACM SIGIR Conference, SIGIR ’24, page 2569–2573.

Paper Registration

Fill in your paper information

English Information

Authors

Reference