From Classical to Colloquial: Leveraging LLMs for Sadhu–Cholit Register Identification in Bangla
- DOI
- 10.2991/978-94-6239-664-7_33How to use a DOI?
- Keywords
- Bangla NLP; Sadhu Bhasha; Cholit Bhasha; Register Classification; Text Classification; Deep Learning; LSTM; BiLSTM; BanglaBERT; Multilingual BERT (mBERT)
- Abstract
Posing as a diglossic and morphologically rich language, Bangla contains two major types of registers: Sadhu Bhasha, the classical type, and Cholit Bhasha, the colloquial. Identification of the registers can be beneficial for downstream applications involving NLP such as translation, OCR, and speech synthesis. The study involved developing a dataset balanced with 7350 Sadhu and Cholit sentences. The dataset was preprocessed by tokenization, normalization, and padding, then split 80–20 for training and testing. Four deep learning models, viz. LSTM, Bi-LSTM, BanglaBERT, and mBERT, were trained in identical settings, using Adam optimizers with a batch size of 32 for 10 epochs. Experimental results suggested that while sequential models did perform reasonably well, transformer models outperformed them substantially, with BanglaBERT attaining the highest accuracy of 95%. These results become the benchmark for Sadhu-Cholit classification and stress the importance of register sensitivity in the Bangla NLP.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Rasel Parvez AU - Md Anwar Hossain AU - Showrov Azam AU - A. K. M. Bahalul Haque AU - Sadman Sadik Khan AU - Sadekur Rahman PY - 2026 DA - 2026/06/08 TI - From Classical to Colloquial: Leveraging LLMs for Sadhu–Cholit Register Identification in Bangla BT - Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025) PB - Atlantis Press SP - 473 EP - 486 SN - 1951-6851 UR - https://doi.org/10.2991/978-94-6239-664-7_33 DO - 10.2991/978-94-6239-664-7_33 ID - Parvez2026 ER -