Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)

From Classical to Colloquial: Leveraging LLMs for Sadhu–Cholit Register Identification in Bangla

Authors
Rasel Parvez1, Md Anwar Hossain2, Showrov Azam1, A. K. M. Bahalul Haque3, Sadman Sadik Khan1, *, Sadekur Rahman1
1Daffodil International University, Dhaka, Bangladesh
2Maharishi International University, Fairfield, Iowa, USA
3Abo Akademi University, Turku, Finland
*Corresponding author. Email: sadman15-13696@diu.edu.bd
Corresponding Author
Sadman Sadik Khan
Available Online 8 June 2026.
DOI
10.2991/978-94-6239-664-7_33How to use a DOI?
Keywords
Bangla NLP; Sadhu Bhasha; Cholit Bhasha; Register Classification; Text Classification; Deep Learning; LSTM; BiLSTM; BanglaBERT; Multilingual BERT (mBERT)
Abstract

Posing as a diglossic and morphologically rich language, Bangla contains two major types of registers: Sadhu Bhasha, the classical type, and Cholit Bhasha, the colloquial. Identification of the registers can be beneficial for downstream applications involving NLP such as translation, OCR, and speech synthesis. The study involved developing a dataset balanced with 7350 Sadhu and Cholit sentences. The dataset was preprocessed by tokenization, normalization, and padding, then split 80–20 for training and testing. Four deep learning models, viz. LSTM, Bi-LSTM, BanglaBERT, and mBERT, were trained in identical settings, using Adam optimizers with a batch size of 32 for 10 epochs. Experimental results suggested that while sequential models did perform reasonably well, transformer models outperformed them substantially, with BanglaBERT attaining the highest accuracy of 95%. These results become the benchmark for Sadhu-Cholit classification and stress the importance of register sensitivity in the Bangla NLP.

Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)
Series
Advances in Intelligent Systems Research
Publication Date
8 June 2026
ISBN
978-94-6239-664-7
ISSN
1951-6851
DOI
10.2991/978-94-6239-664-7_33How to use a DOI?
Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Rasel Parvez
AU  - Md Anwar Hossain
AU  - Showrov Azam
AU  - A. K. M. Bahalul Haque
AU  - Sadman Sadik Khan
AU  - Sadekur Rahman
PY  - 2026
DA  - 2026/06/08
TI  - From Classical to Colloquial: Leveraging LLMs for Sadhu–Cholit Register Identification in Bangla
BT  - Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)
PB  - Atlantis Press
SP  - 473
EP  - 486
SN  - 1951-6851
UR  - https://doi.org/10.2991/978-94-6239-664-7_33
DO  - 10.2991/978-94-6239-664-7_33
ID  - Parvez2026
ER  -