Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)

Shadhu-Cholito Detection Across Scripts: A Comprehensive Approach to Banglish and Bengali Register Classification

Authors
Rafsan Hasan Pronay1, *, Anupam Singha1, 2, *, Kingkar Prosad Ghosh1, 3, *
1Department of Computer Science and Engineering, R. P. Shaha University, Narayanganj, 1400, Bangladesh
2Department of Computer Science and Engineering, Vel Tech Rangarajan Dr, Sagunthala R&D Institute of Science & Technology, Chennai, India
3Department of Computer Science and Engineering, Volgograd State Technical University, Volgograd, Russia
*Corresponding author. Email: rafsanhasanpronoy00@gmail.com
*Corresponding author. Email: anupumeos@gmail.com
*Corresponding author. Email: kingkar@rpsu.edu.bd
Corresponding Authors
Rafsan Hasan Pronay, Anupam Singha, Kingkar Prosad Ghosh
Available Online 8 June 2026.
DOI
10.2991/978-94-6239-664-7_34How to use a DOI?
Keywords
Banglish; Bengali language registers; Shadhu-Cholito classification; Cross-script NLP; Ensemble methods; Low-resource languages
Abstract

The increasing usage of Banglish-a code-mixed variety of Bangla written in Roman script-presents significant challenges for NLP. This paper presents the first cross-script framework for identifying Bengali’s two main language registers, Shadhu and Cholito, across both Bangla and Banglish text. A balanced benchmark dataset is developed, and a wide range of models is evaluated, including transformer architectures, such as MuRIL, XLM-RoBERTa, mBERT, and DistilBERT, a Bi-LSTM network, as well as traditional machine learning classifiers. Experiments demonstrate that MuRIL achieves the best performance on Bangla with 95.92% accuracy and 0.9591 macro F1-score, while mBERT performs best for Banglish, yielding 85.73% accuracy and 0.8573 macro F1-score. For the combined four-class corpus, the highest overall accuracy of 90.08% and 0.9001 macro F1-score was obtained by XLM-RoBERTa. Ensemble methods, weighted soft voting, and hard voting further develop robustness on transliterated and codemixed data, 89.87% and 89.16% accuracy, respectively. These results set a strong benchmark for cross-script Shadhu-Cholito classification and form the basis for future applications, including sentiment analysis and machine translation tasks in the low-resource and mixed-script environments.

Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)
Series
Advances in Intelligent Systems Research
Publication Date
8 June 2026
ISBN
978-94-6239-664-7
ISSN
1951-6851
DOI
10.2991/978-94-6239-664-7_34How to use a DOI?
Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Rafsan Hasan Pronay
AU  - Anupam Singha
AU  - Kingkar Prosad Ghosh
PY  - 2026
DA  - 2026/06/08
TI  - Shadhu-Cholito Detection Across Scripts: A Comprehensive Approach to Banglish and Bengali Register Classification
BT  - Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)
PB  - Atlantis Press
SP  - 487
EP  - 503
SN  - 1951-6851
UR  - https://doi.org/10.2991/978-94-6239-664-7_34
DO  - 10.2991/978-94-6239-664-7_34
ID  - Pronay2026
ER  -