Shadhu-Cholito Detection Across Scripts: A Comprehensive Approach to Banglish and Bengali Register Classification
- DOI
- 10.2991/978-94-6239-664-7_34How to use a DOI?
- Keywords
- Banglish; Bengali language registers; Shadhu-Cholito classification; Cross-script NLP; Ensemble methods; Low-resource languages
- Abstract
The increasing usage of Banglish-a code-mixed variety of Bangla written in Roman script-presents significant challenges for NLP. This paper presents the first cross-script framework for identifying Bengali’s two main language registers, Shadhu and Cholito, across both Bangla and Banglish text. A balanced benchmark dataset is developed, and a wide range of models is evaluated, including transformer architectures, such as MuRIL, XLM-RoBERTa, mBERT, and DistilBERT, a Bi-LSTM network, as well as traditional machine learning classifiers. Experiments demonstrate that MuRIL achieves the best performance on Bangla with 95.92% accuracy and 0.9591 macro F1-score, while mBERT performs best for Banglish, yielding 85.73% accuracy and 0.8573 macro F1-score. For the combined four-class corpus, the highest overall accuracy of 90.08% and 0.9001 macro F1-score was obtained by XLM-RoBERTa. Ensemble methods, weighted soft voting, and hard voting further develop robustness on transliterated and codemixed data, 89.87% and 89.16% accuracy, respectively. These results set a strong benchmark for cross-script Shadhu-Cholito classification and form the basis for future applications, including sentiment analysis and machine translation tasks in the low-resource and mixed-script environments.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Rafsan Hasan Pronay AU - Anupam Singha AU - Kingkar Prosad Ghosh PY - 2026 DA - 2026/06/08 TI - Shadhu-Cholito Detection Across Scripts: A Comprehensive Approach to Banglish and Bengali Register Classification BT - Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025) PB - Atlantis Press SP - 487 EP - 503 SN - 1951-6851 UR - https://doi.org/10.2991/978-94-6239-664-7_34 DO - 10.2991/978-94-6239-664-7_34 ID - Pronay2026 ER -