Benchmarking Data Science Prowess in LLMs: A Holistic Evaluation Framework
- DOI
- 10.2991/978-94-6463-940-7_13How to use a DOI?
- Keywords
- Large Language Models; Data Science Evaluation; Benchmarking; Task-Function-Code; Frequency Analysis
- Abstract
This paper introduces a new benchmarking framework, “DataBench360,” created to evaluate the abilities of Large Language Models (LLMs) in solving practical data science problems. Unlike earlier benchmarks that focus only on narrow or simplified measures, DataBench360 uses a structured process to build reliable ground truths, check outputs against clear validation rules, and measure performance across six key data science areas. The framework applies a simple Task-Function- Code (TFC) break- down that makes evaluations more transparent and reproducible. Testing 23 models, both API-based and open- source, shows clear differences in performance, with API-based systems handling complex tasks more effectively. The novelty of this work lies in providing a broader, multi- dimensional evaluation compared to existing single-metric benchmarks, making DataBench360 a valuable tool for advancing AI in real-world data science applications.
- Copyright
- © 2025 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Mithilesh Reddy Maddi PY - 2025 DA - 2025/12/31 TI - Benchmarking Data Science Prowess in LLMs: A Holistic Evaluation Framework BT - Proceedings of the Conference on Social and Sustainable Innovation in Technology & Engineering (SASI-ITE 2025) PB - Atlantis Press SP - 174 EP - 181 SN - 1951-6851 UR - https://doi.org/10.2991/978-94-6463-940-7_13 DO - 10.2991/978-94-6463-940-7_13 ID - Maddi2025 ER -