Alias for Data Science
An autonomous agent that runs your entire data science workflow.
## Overview
**Alias-DataScience** is an **autonomous**, **ready-to-use**, intelligent assistant for real-world **data science workflows**. It transforms high-level analytical questions into executable plans, which can seamlessly handle data acquisition, cleaning, modeling, visualization, and narrative reporting, with minimal human intervention.
## ✨ Key Features
### 🔍 Scalable File Filtering
To handle massive data files commonly found in enterprise data lakes, Alias-DataScience combines parallelized grep operations with Retrieval-Augmented Generation (RAG) to build a low-latency, high-throughput file filtering pipeline. This preprocessing step enables accurate identification of relevant files, significantly expanding our scope and applicability.
### 🧠 Context-Aware Prompt Engineering
Rather than relying on generic instructions, Alias-DataScience employs three specialized prompt templates, each fine-tuned for a dominant data science workflow:
- **Exploratory Data Analysis (EDA)**: Surfaces trends, anomalies, and relationships to answer "what's happening?" and "why?"
- **Predictive Modeling**: Automates feature engineering, model selection, and optimization.
- **Exact Data Computation**: Delivers precise, auditable answers to quantitative queries (e.g., "What was the YoY revenue growth in Q3?").
An intelligent **prompt selector** routes tasks to the best template based on user intent.
### 📊 Handling of Messy Tabular Data
Alias-DataScience parses irregular spreadsheets (merged cells, embedded notes, multi-level headers) and converts them into structured tables. For large files, it outputs a semantic-preserving JSON representation, enabling reliable analysis of human-crafted inputs.
### 👁️ Multimodal Understanding of Visual Content
- **Image Understanding**: Interprets charts, diagrams, and general images to extract numerical data, trends, and domain-specific entities
- **Visual QA**: Answers natural-language questions about visual elements (e.g., "What was the peak value in Q3?").
### 📑 Automated Reporting
For EDA tasks, Alias-DataScience generates an interactive HTML report featuring:
- Actionable insights backed by statistics and visuals,
- Executable code snippets for transparency and reuse.
This bridges the gap between data scientists and stakeholders like business users or auditors.
## 📈 Benchmark Performance
Alias-DataScience achieves **state-of-the-art (SOTA)** across major data science agent benchmarks.
### [DSBench](https://github.com/LiqiangJing/DSBench)
*Realistic tasks from ModelOff & Kaggle; includes multimodal inputs, multi-source data, and large-scale modeling.*
| Task Category |
Framework |
Model |
Score |
| Data Analysis |
Alias-DataScience |
Qwen3-max-Preview |
55.58% 🏆 |
| AutoGen |
GPT-4 |
30.69% |
| AutoGen |
GPT-4o |
34.12% |
| CodeInterpreter |
GPT-4 |
26.39% |
| CodeInterpreter |
GPT-4o |
23.82% |
| Data Modeling |
Alias-DataScience |
Qwen3-max-Preview |
49.70% 🏆 |
| AutoGen |
GPT-4 |
45.52% |
| AutoGen |
GPT-4o |
34.74% |
| CodeInterpreter |
GPT-4 |
26.14% |
| CodeInterpreter |
GPT-4o |
16.90% |
---
### [InsightBench](https://insightbench.github.io/)
*Open-ended comprehensive analytical tasks.*
| Framework |
Model |
Score |
| Alias-DataScience |
Qwen3-max-Preview |
43.29% 🏆 |
| AgentPoirot |
Qwen3-max-Preview |
39.30% |
---
### [DABench](https://github.com/InfiAgent/InfiAgent)
*End-to-end data analysis from real-world CSVs.*
| Framework |
Model |
Score |
| Alias-DataScience |
Qwen3-max-Preview |
95.20% 🏆 |
| AutoGen |
GPT-4 |
71.49% |
|
Data Interpreter |
GPT-4 |
73.55% |
| Data Interpreter |
GPT-4o |
94.93% |
Some tables include data from published sources, used with gratitude to the original authors and cited in good faith. For accuracy, please refer to the original publications.
## 🎯 Use Cases
### 1. Machine Learning
### 2. Exact Data Computation
### 3. Exploratory Data Analysis