Alias for Data Science

An autonomous agent that runs your entire data science workflow. ## Overview

**Alias-DataScience** is an **autonomous**, **ready-to-use**, intelligent assistant for real-world **data science workflows**. It transforms high-level analytical questions into executable plans, which can seamlessly handle data acquisition, cleaning, modeling, visualization, and narrative reporting, with minimal human intervention. ## ✨ Key Features ### 🔍 Scalable File Filtering To handle massive data files commonly found in enterprise data lakes, Alias-DataScience combines parallelized grep operations with Retrieval-Augmented Generation (RAG) to build a low-latency, high-throughput file filtering pipeline. This preprocessing step enables accurate identification of relevant files, significantly expanding our scope and applicability. ### 🧠 Context-Aware Prompt Engineering Rather than relying on generic instructions, Alias-DataScience employs three specialized prompt templates, each fine-tuned for a dominant data science workflow: - **Exploratory Data Analysis (EDA)**: Surfaces trends, anomalies, and relationships to answer "what's happening?" and "why?" - **Predictive Modeling**: Automates feature engineering, model selection, and optimization. - **Exact Data Computation**: Delivers precise, auditable answers to quantitative queries (e.g., "What was the YoY revenue growth in Q3?"). An intelligent **prompt selector** routes tasks to the best template based on user intent. ### 📊 Handling of Messy Tabular Data Alias-DataScience parses irregular spreadsheets (merged cells, embedded notes, multi-level headers) and converts them into structured tables. For large files, it outputs a semantic-preserving JSON representation, enabling reliable analysis of human-crafted inputs. ### 👁️ Multimodal Understanding of Visual Content - **Image Understanding**: Interprets charts, diagrams, and general images to extract numerical data, trends, and domain-specific entities - **Visual QA**: Answers natural-language questions about visual elements (e.g., "What was the peak value in Q3?"). ### 📑 Automated Reporting For EDA tasks, Alias-DataScience generates an interactive HTML report featuring: - Actionable insights backed by statistics and visuals, - Executable code snippets for transparency and reuse. This bridges the gap between data scientists and stakeholders like business users or auditors. ## 📈 Benchmark Performance Alias-DataScience achieves **state-of-the-art (SOTA)** across major data science agent benchmarks. ### [DSBench](https://github.com/LiqiangJing/DSBench) *Realistic tasks from ModelOff & Kaggle; includes multimodal inputs, multi-source data, and large-scale modeling.*

Task Category	Framework	Model	Score
Data Analysis	Alias-DataScience	Qwen3-max-Preview	55.58% 🏆
	AutoGen	GPT-4	30.69%
	AutoGen	GPT-4o	34.12%
	CodeInterpreter	GPT-4	26.39%
	CodeInterpreter	GPT-4o	23.82%
Data Modeling	Alias-DataScience	Qwen3-max-Preview	49.70% 🏆
	AutoGen	GPT-4	45.52%
	AutoGen	GPT-4o	34.74%
	CodeInterpreter	GPT-4	26.14%
	CodeInterpreter	GPT-4o	16.90%

--- ### [InsightBench](https://insightbench.github.io/) *Open-ended comprehensive analytical tasks.*

Framework	Model	Score
Alias-DataScience	Qwen3-max-Preview	43.29% 🏆
AgentPoirot	Qwen3-max-Preview	39.30%

--- ### [DABench](https://github.com/InfiAgent/InfiAgent) *End-to-end data analysis from real-world CSVs.*

Framework	Model	Score
Alias-DataScience	Qwen3-max-Preview	95.20% 🏆
AutoGen	GPT-4	71.49%
Data Interpreter	GPT-4	73.55%
Data Interpreter	GPT-4o	94.93%

Some tables include data from published sources, used with gratitude to the original authors and cited in good faith. For accuracy, please refer to the original publications.

## 🎯 Use Cases ### 1. Machine Learning

### 2. Exact Data Computation

### 3. Exploratory Data Analysis