Add Math Agent (Quick Start for AgentScope Tuner) (#102)

2026-01-15 17:41:26 +08:00
parent 2bdefc9126
commit d896703580
10 changed files with 995 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -67,6 +67,13 @@ This is a repository that **brings together a variety of ready-to-run Python age
 │   └── ace_bench/                          # Benchmarks and evaluation tools
 │
 ├── data_juicer_agent/                      # Data processing multi-agent system
+├── tuner/                                  # Tune AgentScope applications using AgentScope Tuner
+│   ├── math_agent/                         # A quick start example for tuning
+│   ├── frozen_lake/                        # Teach an agent to play a game requiring multiple steps
+│   ├── learn_to_ask/                       # Using LLM-as-a-judge to facilitate agent tuning
+│   ├── email_search/                       # Enhance the tool use ability of your agent
+│   ├── werewolf_game/                      # Enhance a multi-agent application
+│   └── data_augment/                       # Data augmentation for tuning
 ├── sample_template/                        # Template for new sample contributions
 └── README.md
 ```
--- a/README_zh.md
+++ b/README_zh.md
@@ -67,6 +67,13 @@
 │   └── ace_bench/                          # 基准测试与评估工具
 │
 ├── data_juicer_agent/                      # 数据处理多智能体系统
+├── tuner/                                  # 用 AgentScope Tuner 调优 AgentScope 应用
+│   ├── math_agent/                         # 快速入门调优示例
+│   ├── frozen_lake/                        # 教一个智能体玩需要多步操作的游戏
+│   ├── learn_to_ask/                       # 使用 LLM 作为评委辅助智能体训练
+│   ├── email_search/                       # 提升智能体的工具使用能力
+│   ├── werewolf_game/                      # 强化多智能体应用能力
+│   └── data_augment/                       # 增强用于调优的数据
 ├── sample_template/                        # 新样例贡献模板
 └── README.md
 ```
--- a/tuner/README.md
+++ b/tuner/README.md
@@ -0,0 +1,27 @@
+# AgentScope Tuner
+
+This directory contains several examples of how to use the AgentScope Tuner for tuning AgentScope applications. The table below summarizes the available examples:
+
+| Example Name      | Description                                                                        | Example Path                    | Multi-step Interaction  |  LLM-as-a-Judge | Tool-use | Multi-Agent | Data Augmentation |
+|-------------------|------------------------------------------------------------------------------------|---------------------------------|-------------------------|-----------------|----------|-------------|-------------------|
+| Math Agent        | A quick start example for tuning a math-solving agent to enhance its capabilities. | [math_agent](./math_agent)      | ✅ | ❌ | ❌ | ❌ | ❌ |
+| Frozen Lake       | Make an agent to navigate the Frozen Lake environment in multi-step interactions.  | [frozen_lake](./frozen_lake)    | ✅ | ❌ | ❌ | ❌ | ❌ |
+| Learn to Ask      | Using LLM as a judge to provide feedback to facilitate agent tuning.               | [learn_to_ask](./learn_to_ask)  | ✅ | ✅ | ❌ | ❌ | ❌ |
+| Email Search      | Enhance the tool use ability of your agent on tasks without ground truth.          | [email_search](./email_search)  | ✅ | ✅ | ✅ | ❌ | ❌ |
+| Werewolf Game     | Enhance the agent's performance in a multi-agent game setting.                     | [werewolf_game](./werewolf_game)| ✅ | ✅ | ✅ | ✅ | ❌ |
+| Data Augment      | Data augmentation for better tuning results.                                       | [data_augment](./data_augment)  | ❌ | ❌ | ❌ | ❌ | ✅ |
+
+Each example contains a README file with detailed instructions on how to set up and run the tuning process for that specific scenario. Feel free to explore and modify the examples to suit your needs!
+
+
+## Prerequisites
+
+AgentScope Tuner requires:
+
+- Python 3.10 or higher
+- `agentscope>=1.0.12`
+- `trinity-rft>=0.4.1`
+
+AgentScope Tuner is built on top of [Trinity-RFT](https://github.com/modelscope/Trinity-RFT).
+Please refer to the [Trinity-RFT installation guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html)
+for detailed instructions on how to set up the environment.
--- a/tuner/README_zh.md
+++ b/tuner/README_zh.md
@@ -0,0 +1,26 @@
+# AgentScope Tuner 中文说明
+
+本目录包含了多个使用 AgentScope Tuner 对 AgentScope 应用进行调优的示例。下表总结了可用的示例：
+
+| 示例名称         | 描述                                                                 | 示例路径                        | 多步交互 | LLM 评审 | 工具使用 | 多智能体 | 数据增强 |
+|------------------|-------------------------------------------|---------------------------------|----------|----------|----------|----------|----------|
+| 数学智能体         | 快速入门示例，调优数学智能体以提升其能力。     | [math_agent](./math_agent)      | ✅       | ❌       | ❌       | ❌       | ❌       |
+| Frozen Lake       | 让智能体在多步交互中导航冰湖环境。           | [frozen_lake](./frozen_lake)    | ✅       | ❌       | ❌       | ❌       | ❌       |
+| Learn to Ask      | 使用 LLM 作为评审，为智能体调优提供反馈      | [learn_to_ask](./learn_to_ask)  | ✅       | ✅       | ❌       | ❌       | ❌       |
+| 邮件搜索         | 在无标准答案任务中提升智能体的工具使用能力。     | [email_search](./email_search)  | ✅       | ✅       | ✅       | ❌       | ❌       |
+| 狼人杀游戏       | 提升智能体在多智能体游戏场景下的表现。          | [werewolf_game](./werewolf_game)| ✅       | ✅       | ✅       | ✅       | ❌       |
+| 数据增强         | 通过数据增强获得更好的调优效果。               | [data_augment](./data_augment)  | ❌       | ❌       | ❌       | ❌       | ✅       |
+
+每个示例目录下均包含详细的 README 文件，介绍了该场景下的调优流程和使用方法。欢迎根据实际需求进行探索和修改！
+
+## 先决条件
+
+AgentScope Tuner 需要：
+
+- Python 3.10 或更高版本
+- `agentscope>=1.0.12`
+- `trinity-rft>=0.4.1`
+
+AgentScope Tuner 构建于 [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) 之上。
+请参考 [Trinity-RFT 安装指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_installation.html)
+获取详细的安装方法。
--- a/tuner/math_agent/README.md
+++ b/tuner/math_agent/README.md
@@ -0,0 +1,375 @@
+# Math Agent (Quick Start for AgentScope Tuner)
+
+AgentScope provides a `tuner` sub-module to train agent workflows using reinforcement learning (RL).
+This guide walks you through the steps to implement and train an agent workflow using RL with AgentScope Tuner.
+
+## Overview
+
+To train your agent workflow using RL, you need to understand three components:
+
+1. **Workflow function**: Refactor your agent application into a workflow function that follows the specified input/output signature.
+2. **Judge function**: Implement a judge function that computes rewards based on the agent's responses.
+3. **Task dataset**: Prepare a dataset containing training samples for the agent to learn.
+
+The following diagram illustrates the relationship between these components:
+
+```mermaid
+flowchart TD
+    Model[Model] --> WorkflowFunction[Workflow Function]
+    WorkflowFunction --> JudgeFunction[Judge Function]
+    Task[Task] --> WorkflowFunction
+    Task[Task] --> JudgeFunction
+    JudgeFunction --> Reward[Reward]
+
+    classDef wfcolor fill:#e67e22,stroke:#333,color:#111;
+    classDef judgecolor fill:#1abc9c,stroke:#333,color:#111,stroke-dasharray: 5 5;
+    classDef taskcolor fill:#3498db,stroke:#333,color:#111;
+    class WorkflowFunction wfcolor;
+    class JudgeFunction judgecolor;
+    class Task taskcolor;
+```
+
+The workflow function takes a chat model and a task from dataset as input, and produces the agent's response.
+The judge function takes the same task and the agent's response as input, and computes a scalar reward.
+The judge function is optional; if not provided, the workflow function can directly output the reward.
+
+## How to implement
+
+Here we use a math problem solving scenario as an example to illustrate how to implement the above three components.
+
+Suppose you have an agent workflow that solves math problems using the `ReActAgent`.
+
+```python
+from agentscope.agent import ReActAgent
+from agentscope.model import OpenAIChatModel
+from agentscope.formatter import OpenAIChatFormatter
+from agentscope.message import Msg
+
+
+async def run_react_agent(query: str):
+    model = OpenAIChatModel(
+        # your model config here...
+    )
+
+    agent = ReActAgent(
+        name="react_agent",
+        sys_prompt="You are a helpful math problem solving agent.",
+        model=model,
+        enable_meta_tool=True,
+        formatter=OpenAIChatFormatter(),
+    )
+
+    response = await agent.reply(
+        msg=Msg("user", query, role="user"),
+    )
+
+    print(response)
+```
+
+### Step 1: Prepare task dataset
+
+To train the agent solving math problems, you need a training dataset that contains samples of math problems and their corresponding ground truth answers.
+
+The dataset should be organized in huggingface [datasets](https://huggingface.co/docs/datasets/quickstart) format and can be loaded using the `datasets.load_dataset` function. For example:
+
+```
+my_dataset/
+    ├── train.jsonl  # samples for training
+    └── test.jsonl   # samples for evaluation
+```
+
+Suppose your `train.jsonl` contains samples like:
+
+```json
+{"question": "What is 2 + 2?", "answer": "4"}
+{"question": "What is 4 + 4?", "answer": "8"}
+```
+
+Note that the task sample format can vary based on your specific scenario. The key point is that each sample should contain the necessary information for the agent to complete the task and for judging the quality of the response.
+
+You can preview your dataset using the following code:
+
+```python
+from agentscope.tuner import DatasetConfig
+
+DatasetConfig(path="my_dataset", split="train").preview()
+
+# Output:
+# [
+#   {
+#     "question": "What is 2 + 2?",
+#     "answer": "4"
+#   },
+#   {
+#     "question": "What is 4 + 4?",
+#     "answer": "8"
+#   }
+# ]
+```
+
+### Step 2: Define a workflow function
+
+To train an agent workflow using RL, you need to refactor your agent with the following signature.
+
+```python
+async def workflow_function(
+    task: Dict,
+    model: OpenAIChatModel,
+    auxiliary_models: Optional[Dict[str, OpenAIChatModel]]=None,
+) -> WorkflowOutput:
+    """Run the agent workflow on a single task and return a scalar reward."""
+```
+
+- Inputs:
+    - `task`: A dictionary representing a single training task, converted from a sample in the training dataset. For example, if using the dataset prepared in Step 1, the `task` is a dictionary containing `question` and `answer` fields.
+    - `model`: A `ChatModelBase` instance, which has the same interface as `OpenAIChatModel`, but it supports automatically converting invoke history into trainable data.
+    - `auxiliary_models`: A dictionary of auxiliary models that can be used in the workflow. The keys are model names, and the values are `ChatModelBase` instances. These models are different from the main `model` in that they are not directly trained, but can be used to assist the main model in completing the task (e.g., acting as Judge). Empty dict if no auxiliary models are needed.
+
+- Outputs:
+    - `WorkflowOutput`: An object containing the output of the workflow function, which contains:
+        - `reward`: A scalar float representing the reward obtained from the workflow function. Fill this field if you want to directly output the reward from the workflow function. Otherwise, you can leave it as `None` and implement the reward calculation in the judge function.
+        - `response`: The output from the workflow function, which can be the agent's response or other types of outputs depending on your workflow function implementation. Used for reward calculation in the judge function. If you don't need to calculate reward in the judge function, you can leave it as `None`.
+        - `metrics`: A dictionary of additional metrics that can be logged during training. Leave it as `None` if no additional metrics are needed.
+
+
+Below is a refactored version of the original `run_react_agent` function to fit the workflow function signature.
+
+**There are only 3 minor changes from the original function**:
+
+1. use the input `model` to initialize the agent.
+2. use the `question` field from the `task` dictionary as the user query.
+3. return a `WorkflowOutput` object containing the agent's response.
+
+```python
+from typing import Dict
+from agentscope.agent import ReActAgent
+from agentscope.model import OpenAIChatModel
+from agentscope.formatter import OpenAIChatFormatter
+from agentscope.tuner import WorkflowOutput
+from agentscope.message import Msg
+
+async def run_react_agent(
+    task: Dict,
+    model: OpenAIChatModel,
+    auxiliary_models: Dict[str, OpenAIChatModel] | None = None,
+) -> WorkflowOutput:
+    agent = ReActAgent(
+        name="react_agent",
+        sys_prompt="You are a helpful math problem solving agent.",
+        model=model,  # directly use the trainable model here
+        formatter=OpenAIChatFormatter(),
+    )
+
+    response = await agent.reply(
+        msg=Msg("user", task["question"], role="user"),  # extract question from task
+    )
+
+    return WorkflowOutput(  # put the response into WorkflowOutput
+        response=response,
+    )
+```
+
+### Step 3: Implement the judge function
+
+To train the agent using RL, you need to define a judge function that computes a reward following the signature below.
+
+```python
+async def judge_function(
+    task: Dict,
+    response: Any,
+    auxiliary_models: Dict[str, ChatModelBase],
+) -> JudgeOutput:
+    """Calculate reward based on the input task and agent's response."""
+```
+
+- Inputs:
+    - `task`: A dictionary representing a single training task, same as the input to the workflow function.
+    - `response`: The output from the workflow function, which can be the agent's response or other types of outputs depending on your workflow function implementation.
+    - `auxiliary_models`: A dictionary of auxiliary models that can be used in the reward calculation. The keys are model names, and the values are `ChatModelBase` instances. These models are different from the main model in that they are not directly trained, but can be used to assist in calculating the reward (e.g., acting as Judge). Empty dict if no auxiliary models are needed.
+
+- Outputs:
+    - `JudgeOutput`: An object containing the output of the judge function. It contains:
+        - `reward`: A scalar float representing the reward calculated based on the input task and agent's response. This field must be filled.
+        - `metrics`: A dictionary of additional metrics that can be logged during training. Leave it as `None` if no additional metrics are needed.
+
+Here is an example implementation of a simple reward calculation mechanism that gives a reward of `1.0` for an exact match between the agent's answer and the ground truth answer, and `0.0` otherwise.
+
+> Note: This is a toy reward function; in practice, you need to parse the agent's response to extract the final answer before comparing it with the ground truth. You may also want to use a more robust metric for reward calculation.
+
+```python
+from agentscope.message import Msg
+from agentscope.tuner import JudgeOutput
+
+async def judge_function(
+    task: Dict, response: Msg, auxiliary_models: Dict[str, ChatModelBase]
+) -> JudgeOutput:
+    """Simple reward: 1.0 for exact match, else 0.0."""
+    ground_truth = task["answer"]
+    reward = 1.0 if ground_truth in response.get_text_content() else 0.0
+    return JudgeOutput(reward=reward)
+```
+
+> Tip: You can leverage existing [`MetricBase`](https://github.com/agentscope-ai/agentscope/blob/main/src/agentscope/evaluate/_metric_base.py) implementations in your judge function to compute more sophisticated metrics and combine them into a composite reward.
+
+### Step 4: Start tuning
+
+Finally, you can use the `tune` interface to train the defined workflow function with a configuration file.
+
+```python
+from agentscope.tuner import tune, AlgorithmConfig, DatasetConfig, TunerModelConfig
+
+# your workflow / judge function here...
+
+if __name__ == "__main__":
+    dataset = DatasetConfig(path="my_dataset", split="train")
+    model = TunerModelConfig(model_path="Qwen/Qwen3-0.6B", max_model_len=16384)
+    algorithm = AlgorithmConfig(
+        algorithm_type="multi_step_grpo",
+        group_size=8,
+        batch_size=32,
+        learning_rate=1e-6,
+    )
+    tune(
+        workflow_func=run_react_agent,
+        judge_func=judge_function,
+        model=model,
+        train_dataset=dataset,
+        algorithm=algorithm,
+    )
+    # for advanced users, you can pass in config_path to load config from a YAML file
+    # and ignore other arguments
+    # tune(
+    #     workflow_func=run_react_agent,
+    #     judge_func=judge_function,
+    #     config_path="config.yaml",
+    #)
+```
+
+Here, we use `DatasetConfig` to load the training dataset, `TunerModelConfig` to initialize the trainable model, and `AlgorithmConfig` to specify the RL algorithm and its hyperparameters.
+
+> Note:
+> The `tune` function is based on [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) and it converts the input parameters into a YAML configuration internally.
+> Advanced users can ignore `model`, `train_dataset`, `algorithm` arguments and provide a configuration file path pointing to a YAML file using the `config_path` argument instead (see [config.yaml](./config.yaml) for an example).
+> We recommend using the configuration file approach for fine-grained control over the training process and leveraging advanced features provided by Trinity-RFT.
+> You can refer to the Trinity-RFT [Configuration Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html) for more details on configuration options.
+
+The checkpoint and logs will automatically be saved to the `checkpoints/AgentScope` directory under the current working directory and each run will be saved in a sub-directory suffixed with the current timestamp.
+You can find the tensorboard logs inside `monitor/tensorboard` of the checkpoint directory.
+
+```
+math_agent/
+    └── checkpoints/
+        └──AgentScope/
+            └── Experiment-20260104185355/  # each run saved in a sub-directory with timestamp
+                ├── monitor/
+                │   └── tensorboard/  # tensorboard logs
+                └── global_step_x/    # saved model checkpoints at step x
+```
+
+---
+
+### Complete example
+
+```python
+from typing import Dict
+
+from agentscope.tuner import tune, WorkflowOutput, JudgeOutput, DatasetConfig, TunerModelConfig, AlgorithmConfig
+from agentscope.agent import ReActAgent
+from agentscope.model import OpenAIChatModel
+from agentscope.formatter import OpenAIChatFormatter
+from agentscope.message import Msg
+
+
+async def run_react_agent(
+    task: Dict,
+    model: OpenAIChatModel,
+    auxiliary_models: Dict[str, OpenAIChatModel],
+) -> WorkflowOutput:
+    agent = ReActAgent(
+        name="react_agent",
+        sys_prompt="You are a helpful math problem solving agent.",
+        model=model,  # directly use the trainable model here
+        formatter=OpenAIChatFormatter(),
+    )
+
+    response = await agent.reply(
+        msg=Msg("user", task["question"], role="user"),  # extract question from task
+    )
+
+    return WorkflowOutput(
+        response=response,
+    )
+
+
+async def judge_function(
+    task: Dict, response: Msg, auxiliary_models: Dict[str, OpenAIChatModel]
+) -> JudgeOutput:
+    """Simple reward: 1.0 for exact match, else 0.0."""
+    ground_truth = task["answer"]
+    reward = 1.0 if ground_truth in response.get_text_content() else 0.0
+    return JudgeOutput(reward=reward)
+
+
+if __name__ == "__main__":
+    dataset = DatasetConfig(path="my_dataset", split="train")
+    model = TunerModelConfig(model_path="Qwen/Qwen3-0.6B", max_model_len=16384)
+    algorithm = AlgorithmConfig(
+        algorithm_type="multi_step_grpo",  # a GRPO algorithm for agentic scenarios
+        group_size=8,
+        batch_size=32,
+        learning_rate=1e-6,
+    )
+    tune(
+        workflow_func=run_react_agent,
+        judge_func=judge_function,
+        model=model,
+        train_dataset=dataset,
+        algorithm=algorithm,
+    )
+```
+
+> Note:
+> Above code is a simplified example for illustration purposes only.
+> For a complete implementation, please refer to [main.py](./main.py), which trains a ReAct agent to solve math problems on the GSM8K dataset.
+
+---
+
+## How to run
+
+After implementing the workflow function, follow these steps to run the training:
+
+1. Prerequisites
+
+    - At least 2 NVIDIA GPUs with CUDA 12.8 or newer.
+    - Adjust the configuration file ([config.yaml](./config.yaml)) based on your hardware.
+    - Follow the Trinity-RFT [installation guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) to install the latest version from source code.
+    - Download the GSM8K dataset and Qwen/Qwen3-0.6B model checkpoints (example):
+
+      ```bash
+      huggingface-cli download openai/gsm8k --repo-type dataset
+      huggingface-cli download Qwen/Qwen3-0.6B
+      ```
+
+2. Set up a [Ray](https://github.com/ray-project/ray) cluster
+
+    ```bash
+    ray start --head
+    # for multi-node setup, run the following command on worker nodes
+    # ray start --address=<master_address>
+    ```
+
+3. Run the training script
+
+    ```bash
+    python main.py
+    ```
+
+4. The reward curve and other training metrics can be monitored using TensorBoard:
+
+    ```bash
+    tensorboard --logdir ./checkpoints/AgentScope/Experiment-xxxxxx/monitor/tensorboard
+    ```
+
+    An example reward curve is shown below:
+
+    ![reward_curve](./reward_curve.png)
--- a/tuner/math_agent/README_zh.md
+++ b/tuner/math_agent/README_zh.md
@@ -0,0 +1,368 @@
+# 数学智能体（AgentScope Tuner 快速上手）
+
+AgentScope 提供了 `tuner` 子模块，使用强化学习（RL）来训练智能体工作流。
+本实例展示了如何用 AgentScope Tuner 调优一个数学题求解智能体，以提升其解题能力。
+
+## 总览
+
+要用 AgentScope Tuner 训练你的智能体工作流，你需要理解以下三个组件：
+
+1. **工作流函数（Workflow Function）**：将你的智能体应用重构为符合指定输入/输出签名的工作流函数。
+2. **评判函数（Judge Function）**：实现一个评判函数，根据智能体的响应计算奖励。
+3. **任务数据集（Task Dataset）**：准备包含训练样本的数据集，供智能体学习。
+
+下图展示了这些组件之间的关系：
+
+```mermaid
+flowchart TD
+    Model[Model] --> WorkflowFunction[Workflow Function]
+    WorkflowFunction --> JudgeFunction[Judge Function]
+    Task[Task] --> WorkflowFunction
+    Task[Task] --> JudgeFunction
+    JudgeFunction --> Reward[Reward]
+
+    classDef wfcolor fill:#e67e22,stroke:#333,color:#111;
+    classDef judgecolor fill:#1abc9c,stroke:#333,color:#111,stroke-dasharray: 5 5;
+    classDef taskcolor fill:#3498db,stroke:#333,color:#111;
+    class WorkflowFunction wfcolor;
+    class JudgeFunction judgecolor;
+    class Task taskcolor;
+```
+
+工作流函数（Workflow Function）接收一个模型（Model）和来自任务数据集的任务（Task）作为输入，输出智能体的原始响应。
+评判函数（Judge Function）接收同样的任务和智能体响应作为输入，计算一个标量奖励（Reward）。
+评判函数是可选的，工作流函数本身可以直接输出奖励以跳过评判函数。
+
+## 如何实现
+
+这里以数学题求解场景为例，说明如何实现上述三个组件。
+
+假设你有一个用 `ReActAgent` 解决数学题的智能体工作流。
+
+```python
+from agentscope.agent import ReActAgent
+from agentscope.model import OpenAIChatModel
+from agentscope.formatter import OpenAIChatFormatter
+from agentscope.message import Msg
+
+async def run_react_agent(query: str):
+    model = OpenAIChatModel(
+        # 你的模型配置...
+    )
+
+    agent = ReActAgent(
+        name="react_agent",
+        sys_prompt="你是一个乐于助人的数学题解答智能体。",
+        model=model,
+        enable_meta_tool=True,
+        formatter=OpenAIChatFormatter(),
+    )
+
+    response = await agent.reply(
+        msg=Msg("user", query, role="user"),
+    )
+
+    print(response)
+```
+
+### 步骤 1：准备任务数据集（Task Dataset）
+
+要训练智能体解决数学题，你需要一个包含数学题及其标准答案的训练数据集。
+
+数据集应使用 huggingface [datasets](https://huggingface.co/docs/datasets/quickstart) 格式，并可通过 `datasets.load_dataset` 加载。例如：
+
+```
+my_dataset/
+    ├── train.jsonl  # 训练样本
+    └── test.jsonl   # 测试样本
+```
+
+假设你的 `train.jsonl` 内容如下：
+
+```json
+{"question": "2 + 2 等于多少？", "answer": "4"}
+{"question": "4 + 4 等于多少？", "answer": "8"}
+```
+
+注意，任务样本格式可根据你的具体场景变化。关键是每个样本应包含智能体完成任务和评价任务完成效果所需的信息。
+
+你可以用如下代码预览数据集：
+
+```python
+from agentscope.tuner import DatasetConfig
+
+DatasetConfig(path="my_dataset", split="train").preview()
+
+# 输出：
+# [
+#   {
+#     "question": "2 + 2 等于多少？",
+#     "answer": "4"
+#   },
+#   {
+#     "question": "4 + 4 等于多少？",
+#     "answer": "8"
+#   }
+# ]
+```
+
+### 步骤 2：定义工作流函数（Workflow Function）
+
+要用 AgentScope Tuner 训练智能体工作流，需要实现如下函数接口：
+
+```python
+async def workflow_function(
+    task: Dict,
+    model: OpenAIChatModel,
+    auxiliary_models: Optional[Dict[str, OpenAIChatModel]]=None,
+) -> WorkflowOutput:
+    """在单个任务上运行智能体工作流并返回标量奖励。"""
+```
+
+- 输入：
+    - `task`：表示单个训练任务的字典，由训练数据集的样本转换而来。例如，若用上一步准备的数据集，`task` 字典包含 `question` 和 `answer` 字段。
+    - `model`：`ChatModelBase` 实例，接口与 `OpenAIChatModel` 相同，支持自动将调用历史转为可训练数据。
+    - `auxiliary_models`：辅助模型字典，键为模型名，值为 `ChatModelBase` 实例。这些模型不会直接训练，可辅助主模型完成任务（如充当 Judge）。如无需辅助模型则为空字典。
+
+- 输出：
+    - `WorkflowOutput`：包含工作流函数输出的对象，包括：
+        - `reward`：标量浮点数，表示工作流函数获得的奖励。如果希望工作流函数直接输出奖励则填写，否则可留空，由评判函数计算。
+        - `response`：工作流函数的输出，可为智能体响应或其他类型，供评判函数计算奖励。如无需评判函数可留空。
+        - `metrics`：训练过程中可记录的其他指标字典。如无可留空。
+
+如下示例展示了如何将原有 `run_react_agent` 函数改造为工作流函数。
+
+**仅有 3 处小改动**：
+
+1. 用输入的 `model` 初始化智能体。
+2. 用 `task` 字典的 `question` 字段作为用户问题。
+3. `WorkflowOutput` 对象包装原始返回值。
+
+```python
+from typing import Dict
+from agentscope.agent import ReActAgent
+from agentscope.model import OpenAIChatModel
+from agentscope.formatter import OpenAIChatFormatter
+from agentscope.tuner import WorkflowOutput
+from agentscope.message import Msg
+
+async def run_react_agent(
+    task: Dict,
+    model: OpenAIChatModel,
+    auxiliary_models: Dict[str, OpenAIChatModel] | None = None,
+) -> WorkflowOutput:
+    agent = ReActAgent(
+        name="react_agent",
+        sys_prompt="你是一个乐于助人的数学题解答智能体。",
+        model=model,  # 直接用可训练模型
+        formatter=OpenAIChatFormatter(),
+    )
+
+    response = await agent.reply(
+        msg=Msg("user", task["question"], role="user"),  # 从 task 提取问题
+    )
+
+    return WorkflowOutput(  # 将响应放入 WorkflowOutput
+        response=response,
+    )
+```
+
+### 步骤 3：实现评判函数
+
+评判函数用于根据工作流函数的输出计算奖励。你需要实现如下函数接口：
+
+```python
+async def judge_function(
+    task: Dict,
+    response: Any,
+    auxiliary_models: Dict[str, ChatModelBase],
+) -> JudgeOutput:
+    """根据输入任务和工作流函数的返回值计算奖励。"""
+```
+
+- 输入：
+    - `task`：单个训练任务的字典，与工作流函数输入相同。
+    - `response`：工作流函数输出的 response 域，数据类型取决于你的工作流函数的具体实现。
+    - `auxiliary_models`：辅助模型字典，键为模型名，值为 `ChatModelBase` 实例。这些模型不会直接训练，可作为评委辅助奖励计算。如无需辅助模型留空即可。
+
+- 输出：
+    - `JudgeOutput`：包含评判函数输出的对象，包括：
+        - `reward`：根据输入任务和智能体响应计算的标量浮点奖励。此字段必须填写。
+        - `metrics`：训练过程中可记录的其他指标字典。如无可留空。
+
+下面是一个简单奖励机制的实现示例：若智能体答案与标准答案完全一致则奖励为 `1.0`，否则为 `0.0`。
+
+> 注意：该函数仅为示例，实际应用中你需要解析智能体响应以提取最终答案再与标准答案比较，或采用更复杂的奖励计算方法。
+
+```python
+from agentscope.message import Msg
+from agentscope.tuner import JudgeOutput
+
+async def judge_function(
+    task: Dict, response: Msg, auxiliary_models: Dict[str, ChatModelBase]
+) -> JudgeOutput:
+    """简单奖励：如果回复中包含标准答案则为 1.0 否则为 0.0。"""
+    ground_truth = task["answer"]
+    reward = 1.0 if ground_truth in response.get_text_content() else 0.0
+    return JudgeOutput(reward=reward)
+```
+
+> 提示：你可以利用已有的 [`MetricBase`](https://github.com/agentscope-ai/agentscope/blob/main/src/agentscope/evaluate/_metric_base.py) 实例实现评判函数，通过组合多个 Metric 计算更复杂的奖励指标。
+
+### 步骤 4：开始调优
+
+最后，你可以用 `tune` 接口结合一些配置信息训练上面定义的工作流函数。
+
+```python
+from agentscope.tuner import tune, AlgorithmConfig, DatasetConfig, TunerModelConfig
+
+# 你的工作流 / 评判函数 ...
+
+if __name__ == "__main__":
+    dataset = DatasetConfig(path="my_dataset", split="train")
+    model = TunerModelConfig(model_path="Qwen/Qwen3-0.6B", max_model_len=16384)
+    algorithm = AlgorithmConfig(
+        algorithm_type="multi_step_grpo",
+        group_size=8,
+        batch_size=32,
+        learning_rate=1e-6,
+    )
+    tune(
+        workflow_func=run_react_agent,
+        judge_func=judge_function,
+        model=model,
+        train_dataset=dataset,
+        algorithm=algorithm,
+    )
+    # 高级用法：可传入 config_path 来从 YAML 文件加载配置，忽略其他参数
+    # tune(
+    #     workflow_func=run_react_agent,
+    #     judge_func=judge_function,
+    #     config_path="config.yaml",
+    # )
+```
+
+这里用 `DatasetConfig` 加载训练数据集，`TunerModelConfig` 初始化可训练模型，`AlgorithmConfig` 指定 RL 算法及其超参数。
+
+> 注意：
+> `tune` 函数基于 [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) 实现，会将输入参数自动转为 YAML 配置。
+> 高级用户可忽略 `model`、`train_dataset`、`algorithm` 参数，直接用 `config_path` 指定 YAML 配置文件（见 [config.yaml](./config.yaml) 示例）。
+> 推荐用配置文件方式实现更细粒度的训练控制，充分利用 Trinity-RFT 的高级特性。
+> 详细配置说明见 Trinity-RFT [配置指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)。
+
+训练产生的 checkpoint 和日志信息会自动保存在当前目录下的 `checkpoints/AgentScope` 中，每次运行会新建带时间戳的子目录。
+TensorBoard 日志在 checkpoint 目录下的 `monitor/tensorboard` 中。
+
+```
+math_agent/
+    └── checkpoints/
+        └──AgentScope/
+            └── Experiment-20260104185355/  # 每次运行新建带时间戳的子目录
+                ├── monitor/
+                │   └── tensorboard/  # tensorboard 日志
+                └── global_step_x/    # 第 x 步保存的模型 checkpoint
+```
+
+---
+
+### 完整示例
+
+```python
+from typing import Dict
+
+from agentscope.tuner import tune, WorkflowOutput, JudgeOutput, DatasetConfig, TunerModelConfig, AlgorithmConfig
+from agentscope.agent import ReActAgent
+from agentscope.model import OpenAIChatModel
+from agentscope.formatter import OpenAIChatFormatter
+from agentscope.message import Msg
+
+async def run_react_agent(
+    task: Dict,
+    model: OpenAIChatModel,
+    auxiliary_models: Dict[str, OpenAIChatModel],
+) -> WorkflowOutput:
+    agent = ReActAgent(
+        name="react_agent",
+        sys_prompt="你是一个乐于助人的数学题解答智能体。",
+        model=model,
+        formatter=OpenAIChatFormatter(),
+    )
+
+    response = await agent.reply(
+        msg=Msg("user", task["question"], role="user"),  # 从 task 提取问题
+    )
+
+    return WorkflowOutput(
+        response=response,
+    )
+
+async def judge_function(
+    task: Dict, response: Msg, auxiliary_models: Dict[str, OpenAIChatModel]
+) -> JudgeOutput:
+    """简单奖励：如果回复中包含标准答案则为 1.0 否则为 0.0。"""
+    ground_truth = task["answer"]
+    reward = 1.0 if ground_truth in response.get_text_content() else 0.0
+    return JudgeOutput(reward=reward)
+
+if __name__ == "__main__":
+    dataset = DatasetConfig(path="my_dataset", split="train")
+    model = TunerModelConfig(model_path="Qwen/Qwen3-0.6B", max_model_len=16384)
+    algorithm = AlgorithmConfig(
+        algorithm_type="multi_step_grpo",
+        group_size=8,
+        batch_size=32,
+        learning_rate=1e-6,
+    )
+    tune(
+        workflow_func=run_react_agent,
+        judge_func=judge_function,
+        model=model,
+        train_dataset=dataset,
+        algorithm=algorithm,
+    )
+```
+
+> 注意：
+> 上述代码仅为简化示例，完整实现请参考 [main.py](./main.py)，该文件演示了如何在 GSM8K 数据集上训练 ReAct 智能体解决数学题。
+
+---
+
+## 如何运行
+
+实现好工作流函数后，按以下步骤运行训练：
+
+1. 前置条件
+
+    - 至少 2 块 NVIDIA GPU，CUDA 12.8 或更高。
+    - 根据硬件调整配置文件（[config.yaml](./config.yaml)）。
+    - 按 Trinity-RFT [安装指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_installation.html) 从源码安装最新版。
+    - 下载 GSM8K 数据集和 Qwen/Qwen3-0.6B 模型权重（示例）：
+
+      ```bash
+      huggingface-cli download openai/gsm8k --repo-type dataset
+      huggingface-cli download Qwen/Qwen3-0.6B
+      ```
+
+2. 启动 [Ray](https://github.com/ray-project/ray) 集群
+
+    ```bash
+    ray start --head
+    # 多节点时，worker 节点运行如下命令
+    # ray start --address=<master_address>
+    ```
+
+3. 运行训练脚本
+
+    ```bash
+    python main.py
+    ```
+
+4. 用 TensorBoard 监控奖励曲线等训练指标：
+
+    ```bash
+    tensorboard --logdir ./checkpoints/AgentScope/Experiment-xxxxxx/monitor/tensorboard
+    ```
+
+    奖励曲线示例：
+
+    ![reward_curve](./reward_curve.png)
--- a/tuner/math_agent/config.yaml
+++ b/tuner/math_agent/config.yaml
@@ -0,0 +1,53 @@
+# Please refer to https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html for detailed explanation of each field.
+project: AgentScope
+name: GSM8K-Qwen3-0.6B
+# directory to save checkpoints, default to ./checkpoints if TRINITY_CHECKPOINT_ROOT_DIR not set
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+  algorithm_type: multi_step_grpo  # a GRPO-based algorithm for multi-step reasoning
+  repeat_times: 8  # repeat each training sample 8 times
+model:
+  # path to the pre-trained model, default to Qwen/Qwen3-0.6B if TRINITY_MODEL_PATH not set
+  model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen3-0.6B}
+  # maximum tokens generated in response
+  max_response_tokens: 16384
+  # maximum token length for both input and output
+  # if you face OOM, try to reduce max_model_len and max_response_tokens
+  max_model_len: 24576
+  temperature: 1.0
+cluster:
+  node_num: 1  # cluster with 1 node
+  gpu_per_node: 8  # each node has 8 GPUs
+buffer:
+  total_epochs: 1  # run taskset for 1 epoch
+  batch_size: 32  # each step contains 32 samples from taskset
+  train_batch_size: 256  # trainer batch size is 256 (multi-step reasoning generate more training samples)
+  explorer_input:
+    taskset:  # define the taskset for rollout
+      name: gsm8k
+      path: 'openai/gsm8k'
+      subset_name: 'main'
+      split: 'train'
+explorer:
+  runner_per_model: 16  # each model has 16 runners for parallel rollout
+  max_timeout: 600  # max timeout for each rollout is 600 seconds
+  rollout_model:
+    engine_num: 4  # setup 4 vllm inference model instances
+    tensor_parallel_size: 1  # each model instance uses tensor parallel size of 1
+    enable_openai_api: true  # some parameters to provide openai-style API, don't change them
+    enable_history: true
+    enable_auto_tool_choice: true
+    # Qwen3 series tool_call_parser and reasoning_parser, if you use other models, please adjust accordingly
+    tool_call_parser: hermes
+    reasoning_parser: deepseek_r1
+synchronizer:
+  sync_style: dynamic_by_explorer
+  sync_method: 'nccl'
+  sync_interval: 1
+  sync_timeout: 1800  # wait for 30 minutes
+trainer:
+  save_interval: 100  # save checkpoint every 100 steps
+  use_dynamic_bsz: true
+  ulysses_sequence_parallel_size: 1  # use sequence parallelism to reduce memory usage
+monitor:
+  monitor_type: tensorboard  # here we use tensorboard, you can also use wandb, mlflow or swanlab
--- a/tuner/math_agent/main.py
+++ b/tuner/math_agent/main.py
@@ -0,0 +1,130 @@
+# -*- coding: utf-8 -*-
+"""Example of training a ReAct agent on GSM8K with Trinity-RFT."""
+from typing import Dict
+
+from agentscope.tuner import (
+    tune,
+    DatasetConfig,
+    WorkflowOutput,
+    JudgeOutput,
+    TunerModelConfig,
+    AlgorithmConfig,
+)
+from agentscope.agent import ReActAgent
+from agentscope.model import OpenAIChatModel
+from agentscope.formatter import OpenAIChatFormatter
+from agentscope.message import Msg
+
+
+async def run_react_agent(
+    task: Dict,
+    model: OpenAIChatModel,
+    auxiliary_models: Dict[str, OpenAIChatModel] | None = None,
+) -> WorkflowOutput:
+    """A simple workflow function using the ReAct agent to solve tasks.
+
+    Args:
+        task (`Dict`): The task to be solved.
+        model (`OpenAIChatModel`): The language model to use.
+        auxiliary_models (`Dict[str, OpenAIChatModel]`):
+            A dictionary of additional chat models available for
+            LLM-as-a-Judge. Not used in this workflow.
+
+    Returns:
+        `WorkflowOutput`: The workflow output containing the agent's response.
+    """
+    assert (
+        auxiliary_models is None or len(auxiliary_models) == 0
+    ), "No auxiliary models are used in this workflow."
+
+    sys_prompt = (
+        "You are an agent specialized in solving math problems with tools. "
+        "Please solve the math problem given to you. You can write and "
+        "execute Python code to perform calculation or verify your answer. "
+        "You should return your final answer within \\boxed{{}}."
+    )
+    agent = ReActAgent(
+        name="react_agent",
+        sys_prompt=sys_prompt,
+        model=model,
+        enable_meta_tool=True,
+        formatter=OpenAIChatFormatter(),
+    )
+    response = await agent.reply(
+        msg=Msg("user", task["question"], role="user"),
+    )
+    return WorkflowOutput(
+        response=response,
+    )
+
+
+async def gsm8k_judge(
+    task: Dict,
+    response: Msg,
+    auxiliary_models: Dict[str, OpenAIChatModel] | None = None,
+) -> JudgeOutput:
+    """A simple judge function to calculate reward based on agent's response.
+
+    Args:
+        task (`Dict`): The task information for the corresponding workflow.
+        response (`Msg`): The response generated by the corresponding workflow.
+        auxiliary_models (`Dict[str, OpenAIChatModel]`):
+            A dictionary of additional chat models available for LLM-as-a-Judge
+            usage. The keys are model names, and the values are the
+            corresponding OpenAIChatModel instances.
+
+    Returns:
+        `JudgeOutput`: The reward value assigned by the judge function.
+    """
+    from trinity.common.rewards.math_reward import MathBoxedRewardFn
+
+    assert (
+        auxiliary_models is None or len(auxiliary_models) == 0
+    ), "No auxiliary models are used in this workflow."
+
+    reward_fn = MathBoxedRewardFn()
+    # parse truth from gsm8k raw text
+    truth = task["answer"]
+    if isinstance(truth, str) and "####" in truth:
+        truth = truth.split("####")[1].strip()
+    else:
+        truth = str(truth)
+    # parse answer from response message
+    result = response.get_text_content()
+    reward_dict = reward_fn(
+        response=result,
+        truth=truth,
+    )
+    return JudgeOutput(
+        reward=sum(reward_dict.values()),
+        metrics=reward_dict,
+    )
+
+
+if __name__ == "__main__":
+    dataset = DatasetConfig(
+        path="openai/gsm8k",
+        name="main",
+        split="train",
+    )
+    tuner_model = TunerModelConfig(
+        model_path="Qwen/Qwen3-0.6B",
+        max_model_len=24576,
+        max_tokens=16384,
+        temperature=1.0,
+        inference_engine_num=4,
+        tensor_parallel_size=1,
+    )
+    algorithm = AlgorithmConfig(
+        algorithm_type="multi_step_grpo",
+        group_size=8,
+        learning_rate=1e-6,
+        batch_size=32,
+    )
+    tune(
+        workflow_func=run_react_agent,
+        judge_func=gsm8k_judge,
+        train_dataset=dataset,
+        model=tuner_model,
+        algorithm=algorithm,
+    )
--- a/tuner/math_agent/reward_curve.png
+++ b/tuner/math_agent/reward_curve.png
--- a/tuner/requirements.txt
+++ b/tuner/requirements.txt
@@ -0,0 +1,2 @@
+agentscope[full]>=1.0.12
+trinity-rft>=0.4.1