Add examples for frozenlake and emailsearch (#94)

2026-01-19 12:25:13 +08:00
parent 3821fb04ac
commit 654c35127a
26 changed files with 3370 additions and 14 deletions
--- a/tuner/frozen_lake/README.md
+++ b/tuner/frozen_lake/README.md
@@ -0,0 +1,271 @@
+# Training FrozenLake Agent with RL using AgentScope-Tuner
+
+## Summary
+
+This example demonstrates how to use AgentScope-Tuner to implement reinforcement fine-tuning for the [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) task using [Trinity-RFT](https://github.com/agentscope-ai/Trinity-RFT). The agent learns to navigate a frozen lake grid from a starting position to a goal while avoiding holes through multi-step interactions with the environment.
+
+## Task Setting
+
+### Agent Goal
+The agent's objective is to navigate from the starting position (S) to the goal position (G) on a frozen lake grid without falling into holes (H). The agent must:
+- Plan a path through frozen tiles (F) to reach the goal
+- Avoid holes that terminate the episode with zero reward
+- Complete the task within a limited number of steps
+
+### Agent Type
+The agent is implemented as a **ReActAgent** (Reasoning and Acting Agent) that:
+- Observes the current state of the frozen lake grid
+- Reasons about the best action to take
+- Executes actions (Up, Down, Left, Right) to move through the environment
+- Maintains internal state across multiple steps in an episode
+
+### Environment
+The environment is based on Gymnasium's FrozenLake environment, wrapped to provide:
+- **Grid-based navigation**: Randomly generated maps with configurable size (2x2 to 6x6)
+- **Tile types**:
+  - `S`: Start position
+  - `F`: Frozen tile (safe to walk on)
+  - `H`: Hole (terminates episode with reward 0)
+  - `G`: Goal (terminates episode with reward +1.0)
+- **Action space**: Discrete actions (Up, Down, Left, Right)
+- **Reward structure**:
+  - +1.0 for reaching the goal
+  - 0.0 for falling into a hole or failing to reach the goal
+- **Observations**: Text-based grid representation showing current player position
+
+The agent does not use external tools. It interacts directly with the environment through:
+- `env.reset(task)`: Initialize environment with task parameters
+- `env.step(action)`: Execute action and receive observation, reward, and done flag
+- `env.render()`: Get text representation of current state
+
+
+## Dataset Preparation
+
+The dataset contains task parameters for generating FrozenLake environments. Each sample specifies:
+- `seed`: Random seed for reproducible map generation
+- `size`: Grid size (randomly sampled from 2 to `map_max_size`, e.g., 4x4, 6x6)
+- `p`: Probability that a tile is frozen (vs. being a hole), randomly sampled from 0.6 to 0.85
+- `index`: Sample index
+- `uid`: Unique identifier combining seed, size, and p
+
+Run the data preparation script to generate training and test datasets:
+
+```bash
+python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
+```
+
+This will create parquet files in the specified directory:
+
+```
+/path/to/frozenlake_dataset/
+    ├── train.parquet  # 10000 training samples
+    └── test.parquet   # 100 test samples
+```
+
+Each sample looks like:
+
+```json
+{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
+```
+
+**Note**: The data preparation script ensures that all generated maps have a valid path from start to goal within the maximum allowed steps (`env_max_steps=8`), filtering out unsolvable tasks.
+
+## Code Implementation
+
+This section provides a high-level overview of the code implementation. For detailed implementation, please refer to the source code.
+
+### High-level Overview
+
+The implementation consists of three main components:
+
+1. **Agent** (`FrozenLakeAgent`): Extends `ReActAgent` to handle multi-step navigation
+2. **Environment** (`FrozenLakeEnv`): Wraps Gymnasium's FrozenLake environment
+3. **Workflow** (`run_frozen_lake`): Orchestrates the agent-environment interaction loop
+
+### Agent Workflow
+
+The workflow function `run_frozen_lake` implements the agent-environment interaction loop:
+
+```python
+async def run_frozen_lake(
+    task: Dict,
+    model: ChatModelBase,
+    auxiliary_models: Dict[str, ChatModelBase],
+) -> WorkflowOutput:
+    # ...
+
+    # Create agent and environment
+    agent = FrozenLakeAgent(model=model, ...)
+    env = FrozenLakeEnv(...)
+    observation, _ = env.reset(task)
+    rewards = []
+    # ...
+
+    # Agent-environment interaction loop
+    for _ in range(max_steps):
+        response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
+        action = agent.get_action(response)
+        observation, reward, done, _ = env.step(action)
+        rewards.append(reward)
+        if done:
+            break
+
+    # ...
+    final_reward = sum(rewards)
+    final_response = Msg("assistant", response_content, role="assistant")
+
+    return WorkflowOutput(
+        reward=final_reward,
+        response=final_response,
+        metrics={
+            "env_steps": float(step_count),
+            "env_done": float(done),
+        },
+    )
+
+```
+
+**Key characteristics:**
+- Multi-step interaction: The agent takes multiple actions in a single episode, unlike single-turn QA tasks
+- State tracking: The agent maintains internal state (current step, last action, last observation) across steps
+- Error handling: Invalid actions or agent errors are caught and handled gracefully
+
+### Reward Function
+
+No separate judge function is needed. The reward comes directly from the environment:
+- 1.0: Agent successfully reaches the goal (G)
+- 0.0: Agent falls into a hole (H) or fails to reach the goal within the maximum steps
+
+The reward is computed as the sum of step rewards throughout the episode. The workflow returns:
+- `reward`: Final cumulative reward
+- `response`: Final response message containing observation, total reward, steps taken, and termination reason
+- `metrics`: Additional metrics including `env_steps` (number of steps taken) and `env_done` (whether episode completed)
+
+### Implementation Details
+
+The environment (`FrozenLakeEnv`) wraps Gymnasium's FrozenLake and provides:
+- `reset(task)`: Initialize the environment with task parameters
+- `step(action)`: Execute an action and return (observation, reward, done, info)
+- `render()`: Return a text representation of the current state
+
+The agent (`FrozenLakeAgent`) extends `ReActAgent` and provides:
+- `reply(msg)`: Reply to a message and return an action (inherited from AgentScope)
+- `get_prompt(observation)`: Generate a prompt from the current observation
+- `get_action(response)`: Parse the model's response to extract an action (Up/Down/Left/Right)
+- `update_state(action, observation)`: Update internal state after each step
+
+See [frozenlake_env.py](./frozenlake_env.py) and [frozenlake_agent.py](./frozenlake_agent.py) for implementation details.
+
+### Step 4: Use `tune` to train the workflow
+
+```python
+from agentscope.tuner import tune, DatasetConfig
+
+if __name__ == "__main__":
+    config_path = os.path.join(
+        os.path.dirname(__file__),
+        "config.yaml",
+    )
+    dataset = DatasetConfig(
+        path="/path/to/frozenlake_dataset",
+        name="default",
+        split="train",
+    )
+    tune(
+        workflow_func=run_frozen_lake,
+        train_dataset=dataset,
+        config_path=config_path,
+    )
+```
+
+See [config.yaml](./config.yaml) for the training configuration. For full configuration details, see [Trinity-RFT Configuration Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html).
+
+---
+
+## How to Run
+
+### Prerequisites
+
+- At least 2 NVIDIA GPUs with CUDA 12.8 or newer
+- Follow the Trinity-RFT [installation guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) to install the latest version from source code
+- Install gymnasium for the FrozenLake environment:
+
+  ```bash
+  pip install gymnasium[toy_text]
+  ```
+
+- Download the model checkpoint (example):
+
+  ```bash
+  huggingface-cli download Qwen/Qwen2.5-3B-Instruct
+  ```
+
+### Step 1: Prepare the Dataset
+
+```bash
+python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
+```
+
+Update the dataset path in `main.py` to point to your generated dataset directory.
+
+### Step 2: Configure the Training
+
+Key configuration can be identified in the code, including:
+
+**Algorithm Configuration** (`AlgorithmConfig`):
+- `algorithm_type`: `multi_step_grpo` (Group Relative Policy Optimization for multi-step tasks)
+- `group_size`: Number of policy update iterations per batch (default: 16)
+- `batch_size`: Batch size for training (default: 32)
+- `learning_rate`: Learning rate (default: 1e-6)
+
+**Model Configuration** (`TunerModelConfig`):
+- `model_path`: Path to the base model (e.g., `Qwen/Qwen2.5-3B-Instruct`)
+- `max_model_len`: Maximum model context length (default: 25600)
+- `max_tokens`: Maximum tokens for response generation (default: 2048)
+- `inference_engine_num`: Number of inference engines (default: 6, using 6 GPUs for inference)
+
+**Dataset Configuration** (`DatasetConfig`):
+- `path`: Path to the dataset (default: `/path/to/frozenlake`)
+- `split`: Split of the dataset (default: `train`)
+
+Adjust these parameters based on your hardware resources and training requirements. Other parameters can be spetified in  [config.yaml](./config.yaml).
+
+
+### Step 3: Set Up Ray Cluster
+
+Set up a [Ray](https://github.com/ray-project/ray) cluster:
+
+```bash
+ray start --head
+# for multi-node setup, run the following command on worker nodes
+# ray start --address=<master_address>
+```
+
+### Step 4: Run the Training Script
+
+```bash
+python main.py
+```
+
+The training will start and you can monitor the progress through the logs. Checkpoints will be saved once every `trainer.save_interval` steps.
+
+## Experimental Results
+
+### Training Reward Curve
+
+The reward curve during training shows the agent's learning progress:
+
+![reward](./critic_rewards_mean.png)
+
+The training reward typically increases over epochs as the agent learns to navigate the frozen lake more effectively.
+
+### Example Agent Output
+
+An example of agent output is given below:
+```
+From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.
+
+The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.
+
+Let's take the action ```Left```.
+```
--- a/tuner/frozen_lake/README_zh.md
+++ b/tuner/frozen_lake/README_zh.md
@@ -0,0 +1,250 @@
+# 使用 AgentScope-Tuner 训练 FrozenLake Agent
+
+## 摘要
+
+本示例展示如何使用 AgentScope-Tuner 配合 [Trinity-RFT](https://github.com/agentscope-ai/Trinity-RFT) 对 [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) 任务进行强化微调。智能体需要在冰湖网格中从起点走到终点，避开坑洞，并在有限步数内完成任务。
+
+## 任务设定
+
+### 智能体目标
+智能体要在冰湖网格上从起点 (S) 抵达终点 (G)，同时：
+- 规划路径经过冰面 (F) 到达终点
+- 避开会结束回合且奖励为 0 的坑洞 (H)
+- 在限定步数内完成任务
+
+### 智能体类型
+智能体实现为 **ReActAgent**，它的行为包括：
+- 观察当前冰湖网格状态
+- 推理下一步最优动作
+- 执行动作（上、下、左、右）在环境中移动
+- 在多步交互中维护内部状态
+
+### 环境
+环境基于 Gymnasium 的 FrozenLake，并提供：
+- **网格导航**：随机生成 2x2 至 6x6 的地图
+- **格子类型**：
+  - `S`：起点
+  - `F`：冰面（可通行）
+  - `H`：坑洞（奖励 0，结束回合）
+  - `G`：终点（奖励 +1.0，结束回合）
+- **动作空间**：离散动作（上、下、左、右）
+- **奖励设计**：
+  - 到达终点 +1.0
+  - 掉入坑洞或未在最大步数内到达终点为 0.0
+- **观测**：返回当前玩家位置的文本网格表示
+
+智能体不使用外部工具，直接通过以下接口与环境交互：
+- `env.reset(task)`：根据任务参数初始化环境
+- `env.step(action)`：执行动作，返回观测、奖励和结束标志
+- `env.render()`：返回当前状态的文本表示
+
+## 数据集准备
+
+数据集包含用于生成 FrozenLake 环境的任务参数，每个样本包含：
+- `seed`：随机种子，保证地图可复现
+- `size`：网格大小（在 2 和 `map_max_size` 之间随机，如 4x4、6x6）
+- `p`：格子为冰面的概率（0.6 到 0.85 之间随机），其余为坑洞
+- `index`：样本索引
+- `uid`：由 seed、size、p 组合而成的唯一 ID
+
+运行数据准备脚本生成训练集与测试集：
+
+```bash
+python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
+```
+
+生成的目录结构示例：
+```
+/path/to/frozenlake_dataset/
+    ├── train.parquet  # 10000 条训练样本
+    └── test.parquet   # 100 条测试样本
+```
+
+样本示例：
+```json
+{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
+```
+
+**注意**：脚本会过滤无解的地图，确保在最大步数 (`env_max_steps=8`) 内存在从起点到终点的可行路径。
+
+## 代码实现
+
+本节提供代码实现的高级概览。详细实现请参考源代码。
+
+### 高级概览
+实现由三部分组成：
+1. **Agent** (`FrozenLakeAgent`)：继承 `ReActAgent`，负责多步交互
+2. **环境** (`FrozenLakeEnv`)：封装 Gymnasium FrozenLake
+3. **工作流** (`run_frozen_lake`)：组织智能体与环境的交互流程
+
+### 工作流
+`run_frozen_lake` 实现多步交互流程：
+
+```python
+async def run_frozen_lake(
+    task: Dict,
+    model: ChatModelBase,
+    auxiliary_models: Dict[str, ChatModelBase],
+) -> WorkflowOutput:
+    # ...
+
+    # 创建智能体和环境
+    agent = FrozenLakeAgent(model=model, ...)
+    env = FrozenLakeEnv(...)
+    observation, _ = env.reset(task)
+    rewards = []
+    # ...
+
+    # 智能体-环境交互循环
+    for _ in range(max_steps):
+        response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
+        action = agent.get_action(response)
+        observation, reward, done, _ = env.step(action)
+        rewards.append(reward)
+        if done:
+            break
+
+    # ...
+    final_reward = sum(rewards)
+    final_response = Msg("assistant", response_content, role="assistant")
+
+    return WorkflowOutput(
+        reward=final_reward,
+        response=final_response,
+        metrics={"env_steps": float(step_count), "env_done": float(done)},
+    )
+```
+
+**关键特性：**
+- 多步交互：单次 episode 内多次动作，不是单轮 QA
+- 状态跟踪：记录当前步、上次动作与观测
+- 错误处理：无效动作或异常会被捕获并处理
+
+### 奖励函数
+无需额外 judge，奖励由环境直接给出：
+- 1.0：到达终点
+- 0.0：掉入坑洞或超步数未达终点
+
+工作流返回：
+- `reward`：累计奖励
+- `response`：包含观测、总奖励、步数、终止原因的最终回复
+- `metrics`：`env_steps`（步数）、`env_done`（是否结束）
+
+### 实现细节
+
+环境 (`FrozenLakeEnv`) 封装了 Gymnasium 的 FrozenLake，提供：
+- `reset(task)`: 使用任务参数初始化环境
+- `step(action)`: 执行动作并返回 (observation, reward, done, info)
+- `render()`: 返回当前状态的文本表示
+
+智能体 (`FrozenLakeAgent`) 继承 `ReActAgent`，提供：
+- `reply(msg)`: 回复消息并返回动作（继承自 AgentScope）
+- `get_prompt(observation)`: 从当前观测生成提示
+- `get_action(response)`: 解析模型响应以提取动作（Up/Down/Left/Right）
+- `update_state(action, observation)`: 在每步后更新内部状态
+
+详细实现请参考 [frozenlake_env.py](./frozenlake_env.py) 和 [frozenlake_agent.py](./frozenlake_agent.py)。
+
+### 步骤 4：使用 `tune` 训练工作流
+
+```python
+from agentscope.tuner import tune, DatasetConfig
+
+if __name__ == "__main__":
+    config_path = os.path.join(
+        os.path.dirname(__file__),
+        "config.yaml",
+    )
+    dataset = DatasetConfig(
+        path="/path/to/frozenlake_dataset",
+        name="default",
+        split="train",
+    )
+    tune(
+        workflow_func=run_frozen_lake,
+        train_dataset=dataset,
+        config_path=config_path,
+    )
+```
+
+训练配置请参考 [config.yaml](./config.yaml)。完整配置详情请参考 [Trinity-RFT 配置指南](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)。
+
+---
+
+## 运行方法
+
+### 依赖
+- 至少 2 张 NVIDIA GPU，CUDA 版本 ≥ 12.8
+- 按 [Trinity-RFT 安装指南](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) 从源码安装
+- 安装 gymnasium 冰湖环境：
+  ```bash
+  pip install gymnasium[toy_text]
+  ```
+- 下载模型权重（示例）：
+  ```bash
+  huggingface-cli download Qwen/Qwen2.5-3B-Instruct
+  ```
+
+### 步骤 1：准备数据集
+```bash
+python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
+```
+将 `main.py` 中的数据集路径改为你的生成目录。
+
+### 步骤 2：配置训练
+
+关键配置可在代码中设置，包括：
+
+**算法配置** (`AlgorithmConfig`)：
+- `algorithm_type`: `multi_step_grpo`（用于多步任务的组相对策略优化）
+- `group_size`: 每批次的策略更新组大小（默认 16）
+- `batch_size`: 批大小（默认 32）
+- `learning_rate`: 学习率（默认 1e-6）
+
+**模型配置** (`TunerModelConfig`)：
+- `model_path`: 基础模型路径（如 `Qwen/Qwen2.5-3B-Instruct`）
+- `max_model_len`: 最大上下文长度（默认 25600）
+- `max_tokens`: 响应最大生成长度（默认 2048）
+- `inference_engine_num`: 推理引擎数量（默认 6，表示用 6 个 GPU 进行推理）
+
+**数据集配置** (`DatasetConfig`)：
+- `path`: 数据集路径（默认 `/path/to/frozenlake`）
+- `split`: 数据集分片（默认 `train`）
+
+可根据硬件资源和训练需求调整这些参数。其他参数可在 [config.yaml](./config.yaml) 中指定。
+
+### 步骤 3：设置 Ray 集群
+
+设置 [Ray](https://github.com/ray-project/ray) 集群：
+```bash
+ray start --head
+# 对于多节点设置，在工作节点上运行以下命令
+# ray start --address=<master_address>
+```
+
+### 步骤 4：运行训练脚本
+```bash
+python main.py
+```
+训练将开始，可通过日志监控进度。检查点将每 `trainer.save_interval` 步保存一次。
+
+## 实验结果
+
+### 训练奖励曲线
+
+训练过程中的奖励曲线显示智能体的学习进度：
+
+![reward](./critic_rewards_mean.png)
+
+训练奖励通常随着智能体学习更有效地导航冰湖而随训练轮次增加。
+
+### 智能体输出示例
+
+智能体输出示例如下：
+```
+From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.
+
+The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.
+
+Let's take the action ```Left```.
+```
--- a/tuner/frozen_lake/_frozenlake_agent.py
+++ b/tuner/frozen_lake/_frozenlake_agent.py
@@ -0,0 +1,102 @@
+# -*- coding: utf-8 -*-
+"""Adapted from Trinity-RFT"""
+import re
+from _utils import SYSTEM_PROMPT, FrozenLakeAction  # pylint: disable=E0611
+from agentscope.agent import ReActAgent
+from agentscope.formatter import OpenAIChatFormatter
+from agentscope.message import Msg
+from agentscope.model import OpenAIChatModel
+
+
+INVALID_ACTION = "still"
+VALID_ACTIONS = {
+    "left": 1,
+    "down": 2,
+    "right": 3,
+    "up": 4,
+}
+
+
+class FrozenLakeAgent(ReActAgent):
+    """Agent for FrozenLake environment."""
+
+    def __init__(self, model: OpenAIChatModel, max_steps: int = 20):
+        super().__init__(
+            name="frozenlake_agent",
+            model=model,
+            sys_prompt=SYSTEM_PROMPT,
+            formatter=OpenAIChatFormatter(),
+            max_iters=1,
+        )
+        self.response_structure = FrozenLakeAction
+        self.current_step = 0
+        self.last_action = None
+        self.last_observation = None
+        self.max_steps = max_steps
+
+    def get_prompt(self, observation: str) -> str:
+        """Get prompt for the agent based on current observation."""
+        prompt = (
+            f"Current Observation ({self.current_step}): \n"
+            + observation
+            + "\n"
+            + (
+                "You have not achieved the goal, P has not reached G yet. "
+                "Please give the next action."
+            )
+        )
+        if self.current_step > 0 and self.last_action is not None:
+            if self.last_observation == observation:
+                prompt += (
+                    "\nYour last response is invalid. "
+                    "Your position didn't change at all. "
+                    "You may need to recheck your thinking process, "
+                    "action outputted, and the format of response. "
+                    "Remember, you should only output the NEXT ACTION "
+                    "at each iteration in the ``` ```. "
+                    "For example, if you want to move up, "
+                    "you should output ```Up```."
+                )
+
+        if (
+            self.max_steps is not None
+            and self.max_steps - self.current_step > 0
+        ):
+            remaining = self.max_steps - self.current_step
+            prompt += (
+                f"\nThe maximum number of steps remaining is {remaining}."
+            )
+
+        return prompt
+
+    def get_action(self, msg: Msg) -> str:
+        """Extract action from agent response message."""
+        response: str = (
+            msg.content
+            if isinstance(msg.content, str)
+            else msg.content[0].get("text")
+        )
+        action = INVALID_ACTION
+
+        matches = re.findall(r"```(.*?)```", response, re.DOTALL)
+
+        if matches:
+            last_match_content = matches[-1].strip()
+            action = last_match_content.lower()
+            if action not in VALID_ACTIONS:
+                action = INVALID_ACTION
+
+        return action
+
+    def update_state(self, action: str, observation: str) -> None:
+        """Update agent state with action and observation."""
+        self.last_action = action
+        self.last_observation = observation
+        self.current_step += 1
+
+    async def reset(self) -> None:
+        """Reset agent state for a new episode."""
+        self.current_step = 0
+        self.last_action = None
+        self.last_observation = None
+        await self.memory.clear()
--- a/tuner/frozen_lake/_frozenlake_env.py
+++ b/tuner/frozen_lake/_frozenlake_env.py
@@ -0,0 +1,316 @@
+# -*- coding: utf-8 -*-
+"""Adapted from Trinity-RFT"""
+import copy
+from typing import Dict, Optional, Tuple, Union
+import numpy as np
+
+try:
+    from gymnasium.envs.toy_text.frozen_lake import (
+        FrozenLakeEnv as GymFrozenLakeEnv,
+    )
+except ImportError:
+    GymFrozenLakeEnv = object
+from _utils import (  # pylint: disable=E0611
+    generate_random_map,
+    get_goal_position,
+)
+
+
+class FrozenLakeEnv(GymFrozenLakeEnv):
+    """FrozenLake environment wrapper."""
+
+    # Map gym state in integer
+    MAP_LOOKUP = {
+        b"P": 0,
+        b"F": 1,
+        b"H": 2,
+        b"G": 3,
+    }
+
+    # Define rules to transform to rendered text observation of the environment
+    GRID_LOOKUP = {
+        0: " P \t",  # player
+        1: " _ \t",  # frozen
+        2: " O \t",  # hole
+        3: " G \t",  # goal
+        4: " X \t",  # player fall into hole
+        5: " √ \t",  # player on goal
+    }
+
+    ACTION_LOOKUP = {
+        "still": 0,
+        "left": 1,
+        "down": 2,
+        "right": 3,
+        "up": 4,
+    }
+
+    INVALID_ACTION = 0
+    PENALTY_FOR_INVALID = -1
+
+    def __init__(
+        self,
+        max_steps: int = 8,
+        desc: Optional[str] = None,
+        is_slippery: bool = False,
+        size: int = 8,
+        p: float = 0.8,
+        seed: int = 42,
+    ):
+        self.max_steps = max_steps or 8
+        self.desc: Union[str, np.ndarray, None] = desc
+        self.is_slippery = is_slippery
+        self.size = size
+        self.p = p
+        self.seed = seed
+        self.render_mode: Optional[str] = None
+        try:
+            import gymnasium as gym
+        except ImportError as e:
+            error_message = (
+                "Gymnasium is not installed. "
+                "Please install gymnasium first before "
+                "running the frozen_lake workflow. "
+                f"Error: {str(e)}"
+            )
+            raise ImportError(error_message) from e
+
+        if self.desc is None:
+            random_map, goal_position = generate_random_map(
+                size=self.size,
+                p=self.p,
+                seed=self.seed,
+                max_steps=self.max_steps,
+            )
+        else:
+            random_map = np.asarray(copy.deepcopy(self.desc), dtype="c")
+            goal_position = get_goal_position(random_map)
+
+        self.goal_position = goal_position
+
+        super().__init__(
+            desc=random_map[:],
+            is_slippery=self.is_slippery,
+        )
+        assert isinstance(self.desc, np.ndarray)
+        self.action_space = gym.spaces.Discrete(4, start=1)
+
+        self.map_kwargs = {
+            "size": size,
+            "p": p,
+        }
+        self.env_kwargs = {
+            "is_slippery": is_slippery,
+            "desc": copy.deepcopy(desc),
+            "seed": seed,
+        }
+
+        self.action_map = {
+            1: 0,  # left
+            2: 1,  # down
+            3: 2,  # right
+            4: 3,  # up
+        }
+
+    def _get_player_position(self) -> Tuple[int, int]:
+        return (self.s // self.ncol, self.s % self.ncol)  # (row, col)
+
+    def step(self, action: str) -> Tuple[str, float, bool, Dict]:
+        """Execute a step in the environment.
+
+        Maps custom action to gymnasium FrozenLakeEnv action and
+        takes the step. Checks if the action is effective (whether
+        player moves in the env).
+
+        Args:
+            action: The action to take.
+
+        Returns:
+            Tuple of (observation, reward, done, info).
+        """
+        if self.success():
+            obs = self.render(mode="tiny_rgb_array")
+            assert isinstance(obs, str)
+            return obs, 1.0, True, {"action_is_effective": False}
+
+        action_id: int = self.ACTION_LOOKUP.get(action.lower(), 0)
+
+        if not action_id:
+            action_id = self.INVALID_ACTION
+
+        if (
+            action_id == self.INVALID_ACTION
+            or action_id not in self.action_map
+        ):
+            obs = self.render(mode="tiny_rgb_array")
+            assert isinstance(obs, str)
+            return obs, 0.0, False, {"action_is_effective": False}
+
+        prev_player_position = int(self.s)
+
+        # Call parent class step method
+        # Note: GymFrozenLakeEnv is imported at module level
+        player_pos, reward, done, _, _ = super().step(
+            self.action_map[action_id],
+        )
+
+        obs = self.render(mode="tiny_rgb_array")
+        assert isinstance(obs, str)
+        return (
+            obs,
+            float(reward),
+            bool(done),
+            {"action_is_effective": prev_player_position != int(player_pos)},
+        )
+
+    def render(
+        self,
+        mode: str = "tiny_rgb_array",
+    ) -> str | list[str] | np.ndarray:
+        """Render the environment.
+
+        Args:
+            mode: Rendering mode. Options: "tiny_rgb_array", "list",
+                "state", "rgb_array", "ansi".
+
+        Returns:
+            Rendered observation based on the mode.
+        """
+        assert mode in [
+            "tiny_rgb_array",
+            "list",
+            "state",
+            "rgb_array",
+            "ansi",
+        ]
+        if mode in ["rgb_array", "ansi"]:
+            prev_render_mode = self.render_mode
+            self.render_mode = mode
+            obs = super().render()
+            self.render_mode = prev_render_mode
+            return obs
+        assert isinstance(self.desc, np.ndarray)
+        room_state = copy.deepcopy(self.desc)
+
+        # replace the position of start 'S' with 'F'
+        position_S = np.where(room_state == b"S")
+        room_state[position_S] = b"F"
+
+        # replace the position of the player with 'P'
+        position_P = self._get_player_position()
+        room_state[position_P] = b"P"
+
+        if mode == "state":
+            # transform 'S', 'F', 'H', 'G' to numpy integer array
+            room_state = np.vectorize(lambda x: self.MAP_LOOKUP[x])(room_state)
+            # add player in hole or player on goal
+            if self.desc[position_P] == b"H":
+                room_state[position_P] = 4
+            elif self.desc[position_P] == b"G":
+                room_state[position_P] = 5
+            return room_state
+
+        room_state = self.render(mode="state").tolist()
+        assert isinstance(room_state, list)
+
+        if mode == "list":
+
+            def lookup_list(cell: int) -> str:
+                return self.GRID_LOOKUP.get(cell, "?").strip("\t").strip()
+
+            return [
+                " ".join(lookup_list(cell) for cell in row)
+                for row in room_state
+            ]
+
+        if mode == "tiny_rgb_array":
+
+            def lookup_tiny(cell: int) -> str:
+                return self.GRID_LOOKUP.get(cell, "?")
+
+            result = "\n".join(
+                "".join(lookup_tiny(cell) for cell in row)
+                for row in room_state
+            )
+            return result
+
+        # Default return for other modes
+        return ""
+
+    def reset(
+        self,
+        task: Optional[Dict] = None,
+    ) -> tuple[str, Dict]:
+        """Reset the environment with optional task parameters."""
+        task = task or {}
+        # Update parameters from task if provided
+        size = task.get("size", self.map_kwargs["size"])
+        p = task.get("p", self.map_kwargs["p"])
+        seed = task.get("seed", self.env_kwargs["seed"])
+        is_slippery = task.get(
+            "is_slippery",
+            self.env_kwargs["is_slippery"],
+        )
+        desc = task.get("desc", self.env_kwargs.get("desc"))
+
+        # Update instance variables
+        self.size = size
+        self.p = p
+        self.seed = seed
+        self.is_slippery = is_slippery
+        self.map_kwargs["size"] = size
+        self.map_kwargs["p"] = p
+        self.env_kwargs["seed"] = seed
+        self.env_kwargs["is_slippery"] = is_slippery
+        if desc is not None:
+            self.env_kwargs["desc"] = copy.deepcopy(desc)
+
+        if desc is None:
+            random_map, goal_position = generate_random_map(
+                size=size,
+                p=p,
+                seed=seed,
+                max_steps=self.max_steps,
+            )
+        else:
+            random_map = np.asarray(copy.deepcopy(desc), dtype="c")
+            goal_position = get_goal_position(random_map)
+
+        self.goal_position = goal_position
+        self.desc = random_map[:]
+
+        # Reinitialize parent class with new map
+        try:
+            import gymnasium as gym
+
+            super().__init__(
+                desc=random_map[:],
+                is_slippery=self.is_slippery,
+            )
+            assert isinstance(self.desc, np.ndarray)
+            self.action_space = gym.spaces.Discrete(4, start=1)
+        except ImportError as e:
+            error_message = (
+                "Gymnasium is not installed. "
+                "Please install gymnasium first before "
+                "running the frozen_lake workflow. "
+                f"Error: {str(e)}"
+            )
+            raise ImportError(error_message) from e
+
+        super().reset(seed=self.seed)
+        obs = self.render(mode="tiny_rgb_array")
+        assert isinstance(obs, str)
+        return obs, {}
+
+    def finished(self) -> bool:
+        """Check if the episode is finished (goal or hole)."""
+        player_pos = self._get_player_position()
+        assert isinstance(self.desc, np.ndarray)
+        return self.desc[player_pos] in b"GH"  # type: ignore
+
+    def success(self) -> bool:
+        """Check if the agent has reached the goal (G)."""
+        player_pos = self._get_player_position()
+        assert isinstance(self.desc, np.ndarray)
+        return self.desc[player_pos] in b"G"
--- a/tuner/frozen_lake/_utils.py
+++ b/tuner/frozen_lake/_utils.py
@@ -0,0 +1,209 @@
+# -*- coding: utf-8 -*-
+"""
+Utils for the FrozenLake environment.
+Modified from rllm
+"""
+
+from typing import Literal, Optional, Tuple
+import numpy as np
+from pydantic import BaseModel, Field
+
+# Map gym state in integer
+MAP_LOOKUP = {
+    b"P": 0,
+    b"F": 1,
+    b"H": 2,
+    b"G": 3,
+}
+
+# Define rules to transform to rendered text observation of the environment
+GRID_LOOKUP = {
+    0: " P \t",  # player
+    1: " _ \t",  # frozen
+    2: " O \t",  # hole
+    3: " G \t",  # goal
+    4: " X \t",  # player fall into hole
+    5: " √ \t",  # player on goal
+}
+
+ACTION_LOOKUP = {
+    0: "None",
+    1: "Left",
+    2: "Down",
+    3: "Right",
+    4: "Up",
+}
+
+# Prompting format inspired by the RAGEN project
+SYSTEM_PROMPT = """You are Qwen, created by Alibaba Cloud. \
+You are a helpful assistant. You are walking on a frozen lake.
+
+FrozenLake Quick Guide
+Goal: Reach the goal (G). Player (P) and Goal (G) must overlap.
+
+Symbols:
+_ Frozen | O Hole | G Goal | P Player
+
+Rules:
+1. Avoid falling into holes (O).
+2. Frozen tiles are slippery, you may move perpendicular to
+   your intended direction.
+
+Valid Action (separated by | ):
+Up | Down | Left | Right
+
+Rewards:
+Fall into hole: 0
+Reach goal: +1.0
+
+You will be provided the current observation, please decide on
+the next Action.
+You should show your thought process and then input the final
+action in ``` ```.
+You should only output the NEXT ACTION at each iteration in
+the ``` ```. For example, if you want to move up, you should
+output ```Up```.
+You should plan ahead and need to achieve it in minimum number
+of steps.
+You should be aware that frozen tiles can be slippery, but the
+chance is small and you should not overthink it.
+
+Please show your thinking process and put the final action in
+``` ```. In every turn, the final action MUST be one of Up,
+Down, Left, Right.
+"""
+
+
+class FrozenLakeAction(BaseModel):
+    """Action model for FrozenLake environment."""
+
+    action: Literal["Up", "Down", "Left", "Right"] = Field(
+        description=(
+            "The action to take in the FrozenLake environment, "
+            "must be one of Up, Down, Left, Right"
+        ),
+    )
+
+
+def is_valid(board: list[list[str]], max_size: int, max_steps: int) -> bool:
+    """DFS to check that it's a valid path.
+
+    Args:
+        board: The board representation as a list of lists.
+        max_size: Maximum size of the board.
+        max_steps: Maximum number of steps allowed.
+
+    Returns:
+        True if there's a valid path from start to goal within max_steps,
+        False otherwise.
+    """
+    frontier, discovered = [], set()
+    # find the start point
+    start_r, start_c = np.where(np.array(board) == "S")
+    frontier.append((start_r[0], start_c[0], 0))  # row, col steps
+    # dfs to check if there is a path from start to goal
+    while frontier:
+        r, c, steps = frontier.pop()
+        if steps > max_steps:
+            continue
+
+        if (r, c) not in discovered:
+            discovered.add((r, c))
+            directions = [(1, 0), (0, 1), (-1, 0), (0, -1)]
+            for x, y in directions:
+                r_new = r + x
+                c_new = c + y
+                if (
+                    r_new < 0
+                    or r_new >= max_size
+                    or c_new < 0
+                    or c_new >= max_size
+                ):  # noqa: PLR2004
+                    continue
+                if board[r_new][c_new] == "G":
+                    return True
+                if board[r_new][c_new] != "H":
+                    frontier.append((r_new, c_new, steps + 1))
+    return False
+
+
+def generate_random_map(
+    size: int = 8,
+    p: float = 0.8,
+    seed: int = 0,
+    max_steps: int = 5,
+) -> Tuple[list[str], Tuple[int, int]]:
+    """Generates a random valid map (one that has a path from start to goal).
+
+    Args:
+        size: Size of each side of the grid.
+        p: Probability that a tile is frozen.
+        seed: Seed to ensure the generation of reproducible maps.
+        max_steps: Maximum number of steps allowed.
+
+    Returns:
+        A tuple containing a random valid map and the goal position (row, col).
+    """
+    valid = False
+    board: list[list[str]] = []  # initialize to make pyright happy
+
+    try:
+        from gymnasium.utils import seeding
+
+        np_random, _ = seeding.np_random(seed)
+    except ImportError as exc:
+        raise ImportError(
+            "Gymnasium is not installed. "
+            "Please install gymnasium first before "
+            "running the frozen_lake workflow.",
+        ) from exc
+
+    # generate random start and end points
+    while not valid:
+        p = min(1, p)
+        board = np_random.choice(
+            ["F", "H"],
+            (size, size),
+            p=[p, 1 - p],
+        ).tolist()
+
+        while True:
+            start_r = int(np_random.integers(0, size))
+            start_c = int(np_random.integers(0, size))
+            goal_r = int(np_random.integers(0, size))
+            goal_c = int(np_random.integers(0, size))
+
+            # Ensure start and goal are different positions
+            if (start_r, start_c) != (goal_r, goal_c):
+                break
+
+        board[start_r][start_c] = "S"
+        board[goal_r][goal_c] = "G"
+
+        valid = is_valid(board, size, max_steps)
+    return ["".join(x) for x in board], (goal_r, goal_c)
+
+
+def get_goal_position(
+    random_map: np.ndarray,
+) -> Optional[Tuple[int, int]]:
+    """Get the goal position from a random map.
+
+    Args:
+        random_map: The map as a numpy array.
+
+    Returns:
+        Tuple of (row, col) if goal found, None otherwise.
+    """
+    positions = np.argwhere(random_map == b"G")
+    if positions.size == 0:
+        return None  # G not found
+    return tuple(positions[0])  # returns (row, col)
+
+
+__all__ = [
+    "SYSTEM_PROMPT",
+    "FrozenLakeAction",
+    "generate_random_map",
+    "get_goal_position",
+]
--- a/tuner/frozen_lake/config.yaml
+++ b/tuner/frozen_lake/config.yaml
@@ -0,0 +1,53 @@
+project: "AgentScope"  # Project name
+name: "FrozenLake"  # Experiment name
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}  # Directory to save model checkpoints
+algorithm:
+  algorithm_type: multi_step_grpo  # GRPO series for multi-step scenario
+  repeat_times: 16  # Number of rollouts per prompt for advantage estimation
+  kl_loss_fn: "low_var_kl"
+  kl_loss_fn_args:
+    kl_coef: 0 # KL divergence coefficient
+  advantage_fn_args:
+    epsilon: 1e-6  # Small value for numerical stability
+    std_threshold: 0.0001  # Threshold for standard deviation
+  optimizer:
+    lr: 1e-6  # Learning rate
+model:
+  model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-3B-Instruct}  # Base model path
+  max_prompt_tokens: 23552  # Max tokens for prompt
+  max_response_tokens: 2048  # Max tokens per response
+  max_model_len: 25600  # Max context length
+  temperature: 1.0  # Sampling temperature
+buffer:
+  total_epochs: 5  # Total training epochs
+  batch_size: 32  # Batch size per explore step
+  train_batch_size: 1024  # Total experiences per training step
+  trainer_input:
+    experience_buffer:
+      name: experience_buffer
+      storage_type: queue
+      max_read_timeout: 7200  # Max timeout for reading from buffer (seconds)
+      replay_buffer:
+        enable: true  # Enable experience replay
+        priority_fn: linear_decay  # Priority function for replay buffer
+        priority_fn_args:
+          decay: 0.1  # Decay rate for priority function
+explorer:
+  runner_per_model: 16  # Number of runners per model
+  rollout_model:
+    engine_num: 6  # Number of vLLM engines for rollout model
+    tensor_parallel_size: 1  # TP size per engine for rollout model
+    enable_openai_api: true  # Enable OpenAI-compatible API
+    enable_history: true  # Enable conversation history
+    enable_auto_tool_choice: true  # Enable automatic tool selection
+    tool_call_parser: hermes  # Parser for tool calls
+trainer:
+  save_interval: 100  # Save checkpoint every N steps
+  use_dynamic_bsz: true  # Use dynamic batch size
+  grad_clip: 1.0  # Gradient clipping value
+  max_token_len_per_gpu: 25600  # Max token length per GPU
+  ulysses_sequence_parallel_size: 2  # Sequence parallel size for Ulysses
+synchronizer:
+  sync_style: dynamic_by_explorer  # Sync triggered dynamically by explorer
+  sync_interval: 1  # Sync every N steps
+  sync_timeout: 1200  # Timeout for synchronization (seconds)
--- a/tuner/frozen_lake/critic_rewards_mean.png
+++ b/tuner/frozen_lake/critic_rewards_mean.png
--- a/tuner/frozen_lake/get_frozenlake_data.py
+++ b/tuner/frozen_lake/get_frozenlake_data.py
@@ -0,0 +1,131 @@
+# -*- coding: utf-8 -*-
+"""
+Modified from rllm
+"""
+import argparse
+import os
+
+import numpy as np
+import pandas as pd
+
+
+DEFAULT_DATA_PATH = os.path.join(
+    os.path.dirname(os.path.abspath(__file__)),
+    "..",
+    "data",
+    "frozenlake",
+)
+
+
+def save_dataset_to_local(
+    data_path: str,
+    data: list[dict],
+    split: str = "default",
+) -> str:
+    """Save dataset directly to local data_path.
+
+    Args:
+        data_path: Path to save the dataset
+        data: List of dictionaries containing the dataset examples
+        split: Split name (e.g., 'train', 'test', 'default')
+
+    Returns:
+        str: Path to the saved parquet file
+    """
+    os.makedirs(data_path, exist_ok=True)
+
+    # Convert to DataFrame and save
+    data_df = pd.DataFrame(data)
+    dataset_path = os.path.join(data_path, f"{split}.parquet")
+    data_df.to_parquet(dataset_path)
+
+    print(
+        f"Saved dataset frozenlake split '{split}' "
+        f"with {len(data)} examples at {dataset_path}. "
+        f"Make sure to set the environment variable "
+        f"<TRINITY_TASKSET_PATH> to {data_path}.",
+    )
+
+    return dataset_path
+
+
+def prepare_frozenlake_data(
+    data_path: str,
+    train_size: int = 10000,
+    test_size: int = 100,
+    map_max_size: int = 6,
+) -> tuple[list[dict], list[dict]]:
+    """
+    Prepare and save FrozenLake datasets for training and testing.
+
+    Args:
+        data_path (str): Path to save the dataset
+        train_size (int): Number of training examples to generate
+        test_size (int): Number of test examples to generate
+        map_max_size (int): Maximum size of the map
+
+    Returns:
+        tuple: (train_data, test_data) - Lists of data dictionaries
+    """
+    # Set random seed for reproducibility
+    np.random.seed(42)
+
+    # Generate random parameters for train and test sets
+    train_seeds = np.random.randint(0, 100000, size=train_size)
+    test_seeds = np.random.randint(0, 100000, size=test_size)
+    train_sizes = np.random.randint(2, map_max_size, size=train_size)
+    test_sizes = np.random.randint(2, map_max_size, size=test_size)
+    train_ps = np.random.uniform(0.6, 0.85, size=train_size)
+    test_ps = np.random.uniform(0.6, 0.85, size=test_size)
+
+    def frozenlake_process_fn(
+        seed: int,
+        size: int,
+        p: float,
+        idx: int,
+    ) -> dict:
+        """Process function to create FrozenLake task instances."""
+        return {
+            "seed": seed,
+            "size": size,
+            "p": p,
+            "index": idx,
+            "uid": f"{seed}_{size}_{p}",
+        }
+
+    # Create train and test data
+    train_data_list = [
+        frozenlake_process_fn(seed, train_sizes[idx], train_ps[idx], idx)
+        for idx, seed in enumerate(train_seeds)
+    ]
+    test_data_list = [
+        frozenlake_process_fn(seed, test_sizes[idx], test_ps[idx], idx)
+        for idx, seed in enumerate(test_seeds)
+    ]
+
+    # Save datasets directly to local DATA_PATH
+    save_dataset_to_local(data_path, train_data_list, "train")
+    save_dataset_to_local(data_path, test_data_list, "test")
+
+    return train_data_list, test_data_list
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_dir", default=DEFAULT_DATA_PATH)
+    parser.add_argument("--train_size", type=int, default=10000)
+    parser.add_argument("--test_size", type=int, default=100)
+    parser.add_argument("--map_max_size", type=int, default=6)
+    args = parser.parse_args()
+
+    train_data, test_data = prepare_frozenlake_data(
+        data_path=args.local_dir,
+        train_size=args.train_size,
+        test_size=args.test_size,
+        map_max_size=args.map_max_size,
+    )
+
+    print(f"Train dataset: {len(train_data)} examples")
+    print(f"Test dataset: {len(test_data)} examples")
+    print("Sample train example:", train_data[0])
+    print("Sample test example:", test_data[0])
--- a/tuner/frozen_lake/main.py
+++ b/tuner/frozen_lake/main.py
@@ -0,0 +1,151 @@
+# -*- coding: utf-8 -*-
+"""Example of training a FrozenLake agent with Trinity-RFT."""
+import os
+from typing import Dict
+from _frozenlake_agent import FrozenLakeAgent
+from _frozenlake_env import FrozenLakeEnv
+from agentscope.message import Msg
+from agentscope.tuner import (
+    tune,
+    WorkflowOutput,
+    DatasetConfig,
+    TunerModelConfig,
+    AlgorithmConfig,
+)
+from agentscope.model import ChatModelBase
+
+
+async def run_frozen_lake(
+    task: Dict,
+    model: ChatModelBase,
+    auxiliary_models: Dict[str, ChatModelBase],
+) -> WorkflowOutput:
+    """A workflow function using the FrozenLake agent to solve tasks.
+
+    Args:
+        task (Dict): The task to be solved, containing environment parameters
+            like size, p, seed, is_slippery, etc.
+        model (ChatModelBase): The language model to use.
+
+    Returns:
+        WorkflowOutput: The workflow output containing the reward, response and
+            metrics.
+    """
+
+    assert len(auxiliary_models) == 0, "No auxiliary models are needed"
+
+    # Extract workflow arguments from task or use defaults
+    workflow_args = task.get("workflow_args", {})
+    if not workflow_args:
+        workflow_args = task
+
+    env_max_steps = workflow_args.get("env_max_steps", 8)
+    agent_max_steps = workflow_args.get("agent_max_steps", 10)
+    is_slippery = workflow_args.get("is_slippery", False)
+    desc = workflow_args.get("desc", None)
+
+    # Extract task-specific arguments (for environment generation)
+    size = task.get("size", 8)
+    p = task.get("p", 0.8)
+    seed = task.get("seed", 42)
+
+    # Initialize agent and environment
+    agent = FrozenLakeAgent(model=model, max_steps=agent_max_steps)
+    env = FrozenLakeEnv(
+        max_steps=env_max_steps,
+        desc=desc,
+        is_slippery=is_slippery,
+        size=size,
+        p=p,
+        seed=seed,
+    )
+
+    # Reset environment with task parameters
+    observation, _ = env.reset(task)
+    observation_str = str(observation)
+    rewards = []
+    step_count = 0
+    done = False
+    terminate_reason = None
+
+    # Run agent-environment interaction loop
+    for _ in range(agent_max_steps):
+        step_count += 1
+        try:
+            # get prompt
+            prompt = agent.get_prompt(observation_str)
+
+            response = await agent.reply(msg=Msg("user", prompt, role="user"))
+
+            # record action and observation
+            action = agent.get_action(response)
+            agent.update_state(action=action, observation=observation_str)
+
+        except Exception as e:
+            terminate_reason = f"agent_error: {str(e)}"
+            break
+
+        # environment step
+        observation, reward, done, _ = env.step(action)
+        observation_str = str(observation)
+        rewards.append(reward)
+
+        if done:
+            terminate_reason = "success" if env.success() else "hole"
+            break
+
+    if terminate_reason is None:
+        terminate_reason = "max_steps_reached"
+
+    final_reward = sum(rewards)
+    final_observation = observation_str
+
+    # Create response message with environment information
+    response_content = (
+        f"Final observation:\n{final_observation}\n"
+        f"Total reward: {final_reward}\n"
+        f"Steps taken: {step_count}\n"
+        f"Terminate reason: {terminate_reason}"
+    )
+
+    final_response = Msg("assistant", response_content, role="assistant")
+
+    return WorkflowOutput(
+        reward=final_reward,
+        response=final_response,
+        metrics={
+            "env_steps": float(step_count),
+            "env_done": float(done),
+        },
+    )
+
+
+if __name__ == "__main__":
+    dataset = DatasetConfig(
+        path="/path/to/frozenlake",
+        split="train",
+    )
+    tuner_model = TunerModelConfig(
+        model_path="Qwen/Qwen2.5-3B-Instruct",
+        max_model_len=25600,
+        max_tokens=2048,
+        inference_engine_num=6,
+        reasoning_parser=None,
+    )
+    algorithm = AlgorithmConfig(
+        algorithm_type="multi_step_grpo",
+        group_size=16,
+        batch_size=32,
+        learning_rate=1e-6,
+    )
+    config_path = os.path.join(
+        os.path.dirname(__file__),
+        "config.yaml",
+    )  # define some default parameters
+    tune(
+        workflow_func=run_frozen_lake,
+        model=tuner_model,
+        train_dataset=dataset,
+        algorithm=algorithm,
+        config_path=config_path,
+    )