Add examples for frozenlake and emailsearch (#94)

This commit is contained in:
Yuchang Sun
2026-01-19 12:25:13 +08:00
committed by GitHub
parent 3821fb04ac
commit 654c35127a
26 changed files with 3370 additions and 14 deletions

271
tuner/frozen_lake/README.md Normal file
View File

@@ -0,0 +1,271 @@
# Training FrozenLake Agent with RL using AgentScope-Tuner
## Summary
This example demonstrates how to use AgentScope-Tuner to implement reinforcement fine-tuning for the [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) task using [Trinity-RFT](https://github.com/agentscope-ai/Trinity-RFT). The agent learns to navigate a frozen lake grid from a starting position to a goal while avoiding holes through multi-step interactions with the environment.
## Task Setting
### Agent Goal
The agent's objective is to navigate from the starting position (S) to the goal position (G) on a frozen lake grid without falling into holes (H). The agent must:
- Plan a path through frozen tiles (F) to reach the goal
- Avoid holes that terminate the episode with zero reward
- Complete the task within a limited number of steps
### Agent Type
The agent is implemented as a **ReActAgent** (Reasoning and Acting Agent) that:
- Observes the current state of the frozen lake grid
- Reasons about the best action to take
- Executes actions (Up, Down, Left, Right) to move through the environment
- Maintains internal state across multiple steps in an episode
### Environment
The environment is based on Gymnasium's FrozenLake environment, wrapped to provide:
- **Grid-based navigation**: Randomly generated maps with configurable size (2x2 to 6x6)
- **Tile types**:
- `S`: Start position
- `F`: Frozen tile (safe to walk on)
- `H`: Hole (terminates episode with reward 0)
- `G`: Goal (terminates episode with reward +1.0)
- **Action space**: Discrete actions (Up, Down, Left, Right)
- **Reward structure**:
- +1.0 for reaching the goal
- 0.0 for falling into a hole or failing to reach the goal
- **Observations**: Text-based grid representation showing current player position
The agent does not use external tools. It interacts directly with the environment through:
- `env.reset(task)`: Initialize environment with task parameters
- `env.step(action)`: Execute action and receive observation, reward, and done flag
- `env.render()`: Get text representation of current state
## Dataset Preparation
The dataset contains task parameters for generating FrozenLake environments. Each sample specifies:
- `seed`: Random seed for reproducible map generation
- `size`: Grid size (randomly sampled from 2 to `map_max_size`, e.g., 4x4, 6x6)
- `p`: Probability that a tile is frozen (vs. being a hole), randomly sampled from 0.6 to 0.85
- `index`: Sample index
- `uid`: Unique identifier combining seed, size, and p
Run the data preparation script to generate training and test datasets:
```bash
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
```
This will create parquet files in the specified directory:
```
/path/to/frozenlake_dataset/
├── train.parquet # 10000 training samples
└── test.parquet # 100 test samples
```
Each sample looks like:
```json
{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
```
**Note**: The data preparation script ensures that all generated maps have a valid path from start to goal within the maximum allowed steps (`env_max_steps=8`), filtering out unsolvable tasks.
## Code Implementation
This section provides a high-level overview of the code implementation. For detailed implementation, please refer to the source code.
### High-level Overview
The implementation consists of three main components:
1. **Agent** (`FrozenLakeAgent`): Extends `ReActAgent` to handle multi-step navigation
2. **Environment** (`FrozenLakeEnv`): Wraps Gymnasium's FrozenLake environment
3. **Workflow** (`run_frozen_lake`): Orchestrates the agent-environment interaction loop
### Agent Workflow
The workflow function `run_frozen_lake` implements the agent-environment interaction loop:
```python
async def run_frozen_lake(
task: Dict,
model: ChatModelBase,
auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
# ...
# Create agent and environment
agent = FrozenLakeAgent(model=model, ...)
env = FrozenLakeEnv(...)
observation, _ = env.reset(task)
rewards = []
# ...
# Agent-environment interaction loop
for _ in range(max_steps):
response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
action = agent.get_action(response)
observation, reward, done, _ = env.step(action)
rewards.append(reward)
if done:
break
# ...
final_reward = sum(rewards)
final_response = Msg("assistant", response_content, role="assistant")
return WorkflowOutput(
reward=final_reward,
response=final_response,
metrics={
"env_steps": float(step_count),
"env_done": float(done),
},
)
```
**Key characteristics:**
- Multi-step interaction: The agent takes multiple actions in a single episode, unlike single-turn QA tasks
- State tracking: The agent maintains internal state (current step, last action, last observation) across steps
- Error handling: Invalid actions or agent errors are caught and handled gracefully
### Reward Function
No separate judge function is needed. The reward comes directly from the environment:
- 1.0: Agent successfully reaches the goal (G)
- 0.0: Agent falls into a hole (H) or fails to reach the goal within the maximum steps
The reward is computed as the sum of step rewards throughout the episode. The workflow returns:
- `reward`: Final cumulative reward
- `response`: Final response message containing observation, total reward, steps taken, and termination reason
- `metrics`: Additional metrics including `env_steps` (number of steps taken) and `env_done` (whether episode completed)
### Implementation Details
The environment (`FrozenLakeEnv`) wraps Gymnasium's FrozenLake and provides:
- `reset(task)`: Initialize the environment with task parameters
- `step(action)`: Execute an action and return (observation, reward, done, info)
- `render()`: Return a text representation of the current state
The agent (`FrozenLakeAgent`) extends `ReActAgent` and provides:
- `reply(msg)`: Reply to a message and return an action (inherited from AgentScope)
- `get_prompt(observation)`: Generate a prompt from the current observation
- `get_action(response)`: Parse the model's response to extract an action (Up/Down/Left/Right)
- `update_state(action, observation)`: Update internal state after each step
See [frozenlake_env.py](./frozenlake_env.py) and [frozenlake_agent.py](./frozenlake_agent.py) for implementation details.
### Step 4: Use `tune` to train the workflow
```python
from agentscope.tuner import tune, DatasetConfig
if __name__ == "__main__":
config_path = os.path.join(
os.path.dirname(__file__),
"config.yaml",
)
dataset = DatasetConfig(
path="/path/to/frozenlake_dataset",
name="default",
split="train",
)
tune(
workflow_func=run_frozen_lake,
train_dataset=dataset,
config_path=config_path,
)
```
See [config.yaml](./config.yaml) for the training configuration. For full configuration details, see [Trinity-RFT Configuration Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html).
---
## How to Run
### Prerequisites
- At least 2 NVIDIA GPUs with CUDA 12.8 or newer
- Follow the Trinity-RFT [installation guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) to install the latest version from source code
- Install gymnasium for the FrozenLake environment:
```bash
pip install gymnasium[toy_text]
```
- Download the model checkpoint (example):
```bash
huggingface-cli download Qwen/Qwen2.5-3B-Instruct
```
### Step 1: Prepare the Dataset
```bash
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
```
Update the dataset path in `main.py` to point to your generated dataset directory.
### Step 2: Configure the Training
Key configuration can be identified in the code, including:
**Algorithm Configuration** (`AlgorithmConfig`):
- `algorithm_type`: `multi_step_grpo` (Group Relative Policy Optimization for multi-step tasks)
- `group_size`: Number of policy update iterations per batch (default: 16)
- `batch_size`: Batch size for training (default: 32)
- `learning_rate`: Learning rate (default: 1e-6)
**Model Configuration** (`TunerModelConfig`):
- `model_path`: Path to the base model (e.g., `Qwen/Qwen2.5-3B-Instruct`)
- `max_model_len`: Maximum model context length (default: 25600)
- `max_tokens`: Maximum tokens for response generation (default: 2048)
- `inference_engine_num`: Number of inference engines (default: 6, using 6 GPUs for inference)
**Dataset Configuration** (`DatasetConfig`):
- `path`: Path to the dataset (default: `/path/to/frozenlake`)
- `split`: Split of the dataset (default: `train`)
Adjust these parameters based on your hardware resources and training requirements. Other parameters can be spetified in [config.yaml](./config.yaml).
### Step 3: Set Up Ray Cluster
Set up a [Ray](https://github.com/ray-project/ray) cluster:
```bash
ray start --head
# for multi-node setup, run the following command on worker nodes
# ray start --address=<master_address>
```
### Step 4: Run the Training Script
```bash
python main.py
```
The training will start and you can monitor the progress through the logs. Checkpoints will be saved once every `trainer.save_interval` steps.
## Experimental Results
### Training Reward Curve
The reward curve during training shows the agent's learning progress:
![reward](./critic_rewards_mean.png)
The training reward typically increases over epochs as the agent learns to navigate the frozen lake more effectively.
### Example Agent Output
An example of agent output is given below:
```
From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.
The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.
Let's take the action ```Left```.
```

View File

@@ -0,0 +1,250 @@
# 使用 AgentScope-Tuner 训练 FrozenLake Agent
## 摘要
本示例展示如何使用 AgentScope-Tuner 配合 [Trinity-RFT](https://github.com/agentscope-ai/Trinity-RFT) 对 [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) 任务进行强化微调。智能体需要在冰湖网格中从起点走到终点,避开坑洞,并在有限步数内完成任务。
## 任务设定
### 智能体目标
智能体要在冰湖网格上从起点 (S) 抵达终点 (G),同时:
- 规划路径经过冰面 (F) 到达终点
- 避开会结束回合且奖励为 0 的坑洞 (H)
- 在限定步数内完成任务
### 智能体类型
智能体实现为 **ReActAgent**,它的行为包括:
- 观察当前冰湖网格状态
- 推理下一步最优动作
- 执行动作(上、下、左、右)在环境中移动
- 在多步交互中维护内部状态
### 环境
环境基于 Gymnasium 的 FrozenLake并提供
- **网格导航**:随机生成 2x2 至 6x6 的地图
- **格子类型**
- `S`:起点
- `F`:冰面(可通行)
- `H`:坑洞(奖励 0结束回合
- `G`:终点(奖励 +1.0,结束回合)
- **动作空间**:离散动作(上、下、左、右)
- **奖励设计**
- 到达终点 +1.0
- 掉入坑洞或未在最大步数内到达终点为 0.0
- **观测**:返回当前玩家位置的文本网格表示
智能体不使用外部工具,直接通过以下接口与环境交互:
- `env.reset(task)`:根据任务参数初始化环境
- `env.step(action)`:执行动作,返回观测、奖励和结束标志
- `env.render()`:返回当前状态的文本表示
## 数据集准备
数据集包含用于生成 FrozenLake 环境的任务参数,每个样本包含:
- `seed`:随机种子,保证地图可复现
- `size`:网格大小(在 2 和 `map_max_size` 之间随机,如 4x4、6x6
- `p`格子为冰面的概率0.6 到 0.85 之间随机),其余为坑洞
- `index`:样本索引
- `uid`:由 seed、size、p 组合而成的唯一 ID
运行数据准备脚本生成训练集与测试集:
```bash
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
```
生成的目录结构示例:
```
/path/to/frozenlake_dataset/
├── train.parquet # 10000 条训练样本
└── test.parquet # 100 条测试样本
```
样本示例:
```json
{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
```
**注意**:脚本会过滤无解的地图,确保在最大步数 (`env_max_steps=8`) 内存在从起点到终点的可行路径。
## 代码实现
本节提供代码实现的高级概览。详细实现请参考源代码。
### 高级概览
实现由三部分组成:
1. **Agent** (`FrozenLakeAgent`):继承 `ReActAgent`,负责多步交互
2. **环境** (`FrozenLakeEnv`):封装 Gymnasium FrozenLake
3. **工作流** (`run_frozen_lake`):组织智能体与环境的交互流程
### 工作流
`run_frozen_lake` 实现多步交互流程:
```python
async def run_frozen_lake(
task: Dict,
model: ChatModelBase,
auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
# ...
# 创建智能体和环境
agent = FrozenLakeAgent(model=model, ...)
env = FrozenLakeEnv(...)
observation, _ = env.reset(task)
rewards = []
# ...
# 智能体-环境交互循环
for _ in range(max_steps):
response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
action = agent.get_action(response)
observation, reward, done, _ = env.step(action)
rewards.append(reward)
if done:
break
# ...
final_reward = sum(rewards)
final_response = Msg("assistant", response_content, role="assistant")
return WorkflowOutput(
reward=final_reward,
response=final_response,
metrics={"env_steps": float(step_count), "env_done": float(done)},
)
```
**关键特性:**
- 多步交互:单次 episode 内多次动作,不是单轮 QA
- 状态跟踪:记录当前步、上次动作与观测
- 错误处理:无效动作或异常会被捕获并处理
### 奖励函数
无需额外 judge奖励由环境直接给出
- 1.0:到达终点
- 0.0:掉入坑洞或超步数未达终点
工作流返回:
- `reward`:累计奖励
- `response`:包含观测、总奖励、步数、终止原因的最终回复
- `metrics``env_steps`(步数)、`env_done`(是否结束)
### 实现细节
环境 (`FrozenLakeEnv`) 封装了 Gymnasium 的 FrozenLake提供
- `reset(task)`: 使用任务参数初始化环境
- `step(action)`: 执行动作并返回 (observation, reward, done, info)
- `render()`: 返回当前状态的文本表示
智能体 (`FrozenLakeAgent`) 继承 `ReActAgent`,提供:
- `reply(msg)`: 回复消息并返回动作(继承自 AgentScope
- `get_prompt(observation)`: 从当前观测生成提示
- `get_action(response)`: 解析模型响应以提取动作Up/Down/Left/Right
- `update_state(action, observation)`: 在每步后更新内部状态
详细实现请参考 [frozenlake_env.py](./frozenlake_env.py) 和 [frozenlake_agent.py](./frozenlake_agent.py)。
### 步骤 4使用 `tune` 训练工作流
```python
from agentscope.tuner import tune, DatasetConfig
if __name__ == "__main__":
config_path = os.path.join(
os.path.dirname(__file__),
"config.yaml",
)
dataset = DatasetConfig(
path="/path/to/frozenlake_dataset",
name="default",
split="train",
)
tune(
workflow_func=run_frozen_lake,
train_dataset=dataset,
config_path=config_path,
)
```
训练配置请参考 [config.yaml](./config.yaml)。完整配置详情请参考 [Trinity-RFT 配置指南](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)。
---
## 运行方法
### 依赖
- 至少 2 张 NVIDIA GPUCUDA 版本 ≥ 12.8
- 按 [Trinity-RFT 安装指南](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) 从源码安装
- 安装 gymnasium 冰湖环境:
```bash
pip install gymnasium[toy_text]
```
- 下载模型权重(示例):
```bash
huggingface-cli download Qwen/Qwen2.5-3B-Instruct
```
### 步骤 1准备数据集
```bash
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
```
将 `main.py` 中的数据集路径改为你的生成目录。
### 步骤 2配置训练
关键配置可在代码中设置,包括:
**算法配置** (`AlgorithmConfig`)
- `algorithm_type`: `multi_step_grpo`(用于多步任务的组相对策略优化)
- `group_size`: 每批次的策略更新组大小(默认 16
- `batch_size`: 批大小(默认 32
- `learning_rate`: 学习率(默认 1e-6
**模型配置** (`TunerModelConfig`)
- `model_path`: 基础模型路径(如 `Qwen/Qwen2.5-3B-Instruct`
- `max_model_len`: 最大上下文长度(默认 25600
- `max_tokens`: 响应最大生成长度(默认 2048
- `inference_engine_num`: 推理引擎数量(默认 6表示用 6 个 GPU 进行推理)
**数据集配置** (`DatasetConfig`)
- `path`: 数据集路径(默认 `/path/to/frozenlake`
- `split`: 数据集分片(默认 `train`
可根据硬件资源和训练需求调整这些参数。其他参数可在 [config.yaml](./config.yaml) 中指定。
### 步骤 3设置 Ray 集群
设置 [Ray](https://github.com/ray-project/ray) 集群:
```bash
ray start --head
# 对于多节点设置,在工作节点上运行以下命令
# ray start --address=<master_address>
```
### 步骤 4运行训练脚本
```bash
python main.py
```
训练将开始,可通过日志监控进度。检查点将每 `trainer.save_interval` 步保存一次。
## 实验结果
### 训练奖励曲线
训练过程中的奖励曲线显示智能体的学习进度:
![reward](./critic_rewards_mean.png)
训练奖励通常随着智能体学习更有效地导航冰湖而随训练轮次增加。
### 智能体输出示例
智能体输出示例如下:
```
From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.
The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.
Let's take the action ```Left```.
```

View File

@@ -0,0 +1,102 @@
# -*- coding: utf-8 -*-
"""Adapted from Trinity-RFT"""
import re
from _utils import SYSTEM_PROMPT, FrozenLakeAction # pylint: disable=E0611
from agentscope.agent import ReActAgent
from agentscope.formatter import OpenAIChatFormatter
from agentscope.message import Msg
from agentscope.model import OpenAIChatModel
INVALID_ACTION = "still"
VALID_ACTIONS = {
"left": 1,
"down": 2,
"right": 3,
"up": 4,
}
class FrozenLakeAgent(ReActAgent):
"""Agent for FrozenLake environment."""
def __init__(self, model: OpenAIChatModel, max_steps: int = 20):
super().__init__(
name="frozenlake_agent",
model=model,
sys_prompt=SYSTEM_PROMPT,
formatter=OpenAIChatFormatter(),
max_iters=1,
)
self.response_structure = FrozenLakeAction
self.current_step = 0
self.last_action = None
self.last_observation = None
self.max_steps = max_steps
def get_prompt(self, observation: str) -> str:
"""Get prompt for the agent based on current observation."""
prompt = (
f"Current Observation ({self.current_step}): \n"
+ observation
+ "\n"
+ (
"You have not achieved the goal, P has not reached G yet. "
"Please give the next action."
)
)
if self.current_step > 0 and self.last_action is not None:
if self.last_observation == observation:
prompt += (
"\nYour last response is invalid. "
"Your position didn't change at all. "
"You may need to recheck your thinking process, "
"action outputted, and the format of response. "
"Remember, you should only output the NEXT ACTION "
"at each iteration in the ``` ```. "
"For example, if you want to move up, "
"you should output ```Up```."
)
if (
self.max_steps is not None
and self.max_steps - self.current_step > 0
):
remaining = self.max_steps - self.current_step
prompt += (
f"\nThe maximum number of steps remaining is {remaining}."
)
return prompt
def get_action(self, msg: Msg) -> str:
"""Extract action from agent response message."""
response: str = (
msg.content
if isinstance(msg.content, str)
else msg.content[0].get("text")
)
action = INVALID_ACTION
matches = re.findall(r"```(.*?)```", response, re.DOTALL)
if matches:
last_match_content = matches[-1].strip()
action = last_match_content.lower()
if action not in VALID_ACTIONS:
action = INVALID_ACTION
return action
def update_state(self, action: str, observation: str) -> None:
"""Update agent state with action and observation."""
self.last_action = action
self.last_observation = observation
self.current_step += 1
async def reset(self) -> None:
"""Reset agent state for a new episode."""
self.current_step = 0
self.last_action = None
self.last_observation = None
await self.memory.clear()

View File

@@ -0,0 +1,316 @@
# -*- coding: utf-8 -*-
"""Adapted from Trinity-RFT"""
import copy
from typing import Dict, Optional, Tuple, Union
import numpy as np
try:
from gymnasium.envs.toy_text.frozen_lake import (
FrozenLakeEnv as GymFrozenLakeEnv,
)
except ImportError:
GymFrozenLakeEnv = object
from _utils import ( # pylint: disable=E0611
generate_random_map,
get_goal_position,
)
class FrozenLakeEnv(GymFrozenLakeEnv):
"""FrozenLake environment wrapper."""
# Map gym state in integer
MAP_LOOKUP = {
b"P": 0,
b"F": 1,
b"H": 2,
b"G": 3,
}
# Define rules to transform to rendered text observation of the environment
GRID_LOOKUP = {
0: " P \t", # player
1: " _ \t", # frozen
2: " O \t", # hole
3: " G \t", # goal
4: " X \t", # player fall into hole
5: "\t", # player on goal
}
ACTION_LOOKUP = {
"still": 0,
"left": 1,
"down": 2,
"right": 3,
"up": 4,
}
INVALID_ACTION = 0
PENALTY_FOR_INVALID = -1
def __init__(
self,
max_steps: int = 8,
desc: Optional[str] = None,
is_slippery: bool = False,
size: int = 8,
p: float = 0.8,
seed: int = 42,
):
self.max_steps = max_steps or 8
self.desc: Union[str, np.ndarray, None] = desc
self.is_slippery = is_slippery
self.size = size
self.p = p
self.seed = seed
self.render_mode: Optional[str] = None
try:
import gymnasium as gym
except ImportError as e:
error_message = (
"Gymnasium is not installed. "
"Please install gymnasium first before "
"running the frozen_lake workflow. "
f"Error: {str(e)}"
)
raise ImportError(error_message) from e
if self.desc is None:
random_map, goal_position = generate_random_map(
size=self.size,
p=self.p,
seed=self.seed,
max_steps=self.max_steps,
)
else:
random_map = np.asarray(copy.deepcopy(self.desc), dtype="c")
goal_position = get_goal_position(random_map)
self.goal_position = goal_position
super().__init__(
desc=random_map[:],
is_slippery=self.is_slippery,
)
assert isinstance(self.desc, np.ndarray)
self.action_space = gym.spaces.Discrete(4, start=1)
self.map_kwargs = {
"size": size,
"p": p,
}
self.env_kwargs = {
"is_slippery": is_slippery,
"desc": copy.deepcopy(desc),
"seed": seed,
}
self.action_map = {
1: 0, # left
2: 1, # down
3: 2, # right
4: 3, # up
}
def _get_player_position(self) -> Tuple[int, int]:
return (self.s // self.ncol, self.s % self.ncol) # (row, col)
def step(self, action: str) -> Tuple[str, float, bool, Dict]:
"""Execute a step in the environment.
Maps custom action to gymnasium FrozenLakeEnv action and
takes the step. Checks if the action is effective (whether
player moves in the env).
Args:
action: The action to take.
Returns:
Tuple of (observation, reward, done, info).
"""
if self.success():
obs = self.render(mode="tiny_rgb_array")
assert isinstance(obs, str)
return obs, 1.0, True, {"action_is_effective": False}
action_id: int = self.ACTION_LOOKUP.get(action.lower(), 0)
if not action_id:
action_id = self.INVALID_ACTION
if (
action_id == self.INVALID_ACTION
or action_id not in self.action_map
):
obs = self.render(mode="tiny_rgb_array")
assert isinstance(obs, str)
return obs, 0.0, False, {"action_is_effective": False}
prev_player_position = int(self.s)
# Call parent class step method
# Note: GymFrozenLakeEnv is imported at module level
player_pos, reward, done, _, _ = super().step(
self.action_map[action_id],
)
obs = self.render(mode="tiny_rgb_array")
assert isinstance(obs, str)
return (
obs,
float(reward),
bool(done),
{"action_is_effective": prev_player_position != int(player_pos)},
)
def render(
self,
mode: str = "tiny_rgb_array",
) -> str | list[str] | np.ndarray:
"""Render the environment.
Args:
mode: Rendering mode. Options: "tiny_rgb_array", "list",
"state", "rgb_array", "ansi".
Returns:
Rendered observation based on the mode.
"""
assert mode in [
"tiny_rgb_array",
"list",
"state",
"rgb_array",
"ansi",
]
if mode in ["rgb_array", "ansi"]:
prev_render_mode = self.render_mode
self.render_mode = mode
obs = super().render()
self.render_mode = prev_render_mode
return obs
assert isinstance(self.desc, np.ndarray)
room_state = copy.deepcopy(self.desc)
# replace the position of start 'S' with 'F'
position_S = np.where(room_state == b"S")
room_state[position_S] = b"F"
# replace the position of the player with 'P'
position_P = self._get_player_position()
room_state[position_P] = b"P"
if mode == "state":
# transform 'S', 'F', 'H', 'G' to numpy integer array
room_state = np.vectorize(lambda x: self.MAP_LOOKUP[x])(room_state)
# add player in hole or player on goal
if self.desc[position_P] == b"H":
room_state[position_P] = 4
elif self.desc[position_P] == b"G":
room_state[position_P] = 5
return room_state
room_state = self.render(mode="state").tolist()
assert isinstance(room_state, list)
if mode == "list":
def lookup_list(cell: int) -> str:
return self.GRID_LOOKUP.get(cell, "?").strip("\t").strip()
return [
" ".join(lookup_list(cell) for cell in row)
for row in room_state
]
if mode == "tiny_rgb_array":
def lookup_tiny(cell: int) -> str:
return self.GRID_LOOKUP.get(cell, "?")
result = "\n".join(
"".join(lookup_tiny(cell) for cell in row)
for row in room_state
)
return result
# Default return for other modes
return ""
def reset(
self,
task: Optional[Dict] = None,
) -> tuple[str, Dict]:
"""Reset the environment with optional task parameters."""
task = task or {}
# Update parameters from task if provided
size = task.get("size", self.map_kwargs["size"])
p = task.get("p", self.map_kwargs["p"])
seed = task.get("seed", self.env_kwargs["seed"])
is_slippery = task.get(
"is_slippery",
self.env_kwargs["is_slippery"],
)
desc = task.get("desc", self.env_kwargs.get("desc"))
# Update instance variables
self.size = size
self.p = p
self.seed = seed
self.is_slippery = is_slippery
self.map_kwargs["size"] = size
self.map_kwargs["p"] = p
self.env_kwargs["seed"] = seed
self.env_kwargs["is_slippery"] = is_slippery
if desc is not None:
self.env_kwargs["desc"] = copy.deepcopy(desc)
if desc is None:
random_map, goal_position = generate_random_map(
size=size,
p=p,
seed=seed,
max_steps=self.max_steps,
)
else:
random_map = np.asarray(copy.deepcopy(desc), dtype="c")
goal_position = get_goal_position(random_map)
self.goal_position = goal_position
self.desc = random_map[:]
# Reinitialize parent class with new map
try:
import gymnasium as gym
super().__init__(
desc=random_map[:],
is_slippery=self.is_slippery,
)
assert isinstance(self.desc, np.ndarray)
self.action_space = gym.spaces.Discrete(4, start=1)
except ImportError as e:
error_message = (
"Gymnasium is not installed. "
"Please install gymnasium first before "
"running the frozen_lake workflow. "
f"Error: {str(e)}"
)
raise ImportError(error_message) from e
super().reset(seed=self.seed)
obs = self.render(mode="tiny_rgb_array")
assert isinstance(obs, str)
return obs, {}
def finished(self) -> bool:
"""Check if the episode is finished (goal or hole)."""
player_pos = self._get_player_position()
assert isinstance(self.desc, np.ndarray)
return self.desc[player_pos] in b"GH" # type: ignore
def success(self) -> bool:
"""Check if the agent has reached the goal (G)."""
player_pos = self._get_player_position()
assert isinstance(self.desc, np.ndarray)
return self.desc[player_pos] in b"G"

209
tuner/frozen_lake/_utils.py Normal file
View File

@@ -0,0 +1,209 @@
# -*- coding: utf-8 -*-
"""
Utils for the FrozenLake environment.
Modified from rllm
"""
from typing import Literal, Optional, Tuple
import numpy as np
from pydantic import BaseModel, Field
# Map gym state in integer
MAP_LOOKUP = {
b"P": 0,
b"F": 1,
b"H": 2,
b"G": 3,
}
# Define rules to transform to rendered text observation of the environment
GRID_LOOKUP = {
0: " P \t", # player
1: " _ \t", # frozen
2: " O \t", # hole
3: " G \t", # goal
4: " X \t", # player fall into hole
5: "\t", # player on goal
}
ACTION_LOOKUP = {
0: "None",
1: "Left",
2: "Down",
3: "Right",
4: "Up",
}
# Prompting format inspired by the RAGEN project
SYSTEM_PROMPT = """You are Qwen, created by Alibaba Cloud. \
You are a helpful assistant. You are walking on a frozen lake.
FrozenLake Quick Guide
Goal: Reach the goal (G). Player (P) and Goal (G) must overlap.
Symbols:
_ Frozen | O Hole | G Goal | P Player
Rules:
1. Avoid falling into holes (O).
2. Frozen tiles are slippery, you may move perpendicular to
your intended direction.
Valid Action (separated by | ):
Up | Down | Left | Right
Rewards:
Fall into hole: 0
Reach goal: +1.0
You will be provided the current observation, please decide on
the next Action.
You should show your thought process and then input the final
action in ``` ```.
You should only output the NEXT ACTION at each iteration in
the ``` ```. For example, if you want to move up, you should
output ```Up```.
You should plan ahead and need to achieve it in minimum number
of steps.
You should be aware that frozen tiles can be slippery, but the
chance is small and you should not overthink it.
Please show your thinking process and put the final action in
``` ```. In every turn, the final action MUST be one of Up,
Down, Left, Right.
"""
class FrozenLakeAction(BaseModel):
"""Action model for FrozenLake environment."""
action: Literal["Up", "Down", "Left", "Right"] = Field(
description=(
"The action to take in the FrozenLake environment, "
"must be one of Up, Down, Left, Right"
),
)
def is_valid(board: list[list[str]], max_size: int, max_steps: int) -> bool:
"""DFS to check that it's a valid path.
Args:
board: The board representation as a list of lists.
max_size: Maximum size of the board.
max_steps: Maximum number of steps allowed.
Returns:
True if there's a valid path from start to goal within max_steps,
False otherwise.
"""
frontier, discovered = [], set()
# find the start point
start_r, start_c = np.where(np.array(board) == "S")
frontier.append((start_r[0], start_c[0], 0)) # row, col steps
# dfs to check if there is a path from start to goal
while frontier:
r, c, steps = frontier.pop()
if steps > max_steps:
continue
if (r, c) not in discovered:
discovered.add((r, c))
directions = [(1, 0), (0, 1), (-1, 0), (0, -1)]
for x, y in directions:
r_new = r + x
c_new = c + y
if (
r_new < 0
or r_new >= max_size
or c_new < 0
or c_new >= max_size
): # noqa: PLR2004
continue
if board[r_new][c_new] == "G":
return True
if board[r_new][c_new] != "H":
frontier.append((r_new, c_new, steps + 1))
return False
def generate_random_map(
size: int = 8,
p: float = 0.8,
seed: int = 0,
max_steps: int = 5,
) -> Tuple[list[str], Tuple[int, int]]:
"""Generates a random valid map (one that has a path from start to goal).
Args:
size: Size of each side of the grid.
p: Probability that a tile is frozen.
seed: Seed to ensure the generation of reproducible maps.
max_steps: Maximum number of steps allowed.
Returns:
A tuple containing a random valid map and the goal position (row, col).
"""
valid = False
board: list[list[str]] = [] # initialize to make pyright happy
try:
from gymnasium.utils import seeding
np_random, _ = seeding.np_random(seed)
except ImportError as exc:
raise ImportError(
"Gymnasium is not installed. "
"Please install gymnasium first before "
"running the frozen_lake workflow.",
) from exc
# generate random start and end points
while not valid:
p = min(1, p)
board = np_random.choice(
["F", "H"],
(size, size),
p=[p, 1 - p],
).tolist()
while True:
start_r = int(np_random.integers(0, size))
start_c = int(np_random.integers(0, size))
goal_r = int(np_random.integers(0, size))
goal_c = int(np_random.integers(0, size))
# Ensure start and goal are different positions
if (start_r, start_c) != (goal_r, goal_c):
break
board[start_r][start_c] = "S"
board[goal_r][goal_c] = "G"
valid = is_valid(board, size, max_steps)
return ["".join(x) for x in board], (goal_r, goal_c)
def get_goal_position(
random_map: np.ndarray,
) -> Optional[Tuple[int, int]]:
"""Get the goal position from a random map.
Args:
random_map: The map as a numpy array.
Returns:
Tuple of (row, col) if goal found, None otherwise.
"""
positions = np.argwhere(random_map == b"G")
if positions.size == 0:
return None # G not found
return tuple(positions[0]) # returns (row, col)
__all__ = [
"SYSTEM_PROMPT",
"FrozenLakeAction",
"generate_random_map",
"get_goal_position",
]

View File

@@ -0,0 +1,53 @@
project: "AgentScope" # Project name
name: "FrozenLake" # Experiment name
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} # Directory to save model checkpoints
algorithm:
algorithm_type: multi_step_grpo # GRPO series for multi-step scenario
repeat_times: 16 # Number of rollouts per prompt for advantage estimation
kl_loss_fn: "low_var_kl"
kl_loss_fn_args:
kl_coef: 0 # KL divergence coefficient
advantage_fn_args:
epsilon: 1e-6 # Small value for numerical stability
std_threshold: 0.0001 # Threshold for standard deviation
optimizer:
lr: 1e-6 # Learning rate
model:
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-3B-Instruct} # Base model path
max_prompt_tokens: 23552 # Max tokens for prompt
max_response_tokens: 2048 # Max tokens per response
max_model_len: 25600 # Max context length
temperature: 1.0 # Sampling temperature
buffer:
total_epochs: 5 # Total training epochs
batch_size: 32 # Batch size per explore step
train_batch_size: 1024 # Total experiences per training step
trainer_input:
experience_buffer:
name: experience_buffer
storage_type: queue
max_read_timeout: 7200 # Max timeout for reading from buffer (seconds)
replay_buffer:
enable: true # Enable experience replay
priority_fn: linear_decay # Priority function for replay buffer
priority_fn_args:
decay: 0.1 # Decay rate for priority function
explorer:
runner_per_model: 16 # Number of runners per model
rollout_model:
engine_num: 6 # Number of vLLM engines for rollout model
tensor_parallel_size: 1 # TP size per engine for rollout model
enable_openai_api: true # Enable OpenAI-compatible API
enable_history: true # Enable conversation history
enable_auto_tool_choice: true # Enable automatic tool selection
tool_call_parser: hermes # Parser for tool calls
trainer:
save_interval: 100 # Save checkpoint every N steps
use_dynamic_bsz: true # Use dynamic batch size
grad_clip: 1.0 # Gradient clipping value
max_token_len_per_gpu: 25600 # Max token length per GPU
ulysses_sequence_parallel_size: 2 # Sequence parallel size for Ulysses
synchronizer:
sync_style: dynamic_by_explorer # Sync triggered dynamically by explorer
sync_interval: 1 # Sync every N steps
sync_timeout: 1200 # Timeout for synchronization (seconds)

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

View File

@@ -0,0 +1,131 @@
# -*- coding: utf-8 -*-
"""
Modified from rllm
"""
import argparse
import os
import numpy as np
import pandas as pd
DEFAULT_DATA_PATH = os.path.join(
os.path.dirname(os.path.abspath(__file__)),
"..",
"data",
"frozenlake",
)
def save_dataset_to_local(
data_path: str,
data: list[dict],
split: str = "default",
) -> str:
"""Save dataset directly to local data_path.
Args:
data_path: Path to save the dataset
data: List of dictionaries containing the dataset examples
split: Split name (e.g., 'train', 'test', 'default')
Returns:
str: Path to the saved parquet file
"""
os.makedirs(data_path, exist_ok=True)
# Convert to DataFrame and save
data_df = pd.DataFrame(data)
dataset_path = os.path.join(data_path, f"{split}.parquet")
data_df.to_parquet(dataset_path)
print(
f"Saved dataset frozenlake split '{split}' "
f"with {len(data)} examples at {dataset_path}. "
f"Make sure to set the environment variable "
f"<TRINITY_TASKSET_PATH> to {data_path}.",
)
return dataset_path
def prepare_frozenlake_data(
data_path: str,
train_size: int = 10000,
test_size: int = 100,
map_max_size: int = 6,
) -> tuple[list[dict], list[dict]]:
"""
Prepare and save FrozenLake datasets for training and testing.
Args:
data_path (str): Path to save the dataset
train_size (int): Number of training examples to generate
test_size (int): Number of test examples to generate
map_max_size (int): Maximum size of the map
Returns:
tuple: (train_data, test_data) - Lists of data dictionaries
"""
# Set random seed for reproducibility
np.random.seed(42)
# Generate random parameters for train and test sets
train_seeds = np.random.randint(0, 100000, size=train_size)
test_seeds = np.random.randint(0, 100000, size=test_size)
train_sizes = np.random.randint(2, map_max_size, size=train_size)
test_sizes = np.random.randint(2, map_max_size, size=test_size)
train_ps = np.random.uniform(0.6, 0.85, size=train_size)
test_ps = np.random.uniform(0.6, 0.85, size=test_size)
def frozenlake_process_fn(
seed: int,
size: int,
p: float,
idx: int,
) -> dict:
"""Process function to create FrozenLake task instances."""
return {
"seed": seed,
"size": size,
"p": p,
"index": idx,
"uid": f"{seed}_{size}_{p}",
}
# Create train and test data
train_data_list = [
frozenlake_process_fn(seed, train_sizes[idx], train_ps[idx], idx)
for idx, seed in enumerate(train_seeds)
]
test_data_list = [
frozenlake_process_fn(seed, test_sizes[idx], test_ps[idx], idx)
for idx, seed in enumerate(test_seeds)
]
# Save datasets directly to local DATA_PATH
save_dataset_to_local(data_path, train_data_list, "train")
save_dataset_to_local(data_path, test_data_list, "test")
return train_data_list, test_data_list
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--local_dir", default=DEFAULT_DATA_PATH)
parser.add_argument("--train_size", type=int, default=10000)
parser.add_argument("--test_size", type=int, default=100)
parser.add_argument("--map_max_size", type=int, default=6)
args = parser.parse_args()
train_data, test_data = prepare_frozenlake_data(
data_path=args.local_dir,
train_size=args.train_size,
test_size=args.test_size,
map_max_size=args.map_max_size,
)
print(f"Train dataset: {len(train_data)} examples")
print(f"Test dataset: {len(test_data)} examples")
print("Sample train example:", train_data[0])
print("Sample test example:", test_data[0])

151
tuner/frozen_lake/main.py Normal file
View File

@@ -0,0 +1,151 @@
# -*- coding: utf-8 -*-
"""Example of training a FrozenLake agent with Trinity-RFT."""
import os
from typing import Dict
from _frozenlake_agent import FrozenLakeAgent
from _frozenlake_env import FrozenLakeEnv
from agentscope.message import Msg
from agentscope.tuner import (
tune,
WorkflowOutput,
DatasetConfig,
TunerModelConfig,
AlgorithmConfig,
)
from agentscope.model import ChatModelBase
async def run_frozen_lake(
task: Dict,
model: ChatModelBase,
auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
"""A workflow function using the FrozenLake agent to solve tasks.
Args:
task (Dict): The task to be solved, containing environment parameters
like size, p, seed, is_slippery, etc.
model (ChatModelBase): The language model to use.
Returns:
WorkflowOutput: The workflow output containing the reward, response and
metrics.
"""
assert len(auxiliary_models) == 0, "No auxiliary models are needed"
# Extract workflow arguments from task or use defaults
workflow_args = task.get("workflow_args", {})
if not workflow_args:
workflow_args = task
env_max_steps = workflow_args.get("env_max_steps", 8)
agent_max_steps = workflow_args.get("agent_max_steps", 10)
is_slippery = workflow_args.get("is_slippery", False)
desc = workflow_args.get("desc", None)
# Extract task-specific arguments (for environment generation)
size = task.get("size", 8)
p = task.get("p", 0.8)
seed = task.get("seed", 42)
# Initialize agent and environment
agent = FrozenLakeAgent(model=model, max_steps=agent_max_steps)
env = FrozenLakeEnv(
max_steps=env_max_steps,
desc=desc,
is_slippery=is_slippery,
size=size,
p=p,
seed=seed,
)
# Reset environment with task parameters
observation, _ = env.reset(task)
observation_str = str(observation)
rewards = []
step_count = 0
done = False
terminate_reason = None
# Run agent-environment interaction loop
for _ in range(agent_max_steps):
step_count += 1
try:
# get prompt
prompt = agent.get_prompt(observation_str)
response = await agent.reply(msg=Msg("user", prompt, role="user"))
# record action and observation
action = agent.get_action(response)
agent.update_state(action=action, observation=observation_str)
except Exception as e:
terminate_reason = f"agent_error: {str(e)}"
break
# environment step
observation, reward, done, _ = env.step(action)
observation_str = str(observation)
rewards.append(reward)
if done:
terminate_reason = "success" if env.success() else "hole"
break
if terminate_reason is None:
terminate_reason = "max_steps_reached"
final_reward = sum(rewards)
final_observation = observation_str
# Create response message with environment information
response_content = (
f"Final observation:\n{final_observation}\n"
f"Total reward: {final_reward}\n"
f"Steps taken: {step_count}\n"
f"Terminate reason: {terminate_reason}"
)
final_response = Msg("assistant", response_content, role="assistant")
return WorkflowOutput(
reward=final_reward,
response=final_response,
metrics={
"env_steps": float(step_count),
"env_done": float(done),
},
)
if __name__ == "__main__":
dataset = DatasetConfig(
path="/path/to/frozenlake",
split="train",
)
tuner_model = TunerModelConfig(
model_path="Qwen/Qwen2.5-3B-Instruct",
max_model_len=25600,
max_tokens=2048,
inference_engine_num=6,
reasoning_parser=None,
)
algorithm = AlgorithmConfig(
algorithm_type="multi_step_grpo",
group_size=16,
batch_size=32,
learning_rate=1e-6,
)
config_path = os.path.join(
os.path.dirname(__file__),
"config.yaml",
) # define some default parameters
tune(
workflow_func=run_frozen_lake,
model=tuner_model,
train_dataset=dataset,
algorithm=algorithm,
config_path=config_path,
)