Add examples for frozenlake and emailsearch (#94)
This commit is contained in:
271
tuner/frozen_lake/README.md
Normal file
271
tuner/frozen_lake/README.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# Training FrozenLake Agent with RL using AgentScope-Tuner
|
||||
|
||||
## Summary
|
||||
|
||||
This example demonstrates how to use AgentScope-Tuner to implement reinforcement fine-tuning for the [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) task using [Trinity-RFT](https://github.com/agentscope-ai/Trinity-RFT). The agent learns to navigate a frozen lake grid from a starting position to a goal while avoiding holes through multi-step interactions with the environment.
|
||||
|
||||
## Task Setting
|
||||
|
||||
### Agent Goal
|
||||
The agent's objective is to navigate from the starting position (S) to the goal position (G) on a frozen lake grid without falling into holes (H). The agent must:
|
||||
- Plan a path through frozen tiles (F) to reach the goal
|
||||
- Avoid holes that terminate the episode with zero reward
|
||||
- Complete the task within a limited number of steps
|
||||
|
||||
### Agent Type
|
||||
The agent is implemented as a **ReActAgent** (Reasoning and Acting Agent) that:
|
||||
- Observes the current state of the frozen lake grid
|
||||
- Reasons about the best action to take
|
||||
- Executes actions (Up, Down, Left, Right) to move through the environment
|
||||
- Maintains internal state across multiple steps in an episode
|
||||
|
||||
### Environment
|
||||
The environment is based on Gymnasium's FrozenLake environment, wrapped to provide:
|
||||
- **Grid-based navigation**: Randomly generated maps with configurable size (2x2 to 6x6)
|
||||
- **Tile types**:
|
||||
- `S`: Start position
|
||||
- `F`: Frozen tile (safe to walk on)
|
||||
- `H`: Hole (terminates episode with reward 0)
|
||||
- `G`: Goal (terminates episode with reward +1.0)
|
||||
- **Action space**: Discrete actions (Up, Down, Left, Right)
|
||||
- **Reward structure**:
|
||||
- +1.0 for reaching the goal
|
||||
- 0.0 for falling into a hole or failing to reach the goal
|
||||
- **Observations**: Text-based grid representation showing current player position
|
||||
|
||||
The agent does not use external tools. It interacts directly with the environment through:
|
||||
- `env.reset(task)`: Initialize environment with task parameters
|
||||
- `env.step(action)`: Execute action and receive observation, reward, and done flag
|
||||
- `env.render()`: Get text representation of current state
|
||||
|
||||
|
||||
## Dataset Preparation
|
||||
|
||||
The dataset contains task parameters for generating FrozenLake environments. Each sample specifies:
|
||||
- `seed`: Random seed for reproducible map generation
|
||||
- `size`: Grid size (randomly sampled from 2 to `map_max_size`, e.g., 4x4, 6x6)
|
||||
- `p`: Probability that a tile is frozen (vs. being a hole), randomly sampled from 0.6 to 0.85
|
||||
- `index`: Sample index
|
||||
- `uid`: Unique identifier combining seed, size, and p
|
||||
|
||||
Run the data preparation script to generate training and test datasets:
|
||||
|
||||
```bash
|
||||
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
|
||||
```
|
||||
|
||||
This will create parquet files in the specified directory:
|
||||
|
||||
```
|
||||
/path/to/frozenlake_dataset/
|
||||
├── train.parquet # 10000 training samples
|
||||
└── test.parquet # 100 test samples
|
||||
```
|
||||
|
||||
Each sample looks like:
|
||||
|
||||
```json
|
||||
{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
|
||||
```
|
||||
|
||||
**Note**: The data preparation script ensures that all generated maps have a valid path from start to goal within the maximum allowed steps (`env_max_steps=8`), filtering out unsolvable tasks.
|
||||
|
||||
## Code Implementation
|
||||
|
||||
This section provides a high-level overview of the code implementation. For detailed implementation, please refer to the source code.
|
||||
|
||||
### High-level Overview
|
||||
|
||||
The implementation consists of three main components:
|
||||
|
||||
1. **Agent** (`FrozenLakeAgent`): Extends `ReActAgent` to handle multi-step navigation
|
||||
2. **Environment** (`FrozenLakeEnv`): Wraps Gymnasium's FrozenLake environment
|
||||
3. **Workflow** (`run_frozen_lake`): Orchestrates the agent-environment interaction loop
|
||||
|
||||
### Agent Workflow
|
||||
|
||||
The workflow function `run_frozen_lake` implements the agent-environment interaction loop:
|
||||
|
||||
```python
|
||||
async def run_frozen_lake(
|
||||
task: Dict,
|
||||
model: ChatModelBase,
|
||||
auxiliary_models: Dict[str, ChatModelBase],
|
||||
) -> WorkflowOutput:
|
||||
# ...
|
||||
|
||||
# Create agent and environment
|
||||
agent = FrozenLakeAgent(model=model, ...)
|
||||
env = FrozenLakeEnv(...)
|
||||
observation, _ = env.reset(task)
|
||||
rewards = []
|
||||
# ...
|
||||
|
||||
# Agent-environment interaction loop
|
||||
for _ in range(max_steps):
|
||||
response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
|
||||
action = agent.get_action(response)
|
||||
observation, reward, done, _ = env.step(action)
|
||||
rewards.append(reward)
|
||||
if done:
|
||||
break
|
||||
|
||||
# ...
|
||||
final_reward = sum(rewards)
|
||||
final_response = Msg("assistant", response_content, role="assistant")
|
||||
|
||||
return WorkflowOutput(
|
||||
reward=final_reward,
|
||||
response=final_response,
|
||||
metrics={
|
||||
"env_steps": float(step_count),
|
||||
"env_done": float(done),
|
||||
},
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
**Key characteristics:**
|
||||
- Multi-step interaction: The agent takes multiple actions in a single episode, unlike single-turn QA tasks
|
||||
- State tracking: The agent maintains internal state (current step, last action, last observation) across steps
|
||||
- Error handling: Invalid actions or agent errors are caught and handled gracefully
|
||||
|
||||
### Reward Function
|
||||
|
||||
No separate judge function is needed. The reward comes directly from the environment:
|
||||
- 1.0: Agent successfully reaches the goal (G)
|
||||
- 0.0: Agent falls into a hole (H) or fails to reach the goal within the maximum steps
|
||||
|
||||
The reward is computed as the sum of step rewards throughout the episode. The workflow returns:
|
||||
- `reward`: Final cumulative reward
|
||||
- `response`: Final response message containing observation, total reward, steps taken, and termination reason
|
||||
- `metrics`: Additional metrics including `env_steps` (number of steps taken) and `env_done` (whether episode completed)
|
||||
|
||||
### Implementation Details
|
||||
|
||||
The environment (`FrozenLakeEnv`) wraps Gymnasium's FrozenLake and provides:
|
||||
- `reset(task)`: Initialize the environment with task parameters
|
||||
- `step(action)`: Execute an action and return (observation, reward, done, info)
|
||||
- `render()`: Return a text representation of the current state
|
||||
|
||||
The agent (`FrozenLakeAgent`) extends `ReActAgent` and provides:
|
||||
- `reply(msg)`: Reply to a message and return an action (inherited from AgentScope)
|
||||
- `get_prompt(observation)`: Generate a prompt from the current observation
|
||||
- `get_action(response)`: Parse the model's response to extract an action (Up/Down/Left/Right)
|
||||
- `update_state(action, observation)`: Update internal state after each step
|
||||
|
||||
See [frozenlake_env.py](./frozenlake_env.py) and [frozenlake_agent.py](./frozenlake_agent.py) for implementation details.
|
||||
|
||||
### Step 4: Use `tune` to train the workflow
|
||||
|
||||
```python
|
||||
from agentscope.tuner import tune, DatasetConfig
|
||||
|
||||
if __name__ == "__main__":
|
||||
config_path = os.path.join(
|
||||
os.path.dirname(__file__),
|
||||
"config.yaml",
|
||||
)
|
||||
dataset = DatasetConfig(
|
||||
path="/path/to/frozenlake_dataset",
|
||||
name="default",
|
||||
split="train",
|
||||
)
|
||||
tune(
|
||||
workflow_func=run_frozen_lake,
|
||||
train_dataset=dataset,
|
||||
config_path=config_path,
|
||||
)
|
||||
```
|
||||
|
||||
See [config.yaml](./config.yaml) for the training configuration. For full configuration details, see [Trinity-RFT Configuration Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html).
|
||||
|
||||
---
|
||||
|
||||
## How to Run
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- At least 2 NVIDIA GPUs with CUDA 12.8 or newer
|
||||
- Follow the Trinity-RFT [installation guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) to install the latest version from source code
|
||||
- Install gymnasium for the FrozenLake environment:
|
||||
|
||||
```bash
|
||||
pip install gymnasium[toy_text]
|
||||
```
|
||||
|
||||
- Download the model checkpoint (example):
|
||||
|
||||
```bash
|
||||
huggingface-cli download Qwen/Qwen2.5-3B-Instruct
|
||||
```
|
||||
|
||||
### Step 1: Prepare the Dataset
|
||||
|
||||
```bash
|
||||
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
|
||||
```
|
||||
|
||||
Update the dataset path in `main.py` to point to your generated dataset directory.
|
||||
|
||||
### Step 2: Configure the Training
|
||||
|
||||
Key configuration can be identified in the code, including:
|
||||
|
||||
**Algorithm Configuration** (`AlgorithmConfig`):
|
||||
- `algorithm_type`: `multi_step_grpo` (Group Relative Policy Optimization for multi-step tasks)
|
||||
- `group_size`: Number of policy update iterations per batch (default: 16)
|
||||
- `batch_size`: Batch size for training (default: 32)
|
||||
- `learning_rate`: Learning rate (default: 1e-6)
|
||||
|
||||
**Model Configuration** (`TunerModelConfig`):
|
||||
- `model_path`: Path to the base model (e.g., `Qwen/Qwen2.5-3B-Instruct`)
|
||||
- `max_model_len`: Maximum model context length (default: 25600)
|
||||
- `max_tokens`: Maximum tokens for response generation (default: 2048)
|
||||
- `inference_engine_num`: Number of inference engines (default: 6, using 6 GPUs for inference)
|
||||
|
||||
**Dataset Configuration** (`DatasetConfig`):
|
||||
- `path`: Path to the dataset (default: `/path/to/frozenlake`)
|
||||
- `split`: Split of the dataset (default: `train`)
|
||||
|
||||
Adjust these parameters based on your hardware resources and training requirements. Other parameters can be spetified in [config.yaml](./config.yaml).
|
||||
|
||||
|
||||
### Step 3: Set Up Ray Cluster
|
||||
|
||||
Set up a [Ray](https://github.com/ray-project/ray) cluster:
|
||||
|
||||
```bash
|
||||
ray start --head
|
||||
# for multi-node setup, run the following command on worker nodes
|
||||
# ray start --address=<master_address>
|
||||
```
|
||||
|
||||
### Step 4: Run the Training Script
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
The training will start and you can monitor the progress through the logs. Checkpoints will be saved once every `trainer.save_interval` steps.
|
||||
|
||||
## Experimental Results
|
||||
|
||||
### Training Reward Curve
|
||||
|
||||
The reward curve during training shows the agent's learning progress:
|
||||
|
||||

|
||||
|
||||
The training reward typically increases over epochs as the agent learns to navigate the frozen lake more effectively.
|
||||
|
||||
### Example Agent Output
|
||||
|
||||
An example of agent output is given below:
|
||||
```
|
||||
From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.
|
||||
|
||||
The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.
|
||||
|
||||
Let's take the action ```Left```.
|
||||
```
|
||||
250
tuner/frozen_lake/README_zh.md
Normal file
250
tuner/frozen_lake/README_zh.md
Normal file
@@ -0,0 +1,250 @@
|
||||
# 使用 AgentScope-Tuner 训练 FrozenLake Agent
|
||||
|
||||
## 摘要
|
||||
|
||||
本示例展示如何使用 AgentScope-Tuner 配合 [Trinity-RFT](https://github.com/agentscope-ai/Trinity-RFT) 对 [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) 任务进行强化微调。智能体需要在冰湖网格中从起点走到终点,避开坑洞,并在有限步数内完成任务。
|
||||
|
||||
## 任务设定
|
||||
|
||||
### 智能体目标
|
||||
智能体要在冰湖网格上从起点 (S) 抵达终点 (G),同时:
|
||||
- 规划路径经过冰面 (F) 到达终点
|
||||
- 避开会结束回合且奖励为 0 的坑洞 (H)
|
||||
- 在限定步数内完成任务
|
||||
|
||||
### 智能体类型
|
||||
智能体实现为 **ReActAgent**,它的行为包括:
|
||||
- 观察当前冰湖网格状态
|
||||
- 推理下一步最优动作
|
||||
- 执行动作(上、下、左、右)在环境中移动
|
||||
- 在多步交互中维护内部状态
|
||||
|
||||
### 环境
|
||||
环境基于 Gymnasium 的 FrozenLake,并提供:
|
||||
- **网格导航**:随机生成 2x2 至 6x6 的地图
|
||||
- **格子类型**:
|
||||
- `S`:起点
|
||||
- `F`:冰面(可通行)
|
||||
- `H`:坑洞(奖励 0,结束回合)
|
||||
- `G`:终点(奖励 +1.0,结束回合)
|
||||
- **动作空间**:离散动作(上、下、左、右)
|
||||
- **奖励设计**:
|
||||
- 到达终点 +1.0
|
||||
- 掉入坑洞或未在最大步数内到达终点为 0.0
|
||||
- **观测**:返回当前玩家位置的文本网格表示
|
||||
|
||||
智能体不使用外部工具,直接通过以下接口与环境交互:
|
||||
- `env.reset(task)`:根据任务参数初始化环境
|
||||
- `env.step(action)`:执行动作,返回观测、奖励和结束标志
|
||||
- `env.render()`:返回当前状态的文本表示
|
||||
|
||||
## 数据集准备
|
||||
|
||||
数据集包含用于生成 FrozenLake 环境的任务参数,每个样本包含:
|
||||
- `seed`:随机种子,保证地图可复现
|
||||
- `size`:网格大小(在 2 和 `map_max_size` 之间随机,如 4x4、6x6)
|
||||
- `p`:格子为冰面的概率(0.6 到 0.85 之间随机),其余为坑洞
|
||||
- `index`:样本索引
|
||||
- `uid`:由 seed、size、p 组合而成的唯一 ID
|
||||
|
||||
运行数据准备脚本生成训练集与测试集:
|
||||
|
||||
```bash
|
||||
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
|
||||
```
|
||||
|
||||
生成的目录结构示例:
|
||||
```
|
||||
/path/to/frozenlake_dataset/
|
||||
├── train.parquet # 10000 条训练样本
|
||||
└── test.parquet # 100 条测试样本
|
||||
```
|
||||
|
||||
样本示例:
|
||||
```json
|
||||
{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
|
||||
```
|
||||
|
||||
**注意**:脚本会过滤无解的地图,确保在最大步数 (`env_max_steps=8`) 内存在从起点到终点的可行路径。
|
||||
|
||||
## 代码实现
|
||||
|
||||
本节提供代码实现的高级概览。详细实现请参考源代码。
|
||||
|
||||
### 高级概览
|
||||
实现由三部分组成:
|
||||
1. **Agent** (`FrozenLakeAgent`):继承 `ReActAgent`,负责多步交互
|
||||
2. **环境** (`FrozenLakeEnv`):封装 Gymnasium FrozenLake
|
||||
3. **工作流** (`run_frozen_lake`):组织智能体与环境的交互流程
|
||||
|
||||
### 工作流
|
||||
`run_frozen_lake` 实现多步交互流程:
|
||||
|
||||
```python
|
||||
async def run_frozen_lake(
|
||||
task: Dict,
|
||||
model: ChatModelBase,
|
||||
auxiliary_models: Dict[str, ChatModelBase],
|
||||
) -> WorkflowOutput:
|
||||
# ...
|
||||
|
||||
# 创建智能体和环境
|
||||
agent = FrozenLakeAgent(model=model, ...)
|
||||
env = FrozenLakeEnv(...)
|
||||
observation, _ = env.reset(task)
|
||||
rewards = []
|
||||
# ...
|
||||
|
||||
# 智能体-环境交互循环
|
||||
for _ in range(max_steps):
|
||||
response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
|
||||
action = agent.get_action(response)
|
||||
observation, reward, done, _ = env.step(action)
|
||||
rewards.append(reward)
|
||||
if done:
|
||||
break
|
||||
|
||||
# ...
|
||||
final_reward = sum(rewards)
|
||||
final_response = Msg("assistant", response_content, role="assistant")
|
||||
|
||||
return WorkflowOutput(
|
||||
reward=final_reward,
|
||||
response=final_response,
|
||||
metrics={"env_steps": float(step_count), "env_done": float(done)},
|
||||
)
|
||||
```
|
||||
|
||||
**关键特性:**
|
||||
- 多步交互:单次 episode 内多次动作,不是单轮 QA
|
||||
- 状态跟踪:记录当前步、上次动作与观测
|
||||
- 错误处理:无效动作或异常会被捕获并处理
|
||||
|
||||
### 奖励函数
|
||||
无需额外 judge,奖励由环境直接给出:
|
||||
- 1.0:到达终点
|
||||
- 0.0:掉入坑洞或超步数未达终点
|
||||
|
||||
工作流返回:
|
||||
- `reward`:累计奖励
|
||||
- `response`:包含观测、总奖励、步数、终止原因的最终回复
|
||||
- `metrics`:`env_steps`(步数)、`env_done`(是否结束)
|
||||
|
||||
### 实现细节
|
||||
|
||||
环境 (`FrozenLakeEnv`) 封装了 Gymnasium 的 FrozenLake,提供:
|
||||
- `reset(task)`: 使用任务参数初始化环境
|
||||
- `step(action)`: 执行动作并返回 (observation, reward, done, info)
|
||||
- `render()`: 返回当前状态的文本表示
|
||||
|
||||
智能体 (`FrozenLakeAgent`) 继承 `ReActAgent`,提供:
|
||||
- `reply(msg)`: 回复消息并返回动作(继承自 AgentScope)
|
||||
- `get_prompt(observation)`: 从当前观测生成提示
|
||||
- `get_action(response)`: 解析模型响应以提取动作(Up/Down/Left/Right)
|
||||
- `update_state(action, observation)`: 在每步后更新内部状态
|
||||
|
||||
详细实现请参考 [frozenlake_env.py](./frozenlake_env.py) 和 [frozenlake_agent.py](./frozenlake_agent.py)。
|
||||
|
||||
### 步骤 4:使用 `tune` 训练工作流
|
||||
|
||||
```python
|
||||
from agentscope.tuner import tune, DatasetConfig
|
||||
|
||||
if __name__ == "__main__":
|
||||
config_path = os.path.join(
|
||||
os.path.dirname(__file__),
|
||||
"config.yaml",
|
||||
)
|
||||
dataset = DatasetConfig(
|
||||
path="/path/to/frozenlake_dataset",
|
||||
name="default",
|
||||
split="train",
|
||||
)
|
||||
tune(
|
||||
workflow_func=run_frozen_lake,
|
||||
train_dataset=dataset,
|
||||
config_path=config_path,
|
||||
)
|
||||
```
|
||||
|
||||
训练配置请参考 [config.yaml](./config.yaml)。完整配置详情请参考 [Trinity-RFT 配置指南](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)。
|
||||
|
||||
---
|
||||
|
||||
## 运行方法
|
||||
|
||||
### 依赖
|
||||
- 至少 2 张 NVIDIA GPU,CUDA 版本 ≥ 12.8
|
||||
- 按 [Trinity-RFT 安装指南](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) 从源码安装
|
||||
- 安装 gymnasium 冰湖环境:
|
||||
```bash
|
||||
pip install gymnasium[toy_text]
|
||||
```
|
||||
- 下载模型权重(示例):
|
||||
```bash
|
||||
huggingface-cli download Qwen/Qwen2.5-3B-Instruct
|
||||
```
|
||||
|
||||
### 步骤 1:准备数据集
|
||||
```bash
|
||||
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
|
||||
```
|
||||
将 `main.py` 中的数据集路径改为你的生成目录。
|
||||
|
||||
### 步骤 2:配置训练
|
||||
|
||||
关键配置可在代码中设置,包括:
|
||||
|
||||
**算法配置** (`AlgorithmConfig`):
|
||||
- `algorithm_type`: `multi_step_grpo`(用于多步任务的组相对策略优化)
|
||||
- `group_size`: 每批次的策略更新组大小(默认 16)
|
||||
- `batch_size`: 批大小(默认 32)
|
||||
- `learning_rate`: 学习率(默认 1e-6)
|
||||
|
||||
**模型配置** (`TunerModelConfig`):
|
||||
- `model_path`: 基础模型路径(如 `Qwen/Qwen2.5-3B-Instruct`)
|
||||
- `max_model_len`: 最大上下文长度(默认 25600)
|
||||
- `max_tokens`: 响应最大生成长度(默认 2048)
|
||||
- `inference_engine_num`: 推理引擎数量(默认 6,表示用 6 个 GPU 进行推理)
|
||||
|
||||
**数据集配置** (`DatasetConfig`):
|
||||
- `path`: 数据集路径(默认 `/path/to/frozenlake`)
|
||||
- `split`: 数据集分片(默认 `train`)
|
||||
|
||||
可根据硬件资源和训练需求调整这些参数。其他参数可在 [config.yaml](./config.yaml) 中指定。
|
||||
|
||||
### 步骤 3:设置 Ray 集群
|
||||
|
||||
设置 [Ray](https://github.com/ray-project/ray) 集群:
|
||||
```bash
|
||||
ray start --head
|
||||
# 对于多节点设置,在工作节点上运行以下命令
|
||||
# ray start --address=<master_address>
|
||||
```
|
||||
|
||||
### 步骤 4:运行训练脚本
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
训练将开始,可通过日志监控进度。检查点将每 `trainer.save_interval` 步保存一次。
|
||||
|
||||
## 实验结果
|
||||
|
||||
### 训练奖励曲线
|
||||
|
||||
训练过程中的奖励曲线显示智能体的学习进度:
|
||||
|
||||

|
||||
|
||||
训练奖励通常随着智能体学习更有效地导航冰湖而随训练轮次增加。
|
||||
|
||||
### 智能体输出示例
|
||||
|
||||
智能体输出示例如下:
|
||||
```
|
||||
From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.
|
||||
|
||||
The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.
|
||||
|
||||
Let's take the action ```Left```.
|
||||
```
|
||||
102
tuner/frozen_lake/_frozenlake_agent.py
Normal file
102
tuner/frozen_lake/_frozenlake_agent.py
Normal file
@@ -0,0 +1,102 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Adapted from Trinity-RFT"""
|
||||
import re
|
||||
from _utils import SYSTEM_PROMPT, FrozenLakeAction # pylint: disable=E0611
|
||||
from agentscope.agent import ReActAgent
|
||||
from agentscope.formatter import OpenAIChatFormatter
|
||||
from agentscope.message import Msg
|
||||
from agentscope.model import OpenAIChatModel
|
||||
|
||||
|
||||
INVALID_ACTION = "still"
|
||||
VALID_ACTIONS = {
|
||||
"left": 1,
|
||||
"down": 2,
|
||||
"right": 3,
|
||||
"up": 4,
|
||||
}
|
||||
|
||||
|
||||
class FrozenLakeAgent(ReActAgent):
|
||||
"""Agent for FrozenLake environment."""
|
||||
|
||||
def __init__(self, model: OpenAIChatModel, max_steps: int = 20):
|
||||
super().__init__(
|
||||
name="frozenlake_agent",
|
||||
model=model,
|
||||
sys_prompt=SYSTEM_PROMPT,
|
||||
formatter=OpenAIChatFormatter(),
|
||||
max_iters=1,
|
||||
)
|
||||
self.response_structure = FrozenLakeAction
|
||||
self.current_step = 0
|
||||
self.last_action = None
|
||||
self.last_observation = None
|
||||
self.max_steps = max_steps
|
||||
|
||||
def get_prompt(self, observation: str) -> str:
|
||||
"""Get prompt for the agent based on current observation."""
|
||||
prompt = (
|
||||
f"Current Observation ({self.current_step}): \n"
|
||||
+ observation
|
||||
+ "\n"
|
||||
+ (
|
||||
"You have not achieved the goal, P has not reached G yet. "
|
||||
"Please give the next action."
|
||||
)
|
||||
)
|
||||
if self.current_step > 0 and self.last_action is not None:
|
||||
if self.last_observation == observation:
|
||||
prompt += (
|
||||
"\nYour last response is invalid. "
|
||||
"Your position didn't change at all. "
|
||||
"You may need to recheck your thinking process, "
|
||||
"action outputted, and the format of response. "
|
||||
"Remember, you should only output the NEXT ACTION "
|
||||
"at each iteration in the ``` ```. "
|
||||
"For example, if you want to move up, "
|
||||
"you should output ```Up```."
|
||||
)
|
||||
|
||||
if (
|
||||
self.max_steps is not None
|
||||
and self.max_steps - self.current_step > 0
|
||||
):
|
||||
remaining = self.max_steps - self.current_step
|
||||
prompt += (
|
||||
f"\nThe maximum number of steps remaining is {remaining}."
|
||||
)
|
||||
|
||||
return prompt
|
||||
|
||||
def get_action(self, msg: Msg) -> str:
|
||||
"""Extract action from agent response message."""
|
||||
response: str = (
|
||||
msg.content
|
||||
if isinstance(msg.content, str)
|
||||
else msg.content[0].get("text")
|
||||
)
|
||||
action = INVALID_ACTION
|
||||
|
||||
matches = re.findall(r"```(.*?)```", response, re.DOTALL)
|
||||
|
||||
if matches:
|
||||
last_match_content = matches[-1].strip()
|
||||
action = last_match_content.lower()
|
||||
if action not in VALID_ACTIONS:
|
||||
action = INVALID_ACTION
|
||||
|
||||
return action
|
||||
|
||||
def update_state(self, action: str, observation: str) -> None:
|
||||
"""Update agent state with action and observation."""
|
||||
self.last_action = action
|
||||
self.last_observation = observation
|
||||
self.current_step += 1
|
||||
|
||||
async def reset(self) -> None:
|
||||
"""Reset agent state for a new episode."""
|
||||
self.current_step = 0
|
||||
self.last_action = None
|
||||
self.last_observation = None
|
||||
await self.memory.clear()
|
||||
316
tuner/frozen_lake/_frozenlake_env.py
Normal file
316
tuner/frozen_lake/_frozenlake_env.py
Normal file
@@ -0,0 +1,316 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Adapted from Trinity-RFT"""
|
||||
import copy
|
||||
from typing import Dict, Optional, Tuple, Union
|
||||
import numpy as np
|
||||
|
||||
try:
|
||||
from gymnasium.envs.toy_text.frozen_lake import (
|
||||
FrozenLakeEnv as GymFrozenLakeEnv,
|
||||
)
|
||||
except ImportError:
|
||||
GymFrozenLakeEnv = object
|
||||
from _utils import ( # pylint: disable=E0611
|
||||
generate_random_map,
|
||||
get_goal_position,
|
||||
)
|
||||
|
||||
|
||||
class FrozenLakeEnv(GymFrozenLakeEnv):
|
||||
"""FrozenLake environment wrapper."""
|
||||
|
||||
# Map gym state in integer
|
||||
MAP_LOOKUP = {
|
||||
b"P": 0,
|
||||
b"F": 1,
|
||||
b"H": 2,
|
||||
b"G": 3,
|
||||
}
|
||||
|
||||
# Define rules to transform to rendered text observation of the environment
|
||||
GRID_LOOKUP = {
|
||||
0: " P \t", # player
|
||||
1: " _ \t", # frozen
|
||||
2: " O \t", # hole
|
||||
3: " G \t", # goal
|
||||
4: " X \t", # player fall into hole
|
||||
5: " √ \t", # player on goal
|
||||
}
|
||||
|
||||
ACTION_LOOKUP = {
|
||||
"still": 0,
|
||||
"left": 1,
|
||||
"down": 2,
|
||||
"right": 3,
|
||||
"up": 4,
|
||||
}
|
||||
|
||||
INVALID_ACTION = 0
|
||||
PENALTY_FOR_INVALID = -1
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
max_steps: int = 8,
|
||||
desc: Optional[str] = None,
|
||||
is_slippery: bool = False,
|
||||
size: int = 8,
|
||||
p: float = 0.8,
|
||||
seed: int = 42,
|
||||
):
|
||||
self.max_steps = max_steps or 8
|
||||
self.desc: Union[str, np.ndarray, None] = desc
|
||||
self.is_slippery = is_slippery
|
||||
self.size = size
|
||||
self.p = p
|
||||
self.seed = seed
|
||||
self.render_mode: Optional[str] = None
|
||||
try:
|
||||
import gymnasium as gym
|
||||
except ImportError as e:
|
||||
error_message = (
|
||||
"Gymnasium is not installed. "
|
||||
"Please install gymnasium first before "
|
||||
"running the frozen_lake workflow. "
|
||||
f"Error: {str(e)}"
|
||||
)
|
||||
raise ImportError(error_message) from e
|
||||
|
||||
if self.desc is None:
|
||||
random_map, goal_position = generate_random_map(
|
||||
size=self.size,
|
||||
p=self.p,
|
||||
seed=self.seed,
|
||||
max_steps=self.max_steps,
|
||||
)
|
||||
else:
|
||||
random_map = np.asarray(copy.deepcopy(self.desc), dtype="c")
|
||||
goal_position = get_goal_position(random_map)
|
||||
|
||||
self.goal_position = goal_position
|
||||
|
||||
super().__init__(
|
||||
desc=random_map[:],
|
||||
is_slippery=self.is_slippery,
|
||||
)
|
||||
assert isinstance(self.desc, np.ndarray)
|
||||
self.action_space = gym.spaces.Discrete(4, start=1)
|
||||
|
||||
self.map_kwargs = {
|
||||
"size": size,
|
||||
"p": p,
|
||||
}
|
||||
self.env_kwargs = {
|
||||
"is_slippery": is_slippery,
|
||||
"desc": copy.deepcopy(desc),
|
||||
"seed": seed,
|
||||
}
|
||||
|
||||
self.action_map = {
|
||||
1: 0, # left
|
||||
2: 1, # down
|
||||
3: 2, # right
|
||||
4: 3, # up
|
||||
}
|
||||
|
||||
def _get_player_position(self) -> Tuple[int, int]:
|
||||
return (self.s // self.ncol, self.s % self.ncol) # (row, col)
|
||||
|
||||
def step(self, action: str) -> Tuple[str, float, bool, Dict]:
|
||||
"""Execute a step in the environment.
|
||||
|
||||
Maps custom action to gymnasium FrozenLakeEnv action and
|
||||
takes the step. Checks if the action is effective (whether
|
||||
player moves in the env).
|
||||
|
||||
Args:
|
||||
action: The action to take.
|
||||
|
||||
Returns:
|
||||
Tuple of (observation, reward, done, info).
|
||||
"""
|
||||
if self.success():
|
||||
obs = self.render(mode="tiny_rgb_array")
|
||||
assert isinstance(obs, str)
|
||||
return obs, 1.0, True, {"action_is_effective": False}
|
||||
|
||||
action_id: int = self.ACTION_LOOKUP.get(action.lower(), 0)
|
||||
|
||||
if not action_id:
|
||||
action_id = self.INVALID_ACTION
|
||||
|
||||
if (
|
||||
action_id == self.INVALID_ACTION
|
||||
or action_id not in self.action_map
|
||||
):
|
||||
obs = self.render(mode="tiny_rgb_array")
|
||||
assert isinstance(obs, str)
|
||||
return obs, 0.0, False, {"action_is_effective": False}
|
||||
|
||||
prev_player_position = int(self.s)
|
||||
|
||||
# Call parent class step method
|
||||
# Note: GymFrozenLakeEnv is imported at module level
|
||||
player_pos, reward, done, _, _ = super().step(
|
||||
self.action_map[action_id],
|
||||
)
|
||||
|
||||
obs = self.render(mode="tiny_rgb_array")
|
||||
assert isinstance(obs, str)
|
||||
return (
|
||||
obs,
|
||||
float(reward),
|
||||
bool(done),
|
||||
{"action_is_effective": prev_player_position != int(player_pos)},
|
||||
)
|
||||
|
||||
def render(
|
||||
self,
|
||||
mode: str = "tiny_rgb_array",
|
||||
) -> str | list[str] | np.ndarray:
|
||||
"""Render the environment.
|
||||
|
||||
Args:
|
||||
mode: Rendering mode. Options: "tiny_rgb_array", "list",
|
||||
"state", "rgb_array", "ansi".
|
||||
|
||||
Returns:
|
||||
Rendered observation based on the mode.
|
||||
"""
|
||||
assert mode in [
|
||||
"tiny_rgb_array",
|
||||
"list",
|
||||
"state",
|
||||
"rgb_array",
|
||||
"ansi",
|
||||
]
|
||||
if mode in ["rgb_array", "ansi"]:
|
||||
prev_render_mode = self.render_mode
|
||||
self.render_mode = mode
|
||||
obs = super().render()
|
||||
self.render_mode = prev_render_mode
|
||||
return obs
|
||||
assert isinstance(self.desc, np.ndarray)
|
||||
room_state = copy.deepcopy(self.desc)
|
||||
|
||||
# replace the position of start 'S' with 'F'
|
||||
position_S = np.where(room_state == b"S")
|
||||
room_state[position_S] = b"F"
|
||||
|
||||
# replace the position of the player with 'P'
|
||||
position_P = self._get_player_position()
|
||||
room_state[position_P] = b"P"
|
||||
|
||||
if mode == "state":
|
||||
# transform 'S', 'F', 'H', 'G' to numpy integer array
|
||||
room_state = np.vectorize(lambda x: self.MAP_LOOKUP[x])(room_state)
|
||||
# add player in hole or player on goal
|
||||
if self.desc[position_P] == b"H":
|
||||
room_state[position_P] = 4
|
||||
elif self.desc[position_P] == b"G":
|
||||
room_state[position_P] = 5
|
||||
return room_state
|
||||
|
||||
room_state = self.render(mode="state").tolist()
|
||||
assert isinstance(room_state, list)
|
||||
|
||||
if mode == "list":
|
||||
|
||||
def lookup_list(cell: int) -> str:
|
||||
return self.GRID_LOOKUP.get(cell, "?").strip("\t").strip()
|
||||
|
||||
return [
|
||||
" ".join(lookup_list(cell) for cell in row)
|
||||
for row in room_state
|
||||
]
|
||||
|
||||
if mode == "tiny_rgb_array":
|
||||
|
||||
def lookup_tiny(cell: int) -> str:
|
||||
return self.GRID_LOOKUP.get(cell, "?")
|
||||
|
||||
result = "\n".join(
|
||||
"".join(lookup_tiny(cell) for cell in row)
|
||||
for row in room_state
|
||||
)
|
||||
return result
|
||||
|
||||
# Default return for other modes
|
||||
return ""
|
||||
|
||||
def reset(
|
||||
self,
|
||||
task: Optional[Dict] = None,
|
||||
) -> tuple[str, Dict]:
|
||||
"""Reset the environment with optional task parameters."""
|
||||
task = task or {}
|
||||
# Update parameters from task if provided
|
||||
size = task.get("size", self.map_kwargs["size"])
|
||||
p = task.get("p", self.map_kwargs["p"])
|
||||
seed = task.get("seed", self.env_kwargs["seed"])
|
||||
is_slippery = task.get(
|
||||
"is_slippery",
|
||||
self.env_kwargs["is_slippery"],
|
||||
)
|
||||
desc = task.get("desc", self.env_kwargs.get("desc"))
|
||||
|
||||
# Update instance variables
|
||||
self.size = size
|
||||
self.p = p
|
||||
self.seed = seed
|
||||
self.is_slippery = is_slippery
|
||||
self.map_kwargs["size"] = size
|
||||
self.map_kwargs["p"] = p
|
||||
self.env_kwargs["seed"] = seed
|
||||
self.env_kwargs["is_slippery"] = is_slippery
|
||||
if desc is not None:
|
||||
self.env_kwargs["desc"] = copy.deepcopy(desc)
|
||||
|
||||
if desc is None:
|
||||
random_map, goal_position = generate_random_map(
|
||||
size=size,
|
||||
p=p,
|
||||
seed=seed,
|
||||
max_steps=self.max_steps,
|
||||
)
|
||||
else:
|
||||
random_map = np.asarray(copy.deepcopy(desc), dtype="c")
|
||||
goal_position = get_goal_position(random_map)
|
||||
|
||||
self.goal_position = goal_position
|
||||
self.desc = random_map[:]
|
||||
|
||||
# Reinitialize parent class with new map
|
||||
try:
|
||||
import gymnasium as gym
|
||||
|
||||
super().__init__(
|
||||
desc=random_map[:],
|
||||
is_slippery=self.is_slippery,
|
||||
)
|
||||
assert isinstance(self.desc, np.ndarray)
|
||||
self.action_space = gym.spaces.Discrete(4, start=1)
|
||||
except ImportError as e:
|
||||
error_message = (
|
||||
"Gymnasium is not installed. "
|
||||
"Please install gymnasium first before "
|
||||
"running the frozen_lake workflow. "
|
||||
f"Error: {str(e)}"
|
||||
)
|
||||
raise ImportError(error_message) from e
|
||||
|
||||
super().reset(seed=self.seed)
|
||||
obs = self.render(mode="tiny_rgb_array")
|
||||
assert isinstance(obs, str)
|
||||
return obs, {}
|
||||
|
||||
def finished(self) -> bool:
|
||||
"""Check if the episode is finished (goal or hole)."""
|
||||
player_pos = self._get_player_position()
|
||||
assert isinstance(self.desc, np.ndarray)
|
||||
return self.desc[player_pos] in b"GH" # type: ignore
|
||||
|
||||
def success(self) -> bool:
|
||||
"""Check if the agent has reached the goal (G)."""
|
||||
player_pos = self._get_player_position()
|
||||
assert isinstance(self.desc, np.ndarray)
|
||||
return self.desc[player_pos] in b"G"
|
||||
209
tuner/frozen_lake/_utils.py
Normal file
209
tuner/frozen_lake/_utils.py
Normal file
@@ -0,0 +1,209 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Utils for the FrozenLake environment.
|
||||
Modified from rllm
|
||||
"""
|
||||
|
||||
from typing import Literal, Optional, Tuple
|
||||
import numpy as np
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
# Map gym state in integer
|
||||
MAP_LOOKUP = {
|
||||
b"P": 0,
|
||||
b"F": 1,
|
||||
b"H": 2,
|
||||
b"G": 3,
|
||||
}
|
||||
|
||||
# Define rules to transform to rendered text observation of the environment
|
||||
GRID_LOOKUP = {
|
||||
0: " P \t", # player
|
||||
1: " _ \t", # frozen
|
||||
2: " O \t", # hole
|
||||
3: " G \t", # goal
|
||||
4: " X \t", # player fall into hole
|
||||
5: " √ \t", # player on goal
|
||||
}
|
||||
|
||||
ACTION_LOOKUP = {
|
||||
0: "None",
|
||||
1: "Left",
|
||||
2: "Down",
|
||||
3: "Right",
|
||||
4: "Up",
|
||||
}
|
||||
|
||||
# Prompting format inspired by the RAGEN project
|
||||
SYSTEM_PROMPT = """You are Qwen, created by Alibaba Cloud. \
|
||||
You are a helpful assistant. You are walking on a frozen lake.
|
||||
|
||||
FrozenLake Quick Guide
|
||||
Goal: Reach the goal (G). Player (P) and Goal (G) must overlap.
|
||||
|
||||
Symbols:
|
||||
_ Frozen | O Hole | G Goal | P Player
|
||||
|
||||
Rules:
|
||||
1. Avoid falling into holes (O).
|
||||
2. Frozen tiles are slippery, you may move perpendicular to
|
||||
your intended direction.
|
||||
|
||||
Valid Action (separated by | ):
|
||||
Up | Down | Left | Right
|
||||
|
||||
Rewards:
|
||||
Fall into hole: 0
|
||||
Reach goal: +1.0
|
||||
|
||||
You will be provided the current observation, please decide on
|
||||
the next Action.
|
||||
You should show your thought process and then input the final
|
||||
action in ``` ```.
|
||||
You should only output the NEXT ACTION at each iteration in
|
||||
the ``` ```. For example, if you want to move up, you should
|
||||
output ```Up```.
|
||||
You should plan ahead and need to achieve it in minimum number
|
||||
of steps.
|
||||
You should be aware that frozen tiles can be slippery, but the
|
||||
chance is small and you should not overthink it.
|
||||
|
||||
Please show your thinking process and put the final action in
|
||||
``` ```. In every turn, the final action MUST be one of Up,
|
||||
Down, Left, Right.
|
||||
"""
|
||||
|
||||
|
||||
class FrozenLakeAction(BaseModel):
|
||||
"""Action model for FrozenLake environment."""
|
||||
|
||||
action: Literal["Up", "Down", "Left", "Right"] = Field(
|
||||
description=(
|
||||
"The action to take in the FrozenLake environment, "
|
||||
"must be one of Up, Down, Left, Right"
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def is_valid(board: list[list[str]], max_size: int, max_steps: int) -> bool:
|
||||
"""DFS to check that it's a valid path.
|
||||
|
||||
Args:
|
||||
board: The board representation as a list of lists.
|
||||
max_size: Maximum size of the board.
|
||||
max_steps: Maximum number of steps allowed.
|
||||
|
||||
Returns:
|
||||
True if there's a valid path from start to goal within max_steps,
|
||||
False otherwise.
|
||||
"""
|
||||
frontier, discovered = [], set()
|
||||
# find the start point
|
||||
start_r, start_c = np.where(np.array(board) == "S")
|
||||
frontier.append((start_r[0], start_c[0], 0)) # row, col steps
|
||||
# dfs to check if there is a path from start to goal
|
||||
while frontier:
|
||||
r, c, steps = frontier.pop()
|
||||
if steps > max_steps:
|
||||
continue
|
||||
|
||||
if (r, c) not in discovered:
|
||||
discovered.add((r, c))
|
||||
directions = [(1, 0), (0, 1), (-1, 0), (0, -1)]
|
||||
for x, y in directions:
|
||||
r_new = r + x
|
||||
c_new = c + y
|
||||
if (
|
||||
r_new < 0
|
||||
or r_new >= max_size
|
||||
or c_new < 0
|
||||
or c_new >= max_size
|
||||
): # noqa: PLR2004
|
||||
continue
|
||||
if board[r_new][c_new] == "G":
|
||||
return True
|
||||
if board[r_new][c_new] != "H":
|
||||
frontier.append((r_new, c_new, steps + 1))
|
||||
return False
|
||||
|
||||
|
||||
def generate_random_map(
|
||||
size: int = 8,
|
||||
p: float = 0.8,
|
||||
seed: int = 0,
|
||||
max_steps: int = 5,
|
||||
) -> Tuple[list[str], Tuple[int, int]]:
|
||||
"""Generates a random valid map (one that has a path from start to goal).
|
||||
|
||||
Args:
|
||||
size: Size of each side of the grid.
|
||||
p: Probability that a tile is frozen.
|
||||
seed: Seed to ensure the generation of reproducible maps.
|
||||
max_steps: Maximum number of steps allowed.
|
||||
|
||||
Returns:
|
||||
A tuple containing a random valid map and the goal position (row, col).
|
||||
"""
|
||||
valid = False
|
||||
board: list[list[str]] = [] # initialize to make pyright happy
|
||||
|
||||
try:
|
||||
from gymnasium.utils import seeding
|
||||
|
||||
np_random, _ = seeding.np_random(seed)
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"Gymnasium is not installed. "
|
||||
"Please install gymnasium first before "
|
||||
"running the frozen_lake workflow.",
|
||||
) from exc
|
||||
|
||||
# generate random start and end points
|
||||
while not valid:
|
||||
p = min(1, p)
|
||||
board = np_random.choice(
|
||||
["F", "H"],
|
||||
(size, size),
|
||||
p=[p, 1 - p],
|
||||
).tolist()
|
||||
|
||||
while True:
|
||||
start_r = int(np_random.integers(0, size))
|
||||
start_c = int(np_random.integers(0, size))
|
||||
goal_r = int(np_random.integers(0, size))
|
||||
goal_c = int(np_random.integers(0, size))
|
||||
|
||||
# Ensure start and goal are different positions
|
||||
if (start_r, start_c) != (goal_r, goal_c):
|
||||
break
|
||||
|
||||
board[start_r][start_c] = "S"
|
||||
board[goal_r][goal_c] = "G"
|
||||
|
||||
valid = is_valid(board, size, max_steps)
|
||||
return ["".join(x) for x in board], (goal_r, goal_c)
|
||||
|
||||
|
||||
def get_goal_position(
|
||||
random_map: np.ndarray,
|
||||
) -> Optional[Tuple[int, int]]:
|
||||
"""Get the goal position from a random map.
|
||||
|
||||
Args:
|
||||
random_map: The map as a numpy array.
|
||||
|
||||
Returns:
|
||||
Tuple of (row, col) if goal found, None otherwise.
|
||||
"""
|
||||
positions = np.argwhere(random_map == b"G")
|
||||
if positions.size == 0:
|
||||
return None # G not found
|
||||
return tuple(positions[0]) # returns (row, col)
|
||||
|
||||
|
||||
__all__ = [
|
||||
"SYSTEM_PROMPT",
|
||||
"FrozenLakeAction",
|
||||
"generate_random_map",
|
||||
"get_goal_position",
|
||||
]
|
||||
53
tuner/frozen_lake/config.yaml
Normal file
53
tuner/frozen_lake/config.yaml
Normal file
@@ -0,0 +1,53 @@
|
||||
project: "AgentScope" # Project name
|
||||
name: "FrozenLake" # Experiment name
|
||||
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} # Directory to save model checkpoints
|
||||
algorithm:
|
||||
algorithm_type: multi_step_grpo # GRPO series for multi-step scenario
|
||||
repeat_times: 16 # Number of rollouts per prompt for advantage estimation
|
||||
kl_loss_fn: "low_var_kl"
|
||||
kl_loss_fn_args:
|
||||
kl_coef: 0 # KL divergence coefficient
|
||||
advantage_fn_args:
|
||||
epsilon: 1e-6 # Small value for numerical stability
|
||||
std_threshold: 0.0001 # Threshold for standard deviation
|
||||
optimizer:
|
||||
lr: 1e-6 # Learning rate
|
||||
model:
|
||||
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-3B-Instruct} # Base model path
|
||||
max_prompt_tokens: 23552 # Max tokens for prompt
|
||||
max_response_tokens: 2048 # Max tokens per response
|
||||
max_model_len: 25600 # Max context length
|
||||
temperature: 1.0 # Sampling temperature
|
||||
buffer:
|
||||
total_epochs: 5 # Total training epochs
|
||||
batch_size: 32 # Batch size per explore step
|
||||
train_batch_size: 1024 # Total experiences per training step
|
||||
trainer_input:
|
||||
experience_buffer:
|
||||
name: experience_buffer
|
||||
storage_type: queue
|
||||
max_read_timeout: 7200 # Max timeout for reading from buffer (seconds)
|
||||
replay_buffer:
|
||||
enable: true # Enable experience replay
|
||||
priority_fn: linear_decay # Priority function for replay buffer
|
||||
priority_fn_args:
|
||||
decay: 0.1 # Decay rate for priority function
|
||||
explorer:
|
||||
runner_per_model: 16 # Number of runners per model
|
||||
rollout_model:
|
||||
engine_num: 6 # Number of vLLM engines for rollout model
|
||||
tensor_parallel_size: 1 # TP size per engine for rollout model
|
||||
enable_openai_api: true # Enable OpenAI-compatible API
|
||||
enable_history: true # Enable conversation history
|
||||
enable_auto_tool_choice: true # Enable automatic tool selection
|
||||
tool_call_parser: hermes # Parser for tool calls
|
||||
trainer:
|
||||
save_interval: 100 # Save checkpoint every N steps
|
||||
use_dynamic_bsz: true # Use dynamic batch size
|
||||
grad_clip: 1.0 # Gradient clipping value
|
||||
max_token_len_per_gpu: 25600 # Max token length per GPU
|
||||
ulysses_sequence_parallel_size: 2 # Sequence parallel size for Ulysses
|
||||
synchronizer:
|
||||
sync_style: dynamic_by_explorer # Sync triggered dynamically by explorer
|
||||
sync_interval: 1 # Sync every N steps
|
||||
sync_timeout: 1200 # Timeout for synchronization (seconds)
|
||||
BIN
tuner/frozen_lake/critic_rewards_mean.png
Normal file
BIN
tuner/frozen_lake/critic_rewards_mean.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 62 KiB |
131
tuner/frozen_lake/get_frozenlake_data.py
Normal file
131
tuner/frozen_lake/get_frozenlake_data.py
Normal file
@@ -0,0 +1,131 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Modified from rllm
|
||||
"""
|
||||
import argparse
|
||||
import os
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
|
||||
DEFAULT_DATA_PATH = os.path.join(
|
||||
os.path.dirname(os.path.abspath(__file__)),
|
||||
"..",
|
||||
"data",
|
||||
"frozenlake",
|
||||
)
|
||||
|
||||
|
||||
def save_dataset_to_local(
|
||||
data_path: str,
|
||||
data: list[dict],
|
||||
split: str = "default",
|
||||
) -> str:
|
||||
"""Save dataset directly to local data_path.
|
||||
|
||||
Args:
|
||||
data_path: Path to save the dataset
|
||||
data: List of dictionaries containing the dataset examples
|
||||
split: Split name (e.g., 'train', 'test', 'default')
|
||||
|
||||
Returns:
|
||||
str: Path to the saved parquet file
|
||||
"""
|
||||
os.makedirs(data_path, exist_ok=True)
|
||||
|
||||
# Convert to DataFrame and save
|
||||
data_df = pd.DataFrame(data)
|
||||
dataset_path = os.path.join(data_path, f"{split}.parquet")
|
||||
data_df.to_parquet(dataset_path)
|
||||
|
||||
print(
|
||||
f"Saved dataset frozenlake split '{split}' "
|
||||
f"with {len(data)} examples at {dataset_path}. "
|
||||
f"Make sure to set the environment variable "
|
||||
f"<TRINITY_TASKSET_PATH> to {data_path}.",
|
||||
)
|
||||
|
||||
return dataset_path
|
||||
|
||||
|
||||
def prepare_frozenlake_data(
|
||||
data_path: str,
|
||||
train_size: int = 10000,
|
||||
test_size: int = 100,
|
||||
map_max_size: int = 6,
|
||||
) -> tuple[list[dict], list[dict]]:
|
||||
"""
|
||||
Prepare and save FrozenLake datasets for training and testing.
|
||||
|
||||
Args:
|
||||
data_path (str): Path to save the dataset
|
||||
train_size (int): Number of training examples to generate
|
||||
test_size (int): Number of test examples to generate
|
||||
map_max_size (int): Maximum size of the map
|
||||
|
||||
Returns:
|
||||
tuple: (train_data, test_data) - Lists of data dictionaries
|
||||
"""
|
||||
# Set random seed for reproducibility
|
||||
np.random.seed(42)
|
||||
|
||||
# Generate random parameters for train and test sets
|
||||
train_seeds = np.random.randint(0, 100000, size=train_size)
|
||||
test_seeds = np.random.randint(0, 100000, size=test_size)
|
||||
train_sizes = np.random.randint(2, map_max_size, size=train_size)
|
||||
test_sizes = np.random.randint(2, map_max_size, size=test_size)
|
||||
train_ps = np.random.uniform(0.6, 0.85, size=train_size)
|
||||
test_ps = np.random.uniform(0.6, 0.85, size=test_size)
|
||||
|
||||
def frozenlake_process_fn(
|
||||
seed: int,
|
||||
size: int,
|
||||
p: float,
|
||||
idx: int,
|
||||
) -> dict:
|
||||
"""Process function to create FrozenLake task instances."""
|
||||
return {
|
||||
"seed": seed,
|
||||
"size": size,
|
||||
"p": p,
|
||||
"index": idx,
|
||||
"uid": f"{seed}_{size}_{p}",
|
||||
}
|
||||
|
||||
# Create train and test data
|
||||
train_data_list = [
|
||||
frozenlake_process_fn(seed, train_sizes[idx], train_ps[idx], idx)
|
||||
for idx, seed in enumerate(train_seeds)
|
||||
]
|
||||
test_data_list = [
|
||||
frozenlake_process_fn(seed, test_sizes[idx], test_ps[idx], idx)
|
||||
for idx, seed in enumerate(test_seeds)
|
||||
]
|
||||
|
||||
# Save datasets directly to local DATA_PATH
|
||||
save_dataset_to_local(data_path, train_data_list, "train")
|
||||
save_dataset_to_local(data_path, test_data_list, "test")
|
||||
|
||||
return train_data_list, test_data_list
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--local_dir", default=DEFAULT_DATA_PATH)
|
||||
parser.add_argument("--train_size", type=int, default=10000)
|
||||
parser.add_argument("--test_size", type=int, default=100)
|
||||
parser.add_argument("--map_max_size", type=int, default=6)
|
||||
args = parser.parse_args()
|
||||
|
||||
train_data, test_data = prepare_frozenlake_data(
|
||||
data_path=args.local_dir,
|
||||
train_size=args.train_size,
|
||||
test_size=args.test_size,
|
||||
map_max_size=args.map_max_size,
|
||||
)
|
||||
|
||||
print(f"Train dataset: {len(train_data)} examples")
|
||||
print(f"Test dataset: {len(test_data)} examples")
|
||||
print("Sample train example:", train_data[0])
|
||||
print("Sample test example:", test_data[0])
|
||||
151
tuner/frozen_lake/main.py
Normal file
151
tuner/frozen_lake/main.py
Normal file
@@ -0,0 +1,151 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Example of training a FrozenLake agent with Trinity-RFT."""
|
||||
import os
|
||||
from typing import Dict
|
||||
from _frozenlake_agent import FrozenLakeAgent
|
||||
from _frozenlake_env import FrozenLakeEnv
|
||||
from agentscope.message import Msg
|
||||
from agentscope.tuner import (
|
||||
tune,
|
||||
WorkflowOutput,
|
||||
DatasetConfig,
|
||||
TunerModelConfig,
|
||||
AlgorithmConfig,
|
||||
)
|
||||
from agentscope.model import ChatModelBase
|
||||
|
||||
|
||||
async def run_frozen_lake(
|
||||
task: Dict,
|
||||
model: ChatModelBase,
|
||||
auxiliary_models: Dict[str, ChatModelBase],
|
||||
) -> WorkflowOutput:
|
||||
"""A workflow function using the FrozenLake agent to solve tasks.
|
||||
|
||||
Args:
|
||||
task (Dict): The task to be solved, containing environment parameters
|
||||
like size, p, seed, is_slippery, etc.
|
||||
model (ChatModelBase): The language model to use.
|
||||
|
||||
Returns:
|
||||
WorkflowOutput: The workflow output containing the reward, response and
|
||||
metrics.
|
||||
"""
|
||||
|
||||
assert len(auxiliary_models) == 0, "No auxiliary models are needed"
|
||||
|
||||
# Extract workflow arguments from task or use defaults
|
||||
workflow_args = task.get("workflow_args", {})
|
||||
if not workflow_args:
|
||||
workflow_args = task
|
||||
|
||||
env_max_steps = workflow_args.get("env_max_steps", 8)
|
||||
agent_max_steps = workflow_args.get("agent_max_steps", 10)
|
||||
is_slippery = workflow_args.get("is_slippery", False)
|
||||
desc = workflow_args.get("desc", None)
|
||||
|
||||
# Extract task-specific arguments (for environment generation)
|
||||
size = task.get("size", 8)
|
||||
p = task.get("p", 0.8)
|
||||
seed = task.get("seed", 42)
|
||||
|
||||
# Initialize agent and environment
|
||||
agent = FrozenLakeAgent(model=model, max_steps=agent_max_steps)
|
||||
env = FrozenLakeEnv(
|
||||
max_steps=env_max_steps,
|
||||
desc=desc,
|
||||
is_slippery=is_slippery,
|
||||
size=size,
|
||||
p=p,
|
||||
seed=seed,
|
||||
)
|
||||
|
||||
# Reset environment with task parameters
|
||||
observation, _ = env.reset(task)
|
||||
observation_str = str(observation)
|
||||
rewards = []
|
||||
step_count = 0
|
||||
done = False
|
||||
terminate_reason = None
|
||||
|
||||
# Run agent-environment interaction loop
|
||||
for _ in range(agent_max_steps):
|
||||
step_count += 1
|
||||
try:
|
||||
# get prompt
|
||||
prompt = agent.get_prompt(observation_str)
|
||||
|
||||
response = await agent.reply(msg=Msg("user", prompt, role="user"))
|
||||
|
||||
# record action and observation
|
||||
action = agent.get_action(response)
|
||||
agent.update_state(action=action, observation=observation_str)
|
||||
|
||||
except Exception as e:
|
||||
terminate_reason = f"agent_error: {str(e)}"
|
||||
break
|
||||
|
||||
# environment step
|
||||
observation, reward, done, _ = env.step(action)
|
||||
observation_str = str(observation)
|
||||
rewards.append(reward)
|
||||
|
||||
if done:
|
||||
terminate_reason = "success" if env.success() else "hole"
|
||||
break
|
||||
|
||||
if terminate_reason is None:
|
||||
terminate_reason = "max_steps_reached"
|
||||
|
||||
final_reward = sum(rewards)
|
||||
final_observation = observation_str
|
||||
|
||||
# Create response message with environment information
|
||||
response_content = (
|
||||
f"Final observation:\n{final_observation}\n"
|
||||
f"Total reward: {final_reward}\n"
|
||||
f"Steps taken: {step_count}\n"
|
||||
f"Terminate reason: {terminate_reason}"
|
||||
)
|
||||
|
||||
final_response = Msg("assistant", response_content, role="assistant")
|
||||
|
||||
return WorkflowOutput(
|
||||
reward=final_reward,
|
||||
response=final_response,
|
||||
metrics={
|
||||
"env_steps": float(step_count),
|
||||
"env_done": float(done),
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
dataset = DatasetConfig(
|
||||
path="/path/to/frozenlake",
|
||||
split="train",
|
||||
)
|
||||
tuner_model = TunerModelConfig(
|
||||
model_path="Qwen/Qwen2.5-3B-Instruct",
|
||||
max_model_len=25600,
|
||||
max_tokens=2048,
|
||||
inference_engine_num=6,
|
||||
reasoning_parser=None,
|
||||
)
|
||||
algorithm = AlgorithmConfig(
|
||||
algorithm_type="multi_step_grpo",
|
||||
group_size=16,
|
||||
batch_size=32,
|
||||
learning_rate=1e-6,
|
||||
)
|
||||
config_path = os.path.join(
|
||||
os.path.dirname(__file__),
|
||||
"config.yaml",
|
||||
) # define some default parameters
|
||||
tune(
|
||||
workflow_func=run_frozen_lake,
|
||||
model=tuner_model,
|
||||
train_dataset=dataset,
|
||||
algorithm=algorithm,
|
||||
config_path=config_path,
|
||||
)
|
||||
Reference in New Issue
Block a user