271 lines
10 KiB
Markdown
271 lines
10 KiB
Markdown
# Training FrozenLake Agent with RL using AgentScope-Tuner
|
|
|
|
## Summary
|
|
|
|
This example demonstrates how to use AgentScope-Tuner to implement reinforcement fine-tuning for the [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) task using [Trinity-RFT](https://github.com/agentscope-ai/Trinity-RFT). The agent learns to navigate a frozen lake grid from a starting position to a goal while avoiding holes through multi-step interactions with the environment.
|
|
|
|
## Task Setting
|
|
|
|
### Agent Goal
|
|
The agent's objective is to navigate from the starting position (S) to the goal position (G) on a frozen lake grid without falling into holes (H). The agent must:
|
|
- Plan a path through frozen tiles (F) to reach the goal
|
|
- Avoid holes that terminate the episode with zero reward
|
|
- Complete the task within a limited number of steps
|
|
|
|
### Agent Type
|
|
The agent is implemented as a **ReActAgent** (Reasoning and Acting Agent) that:
|
|
- Observes the current state of the frozen lake grid
|
|
- Reasons about the best action to take
|
|
- Executes actions (Up, Down, Left, Right) to move through the environment
|
|
- Maintains internal state across multiple steps in an episode
|
|
|
|
### Environment
|
|
The environment is based on Gymnasium's FrozenLake environment, wrapped to provide:
|
|
- **Grid-based navigation**: Randomly generated maps with configurable size (2x2 to 6x6)
|
|
- **Tile types**:
|
|
- `S`: Start position
|
|
- `F`: Frozen tile (safe to walk on)
|
|
- `H`: Hole (terminates episode with reward 0)
|
|
- `G`: Goal (terminates episode with reward +1.0)
|
|
- **Action space**: Discrete actions (Up, Down, Left, Right)
|
|
- **Reward structure**:
|
|
- +1.0 for reaching the goal
|
|
- 0.0 for falling into a hole or failing to reach the goal
|
|
- **Observations**: Text-based grid representation showing current player position
|
|
|
|
The agent does not use external tools. It interacts directly with the environment through:
|
|
- `env.reset(task)`: Initialize environment with task parameters
|
|
- `env.step(action)`: Execute action and receive observation, reward, and done flag
|
|
- `env.render()`: Get text representation of current state
|
|
|
|
|
|
## Dataset Preparation
|
|
|
|
The dataset contains task parameters for generating FrozenLake environments. Each sample specifies:
|
|
- `seed`: Random seed for reproducible map generation
|
|
- `size`: Grid size (randomly sampled from 2 to `map_max_size`, e.g., 4x4, 6x6)
|
|
- `p`: Probability that a tile is frozen (vs. being a hole), randomly sampled from 0.6 to 0.85
|
|
- `index`: Sample index
|
|
- `uid`: Unique identifier combining seed, size, and p
|
|
|
|
Run the data preparation script to generate training and test datasets:
|
|
|
|
```bash
|
|
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
|
|
```
|
|
|
|
This will create parquet files in the specified directory:
|
|
|
|
```
|
|
/path/to/frozenlake_dataset/
|
|
├── train.parquet # 10000 training samples
|
|
└── test.parquet # 100 test samples
|
|
```
|
|
|
|
Each sample looks like:
|
|
|
|
```json
|
|
{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
|
|
```
|
|
|
|
**Note**: The data preparation script ensures that all generated maps have a valid path from start to goal within the maximum allowed steps (`env_max_steps=8`), filtering out unsolvable tasks.
|
|
|
|
## Code Implementation
|
|
|
|
This section provides a high-level overview of the code implementation. For detailed implementation, please refer to the source code.
|
|
|
|
### High-level Overview
|
|
|
|
The implementation consists of three main components:
|
|
|
|
1. **Agent** (`FrozenLakeAgent`): Extends `ReActAgent` to handle multi-step navigation
|
|
2. **Environment** (`FrozenLakeEnv`): Wraps Gymnasium's FrozenLake environment
|
|
3. **Workflow** (`run_frozen_lake`): Orchestrates the agent-environment interaction loop
|
|
|
|
### Agent Workflow
|
|
|
|
The workflow function `run_frozen_lake` implements the agent-environment interaction loop:
|
|
|
|
```python
|
|
async def run_frozen_lake(
|
|
task: Dict,
|
|
model: ChatModelBase,
|
|
auxiliary_models: Dict[str, ChatModelBase],
|
|
) -> WorkflowOutput:
|
|
# ...
|
|
|
|
# Create agent and environment
|
|
agent = FrozenLakeAgent(model=model, ...)
|
|
env = FrozenLakeEnv(...)
|
|
observation, _ = env.reset(task)
|
|
rewards = []
|
|
# ...
|
|
|
|
# Agent-environment interaction loop
|
|
for _ in range(max_steps):
|
|
response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
|
|
action = agent.get_action(response)
|
|
observation, reward, done, _ = env.step(action)
|
|
rewards.append(reward)
|
|
if done:
|
|
break
|
|
|
|
# ...
|
|
final_reward = sum(rewards)
|
|
final_response = Msg("assistant", response_content, role="assistant")
|
|
|
|
return WorkflowOutput(
|
|
reward=final_reward,
|
|
response=final_response,
|
|
metrics={
|
|
"env_steps": float(step_count),
|
|
"env_done": float(done),
|
|
},
|
|
)
|
|
|
|
```
|
|
|
|
**Key characteristics:**
|
|
- Multi-step interaction: The agent takes multiple actions in a single episode, unlike single-turn QA tasks
|
|
- State tracking: The agent maintains internal state (current step, last action, last observation) across steps
|
|
- Error handling: Invalid actions or agent errors are caught and handled gracefully
|
|
|
|
### Reward Function
|
|
|
|
No separate judge function is needed. The reward comes directly from the environment:
|
|
- 1.0: Agent successfully reaches the goal (G)
|
|
- 0.0: Agent falls into a hole (H) or fails to reach the goal within the maximum steps
|
|
|
|
The reward is computed as the sum of step rewards throughout the episode. The workflow returns:
|
|
- `reward`: Final cumulative reward
|
|
- `response`: Final response message containing observation, total reward, steps taken, and termination reason
|
|
- `metrics`: Additional metrics including `env_steps` (number of steps taken) and `env_done` (whether episode completed)
|
|
|
|
### Implementation Details
|
|
|
|
The environment (`FrozenLakeEnv`) wraps Gymnasium's FrozenLake and provides:
|
|
- `reset(task)`: Initialize the environment with task parameters
|
|
- `step(action)`: Execute an action and return (observation, reward, done, info)
|
|
- `render()`: Return a text representation of the current state
|
|
|
|
The agent (`FrozenLakeAgent`) extends `ReActAgent` and provides:
|
|
- `reply(msg)`: Reply to a message and return an action (inherited from AgentScope)
|
|
- `get_prompt(observation)`: Generate a prompt from the current observation
|
|
- `get_action(response)`: Parse the model's response to extract an action (Up/Down/Left/Right)
|
|
- `update_state(action, observation)`: Update internal state after each step
|
|
|
|
See [frozenlake_env.py](./frozenlake_env.py) and [frozenlake_agent.py](./frozenlake_agent.py) for implementation details.
|
|
|
|
### Step 4: Use `tune` to train the workflow
|
|
|
|
```python
|
|
from agentscope.tuner import tune, DatasetConfig
|
|
|
|
if __name__ == "__main__":
|
|
config_path = os.path.join(
|
|
os.path.dirname(__file__),
|
|
"config.yaml",
|
|
)
|
|
dataset = DatasetConfig(
|
|
path="/path/to/frozenlake_dataset",
|
|
name="default",
|
|
split="train",
|
|
)
|
|
tune(
|
|
workflow_func=run_frozen_lake,
|
|
train_dataset=dataset,
|
|
config_path=config_path,
|
|
)
|
|
```
|
|
|
|
See [config.yaml](./config.yaml) for the training configuration. For full configuration details, see [Trinity-RFT Configuration Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html).
|
|
|
|
---
|
|
|
|
## How to Run
|
|
|
|
### Prerequisites
|
|
|
|
- At least 2 NVIDIA GPUs with CUDA 12.8 or newer
|
|
- Follow the Trinity-RFT [installation guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) to install the latest version from source code
|
|
- Install gymnasium for the FrozenLake environment:
|
|
|
|
```bash
|
|
pip install gymnasium[toy_text]
|
|
```
|
|
|
|
- Download the model checkpoint (example):
|
|
|
|
```bash
|
|
huggingface-cli download Qwen/Qwen2.5-3B-Instruct
|
|
```
|
|
|
|
### Step 1: Prepare the Dataset
|
|
|
|
```bash
|
|
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
|
|
```
|
|
|
|
Update the dataset path in `main.py` to point to your generated dataset directory.
|
|
|
|
### Step 2: Configure the Training
|
|
|
|
Key configuration can be identified in the code, including:
|
|
|
|
**Algorithm Configuration** (`AlgorithmConfig`):
|
|
- `algorithm_type`: `multi_step_grpo` (Group Relative Policy Optimization for multi-step tasks)
|
|
- `group_size`: Number of policy update iterations per batch (default: 16)
|
|
- `batch_size`: Batch size for training (default: 32)
|
|
- `learning_rate`: Learning rate (default: 1e-6)
|
|
|
|
**Model Configuration** (`TunerModelConfig`):
|
|
- `model_path`: Path to the base model (e.g., `Qwen/Qwen2.5-3B-Instruct`)
|
|
- `max_model_len`: Maximum model context length (default: 25600)
|
|
- `max_tokens`: Maximum tokens for response generation (default: 2048)
|
|
- `inference_engine_num`: Number of inference engines (default: 6, using 6 GPUs for inference)
|
|
|
|
**Dataset Configuration** (`DatasetConfig`):
|
|
- `path`: Path to the dataset (default: `/path/to/frozenlake`)
|
|
- `split`: Split of the dataset (default: `train`)
|
|
|
|
Adjust these parameters based on your hardware resources and training requirements. Other parameters can be spetified in [config.yaml](./config.yaml).
|
|
|
|
|
|
### Step 3: Set Up Ray Cluster
|
|
|
|
Set up a [Ray](https://github.com/ray-project/ray) cluster:
|
|
|
|
```bash
|
|
ray start --head
|
|
# for multi-node setup, run the following command on worker nodes
|
|
# ray start --address=<master_address>
|
|
```
|
|
|
|
### Step 4: Run the Training Script
|
|
|
|
```bash
|
|
python main.py
|
|
```
|
|
|
|
The training will start and you can monitor the progress through the logs. Checkpoints will be saved once every `trainer.save_interval` steps.
|
|
|
|
## Experimental Results
|
|
|
|
### Training Reward Curve
|
|
|
|
The reward curve during training shows the agent's learning progress:
|
|
|
|

|
|
|
|
The training reward typically increases over epochs as the agent learns to navigate the frozen lake more effectively.
|
|
|
|
### Example Agent Output
|
|
|
|
An example of agent output is given below:
|
|
```
|
|
From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.
|
|
|
|
The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.
|
|
|
|
Let's take the action ```Left```.
|
|
``` |