10 KiB
Training FrozenLake Agent with RL using AgentScope-Tuner
Summary
This example demonstrates how to use AgentScope-Tuner to implement reinforcement fine-tuning for the Frozen Lake task using Trinity-RFT. The agent learns to navigate a frozen lake grid from a starting position to a goal while avoiding holes through multi-step interactions with the environment.
Task Setting
Agent Goal
The agent's objective is to navigate from the starting position (S) to the goal position (G) on a frozen lake grid without falling into holes (H). The agent must:
- Plan a path through frozen tiles (F) to reach the goal
- Avoid holes that terminate the episode with zero reward
- Complete the task within a limited number of steps
Agent Type
The agent is implemented as a ReActAgent (Reasoning and Acting Agent) that:
- Observes the current state of the frozen lake grid
- Reasons about the best action to take
- Executes actions (Up, Down, Left, Right) to move through the environment
- Maintains internal state across multiple steps in an episode
Environment
The environment is based on Gymnasium's FrozenLake environment, wrapped to provide:
- Grid-based navigation: Randomly generated maps with configurable size (2x2 to 6x6)
- Tile types:
S: Start positionF: Frozen tile (safe to walk on)H: Hole (terminates episode with reward 0)G: Goal (terminates episode with reward +1.0)
- Action space: Discrete actions (Up, Down, Left, Right)
- Reward structure:
- +1.0 for reaching the goal
- 0.0 for falling into a hole or failing to reach the goal
- Observations: Text-based grid representation showing current player position
The agent does not use external tools. It interacts directly with the environment through:
env.reset(task): Initialize environment with task parametersenv.step(action): Execute action and receive observation, reward, and done flagenv.render(): Get text representation of current state
Dataset Preparation
The dataset contains task parameters for generating FrozenLake environments. Each sample specifies:
seed: Random seed for reproducible map generationsize: Grid size (randomly sampled from 2 tomap_max_size, e.g., 4x4, 6x6)p: Probability that a tile is frozen (vs. being a hole), randomly sampled from 0.6 to 0.85index: Sample indexuid: Unique identifier combining seed, size, and p
Run the data preparation script to generate training and test datasets:
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
This will create parquet files in the specified directory:
/path/to/frozenlake_dataset/
├── train.parquet # 10000 training samples
└── test.parquet # 100 test samples
Each sample looks like:
{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
Note: The data preparation script ensures that all generated maps have a valid path from start to goal within the maximum allowed steps (env_max_steps=8), filtering out unsolvable tasks.
Code Implementation
This section provides a high-level overview of the code implementation. For detailed implementation, please refer to the source code.
High-level Overview
The implementation consists of three main components:
- Agent (
FrozenLakeAgent): ExtendsReActAgentto handle multi-step navigation - Environment (
FrozenLakeEnv): Wraps Gymnasium's FrozenLake environment - Workflow (
run_frozen_lake): Orchestrates the agent-environment interaction loop
Agent Workflow
The workflow function run_frozen_lake implements the agent-environment interaction loop:
async def run_frozen_lake(
task: Dict,
model: ChatModelBase,
auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
# ...
# Create agent and environment
agent = FrozenLakeAgent(model=model, ...)
env = FrozenLakeEnv(...)
observation, _ = env.reset(task)
rewards = []
# ...
# Agent-environment interaction loop
for _ in range(max_steps):
response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
action = agent.get_action(response)
observation, reward, done, _ = env.step(action)
rewards.append(reward)
if done:
break
# ...
final_reward = sum(rewards)
final_response = Msg("assistant", response_content, role="assistant")
return WorkflowOutput(
reward=final_reward,
response=final_response,
metrics={
"env_steps": float(step_count),
"env_done": float(done),
},
)
Key characteristics:
- Multi-step interaction: The agent takes multiple actions in a single episode, unlike single-turn QA tasks
- State tracking: The agent maintains internal state (current step, last action, last observation) across steps
- Error handling: Invalid actions or agent errors are caught and handled gracefully
Reward Function
No separate judge function is needed. The reward comes directly from the environment:
- 1.0: Agent successfully reaches the goal (G)
- 0.0: Agent falls into a hole (H) or fails to reach the goal within the maximum steps
The reward is computed as the sum of step rewards throughout the episode. The workflow returns:
reward: Final cumulative rewardresponse: Final response message containing observation, total reward, steps taken, and termination reasonmetrics: Additional metrics includingenv_steps(number of steps taken) andenv_done(whether episode completed)
Implementation Details
The environment (FrozenLakeEnv) wraps Gymnasium's FrozenLake and provides:
reset(task): Initialize the environment with task parametersstep(action): Execute an action and return (observation, reward, done, info)render(): Return a text representation of the current state
The agent (FrozenLakeAgent) extends ReActAgent and provides:
reply(msg): Reply to a message and return an action (inherited from AgentScope)get_prompt(observation): Generate a prompt from the current observationget_action(response): Parse the model's response to extract an action (Up/Down/Left/Right)update_state(action, observation): Update internal state after each step
See frozenlake_env.py and frozenlake_agent.py for implementation details.
Step 4: Use tune to train the workflow
from agentscope.tuner import tune, DatasetConfig
if __name__ == "__main__":
config_path = os.path.join(
os.path.dirname(__file__),
"config.yaml",
)
dataset = DatasetConfig(
path="/path/to/frozenlake_dataset",
name="default",
split="train",
)
tune(
workflow_func=run_frozen_lake,
train_dataset=dataset,
config_path=config_path,
)
See config.yaml for the training configuration. For full configuration details, see Trinity-RFT Configuration Guide.
How to Run
Prerequisites
-
At least 2 NVIDIA GPUs with CUDA 12.8 or newer
-
Follow the Trinity-RFT installation guide to install the latest version from source code
-
Install gymnasium for the FrozenLake environment:
pip install gymnasium[toy_text] -
Download the model checkpoint (example):
huggingface-cli download Qwen/Qwen2.5-3B-Instruct
Step 1: Prepare the Dataset
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
Update the dataset path in main.py to point to your generated dataset directory.
Step 2: Configure the Training
Key configuration can be identified in the code, including:
Algorithm Configuration (AlgorithmConfig):
algorithm_type:multi_step_grpo(Group Relative Policy Optimization for multi-step tasks)group_size: Number of policy update iterations per batch (default: 16)batch_size: Batch size for training (default: 32)learning_rate: Learning rate (default: 1e-6)
Model Configuration (TunerModelConfig):
model_path: Path to the base model (e.g.,Qwen/Qwen2.5-3B-Instruct)max_model_len: Maximum model context length (default: 25600)max_tokens: Maximum tokens for response generation (default: 2048)inference_engine_num: Number of inference engines (default: 6, using 6 GPUs for inference)
Dataset Configuration (DatasetConfig):
path: Path to the dataset (default:/path/to/frozenlake)split: Split of the dataset (default:train)
Adjust these parameters based on your hardware resources and training requirements. Other parameters can be spetified in config.yaml.
Step 3: Set Up Ray Cluster
Set up a Ray cluster:
ray start --head
# for multi-node setup, run the following command on worker nodes
# ray start --address=<master_address>
Step 4: Run the Training Script
python main.py
The training will start and you can monitor the progress through the logs. Checkpoints will be saved once every trainer.save_interval steps.
Experimental Results
Training Reward Curve
The reward curve during training shows the agent's learning progress:
The training reward typically increases over epochs as the agent learns to navigate the frozen lake more effectively.
Example Agent Output
An example of agent output is given below:
From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.
The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.
Let's take the action ```Left```.
