Files
evotraders/tuner/frozen_lake/README.md
2026-01-20 19:46:50 +08:00

10 KiB

Training FrozenLake Agent with RL using AgentScope-Tuner

Summary

This example demonstrates how to use AgentScope-Tuner to implement reinforcement fine-tuning for the Frozen Lake task using Trinity-RFT. The agent learns to navigate a frozen lake grid from a starting position to a goal while avoiding holes through multi-step interactions with the environment.

Task Setting

Agent Goal

The agent's objective is to navigate from the starting position (S) to the goal position (G) on a frozen lake grid without falling into holes (H). The agent must:

  • Plan a path through frozen tiles (F) to reach the goal
  • Avoid holes that terminate the episode with zero reward
  • Complete the task within a limited number of steps

Agent Type

The agent is implemented as a ReActAgent (Reasoning and Acting Agent) that:

  • Observes the current state of the frozen lake grid
  • Reasons about the best action to take
  • Executes actions (Up, Down, Left, Right) to move through the environment
  • Maintains internal state across multiple steps in an episode

Environment

The environment is based on Gymnasium's FrozenLake environment, wrapped to provide:

  • Grid-based navigation: Randomly generated maps with configurable size (2x2 to 6x6)
  • Tile types:
    • S: Start position
    • F: Frozen tile (safe to walk on)
    • H: Hole (terminates episode with reward 0)
    • G: Goal (terminates episode with reward +1.0)
  • Action space: Discrete actions (Up, Down, Left, Right)
  • Reward structure:
    • +1.0 for reaching the goal
    • 0.0 for falling into a hole or failing to reach the goal
  • Observations: Text-based grid representation showing current player position

The agent does not use external tools. It interacts directly with the environment through:

  • env.reset(task): Initialize environment with task parameters
  • env.step(action): Execute action and receive observation, reward, and done flag
  • env.render(): Get text representation of current state

Dataset Preparation

The dataset contains task parameters for generating FrozenLake environments. Each sample specifies:

  • seed: Random seed for reproducible map generation
  • size: Grid size (randomly sampled from 2 to map_max_size, e.g., 4x4, 6x6)
  • p: Probability that a tile is frozen (vs. being a hole), randomly sampled from 0.6 to 0.85
  • index: Sample index
  • uid: Unique identifier combining seed, size, and p

Run the data preparation script to generate training and test datasets:

python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100

This will create parquet files in the specified directory:

/path/to/frozenlake_dataset/
    ├── train.parquet  # 10000 training samples
    └── test.parquet   # 100 test samples

Each sample looks like:

{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}

Note: The data preparation script ensures that all generated maps have a valid path from start to goal within the maximum allowed steps (env_max_steps=8), filtering out unsolvable tasks.

Code Implementation

This section provides a high-level overview of the code implementation. For detailed implementation, please refer to the source code.

High-level Overview

The implementation consists of three main components:

  1. Agent (FrozenLakeAgent): Extends ReActAgent to handle multi-step navigation
  2. Environment (FrozenLakeEnv): Wraps Gymnasium's FrozenLake environment
  3. Workflow (run_frozen_lake): Orchestrates the agent-environment interaction loop

Agent Workflow

The workflow function run_frozen_lake implements the agent-environment interaction loop:

async def run_frozen_lake(
    task: Dict,
    model: ChatModelBase,
    auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
    # ...

    # Create agent and environment
    agent = FrozenLakeAgent(model=model, ...)
    env = FrozenLakeEnv(...)
    observation, _ = env.reset(task)
    rewards = []
    # ...

    # Agent-environment interaction loop
    for _ in range(max_steps):
        response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
        action = agent.get_action(response)
        observation, reward, done, _ = env.step(action)
        rewards.append(reward)
        if done:
            break

    # ...
    final_reward = sum(rewards)
    final_response = Msg("assistant", response_content, role="assistant")

    return WorkflowOutput(
        reward=final_reward,
        response=final_response,
        metrics={
            "env_steps": float(step_count),
            "env_done": float(done),
        },
    )

Key characteristics:

  • Multi-step interaction: The agent takes multiple actions in a single episode, unlike single-turn QA tasks
  • State tracking: The agent maintains internal state (current step, last action, last observation) across steps
  • Error handling: Invalid actions or agent errors are caught and handled gracefully

Reward Function

No separate judge function is needed. The reward comes directly from the environment:

  • 1.0: Agent successfully reaches the goal (G)
  • 0.0: Agent falls into a hole (H) or fails to reach the goal within the maximum steps

The reward is computed as the sum of step rewards throughout the episode. The workflow returns:

  • reward: Final cumulative reward
  • response: Final response message containing observation, total reward, steps taken, and termination reason
  • metrics: Additional metrics including env_steps (number of steps taken) and env_done (whether episode completed)

Implementation Details

The environment (FrozenLakeEnv) wraps Gymnasium's FrozenLake and provides:

  • reset(task): Initialize the environment with task parameters
  • step(action): Execute an action and return (observation, reward, done, info)
  • render(): Return a text representation of the current state

The agent (FrozenLakeAgent) extends ReActAgent and provides:

  • reply(msg): Reply to a message and return an action (inherited from AgentScope)
  • get_prompt(observation): Generate a prompt from the current observation
  • get_action(response): Parse the model's response to extract an action (Up/Down/Left/Right)
  • update_state(action, observation): Update internal state after each step

See frozenlake_env.py and frozenlake_agent.py for implementation details.

Step 4: Use tune to train the workflow

from agentscope.tuner import tune, DatasetConfig

if __name__ == "__main__":
    config_path = os.path.join(
        os.path.dirname(__file__),
        "config.yaml",
    )
    dataset = DatasetConfig(
        path="/path/to/frozenlake_dataset",
        name="default",
        split="train",
    )
    tune(
        workflow_func=run_frozen_lake,
        train_dataset=dataset,
        config_path=config_path,
    )

See config.yaml for the training configuration. For full configuration details, see Trinity-RFT Configuration Guide.


How to Run

Prerequisites

  • At least 2 NVIDIA GPUs with CUDA 12.8 or newer

  • Follow the Trinity-RFT installation guide to install the latest version from source code

  • Install gymnasium for the FrozenLake environment:

    pip install gymnasium[toy_text]
    
  • Download the model checkpoint (example):

    huggingface-cli download Qwen/Qwen2.5-3B-Instruct
    

Step 1: Prepare the Dataset

python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100

Update the dataset path in main.py to point to your generated dataset directory.

Step 2: Configure the Training

Key configuration can be identified in the code, including:

Algorithm Configuration (AlgorithmConfig):

  • algorithm_type: multi_step_grpo (Group Relative Policy Optimization for multi-step tasks)
  • group_size: Number of policy update iterations per batch (default: 16)
  • batch_size: Batch size for training (default: 32)
  • learning_rate: Learning rate (default: 1e-6)

Model Configuration (TunerModelConfig):

  • model_path: Path to the base model (e.g., Qwen/Qwen2.5-3B-Instruct)
  • max_model_len: Maximum model context length (default: 25600)
  • max_tokens: Maximum tokens for response generation (default: 2048)
  • inference_engine_num: Number of inference engines (default: 6, using 6 GPUs for inference)

Dataset Configuration (DatasetConfig):

  • path: Path to the dataset (default: /path/to/frozenlake)
  • split: Split of the dataset (default: train)

Adjust these parameters based on your hardware resources and training requirements. Other parameters can be spetified in config.yaml.

Step 3: Set Up Ray Cluster

Set up a Ray cluster:

ray start --head
# for multi-node setup, run the following command on worker nodes
# ray start --address=<master_address>

Step 4: Run the Training Script

python main.py

The training will start and you can monitor the progress through the logs. Checkpoints will be saved once every trainer.save_interval steps.

Experimental Results

Training Reward Curve

The reward curve during training shows the agent's learning progress:

reward

The training reward typically increases over epochs as the agent learns to navigate the frozen lake more effectively.

Example Agent Output

An example of agent output is given below:

From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.

The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.

Let's take the action ```Left```.