evotraders/tuner/frozen_lake/README.md

# Training FrozenLake Agent with RL using AgentScope-Tuner

## Summary

This example demonstrates how to use AgentScope-Tuner to implement reinforcement fine-tuning for the [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) task using [Trinity-RFT](https://github.com/agentscope-ai/Trinity-RFT). The agent learns to navigate a frozen lake grid from a starting position to a goal while avoiding holes through multi-step interactions with the environment.

## Task Setting

### Agent Goal
The agent's objective is to navigate from the starting position (S) to the goal position (G) on a frozen lake grid without falling into holes (H). The agent must:
- Plan a path through frozen tiles (F) to reach the goal
- Avoid holes that terminate the episode with zero reward
- Complete the task within a limited number of steps

### Agent Type
The agent is implemented as a **ReActAgent** (Reasoning and Acting Agent) that:
- Observes the current state of the frozen lake grid
- Reasons about the best action to take
- Executes actions (Up, Down, Left, Right) to move through the environment
- Maintains internal state across multiple steps in an episode

### Environment
The environment is based on Gymnasium's FrozenLake environment, wrapped to provide:
- **Grid-based navigation**: Randomly generated maps with configurable size (2x2 to 6x6)
- **Tile types**:
  - `S`: Start position
  - `F`: Frozen tile (safe to walk on)
  - `H`: Hole (terminates episode with reward 0)
  - `G`: Goal (terminates episode with reward +1.0)
- **Action space**: Discrete actions (Up, Down, Left, Right)
- **Reward structure**:
  - +1.0 for reaching the goal
  - 0.0 for falling into a hole or failing to reach the goal
- **Observations**: Text-based grid representation showing current player position

The agent does not use external tools. It interacts directly with the environment through:
- `env.reset(task)`: Initialize environment with task parameters
- `env.step(action)`: Execute action and receive observation, reward, and done flag
- `env.render()`: Get text representation of current state


## Dataset Preparation

The dataset contains task parameters for generating FrozenLake environments. Each sample specifies:
- `seed`: Random seed for reproducible map generation
- `size`: Grid size (randomly sampled from 2 to `map_max_size`, e.g., 4x4, 6x6)
- `p`: Probability that a tile is frozen (vs. being a hole), randomly sampled from 0.6 to 0.85
- `index`: Sample index
- `uid`: Unique identifier combining seed, size, and p

Run the data preparation script to generate training and test datasets:

```bash
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
```

This will create parquet files in the specified directory:

```
/path/to/frozenlake_dataset/
    ├── train.parquet  # 10000 training samples
    └── test.parquet   # 100 test samples
```

Each sample looks like:

```json
{"seed": 12345, "size": 5, "p": 0.75, "index": 0, "uid": "12345_5_0.75"}
```

**Note**: The data preparation script ensures that all generated maps have a valid path from start to goal within the maximum allowed steps (`env_max_steps=8`), filtering out unsolvable tasks.

## Code Implementation

This section provides a high-level overview of the code implementation. For detailed implementation, please refer to the source code.

### High-level Overview

The implementation consists of three main components:

1. **Agent** (`FrozenLakeAgent`): Extends `ReActAgent` to handle multi-step navigation
2. **Environment** (`FrozenLakeEnv`): Wraps Gymnasium's FrozenLake environment
3. **Workflow** (`run_frozen_lake`): Orchestrates the agent-environment interaction loop

### Agent Workflow

The workflow function `run_frozen_lake` implements the agent-environment interaction loop:

```python
async def run_frozen_lake(
    task: Dict,
    model: ChatModelBase,
    auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
    # ...

    # Create agent and environment
    agent = FrozenLakeAgent(model=model, ...)
    env = FrozenLakeEnv(...)
    observation, _ = env.reset(task)
    rewards = []
    # ...

    # Agent-environment interaction loop
    for _ in range(max_steps):
        response = await agent.reply(msg=Msg("user", agent.get_prompt(observation), role="user"))
        action = agent.get_action(response)
        observation, reward, done, _ = env.step(action)
        rewards.append(reward)
        if done:
            break

    # ...
    final_reward = sum(rewards)
    final_response = Msg("assistant", response_content, role="assistant")

    return WorkflowOutput(
        reward=final_reward,
        response=final_response,
        metrics={
            "env_steps": float(step_count),
            "env_done": float(done),
        },
    )

```

**Key characteristics:**
- Multi-step interaction: The agent takes multiple actions in a single episode, unlike single-turn QA tasks
- State tracking: The agent maintains internal state (current step, last action, last observation) across steps
- Error handling: Invalid actions or agent errors are caught and handled gracefully

### Reward Function

No separate judge function is needed. The reward comes directly from the environment:
- 1.0: Agent successfully reaches the goal (G)
- 0.0: Agent falls into a hole (H) or fails to reach the goal within the maximum steps

The reward is computed as the sum of step rewards throughout the episode. The workflow returns:
- `reward`: Final cumulative reward
- `response`: Final response message containing observation, total reward, steps taken, and termination reason
- `metrics`: Additional metrics including `env_steps` (number of steps taken) and `env_done` (whether episode completed)

### Implementation Details

The environment (`FrozenLakeEnv`) wraps Gymnasium's FrozenLake and provides:
- `reset(task)`: Initialize the environment with task parameters
- `step(action)`: Execute an action and return (observation, reward, done, info)
- `render()`: Return a text representation of the current state

The agent (`FrozenLakeAgent`) extends `ReActAgent` and provides:
- `reply(msg)`: Reply to a message and return an action (inherited from AgentScope)
- `get_prompt(observation)`: Generate a prompt from the current observation
- `get_action(response)`: Parse the model's response to extract an action (Up/Down/Left/Right)
- `update_state(action, observation)`: Update internal state after each step

See [frozenlake_env.py](./frozenlake_env.py) and [frozenlake_agent.py](./frozenlake_agent.py) for implementation details.

### Step 4: Use `tune` to train the workflow

```python
from agentscope.tuner import tune, DatasetConfig

if __name__ == "__main__":
    config_path = os.path.join(
        os.path.dirname(__file__),
        "config.yaml",
    )
    dataset = DatasetConfig(
        path="/path/to/frozenlake_dataset",
        name="default",
        split="train",
    )
    tune(
        workflow_func=run_frozen_lake,
        train_dataset=dataset,
        config_path=config_path,
    )
```

See [config.yaml](./config.yaml) for the training configuration. For full configuration details, see [Trinity-RFT Configuration Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html).

---

## How to Run

### Prerequisites

- At least 2 NVIDIA GPUs with CUDA 12.8 or newer
- Follow the Trinity-RFT [installation guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) to install the latest version from source code
- Install gymnasium for the FrozenLake environment:

  ```bash
  pip install gymnasium[toy_text]
  ```

- Download the model checkpoint (example):

  ```bash
  huggingface-cli download Qwen/Qwen2.5-3B-Instruct
  ```

### Step 1: Prepare the Dataset

```bash
python get_frozenlake_data.py --map_max_size 6 --train_size 10000 --test_size 100
```

Update the dataset path in `main.py` to point to your generated dataset directory.

### Step 2: Configure the Training

Key configuration can be identified in the code, including:

**Algorithm Configuration** (`AlgorithmConfig`):
- `algorithm_type`: `multi_step_grpo` (Group Relative Policy Optimization for multi-step tasks)
- `group_size`: Number of policy update iterations per batch (default: 16)
- `batch_size`: Batch size for training (default: 32)
- `learning_rate`: Learning rate (default: 1e-6)

**Model Configuration** (`TunerModelConfig`):
- `model_path`: Path to the base model (e.g., `Qwen/Qwen2.5-3B-Instruct`)
- `max_model_len`: Maximum model context length (default: 25600)
- `max_tokens`: Maximum tokens for response generation (default: 2048)
- `inference_engine_num`: Number of inference engines (default: 6, using 6 GPUs for inference)

**Dataset Configuration** (`DatasetConfig`):
- `path`: Path to the dataset (default: `/path/to/frozenlake`)
- `split`: Split of the dataset (default: `train`)

Adjust these parameters based on your hardware resources and training requirements. Other parameters can be spetified in  [config.yaml](./config.yaml).


### Step 3: Set Up Ray Cluster

Set up a [Ray](https://github.com/ray-project/ray) cluster:

```bash
ray start --head
# for multi-node setup, run the following command on worker nodes
# ray start --address=<master_address>
```

### Step 4: Run the Training Script

```bash
python main.py
```

The training will start and you can monitor the progress through the logs. Checkpoints will be saved once every `trainer.save_interval` steps.

## Experimental Results

### Training Reward Curve

The reward curve during training shows the agent's learning progress:

![reward](./critic_rewards_mean.png)

The training reward typically increases over epochs as the agent learns to navigate the frozen lake more effectively.

### Example Agent Output

An example of agent output is given below:
```
From the current observation, let's analyze the situation. The player (P) is at: (4, 0), and the goal (G) is at: (2, 3). There is also a hole (O) at (4, 4). Given this, I can move towards the goal without worrying about slippery tiles right now.

The shortest path from P to G involves moving left (4 steps) followed by moving down (1 step), since going directly would bypass the hole or move us further from the goal. Let's move left first.

Let's take the action ```Left```.
```