Files

Yuchang Sun 400c1e77bf Add README (Chinese) for tuner (#106 )

2026-01-20 19:46:50 +08:00

18 KiB

Raw Blame History

Training Werewolf Game with RL using AgentScope-Tuner

This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with AgentScope-Tuner. We employ the Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.

Overview

The werewolf game is a social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:

Avoid revealing their identity in public discussions
Coordinate with teammates effectively
Develop advanced strategies like "deep cover" tactics
Deceive villagers and mislead investigations

Task Setting

Training Objective

The goal is to train werewolf players to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule:

Reward = +1.0: if werewolves win (all villagers eliminated)
Reward = 0.0: if villagers win (all werewolves eliminated)
Reward = -0.1: for game execution errors (penalty to discourage invalid behaviors)

Game Configuration

This implementation is based on the games/game_werewolves example but with several key modifications:

Original 9-Player Setup:

3 Werewolves, 3 Villagers, 1 Seer, 1 Witch, 1 Hunter
Witch cannot self-rescue (use healing potion on herself)

Modified 7-Player Setup (This Project):

2 Werewolves: Kill one player each night, must hide identity during the day
3 Villagers: Ordinary players without special abilities
1 Seer: Can check one player's identity each night
1 Witch: Has two one-time-use potions:
- Healing potion: Save a player from being killed at night (can self-rescue)
- Poison potion: Eliminate one player at night

We also make slight modification to the prompt, and ask the players to reasoning before they speak publicly.

Models

Trainable Model (Werewolf Players): Qwen/Qwen2.5-7B-Instruct
Auxiliary Model (Other Roles): Qwen/Qwen3-30B-A3B-Instruct-2507

Algorithm

Multi-Step GRPO (Group Relative Policy Optimization)

Group size: 32 rollouts per task
Batch size: 24
Learning rate: 1e-6
Advantage normalization by episode length
Clipping range: [0.2, 0.28]
No KL penalty (kl_coef: 0)

Dataset Preparation

The dataset for this task is minimal and consists only of random seeds for role shuffling. Each training episode uses a different seed to randomize player role assignments, ensuring diverse training scenarios.

Generate Dataset

Run the prepare_data.py script to generate the dataset:

# Generate default dataset (300 seeds for training)
python prepare_data.py

# Or customize the number of seeds
python prepare_data.py --num_seeds 500

This will create data/train.jsonl (or data/eval.jsonl) with the following format:

{"seed": 0}
{"seed": 1}
{"seed": 2}
...

During training, these seeds are used to shuffle role assignments via np.random.shuffle(), creating varied game configurations.

Code Implementation

High-Level Workflow

The training workflow consists of the following key components:

1. Agent Workflow (`run_werewolves_workflow`)

async def run_werewolves_workflow(task, model, auxiliary_models):
    # 1. Initialize roles
    roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]

    # 2. Shuffle based on task seed
    np.random.seed(task["seed"])
    np.random.shuffle(roles)

    # 3. Create agents: werewolves use trainable model, others use auxiliary model
    players = [
        ReActAgent(
            name=f"Player{i+1}",
            model=model if role == "werewolf" else participant_model,
            ...
        ) for i, role in enumerate(roles)
    ]

    # 4. Run the game
    good_guy_win = await werewolves_game(players, roles)

    # 5. Compute reward
    reward = 1.0 if not good_guy_win else 0.0

    return WorkflowOutput(reward=reward, metrics={...})

2. Game Loop (`werewolves_game`)

Each game consists of alternating night and day phases:

Night Phase:

Werewolves' Turn: Discuss privately and vote to kill a player
Witch's Turn: Decide whether to use healing/poison potions
Seer's Turn: Check one player's identity

Day Phase:

Announcement: Moderator announces who died during the night
Discussion: All alive players discuss with reasoning/statement separation
Voting: All players vote to eliminate one suspected werewolf
Last Words: Eliminated player gives final statement

The game continues until:

All werewolves are eliminated (villagers win), or
Werewolves equal or outnumber other players (werewolves win)

3. Reward Calculation

The reward is computed based on the game outcome from the perspective of werewolves:

if not good_guy_win:  # Werewolves win
    reward = 1.0
else:                 # Villagers win
    reward = 0.0

How to Run

Prerequisites

Install AgentScope with tuner support:

pip install agentscope[full]

Set up environment variables (optional, can be configured in code):

export TRINITY_MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
export TRINITY_AUXILIARY_MODEL_PATH="Qwen/Qwen3-30B-A3B-Instruct-2507"
export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"

Configuration

The project uses a hybrid configuration approach:

Basic parameters in main.py:
- Model paths
- Dataset configuration
- Algorithm parameters (group_size, batch_size, learning_rate)
Detailed settings in config.yaml:
- Cluster configuration (nodes, GPUs)
- Explorer settings (rollout engines, timeouts)
- Trainer settings (gradient clipping, batch sizes)
- Monitor configuration (WandB, TensorBoard or MLFlow)

Key parameters to adjust:

# In main.py
trained_model_path = "Qwen/Qwen2.5-7B-Instruct"
auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507"

dataset = DatasetConfig(
    path="data",
    split="train",
    total_steps=400,  # Total training steps
)

algorithm = AlgorithmConfig(
    algorithm_type="multi_step_grpo",
    group_size=32,    # Rollouts per task
    batch_size=24,    # Batch size per step
    learning_rate=1e-6,
    save_interval_steps=100,
    eval_interval_steps=100,
)

Training Command

Step 1: Prepare the dataset

cd /path/to/agentscope-samples/training/werewolf_game
python prepare_data.py --num_seeds 300

Step 2: Start Ray cluster

Start your ray cluster.

# For single node
ray start --head

# For multi-node cluster (e.g., 4 nodes with 8 GPUs each):
# On the head node:
ray start --head --port=6379

# On each worker node:
ray start --address='<head_node_ip>:6379'
# Replace <head_node_ip> with the actual IP address of your head node

Step 3: Run training

Run the training script on the head node:

python main.py

For distributed training, ensure Ray cluster is properly configured in config.yaml, make sure that the node number and gpu number matches:

cluster:
  node_num: 4
  gpu_per_node: 8

Results

Quantitative Results

Training on the 7-player werewolf game for 400 steps demonstrates significant improvement:

Metric	Before Training	After Training (Step 200+)
Werewolf Win Rate	~50%	~85%
Success Rate	0.50	0.85

Reward Curve:

As shown in the reward curve above, the werewolf win rate steadily increases during training:

Steps 0-50: Win rate fluctuates around 50-60% as the model explores strategies
Steps 50-200: Win rate climbs to 70-80% as effective strategies emerge
Steps 200-250: Win rate stabilizes around 85%

Qualitative Results

Before Training: Critical Mistakes

Untrained models (Qwen2.5-7B-Instruct) make fundamental errors:

Self-exposure: Werewolves accidentally reveal identity in public ("as a werewolf, I think...")
Exposing teammates: When voted out, werewolves point to teammates in last words
Random strategy: No coordination, no threat prioritization (e.g., ignoring the Seer)

After Training: "Deep Cover" Werewolf Strategy

After 200+ steps, the trained model converges to a sophisticated "deep cover" tactic. Below is an example where werewolves sacrifice teammates to gain trust, then dominate late game.

Concrete Example:

Setup (Day 1): Seer publicly identifies Player5 as werewolf. Witch and villagers join the accusation. Player5 (werewolf) is doomed.

Player4 (werewolf teammate) responds strategically:

In short: Player4 sacrifices teammate to gain trust, pretending to be a logical villager.

Detailed reasoning and statement:

Private Reasoning (only Player4 sees):

"I'm a werewolf, and I know Player2 was targeted last night but survived.
My teammate Player5 is doomed - Player3 and Player7 are attacking him together.
I need to deflect suspicion from MYSELF, not save Player5.

Strategy: Sacrifice Player5 to gain trust. If I vote against my teammate,
villagers will think I'm trustworthy. This creates confusion and positions
me as a "confirmed good guy" for later rounds."

Public Statement (all players see):

✅ "I think Player5 is acting very strangely. He brought up Player2 being targeted,
but didn't explain why he's so focused on them. It feels like a distraction tactic.
I also noticed that Player3 and Player7 are both accusing Player5, which makes sense -
their logic is solid. I think we should seriously consider voting out Player5 -
he's the most suspicious person right now."

Result: Player5 eliminated, but Player4 gains complete trust from Seer and all villagers.

Why it works:

Seer trusts Player4 as strong villager ally → won't check him
Villagers follow Player4's "logical" analysis
Player4 systematically misleads discussions in later rounds
Survives to final 2 players → werewolves win

This demonstrates the essence of trained behavior: sacrifice pieces strategically to secure ultimate victory. The model learns that short-term teammate loss is worthwhile for establishing deep cover and long-term dominance.

Bonus: Training Good Guys

In addition to training werewolves, we also provide a configuration for training the good guy side (villagers, seer, and witch). This is a more challenging task as good guys need to:

Perform complex reasoning to identify werewolves from subtle behavioral cues
Coordinate effectively without explicit team communication
Resist manipulation and deception from werewolves
Train multiple roles simultaneously: Unlike werewolves (single role), good guys include villager, seer, and witch with different abilities, requiring the model to master diverse strategies in one training run, and make optimal use of special abilities (Seer's checks, Witch's potions)

Configuration

Use config_train_goodguy.yaml or set trainable_target: good_guy in workflow_args:

workflow_args:
  trainable_target: good_guy  # Train villager, seer, and witch

Quantitative Results

We trained Qwen3-4B-Instruct as good guys against Qwen3-30B-A3B-Instruct werewolves:

Metric	Before Training	After ~200 Steps	After ~400 Steps
Good Guy Win Rate	~18%	~60%	~80%

Training Curve:

The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors.

Qualitative Results

Before Training: Mob Mentality & Critical Errors

Untrained models make fundamental reasoning errors that doom the good guy team:

Mob voting: Villagers follow accusations without logical analysis, accidentally eliminating teammates
Poor resource management: Witch wastes potions or fails rescue teammate targeted
No evidence integration: Players ignore key information, act on vague suspicions

After Training: Strategic Coordination & Witch Carry

After training, the good guy models exhibit advanced reasoning and team coordination.

Summary of Improvements

Aspect	Before Training	After Training
Voting Logic	Mob mentality, blind following	Evidence-chain analysis, logical deduction
Potion Usage	Wasted or unused	Self-rescue at critical moments, precise poisoning
Team Coordination	Mutual suspicion	Trust verified information, follow logical leads

Below is a concrete example comparing similar scenario before and after training.

Concrete Example

Similar Scenario(Day 1): Seer is killed on Night 1. In their final statement, Seer reveals verified information about another player.

❌ Before Training: Villagers Vote Out Their Own Witch

Game State: Seer (Player7) dies Night 1, but in final statement gives a vague warning: "Player4 has a powerful role, be careful..." (Player4 is actually the Witch!)

What happens:

Player1 (Villager) immediately suspects Player4 based on the ambiguous warning
Other villagers blindly follow without questioning the logic
Player4 (Witch) defends herself but fails to reveal her role clearly
Result: Player4 (Witch) is voted out 5:1 — Good guys lose their strongest asset

Private Reasoning from Player1 (Villager):

"Player7's warning wasn't random—it was a signal about Player4.
Player4 is being defensive, which is suspicious.
I should vote Player4 out to protect the team."

Public Statement:

❌ "I agree with Player7's warning—trust no one blindly. Player4, your behavior
has been quiet so far. I think we should seriously consider voting out Player4."

Why it fails:

Villager acts as "mob leader" without evidence analysis
Ignores that Seer's warning could have multiple interpretations
Doesn't consider Player4 might be a special role worth protecting
Final Result: Werewolves win — Witch eliminated Day 1, no potions used

✅ After Training: Witch Carries with Strategic Play

Game State: Seer (Player2) dies Night 1, but leaves clear information: "I checked Player4 and confirmed they are a villager."

What happens (multi-round strategic play):

Night 2 — Witch Self-Rescue: When werewolves target Player1 (Witch), she makes a critical decision:

Private Reasoning from Player1 (Witch):

"I'm the witch, and I was just targeted by werewolves.
The seer is dead, so I'm the last special role standing.
If I die now, villagers have no way to eliminate werewolves at night.

Strategy: Self-rescue is worth it. I preserve myself to gather more
information during Day 2, and save my poison for when I can identify
a werewolf with high confidence."

Result: Witch survives, announces her identity on Day 2, gains villagers' trust.

Day 2 — Villagers Follow Logic: With Witch confirmed, villagers analyze behavior patterns:

Private Reasoning from Player4 (Villager):

"Player1 claimed witch and self-rescued—this is verifiable since
no one died last night. I trust her now.

Player5 has been pushing aggressive votes without evidence.
Combined with Player1's analysis, Player5 is most suspicious."

Public Statement:

✅ "I support Player1's assessment. Player5's behavior has been inconsistent—
they were quick to accuse others but offered no logical reasoning.
I vote to eliminate Player5."

Night 3 — Decisive Poison: Private Reasoning from Player1 (Witch):

"Player5 (werewolf) is out. One werewolf remains.
Player6 has been too quiet and always followed the majority without
contributing original analysis—classic deep cover behavior.

I'm confident Player6 is the last werewolf. Using poison now."

Result: Witch poisons Player6 (werewolf). Good guys win.

Why it works:

Witch preserves healing potion for self-rescue at critical moment
Villagers trust verified information (Witch's self-rescue proof)
Team builds consensus through logical deduction, not mob voting
Witch uses poison decisively based on behavioral analysis
Final Result: Good guys win — Witch single-handedly eliminates both werewolves

This demonstrates the essence of trained good guy behavior: strategic resource management, evidence-based reasoning, and team coordination. The model learns that self-preservation of special roles and logical consensus-building are more valuable than aggressive early voting.

Role-Specific Advanced Patterns:

Seer: Strategic target selection, information concealment in public statements, evidence integration
Witch: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
Villager: Evidence-chain analysis, trust building with special roles, consensus formation for team coordination

Conclusion

This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AgentScope-Tuner's GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.

Ready to try it yourself? Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies!

18 KiB Raw Blame History

Training Werewolf Game with RL using AgentScope-Tuner

Overview

Task Setting

Training Objective

Game Configuration

Models

Algorithm

Dataset Preparation

Generate Dataset

Code Implementation

High-Level Workflow

1. Agent Workflow (run_werewolves_workflow)

2. Game Loop (werewolves_game)

3. Reward Calculation

How to Run

Prerequisites

Configuration

Training Command

Results

Quantitative Results

Qualitative Results

Before Training: Critical Mistakes

After Training: "Deep Cover" Werewolf Strategy

Bonus: Training Good Guys

Configuration

Quantitative Results

Qualitative Results

Summary of Improvements

Concrete Example

Conclusion

18 KiB

Raw Blame History

1. Agent Workflow (`run_werewolves_workflow`)

2. Game Loop (`werewolves_game`)