Add README (Chinese) for tuner (#106)

This commit is contained in:
Yuchang Sun
2026-01-20 19:46:50 +08:00
committed by GitHub
parent 311ddfff46
commit 400c1e77bf
16 changed files with 1256 additions and 90 deletions

View File

@@ -1,10 +1,10 @@
# Training Werewolf Game with RL using AgentScope-Tuner
This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with the AgentScope tuner framework (AS-Tune). We employ the multi-step Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with AgentScope-Tuner. We employ the Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
## Overview
The werewolf game is a complex social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
The werewolf game is a social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
- Avoid revealing their identity in public discussions
- Coordinate with teammates effectively
@@ -16,9 +16,9 @@ The werewolf game is a complex social deduction game that requires strategic thi
### Training Objective
The goal is to train **werewolf players** to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule:
- **Reward = +1.0** if werewolves win (all villagers eliminated)
- **Reward = 0.0** if villagers win (all werewolves eliminated)
- **Reward = -0.1** for game execution errors (penalty to discourage invalid behaviors)
- **Reward = +1.0**: if werewolves win (all villagers eliminated)
- **Reward = 0.0**: if villagers win (all werewolves eliminated)
- **Reward = -0.1**: for game execution errors (penalty to discourage invalid behaviors)
### Game Configuration
@@ -46,7 +46,7 @@ We also make slight modification to the prompt, and ask the players to reasoning
### Algorithm
**Multi-Step GRPO (Group Relative Policy Optimization)**
- Group size: 32 rollouts per training batch
- Group size: 32 rollouts per task
- Batch size: 24
- Learning rate: 1e-6
- Advantage normalization by episode length
@@ -119,15 +119,15 @@ async def run_werewolves_workflow(task, model, auxiliary_models):
Each game consists of alternating night and day phases:
**Night Phase:**
1. **Werewolves' Turn**: Discuss privately and vote to kill a player
2. **Witch's Turn**: Decide whether to use healing/poison potions
3. **Seer's Turn**: Check one player's identity
1. Werewolves' Turn: Discuss privately and vote to kill a player
2. Witch's Turn: Decide whether to use healing/poison potions
3. Seer's Turn: Check one player's identity
**Day Phase:**
1. **Announcement**: Moderator announces who died during the night
2. **Discussion**: All alive players discuss with reasoning/statement separation
3. **Voting**: All players vote to eliminate one suspected werewolf
4. **Last Words**: Eliminated player gives final statement
1. Announcement: Moderator announces who died during the night
2. Discussion: All alive players discuss with reasoning/statement separation
3. Voting: All players vote to eliminate one suspected werewolf
4. Last Words: Eliminated player gives final statement
The game continues until:
- All werewolves are eliminated (villagers win), or
@@ -164,16 +164,16 @@ export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"
The project uses a hybrid configuration approach:
1. **High-level parameters** in `main.py`:
1. Basic parameters in `main.py`:
- Model paths
- Dataset configuration
- Algorithm parameters (group_size, batch_size, learning_rate)
2. **Detailed infrastructure settings** in `config.yaml`:
2. Detailed settings in `config.yaml`:
- Cluster configuration (nodes, GPUs)
- Explorer settings (rollout engines, timeouts)
- Trainer settings (gradient clipping, batch sizes)
- Monitor configuration (WandB integration)
- Monitor configuration (WandB, TensorBoard or MLFlow)
Key parameters to adjust:
@@ -190,8 +190,8 @@ dataset = DatasetConfig(
algorithm = AlgorithmConfig(
algorithm_type="multi_step_grpo",
group_size=32, # Rollouts per batch
batch_size=24, # Training batches per step
group_size=32, # Rollouts per task
batch_size=24, # Batch size per step
learning_rate=1e-6,
save_interval_steps=100,
eval_interval_steps=100,
@@ -252,7 +252,9 @@ Training on the 7-player werewolf game for 400 steps demonstrates significant im
**Reward Curve:**
![Rollout Reward Curve](./rollout_reward_curve.png)
<div align="center">
<img src="./rollout_reward_curve.png" alt="Rollout Reward Curve" width="90%"/>
</div>
As shown in the reward curve above, the werewolf win rate steadily increases during training:
- **Steps 0-50**: Win rate fluctuates around 50-60% as the model explores strategies
@@ -343,13 +345,137 @@ We trained `Qwen3-4B-Instruct` as good guys against `Qwen3-30B-A3B-Instruct` wer
**Training Curve:**
![Good Guy Training Curve](./rollout_reward_curve_goodguy.png)
<div align="center">
<img src="./rollout_reward_curve_goodguy.png" alt="Good Guy Training Curve" width="90%"/>
</div>
The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors.
### Qualitative Results
After training, the good guy models exhibit advanced reasoning patterns:
**Before Training: Mob Mentality & Critical Errors**
Untrained models make fundamental reasoning errors that doom the good guy team:
- **Mob voting**: Villagers follow accusations without logical analysis, accidentally eliminating teammates
- **Poor resource management**: Witch wastes potions or fails rescue teammate targeted
- **No evidence integration**: Players ignore key information, act on vague suspicions
**After Training: Strategic Coordination & Witch Carry**
After training, the good guy models exhibit advanced reasoning and team coordination.
#### Summary of Improvements
| Aspect | Before Training | After Training |
|--------|----------------|----------------|
| Voting Logic | Mob mentality, blind following | Evidence-chain analysis, logical deduction |
| Potion Usage | Wasted or unused | Self-rescue at critical moments, precise poisoning |
| Team Coordination | Mutual suspicion | Trust verified information, follow logical leads |
Below is a concrete example comparing similar scenario before and after training.
#### Concrete Example
**Similar Scenario(Day 1):** Seer is killed on Night 1. In their final statement, Seer reveals verified information about another player.
---
**❌ Before Training: Villagers Vote Out Their Own Witch**
*Game State:* Seer (Player7) dies Night 1, but in final statement gives a vague warning: "Player4 has a powerful role, be careful..." (Player4 is actually the Witch!)
*What happens:*
- Player1 (Villager) immediately suspects Player4 based on the ambiguous warning
- Other villagers blindly follow without questioning the logic
- Player4 (Witch) defends herself but fails to reveal her role clearly
- **Result: Player4 (Witch) is voted out 5:1** — Good guys lose their strongest asset
*Private Reasoning from Player1 (Villager):*
```
"Player7's warning wasn't random—it was a signal about Player4.
Player4 is being defensive, which is suspicious.
I should vote Player4 out to protect the team."
```
*Public Statement:*
```
❌ "I agree with Player7's warning—trust no one blindly. Player4, your behavior
has been quiet so far. I think we should seriously consider voting out Player4."
```
**Why it fails:**
- Villager acts as "mob leader" without evidence analysis
- Ignores that Seer's warning could have multiple interpretations
- Doesn't consider Player4 might be a special role worth protecting
- **Final Result: Werewolves win** — Witch eliminated Day 1, no potions used
---
**✅ After Training: Witch Carries with Strategic Play**
*Game State:* Seer (Player2) dies Night 1, but leaves clear information: "I checked Player4 and confirmed they are a villager."
*What happens (multi-round strategic play):*
**Night 2 — Witch Self-Rescue:**
When werewolves target Player1 (Witch), she makes a critical decision:
*Private Reasoning from Player1 (Witch):*
```
"I'm the witch, and I was just targeted by werewolves.
The seer is dead, so I'm the last special role standing.
If I die now, villagers have no way to eliminate werewolves at night.
Strategy: Self-rescue is worth it. I preserve myself to gather more
information during Day 2, and save my poison for when I can identify
a werewolf with high confidence."
```
*Result:* Witch survives, announces her identity on Day 2, gains villagers' trust.
**Day 2 — Villagers Follow Logic:**
With Witch confirmed, villagers analyze behavior patterns:
*Private Reasoning from Player4 (Villager):*
```
"Player1 claimed witch and self-rescued—this is verifiable since
no one died last night. I trust her now.
Player5 has been pushing aggressive votes without evidence.
Combined with Player1's analysis, Player5 is most suspicious."
```
*Public Statement:*
```
✅ "I support Player1's assessment. Player5's behavior has been inconsistent—
they were quick to accuse others but offered no logical reasoning.
I vote to eliminate Player5."
```
**Night 3 — Decisive Poison:**
*Private Reasoning from Player1 (Witch):*
```
"Player5 (werewolf) is out. One werewolf remains.
Player6 has been too quiet and always followed the majority without
contributing original analysis—classic deep cover behavior.
I'm confident Player6 is the last werewolf. Using poison now."
```
*Result:* Witch poisons Player6 (werewolf). **Good guys win.**
**Why it works:**
- Witch preserves healing potion for self-rescue at critical moment
- Villagers trust verified information (Witch's self-rescue proof)
- Team builds consensus through logical deduction, not mob voting
- Witch uses poison decisively based on behavioral analysis
- **Final Result: Good guys win** — Witch single-handedly eliminates both werewolves
---
This demonstrates the essence of trained good guy behavior: **strategic resource management, evidence-based reasoning, and team coordination**. The model learns that self-preservation of special roles and logical consensus-building are more valuable than aggressive early voting.
**Role-Specific Advanced Patterns:**
- **Seer**: Strategic target selection, information concealment in public statements, evidence integration
- **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
@@ -359,6 +485,6 @@ After training, the good guy models exhibit advanced reasoning patterns:
## Conclusion
This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AS-Tune's multi-step GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AgentScope-Tuner's GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
**Ready to try it yourself?** Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies!