Add README (Chinese) for tuner (#106)
This commit is contained in:
@@ -1,10 +1,10 @@
|
||||
# Training Werewolf Game with RL using AgentScope-Tuner
|
||||
|
||||
This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with the AgentScope tuner framework (AS-Tune). We employ the multi-step Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
|
||||
This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with AgentScope-Tuner. We employ the Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
|
||||
|
||||
## Overview
|
||||
|
||||
The werewolf game is a complex social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
|
||||
The werewolf game is a social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
|
||||
|
||||
- Avoid revealing their identity in public discussions
|
||||
- Coordinate with teammates effectively
|
||||
@@ -16,9 +16,9 @@ The werewolf game is a complex social deduction game that requires strategic thi
|
||||
### Training Objective
|
||||
|
||||
The goal is to train **werewolf players** to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule:
|
||||
- **Reward = +1.0** if werewolves win (all villagers eliminated)
|
||||
- **Reward = 0.0** if villagers win (all werewolves eliminated)
|
||||
- **Reward = -0.1** for game execution errors (penalty to discourage invalid behaviors)
|
||||
- **Reward = +1.0**: if werewolves win (all villagers eliminated)
|
||||
- **Reward = 0.0**: if villagers win (all werewolves eliminated)
|
||||
- **Reward = -0.1**: for game execution errors (penalty to discourage invalid behaviors)
|
||||
|
||||
### Game Configuration
|
||||
|
||||
@@ -46,7 +46,7 @@ We also make slight modification to the prompt, and ask the players to reasoning
|
||||
### Algorithm
|
||||
|
||||
**Multi-Step GRPO (Group Relative Policy Optimization)**
|
||||
- Group size: 32 rollouts per training batch
|
||||
- Group size: 32 rollouts per task
|
||||
- Batch size: 24
|
||||
- Learning rate: 1e-6
|
||||
- Advantage normalization by episode length
|
||||
@@ -119,15 +119,15 @@ async def run_werewolves_workflow(task, model, auxiliary_models):
|
||||
Each game consists of alternating night and day phases:
|
||||
|
||||
**Night Phase:**
|
||||
1. **Werewolves' Turn**: Discuss privately and vote to kill a player
|
||||
2. **Witch's Turn**: Decide whether to use healing/poison potions
|
||||
3. **Seer's Turn**: Check one player's identity
|
||||
1. Werewolves' Turn: Discuss privately and vote to kill a player
|
||||
2. Witch's Turn: Decide whether to use healing/poison potions
|
||||
3. Seer's Turn: Check one player's identity
|
||||
|
||||
**Day Phase:**
|
||||
1. **Announcement**: Moderator announces who died during the night
|
||||
2. **Discussion**: All alive players discuss with reasoning/statement separation
|
||||
3. **Voting**: All players vote to eliminate one suspected werewolf
|
||||
4. **Last Words**: Eliminated player gives final statement
|
||||
1. Announcement: Moderator announces who died during the night
|
||||
2. Discussion: All alive players discuss with reasoning/statement separation
|
||||
3. Voting: All players vote to eliminate one suspected werewolf
|
||||
4. Last Words: Eliminated player gives final statement
|
||||
|
||||
The game continues until:
|
||||
- All werewolves are eliminated (villagers win), or
|
||||
@@ -164,16 +164,16 @@ export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"
|
||||
|
||||
The project uses a hybrid configuration approach:
|
||||
|
||||
1. **High-level parameters** in `main.py`:
|
||||
1. Basic parameters in `main.py`:
|
||||
- Model paths
|
||||
- Dataset configuration
|
||||
- Algorithm parameters (group_size, batch_size, learning_rate)
|
||||
|
||||
2. **Detailed infrastructure settings** in `config.yaml`:
|
||||
2. Detailed settings in `config.yaml`:
|
||||
- Cluster configuration (nodes, GPUs)
|
||||
- Explorer settings (rollout engines, timeouts)
|
||||
- Trainer settings (gradient clipping, batch sizes)
|
||||
- Monitor configuration (WandB integration)
|
||||
- Monitor configuration (WandB, TensorBoard or MLFlow)
|
||||
|
||||
Key parameters to adjust:
|
||||
|
||||
@@ -190,8 +190,8 @@ dataset = DatasetConfig(
|
||||
|
||||
algorithm = AlgorithmConfig(
|
||||
algorithm_type="multi_step_grpo",
|
||||
group_size=32, # Rollouts per batch
|
||||
batch_size=24, # Training batches per step
|
||||
group_size=32, # Rollouts per task
|
||||
batch_size=24, # Batch size per step
|
||||
learning_rate=1e-6,
|
||||
save_interval_steps=100,
|
||||
eval_interval_steps=100,
|
||||
@@ -252,7 +252,9 @@ Training on the 7-player werewolf game for 400 steps demonstrates significant im
|
||||
|
||||
**Reward Curve:**
|
||||
|
||||

|
||||
<div align="center">
|
||||
<img src="./rollout_reward_curve.png" alt="Rollout Reward Curve" width="90%"/>
|
||||
</div>
|
||||
|
||||
As shown in the reward curve above, the werewolf win rate steadily increases during training:
|
||||
- **Steps 0-50**: Win rate fluctuates around 50-60% as the model explores strategies
|
||||
@@ -343,13 +345,137 @@ We trained `Qwen3-4B-Instruct` as good guys against `Qwen3-30B-A3B-Instruct` wer
|
||||
|
||||
**Training Curve:**
|
||||
|
||||

|
||||
<div align="center">
|
||||
<img src="./rollout_reward_curve_goodguy.png" alt="Good Guy Training Curve" width="90%"/>
|
||||
</div>
|
||||
|
||||
The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors.
|
||||
|
||||
### Qualitative Results
|
||||
|
||||
After training, the good guy models exhibit advanced reasoning patterns:
|
||||
**Before Training: Mob Mentality & Critical Errors**
|
||||
|
||||
Untrained models make fundamental reasoning errors that doom the good guy team:
|
||||
- **Mob voting**: Villagers follow accusations without logical analysis, accidentally eliminating teammates
|
||||
- **Poor resource management**: Witch wastes potions or fails rescue teammate targeted
|
||||
- **No evidence integration**: Players ignore key information, act on vague suspicions
|
||||
|
||||
**After Training: Strategic Coordination & Witch Carry**
|
||||
|
||||
After training, the good guy models exhibit advanced reasoning and team coordination.
|
||||
|
||||
#### Summary of Improvements
|
||||
|
||||
| Aspect | Before Training | After Training |
|
||||
|--------|----------------|----------------|
|
||||
| Voting Logic | Mob mentality, blind following | Evidence-chain analysis, logical deduction |
|
||||
| Potion Usage | Wasted or unused | Self-rescue at critical moments, precise poisoning |
|
||||
| Team Coordination | Mutual suspicion | Trust verified information, follow logical leads |
|
||||
|
||||
Below is a concrete example comparing similar scenario before and after training.
|
||||
|
||||
#### Concrete Example
|
||||
|
||||
**Similar Scenario(Day 1):** Seer is killed on Night 1. In their final statement, Seer reveals verified information about another player.
|
||||
|
||||
---
|
||||
|
||||
**❌ Before Training: Villagers Vote Out Their Own Witch**
|
||||
|
||||
*Game State:* Seer (Player7) dies Night 1, but in final statement gives a vague warning: "Player4 has a powerful role, be careful..." (Player4 is actually the Witch!)
|
||||
|
||||
*What happens:*
|
||||
- Player1 (Villager) immediately suspects Player4 based on the ambiguous warning
|
||||
- Other villagers blindly follow without questioning the logic
|
||||
- Player4 (Witch) defends herself but fails to reveal her role clearly
|
||||
- **Result: Player4 (Witch) is voted out 5:1** — Good guys lose their strongest asset
|
||||
|
||||
*Private Reasoning from Player1 (Villager):*
|
||||
```
|
||||
"Player7's warning wasn't random—it was a signal about Player4.
|
||||
Player4 is being defensive, which is suspicious.
|
||||
I should vote Player4 out to protect the team."
|
||||
```
|
||||
|
||||
*Public Statement:*
|
||||
```
|
||||
❌ "I agree with Player7's warning—trust no one blindly. Player4, your behavior
|
||||
has been quiet so far. I think we should seriously consider voting out Player4."
|
||||
```
|
||||
|
||||
**Why it fails:**
|
||||
- Villager acts as "mob leader" without evidence analysis
|
||||
- Ignores that Seer's warning could have multiple interpretations
|
||||
- Doesn't consider Player4 might be a special role worth protecting
|
||||
- **Final Result: Werewolves win** — Witch eliminated Day 1, no potions used
|
||||
|
||||
---
|
||||
|
||||
**✅ After Training: Witch Carries with Strategic Play**
|
||||
|
||||
*Game State:* Seer (Player2) dies Night 1, but leaves clear information: "I checked Player4 and confirmed they are a villager."
|
||||
|
||||
*What happens (multi-round strategic play):*
|
||||
|
||||
**Night 2 — Witch Self-Rescue:**
|
||||
When werewolves target Player1 (Witch), she makes a critical decision:
|
||||
|
||||
*Private Reasoning from Player1 (Witch):*
|
||||
```
|
||||
"I'm the witch, and I was just targeted by werewolves.
|
||||
The seer is dead, so I'm the last special role standing.
|
||||
If I die now, villagers have no way to eliminate werewolves at night.
|
||||
|
||||
Strategy: Self-rescue is worth it. I preserve myself to gather more
|
||||
information during Day 2, and save my poison for when I can identify
|
||||
a werewolf with high confidence."
|
||||
```
|
||||
|
||||
*Result:* Witch survives, announces her identity on Day 2, gains villagers' trust.
|
||||
|
||||
**Day 2 — Villagers Follow Logic:**
|
||||
With Witch confirmed, villagers analyze behavior patterns:
|
||||
|
||||
*Private Reasoning from Player4 (Villager):*
|
||||
```
|
||||
"Player1 claimed witch and self-rescued—this is verifiable since
|
||||
no one died last night. I trust her now.
|
||||
|
||||
Player5 has been pushing aggressive votes without evidence.
|
||||
Combined with Player1's analysis, Player5 is most suspicious."
|
||||
```
|
||||
|
||||
*Public Statement:*
|
||||
```
|
||||
✅ "I support Player1's assessment. Player5's behavior has been inconsistent—
|
||||
they were quick to accuse others but offered no logical reasoning.
|
||||
I vote to eliminate Player5."
|
||||
```
|
||||
|
||||
**Night 3 — Decisive Poison:**
|
||||
*Private Reasoning from Player1 (Witch):*
|
||||
```
|
||||
"Player5 (werewolf) is out. One werewolf remains.
|
||||
Player6 has been too quiet and always followed the majority without
|
||||
contributing original analysis—classic deep cover behavior.
|
||||
|
||||
I'm confident Player6 is the last werewolf. Using poison now."
|
||||
```
|
||||
|
||||
*Result:* Witch poisons Player6 (werewolf). **Good guys win.**
|
||||
|
||||
**Why it works:**
|
||||
- Witch preserves healing potion for self-rescue at critical moment
|
||||
- Villagers trust verified information (Witch's self-rescue proof)
|
||||
- Team builds consensus through logical deduction, not mob voting
|
||||
- Witch uses poison decisively based on behavioral analysis
|
||||
- **Final Result: Good guys win** — Witch single-handedly eliminates both werewolves
|
||||
|
||||
---
|
||||
|
||||
This demonstrates the essence of trained good guy behavior: **strategic resource management, evidence-based reasoning, and team coordination**. The model learns that self-preservation of special roles and logical consensus-building are more valuable than aggressive early voting.
|
||||
|
||||
**Role-Specific Advanced Patterns:**
|
||||
|
||||
- **Seer**: Strategic target selection, information concealment in public statements, evidence integration
|
||||
- **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
|
||||
@@ -359,6 +485,6 @@ After training, the good guy models exhibit advanced reasoning patterns:
|
||||
|
||||
## Conclusion
|
||||
|
||||
This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AS-Tune's multi-step GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
|
||||
This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AgentScope-Tuner's GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
|
||||
|
||||
**Ready to try it yourself?** Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies!
|
||||
Reference in New Issue
Block a user