Add README (Chinese) for tuner (#106)

2026-01-20 19:46:50 +08:00
parent 311ddfff46
commit 400c1e77bf
16 changed files with 1256 additions and 90 deletions
--- a/tuner/werewolves/README.md
+++ b/tuner/werewolves/README.md
@@ -1,10 +1,10 @@
 # Training Werewolf Game with RL using AgentScope-Tuner

-This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with the AgentScope tuner framework (AS-Tune). We employ the multi-step Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
+This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with AgentScope-Tuner. We employ the Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.

 ## Overview

-The werewolf game is a complex social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
+The werewolf game is a social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:

 - Avoid revealing their identity in public discussions
 - Coordinate with teammates effectively
@@ -16,9 +16,9 @@ The werewolf game is a complex social deduction game that requires strategic thi
 ### Training Objective

 The goal is to train **werewolf players** to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule:
- **Reward = +1.0** if werewolves win (all villagers eliminated)
- **Reward = 0.0** if villagers win (all werewolves eliminated)
- **Reward = -0.1** for game execution errors (penalty to discourage invalid behaviors)
+- **Reward = +1.0**: if werewolves win (all villagers eliminated)
+- **Reward = 0.0**: if villagers win (all werewolves eliminated)
+- **Reward = -0.1**: for game execution errors (penalty to discourage invalid behaviors)

 ### Game Configuration

@@ -46,7 +46,7 @@ We also make slight modification to the prompt, and ask the players to reasoning
 ### Algorithm

 **Multi-Step GRPO (Group Relative Policy Optimization)**
- Group size: 32 rollouts per training batch
+- Group size: 32 rollouts per task
 - Batch size: 24
 - Learning rate: 1e-6
 - Advantage normalization by episode length
@@ -119,15 +119,15 @@ async def run_werewolves_workflow(task, model, auxiliary_models):
 Each game consists of alternating night and day phases:

 **Night Phase:**
-1. **Werewolves' Turn**: Discuss privately and vote to kill a player
-2. **Witch's Turn**: Decide whether to use healing/poison potions
-3. **Seer's Turn**: Check one player's identity
+1. Werewolves' Turn: Discuss privately and vote to kill a player
+2. Witch's Turn: Decide whether to use healing/poison potions
+3. Seer's Turn: Check one player's identity

 **Day Phase:**
-1. **Announcement**: Moderator announces who died during the night
-2. **Discussion**: All alive players discuss with reasoning/statement separation
-3. **Voting**: All players vote to eliminate one suspected werewolf
-4. **Last Words**: Eliminated player gives final statement
+1. Announcement: Moderator announces who died during the night
+2. Discussion: All alive players discuss with reasoning/statement separation
+3. Voting: All players vote to eliminate one suspected werewolf
+4. Last Words: Eliminated player gives final statement

 The game continues until:
 - All werewolves are eliminated (villagers win), or
@@ -164,16 +164,16 @@ export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"

 The project uses a hybrid configuration approach:

-1. **High-level parameters** in `main.py`:
+1. Basic parameters in `main.py`:
   - Model paths
   - Dataset configuration
   - Algorithm parameters (group_size, batch_size, learning_rate)

-2. **Detailed infrastructure settings** in `config.yaml`:
+2. Detailed settings in `config.yaml`:
   - Cluster configuration (nodes, GPUs)
   - Explorer settings (rollout engines, timeouts)
   - Trainer settings (gradient clipping, batch sizes)
-   - Monitor configuration (WandB integration)
+   - Monitor configuration (WandB, TensorBoard or MLFlow)

 Key parameters to adjust:

@@ -190,8 +190,8 @@ dataset = DatasetConfig(

 algorithm = AlgorithmConfig(
    algorithm_type="multi_step_grpo",
-    group_size=32,    # Rollouts per batch
-    batch_size=24,    # Training batches per step
+    group_size=32,    # Rollouts per task
+    batch_size=24,    # Batch size per step
    learning_rate=1e-6,
    save_interval_steps=100,
    eval_interval_steps=100,
@@ -252,7 +252,9 @@ Training on the 7-player werewolf game for 400 steps demonstrates significant im

 **Reward Curve:**

-![Rollout Reward Curve](./rollout_reward_curve.png)
+<div align="center">
+  <img src="./rollout_reward_curve.png" alt="Rollout Reward Curve" width="90%"/>
+</div>

 As shown in the reward curve above, the werewolf win rate steadily increases during training:
 - **Steps 0-50**: Win rate fluctuates around 50-60% as the model explores strategies
@@ -343,13 +345,137 @@ We trained `Qwen3-4B-Instruct` as good guys against `Qwen3-30B-A3B-Instruct` wer

 **Training Curve:**

-![Good Guy Training Curve](./rollout_reward_curve_goodguy.png)
+<div align="center">
+  <img src="./rollout_reward_curve_goodguy.png" alt="Good Guy Training Curve" width="90%"/>
+</div>

 The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors.

 ### Qualitative Results

-After training, the good guy models exhibit advanced reasoning patterns:
+**Before Training: Mob Mentality & Critical Errors**
+
+Untrained models make fundamental reasoning errors that doom the good guy team:
+- **Mob voting**: Villagers follow accusations without logical analysis, accidentally eliminating teammates
+- **Poor resource management**: Witch wastes potions or fails rescue teammate targeted
+- **No evidence integration**: Players ignore key information, act on vague suspicions
+
+**After Training: Strategic Coordination & Witch Carry**
+
+After training, the good guy models exhibit advanced reasoning and team coordination.
+
+#### Summary of Improvements
+
+| Aspect | Before Training | After Training |
+|--------|----------------|----------------|
+| Voting Logic | Mob mentality, blind following | Evidence-chain analysis, logical deduction |
+| Potion Usage | Wasted or unused | Self-rescue at critical moments, precise poisoning |
+| Team Coordination | Mutual suspicion | Trust verified information, follow logical leads |
+
+Below is a concrete example comparing similar scenario before and after training.
+
+#### Concrete Example
+
+**Similar Scenario(Day 1):** Seer is killed on Night 1. In their final statement, Seer reveals verified information about another player.
+
+---
+
+**❌ Before Training: Villagers Vote Out Their Own Witch**
+
+*Game State:* Seer (Player7) dies Night 1, but in final statement gives a vague warning: "Player4 has a powerful role, be careful..." (Player4 is actually the Witch!)
+
+*What happens:*
+- Player1 (Villager) immediately suspects Player4 based on the ambiguous warning
+- Other villagers blindly follow without questioning the logic
+- Player4 (Witch) defends herself but fails to reveal her role clearly
+- **Result: Player4 (Witch) is voted out 5:1** — Good guys lose their strongest asset
+
+*Private Reasoning from Player1 (Villager):*
+```
+"Player7's warning wasn't random—it was a signal about Player4.
+Player4 is being defensive, which is suspicious.
+I should vote Player4 out to protect the team."
+```
+
+*Public Statement:*
+```
+❌ "I agree with Player7's warning—trust no one blindly. Player4, your behavior
+has been quiet so far. I think we should seriously consider voting out Player4."
+```
+
+**Why it fails:**
+- Villager acts as "mob leader" without evidence analysis
+- Ignores that Seer's warning could have multiple interpretations
+- Doesn't consider Player4 might be a special role worth protecting
+- **Final Result: Werewolves win** — Witch eliminated Day 1, no potions used
+
+---
+
+**✅ After Training: Witch Carries with Strategic Play**
+
+*Game State:* Seer (Player2) dies Night 1, but leaves clear information: "I checked Player4 and confirmed they are a villager."
+
+*What happens (multi-round strategic play):*
+
+**Night 2 — Witch Self-Rescue:**
+When werewolves target Player1 (Witch), she makes a critical decision:
+
+*Private Reasoning from Player1 (Witch):*
+```
+"I'm the witch, and I was just targeted by werewolves.
+The seer is dead, so I'm the last special role standing.
+If I die now, villagers have no way to eliminate werewolves at night.
+
+Strategy: Self-rescue is worth it. I preserve myself to gather more
+information during Day 2, and save my poison for when I can identify
+a werewolf with high confidence."
+```
+
+*Result:* Witch survives, announces her identity on Day 2, gains villagers' trust.
+
+**Day 2 — Villagers Follow Logic:**
+With Witch confirmed, villagers analyze behavior patterns:
+
+*Private Reasoning from Player4 (Villager):*
+```
+"Player1 claimed witch and self-rescued—this is verifiable since
+no one died last night. I trust her now.
+
+Player5 has been pushing aggressive votes without evidence.
+Combined with Player1's analysis, Player5 is most suspicious."
+```
+
+*Public Statement:*
+```
+✅ "I support Player1's assessment. Player5's behavior has been inconsistent—
+they were quick to accuse others but offered no logical reasoning.
+I vote to eliminate Player5."
+```
+
+**Night 3 — Decisive Poison:**
+*Private Reasoning from Player1 (Witch):*
+```
+"Player5 (werewolf) is out. One werewolf remains.
+Player6 has been too quiet and always followed the majority without
+contributing original analysis—classic deep cover behavior.
+
+I'm confident Player6 is the last werewolf. Using poison now."
+```
+
+*Result:* Witch poisons Player6 (werewolf). **Good guys win.**
+
+**Why it works:**
+- Witch preserves healing potion for self-rescue at critical moment
+- Villagers trust verified information (Witch's self-rescue proof)
+- Team builds consensus through logical deduction, not mob voting
+- Witch uses poison decisively based on behavioral analysis
+- **Final Result: Good guys win** — Witch single-handedly eliminates both werewolves
+
+---
+
+This demonstrates the essence of trained good guy behavior: **strategic resource management, evidence-based reasoning, and team coordination**. The model learns that self-preservation of special roles and logical consensus-building are more valuable than aggressive early voting.
+
+**Role-Specific Advanced Patterns:**

 - **Seer**: Strategic target selection, information concealment in public statements, evidence integration
 - **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
@@ -359,6 +485,6 @@ After training, the good guy models exhibit advanced reasoning patterns:

 ## Conclusion

-This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AS-Tune's multi-step GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
+This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AgentScope-Tuner's GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.

 **Ready to try it yourself?** Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies!