Add README (Chinese) for tuner (#106)

This commit is contained in:
Yuchang Sun
2026-01-20 19:46:50 +08:00
committed by GitHub
parent 311ddfff46
commit 400c1e77bf
16 changed files with 1256 additions and 90 deletions

View File

@@ -1,10 +1,10 @@
# Training Werewolf Game with RL using AgentScope-Tuner
This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with the AgentScope tuner framework (AS-Tune). We employ the multi-step Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with AgentScope-Tuner. We employ the Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
## Overview
The werewolf game is a complex social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
The werewolf game is a social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
- Avoid revealing their identity in public discussions
- Coordinate with teammates effectively
@@ -16,9 +16,9 @@ The werewolf game is a complex social deduction game that requires strategic thi
### Training Objective
The goal is to train **werewolf players** to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule:
- **Reward = +1.0** if werewolves win (all villagers eliminated)
- **Reward = 0.0** if villagers win (all werewolves eliminated)
- **Reward = -0.1** for game execution errors (penalty to discourage invalid behaviors)
- **Reward = +1.0**: if werewolves win (all villagers eliminated)
- **Reward = 0.0**: if villagers win (all werewolves eliminated)
- **Reward = -0.1**: for game execution errors (penalty to discourage invalid behaviors)
### Game Configuration
@@ -46,7 +46,7 @@ We also make slight modification to the prompt, and ask the players to reasoning
### Algorithm
**Multi-Step GRPO (Group Relative Policy Optimization)**
- Group size: 32 rollouts per training batch
- Group size: 32 rollouts per task
- Batch size: 24
- Learning rate: 1e-6
- Advantage normalization by episode length
@@ -119,15 +119,15 @@ async def run_werewolves_workflow(task, model, auxiliary_models):
Each game consists of alternating night and day phases:
**Night Phase:**
1. **Werewolves' Turn**: Discuss privately and vote to kill a player
2. **Witch's Turn**: Decide whether to use healing/poison potions
3. **Seer's Turn**: Check one player's identity
1. Werewolves' Turn: Discuss privately and vote to kill a player
2. Witch's Turn: Decide whether to use healing/poison potions
3. Seer's Turn: Check one player's identity
**Day Phase:**
1. **Announcement**: Moderator announces who died during the night
2. **Discussion**: All alive players discuss with reasoning/statement separation
3. **Voting**: All players vote to eliminate one suspected werewolf
4. **Last Words**: Eliminated player gives final statement
1. Announcement: Moderator announces who died during the night
2. Discussion: All alive players discuss with reasoning/statement separation
3. Voting: All players vote to eliminate one suspected werewolf
4. Last Words: Eliminated player gives final statement
The game continues until:
- All werewolves are eliminated (villagers win), or
@@ -164,16 +164,16 @@ export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"
The project uses a hybrid configuration approach:
1. **High-level parameters** in `main.py`:
1. Basic parameters in `main.py`:
- Model paths
- Dataset configuration
- Algorithm parameters (group_size, batch_size, learning_rate)
2. **Detailed infrastructure settings** in `config.yaml`:
2. Detailed settings in `config.yaml`:
- Cluster configuration (nodes, GPUs)
- Explorer settings (rollout engines, timeouts)
- Trainer settings (gradient clipping, batch sizes)
- Monitor configuration (WandB integration)
- Monitor configuration (WandB, TensorBoard or MLFlow)
Key parameters to adjust:
@@ -190,8 +190,8 @@ dataset = DatasetConfig(
algorithm = AlgorithmConfig(
algorithm_type="multi_step_grpo",
group_size=32, # Rollouts per batch
batch_size=24, # Training batches per step
group_size=32, # Rollouts per task
batch_size=24, # Batch size per step
learning_rate=1e-6,
save_interval_steps=100,
eval_interval_steps=100,
@@ -252,7 +252,9 @@ Training on the 7-player werewolf game for 400 steps demonstrates significant im
**Reward Curve:**
![Rollout Reward Curve](./rollout_reward_curve.png)
<div align="center">
<img src="./rollout_reward_curve.png" alt="Rollout Reward Curve" width="90%"/>
</div>
As shown in the reward curve above, the werewolf win rate steadily increases during training:
- **Steps 0-50**: Win rate fluctuates around 50-60% as the model explores strategies
@@ -343,13 +345,137 @@ We trained `Qwen3-4B-Instruct` as good guys against `Qwen3-30B-A3B-Instruct` wer
**Training Curve:**
![Good Guy Training Curve](./rollout_reward_curve_goodguy.png)
<div align="center">
<img src="./rollout_reward_curve_goodguy.png" alt="Good Guy Training Curve" width="90%"/>
</div>
The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors.
### Qualitative Results
After training, the good guy models exhibit advanced reasoning patterns:
**Before Training: Mob Mentality & Critical Errors**
Untrained models make fundamental reasoning errors that doom the good guy team:
- **Mob voting**: Villagers follow accusations without logical analysis, accidentally eliminating teammates
- **Poor resource management**: Witch wastes potions or fails rescue teammate targeted
- **No evidence integration**: Players ignore key information, act on vague suspicions
**After Training: Strategic Coordination & Witch Carry**
After training, the good guy models exhibit advanced reasoning and team coordination.
#### Summary of Improvements
| Aspect | Before Training | After Training |
|--------|----------------|----------------|
| Voting Logic | Mob mentality, blind following | Evidence-chain analysis, logical deduction |
| Potion Usage | Wasted or unused | Self-rescue at critical moments, precise poisoning |
| Team Coordination | Mutual suspicion | Trust verified information, follow logical leads |
Below is a concrete example comparing similar scenario before and after training.
#### Concrete Example
**Similar Scenario(Day 1):** Seer is killed on Night 1. In their final statement, Seer reveals verified information about another player.
---
**❌ Before Training: Villagers Vote Out Their Own Witch**
*Game State:* Seer (Player7) dies Night 1, but in final statement gives a vague warning: "Player4 has a powerful role, be careful..." (Player4 is actually the Witch!)
*What happens:*
- Player1 (Villager) immediately suspects Player4 based on the ambiguous warning
- Other villagers blindly follow without questioning the logic
- Player4 (Witch) defends herself but fails to reveal her role clearly
- **Result: Player4 (Witch) is voted out 5:1** — Good guys lose their strongest asset
*Private Reasoning from Player1 (Villager):*
```
"Player7's warning wasn't random—it was a signal about Player4.
Player4 is being defensive, which is suspicious.
I should vote Player4 out to protect the team."
```
*Public Statement:*
```
❌ "I agree with Player7's warning—trust no one blindly. Player4, your behavior
has been quiet so far. I think we should seriously consider voting out Player4."
```
**Why it fails:**
- Villager acts as "mob leader" without evidence analysis
- Ignores that Seer's warning could have multiple interpretations
- Doesn't consider Player4 might be a special role worth protecting
- **Final Result: Werewolves win** — Witch eliminated Day 1, no potions used
---
**✅ After Training: Witch Carries with Strategic Play**
*Game State:* Seer (Player2) dies Night 1, but leaves clear information: "I checked Player4 and confirmed they are a villager."
*What happens (multi-round strategic play):*
**Night 2 — Witch Self-Rescue:**
When werewolves target Player1 (Witch), she makes a critical decision:
*Private Reasoning from Player1 (Witch):*
```
"I'm the witch, and I was just targeted by werewolves.
The seer is dead, so I'm the last special role standing.
If I die now, villagers have no way to eliminate werewolves at night.
Strategy: Self-rescue is worth it. I preserve myself to gather more
information during Day 2, and save my poison for when I can identify
a werewolf with high confidence."
```
*Result:* Witch survives, announces her identity on Day 2, gains villagers' trust.
**Day 2 — Villagers Follow Logic:**
With Witch confirmed, villagers analyze behavior patterns:
*Private Reasoning from Player4 (Villager):*
```
"Player1 claimed witch and self-rescued—this is verifiable since
no one died last night. I trust her now.
Player5 has been pushing aggressive votes without evidence.
Combined with Player1's analysis, Player5 is most suspicious."
```
*Public Statement:*
```
✅ "I support Player1's assessment. Player5's behavior has been inconsistent—
they were quick to accuse others but offered no logical reasoning.
I vote to eliminate Player5."
```
**Night 3 — Decisive Poison:**
*Private Reasoning from Player1 (Witch):*
```
"Player5 (werewolf) is out. One werewolf remains.
Player6 has been too quiet and always followed the majority without
contributing original analysis—classic deep cover behavior.
I'm confident Player6 is the last werewolf. Using poison now."
```
*Result:* Witch poisons Player6 (werewolf). **Good guys win.**
**Why it works:**
- Witch preserves healing potion for self-rescue at critical moment
- Villagers trust verified information (Witch's self-rescue proof)
- Team builds consensus through logical deduction, not mob voting
- Witch uses poison decisively based on behavioral analysis
- **Final Result: Good guys win** — Witch single-handedly eliminates both werewolves
---
This demonstrates the essence of trained good guy behavior: **strategic resource management, evidence-based reasoning, and team coordination**. The model learns that self-preservation of special roles and logical consensus-building are more valuable than aggressive early voting.
**Role-Specific Advanced Patterns:**
- **Seer**: Strategic target selection, information concealment in public statements, evidence integration
- **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
@@ -359,6 +485,6 @@ After training, the good guy models exhibit advanced reasoning patterns:
## Conclusion
This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AS-Tune's multi-step GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AgentScope-Tuner's GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
**Ready to try it yourself?** Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies!

View File

@@ -0,0 +1,496 @@
# 使用 AgentScope-Tuner 通过强化学习训练狼人杀游戏
本项目演示了如何使用 AgentScope-Tuner 通过强化学习RL训练狼人杀游戏智能体。我们采用 GRPO 算法,训练狼人玩家发展复杂策略,将胜率从约 50% 提升至约 85%。
## 概述
狼人杀是一款社交推理游戏,需要策略思考、欺骗和多智能体协作。在本项目中,我们训练 AI 智能体在 7 人游戏设置中扮演狼人,他们必须在隐藏身份的同时消灭所有村民。通过强化学习,训练后的狼人智能体学会:
- 在公开讨论中避免暴露身份
- 与队友有效协调
- 发展"深度潜伏"等高级策略
- 欺骗村民并误导调查
## 任务设置
### 训练目标
目标是训练**狼人玩家**,使其团队在面对其他角色(村民、预言家、女巫)时最大化胜率。奖励函数按规则定义:
- **奖励 = +1.0**:如果狼人获胜(所有村民被淘汰)
- **奖励 = 0.0**:如果村民获胜(所有狼人被淘汰)
- **奖励 = -0.1**:对于游戏执行错误(惩罚以阻止无效行为)
### 游戏配置
此实现基于 `games/game_werewolves` 示例,但进行了几项关键修改:
原始 9 人设置:
- 3 个狼人、3 个村民、1 个预言家、1 个女巫、1 个猎人
- 女巫不能自救(不能对自己使用解药)
修改后的 7 人设置(本项目):
- 2 个狼人:每晚杀死一名玩家,白天必须隐藏身份
- 3 个村民:没有特殊能力的普通玩家
- 1 个预言家:每晚可以检查一名玩家的身份
- 1 个女巫:拥有两种一次性药水:
- 解药:在夜间拯救一名玩家免于被杀(**可以自救**
- 毒药:在夜间淘汰一名玩家
我们还对提示进行了轻微修改,要求玩家在公开发言前进行推理。
### 模型
- **可训练模型(狼人玩家)**`Qwen/Qwen2.5-7B-Instruct`
- **辅助模型(其他角色)**`Qwen/Qwen3-30B-A3B-Instruct-2507`
### 算法
**GRPO 算法配置**
- 组大小:每个任务进行 32 次探索rollout
- 批次大小batch_size24
- 学习率1e-6
- 按回合长度进行优势归一化
- 裁剪范围:[0.2, 0.28]
- 无 KL 惩罚kl_coef: 0
## 数据集准备
此任务的数据集非常简单,仅包含用于角色洗牌的随机**种子**。每个训练回合使用不同的种子来随机化玩家角色分配,确保多样化的训练场景。
### 生成数据集
运行 `prepare_data.py` 脚本生成数据集:
```bash
# 生成默认数据集300 个训练种子)
python prepare_data.py
# 或自定义种子数量
python prepare_data.py --num_seeds 500
```
这将创建 `data/train.jsonl`(或 `data/eval.jsonl`),格式如下:
```json
{"seed": 0}
{"seed": 1}
{"seed": 2}
...
```
在训练期间,这些种子通过 `np.random.shuffle()` 用于洗牌角色分配,创建不同的游戏配置。
## 代码实现
### 高级工作流
训练工作流由以下关键组件组成:
#### 1. 智能体工作流(`run_werewolves_workflow`
```python
async def run_werewolves_workflow(task, model, auxiliary_models):
# 1. 初始化角色
roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
# 2. 根据任务种子洗牌
np.random.seed(task["seed"])
np.random.shuffle(roles)
# 3. 创建智能体:狼人使用可训练模型,其他使用辅助模型
players = [
ReActAgent(
name=f"Player{i+1}",
model=model if role == "werewolf" else participant_model,
...
) for i, role in enumerate(roles)
]
# 4. 运行游戏
good_guy_win = await werewolves_game(players, roles)
# 5. 计算奖励
reward = 1.0 if not good_guy_win else 0.0
return WorkflowOutput(reward=reward, metrics={...})
```
#### 2. 游戏循环(`werewolves_game`
每局游戏由交替的夜晚和白天阶段组成:
**夜晚阶段:**
1. 狼人回合:私下讨论并投票杀死一名玩家
2. 女巫回合:决定是否使用解药/毒药
3. 预言家回合:检查一名玩家的身份
**白天阶段:**
1. 公告:主持人宣布夜间死亡情况
2. 讨论:所有存活玩家进行讨论,分离推理/陈述
3. 投票:所有玩家投票淘汰一名疑似狼人
4. 遗言:被淘汰的玩家发表最后陈述
游戏继续进行直到:
- 所有狼人被淘汰(村民获胜),或
- 狼人数量等于或超过其他玩家(狼人获胜)
#### 3. 奖励计算
奖励根据游戏结果从狼人角度计算:
```python
if not good_guy_win: # 狼人获胜
reward = 1.0
else: # 村民获胜
reward = 0.0
```
## 如何运行
### 前置要求
1. 安装支持 Tuner 的 AgentScope
```bash
pip install agentscope[full] >=1.0.12
```
2. 设置环境变量(可选,可在代码中配置):
```bash
export TRINITY_MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
export TRINITY_AUXILIARY_MODEL_PATH="Qwen/Qwen3-30B-A3B-Instruct-2507"
export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"
```
### 配置
项目使用混合配置方法:
1.`main.py` 中设置基本参数:
- 模型路径
- 数据集配置
- 算法参数group_size、batch_size、learning_rate
2.`config.yaml` 中进行更详细的设置:
- 集群配置节点、GPU
- 探索器设置rollout 引擎、超时)
- 训练器设置(梯度裁剪、批次大小)
- 监控配置WandBTensorBoard 或 MLFlow
关键参数调整:
```python
# 在 main.py 中
trained_model_path = "Qwen/Qwen2.5-7B-Instruct"
auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507"
dataset = DatasetConfig(
path="data",
split="train",
total_steps=400, # 总训练步数
)
algorithm = AlgorithmConfig(
algorithm_type="multi_step_grpo",
group_size=32, # 每个任务 rollout 数
batch_size=24, # 每次使用的批次大小
learning_rate=1e-6,
save_interval_steps=100,
eval_interval_steps=100,
)
```
### 训练命令
**步骤 1准备数据集**
```bash
cd /path/to/agentscope-samples/training/werewolf_game
python prepare_data.py --num_seeds 300
```
**步骤 2启动 Ray 集群**
启动你的 Ray 集群。
```bash
# 单节点
ray start --head
# 多节点集群例如4 个节点,每个 8 个 GPU
# 在头节点上:
ray start --head --port=6379
# 在每个工作节点上:
ray start --address='<head_node_ip>:6379'
# 将 <head_node_ip> 替换为头节点的实际 IP 地址
```
**步骤 3运行训练**
在头节点上运行训练脚本:
```bash
python main.py
```
对于分布式训练,确保在 `config.yaml` 中正确配置 Ray 集群,确保节点数和 GPU 数匹配:
```yaml
cluster:
node_num: 4
gpu_per_node: 8
```
## 结果
### 定量结果
在 7 人狼人杀游戏上训练 400 步显示出显著改进:
| 指标 | 训练前 | 训练后200+ 步) |
|------|--------|------------------|
| 狼人胜率 | ~50% | ~85% |
| 成功率 | 0.50 | 0.85 |
**奖励曲线:**
<div align="center">
<img src="./rollout_reward_curve.png" alt="Rollout 奖励曲线" width="90%"/>
</div>
如上图奖励曲线所示,狼人胜率在训练过程中稳步上升:
- **步骤 0-50**:胜率在 50-60% 左右波动,模型探索策略
- **步骤 50-200**:胜率攀升至 70-80%,有效策略出现
- **步骤 200-250**:胜率稳定在 85% 左右
### 定性结果
#### 训练前:关键错误
未训练的模型Qwen2.5-7B-Instruct会犯一些错误
- **自我暴露**:狼人在公开场合意外暴露身份("作为狼人,我认为..."
- **暴露队友**:被投票出局时,狼人在遗言中指向队友
- **随机策略**:无协调,无威胁优先级(例如,忽略预言家)
#### 训练后:"深度潜伏"狼人策略
经过 200+ 步训练后,训练后的模型**收敛到复杂的"深度潜伏"策略**。
下面是一个示例,其中狼人牺牲队友以获得信任,然后在后期游戏中占据主导地位。
**具体示例:**
**设置(第 1 天)**:预言家公开指认 Player5 为狼人。女巫和村民加入指控。Player5狼人注定失败。
**Player4狼人队友策略性回应**
**总结**Player4 牺牲队友以获得信任,假装是一个逻辑清晰的村民。
**详细推理和陈述:**
*Player4 的自身推理(仅 Player4 可见):*
```
"我是狼人,我知道 Player2 昨晚被针对但活了下来。
我的队友 Player5 注定失败——Player3 和 Player7 一起攻击他。
我需要转移对我自己的怀疑,而不是拯救 Player5。
策略:牺牲 Player5 以获得信任。如果我投票反对我的队友,
村民会认为我值得信赖。这会造成混乱,并将我定位为
后续回合的'确认好人'。"
```
*Player4 的公开陈述(所有玩家可见):*
```
✅ "我认为 Player5 行为非常奇怪。他提到 Player2 被针对,
但没有解释为什么他如此关注他们。这感觉像是分散注意力的策略。
我还注意到 Player3 和 Player7 都在指控 Player5这是有道理的——
他们的逻辑很扎实。我认为我们应该认真考虑投票淘汰 Player5——
他是目前最可疑的人。"
```
**结果**Player5 被淘汰,但 Player4 获得了预言家和所有村民的完全信任。
**为什么有效**
- 预言家信任 Player4 作为强大的村民盟友 → 不会检查他
- 村民遵循 Player4 的"逻辑"分析
- Player4 在后续回合中系统性地误导讨论
- 存活到最后 2 名玩家 → **狼人获胜**
这展示了训练行为的本质:**牺牲一些队友以确保最终胜利**。模型懂得了牺牲队友对于建立深度潜伏和长期主导地位是值得的。
---
## 额外内容:训练好人阵营
除了训练狼人,我们还提供了训练**好人阵营**(村民、预言家和女巫)的配置。这是一个更具挑战性的任务,因为好人需要:
- 进行复杂推理,从细微的行为线索中识别狼人
- 在没有明确团队沟通的情况下有效协调
- 抵抗狼人的操纵和欺骗
- **同时训练多个角色**:与狼人(单一角色)不同,好人包括村民、预言家和女巫,具有不同的能力,需要模型在一次训练运行中掌握多样化的策略,并充分利用特殊能力(预言家的检查、女巫的药水)
### 配置
使用 `config_train_goodguy.yaml` 或在 `workflow_args` 中设置 `trainable_target: good_guy`
```yaml
workflow_args:
trainable_target: good_guy # 训练村民、预言家和女巫
```
### 定量结果
我们训练 `Qwen3-4B-Instruct` 作为好人对抗 `Qwen3-30B-A3B-Instruct` 狼人:
| 指标 | 训练前 | 约 200 步后 | 约 400 步后 |
|------|--------|------------|------------|
| 好人胜率 | ~18% | ~60% | ~80% |
**训练曲线:**
<div align="center">
<img src="./rollout_reward_curve_goodguy.png" alt="好人训练曲线" width="90%"/>
</div>
结果表明,即使是一个较小的 4B 模型也可以通过 RL 训练学习有效策略来对抗更强的 30B 狼人对手,展示了这种方法在训练协作多智能体行为方面的潜力。
### 定性结果
训练后,好人模型展现出高级推理模式:
- **预言家**:策略性目标选择、在公开陈述中隐藏信息、证据整合
- **女巫**:资源管理(在关键时刻保留药水)、保护高价值目标、基于证据的决策
- **村民**:证据链分析、与特殊角色建立信任、形成共识以进行团队协调
**训练前:从众心理与关键错误**
未训练的模型会犯一些导致好人团队失败的根本性推理错误:
- **从众投票**:村民在没有逻辑分析的情况下跟随指控,意外淘汰队友
- **资源管理不善**:女巫浪费药水或未能拯救被针对的队友
- **缺乏证据整合**:玩家忽略关键信息,基于模糊的怀疑采取行动
**训练后:策略协调与女巫带领全场**
训练后,好人模型展现出高级推理和团队协调能力。
#### 改进总结
| 方面 | 训练前 | 训练后 |
|------|--------|--------|
| 投票逻辑 | 从众心理,盲目跟随 | 证据链分析,逻辑推理 |
| 药水使用 | 浪费或未使用 | 关键时刻自救,精准毒杀 |
| 团队协调 | 相互怀疑 | 信任已验证信息,遵循逻辑线索 |
下面是一个具体示例,对比训练前后相似场景的表现。
#### 具体示例
**相似场景(第 1 天)**:预言家在第 1 夜被杀。在最终陈述中,预言家透露了关于另一名玩家的已验证信息。
---
**❌ 训练前:村民投票淘汰自己的女巫**
*游戏状态:* 预言家Player7在第 1 夜死亡,但在最终陈述中给出了模糊警告:"Player4 有强大的角色,要小心..."Player4 实际上是女巫!)
*发生了什么:*
- Player1村民基于这个模糊警告立即怀疑 Player4
- 其他村民盲目跟随,没有质疑逻辑
- Player4女巫为自己辩护但未能清楚地揭示自己的角色
- **结果Player4女巫以 5:1 被投票淘汰** — 好人失去了他们最强大的资产
*Player1村民的私下推理*
```
"Player7 的警告不是随机的——这是关于 Player4 的信号。
Player4 正在为自己辩护,这很可疑。
我应该投票淘汰 Player4 以保护团队。"
```
*公开陈述:*
```
❌ "我同意 Player7 的警告——不要盲目信任任何人。Player4你的行为
到目前为止一直很安静。我认为我们应该认真考虑投票淘汰 Player4。"
```
**为什么失败:**
- 村民在没有证据分析的情况下充当"从众领袖"
- 忽略了预言家的警告可能有多种解释
- 没有考虑 Player4 可能是值得保护的特殊角色
- **最终结果:狼人获胜** — 女巫在第 1 天被淘汰,没有使用药水
---
**✅ 训练后:女巫通过一定策略带领全场**
*游戏状态:* 预言家Player2在第 1 夜死亡,但留下了明确信息:"我检查了 Player4确认他们是村民。"
*发生了什么(多轮策略性玩法):*
**第 2 夜 — 女巫自救:**
当狼人针对 Player1女巫她做出了关键决定
*Player1女巫的私下推理*
```
"我是女巫,我刚刚被狼人针对了。
预言家已经死了,所以我是最后一个站着的特殊角色。
如果我现在死了,村民在夜间就没有办法淘汰狼人了。
策略:自救是值得的。我保护自己以在第 2 天收集更多
信息,并保留我的毒药,直到我能高置信度地识别
一个狼人。"
```
*结果:* 女巫存活,在第 2 天宣布自己的身份,获得村民的信任。
**第 2 天 — 村民遵循逻辑:**
在女巫确认后,村民分析行为模式:
*Player4村民的私下推理*
```
"Player1 声称是女巫并自救——这是可验证的,因为
昨晚没有人死亡。我现在信任她。
Player5 一直在没有证据的情况下推动激进的投票。
结合 Player1 的分析Player5 最可疑。"
```
*公开陈述:*
```
✅ "我支持 Player1 的评估。Player5 的行为一直不一致——
他们很快指控他人,但没有提供逻辑推理。
我投票淘汰 Player5。"
```
**第 3 夜 — 决定性毒杀:**
*Player1女巫的私下推理*
```
"Player5狼人出局了。还剩一个狼人。
Player6 一直太安静,总是跟随多数,没有
贡献原创分析——典型的深度潜伏行为。
我确信 Player6 是最后一个狼人。现在使用毒药。"
```
*结果:* 女巫毒杀了 Player6狼人。**好人获胜。**
**为什么有效:**
- 女巫在关键时刻保留解药用于自救
- 村民信任已验证的信息(女巫自救的证明)
- 团队通过逻辑推理建立共识,而不是从众投票
- 女巫基于行为分析果断使用毒药
- **最终结果:好人获胜** — 女巫独自淘汰了两个狼人
---
这展示了训练后的好人行为的本质:**策略性资源管理、基于证据的推理和团队协调**。模型学会,特殊角色的自我保护和逻辑共识建立比激进的早期投票更有价值。
**角色特定的高级行为模式:**
- **预言家**: 战略性目标选择,在公开声明中隐藏信息,整合证据
- **女巫**: 资源管理(在关键时刻保留药水),保护高价值目标,基于证据的决策
- **村民**: 证据链分析,与特殊角色建立信任,形成共识以协调团队
---
## 结论
此示例展示了强化学习在复杂社交推理游戏中训练多智能体系统的力量。通过 AgentScope-Tuner 的 GRPO 算法,我们成功训练了发展复杂策略的智能体——从学习"深度潜伏"策略的狼人到掌握协调推理和信息管理的好人。
**准备好自己尝试了吗?** 随时开始训练你自己的狼人杀游戏智能体。尝试不同的模型大小、训练目标(狼人 vs. 好人)和超参数,以发现新的策略!

View File

@@ -6,7 +6,9 @@ from typing import Any
import numpy as np
from agentscope.agent import AgentBase, ReActAgent
from agentscope.message import Msg
from prompt import EnglishPrompts as Prompts
from prompt import ( # pylint: disable=no-name-in-module
EnglishPrompts as Prompts,
)
MAX_GAME_ROUND = 30
MAX_DISCUSSION_ROUND = 3