Add examples for werewolf game tuner (#96)
This commit is contained in:
@@ -72,7 +72,7 @@ This is a repository that **brings together a variety of ready-to-run Python age
|
|||||||
│ ├── frozen_lake/ # Teach an agent to play a game requiring multiple steps
|
│ ├── frozen_lake/ # Teach an agent to play a game requiring multiple steps
|
||||||
│ ├── learn_to_ask/ # Using LLM-as-a-judge to facilitate agent tuning
|
│ ├── learn_to_ask/ # Using LLM-as-a-judge to facilitate agent tuning
|
||||||
│ ├── email_search/ # Enhance the tool use ability of your agent
|
│ ├── email_search/ # Enhance the tool use ability of your agent
|
||||||
│ ├── werewolf_game/ # Enhance a multi-agent application
|
│ ├── werewolves/ # Enhance a multi-agent application
|
||||||
│ └── data_augment/ # Data augmentation for tuning
|
│ └── data_augment/ # Data augmentation for tuning
|
||||||
├── sample_template/ # Template for new sample contributions
|
├── sample_template/ # Template for new sample contributions
|
||||||
└── README.md
|
└── README.md
|
||||||
|
|||||||
@@ -72,7 +72,7 @@
|
|||||||
│ ├── frozen_lake/ # 教一个智能体玩需要多步操作的游戏
|
│ ├── frozen_lake/ # 教一个智能体玩需要多步操作的游戏
|
||||||
│ ├── learn_to_ask/ # 使用 LLM 作为评委辅助智能体训练
|
│ ├── learn_to_ask/ # 使用 LLM 作为评委辅助智能体训练
|
||||||
│ ├── email_search/ # 提升智能体的工具使用能力
|
│ ├── email_search/ # 提升智能体的工具使用能力
|
||||||
│ ├── werewolf_game/ # 强化多智能体应用能力
|
│ ├── werewolves/ # 强化多智能体应用能力
|
||||||
│ └── data_augment/ # 增强用于调优的数据
|
│ └── data_augment/ # 增强用于调优的数据
|
||||||
├── sample_template/ # 新样例贡献模板
|
├── sample_template/ # 新样例贡献模板
|
||||||
└── README.md
|
└── README.md
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ This directory contains several examples of how to use the AgentScope Tuner for
|
|||||||
| Frozen Lake | Make an agent to navigate the Frozen Lake environment in multi-step interactions. | [frozen_lake](./frozen_lake) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
| Frozen Lake | Make an agent to navigate the Frozen Lake environment in multi-step interactions. | [frozen_lake](./frozen_lake) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| Learn to Ask | Using LLM as a judge to provide feedback to facilitate agent tuning. | [learn_to_ask](./learn_to_ask) | ✅ | ✅ | ❌ | ❌ | ❌ |
|
| Learn to Ask | Using LLM as a judge to provide feedback to facilitate agent tuning. | [learn_to_ask](./learn_to_ask) | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| Email Search | Enhance the tool use ability of your agent on tasks without ground truth. | [email_search](./email_search) | ✅ | ✅ | ✅ | ❌ | ❌ |
|
| Email Search | Enhance the tool use ability of your agent on tasks without ground truth. | [email_search](./email_search) | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| Werewolf Game | Enhance the agent's performance in a multi-agent game setting. | [werewolf_game](./werewolf_game)| ✅ | ✅ | ✅ | ✅ | ❌ |
|
| Werewolf Game | Enhance the agent's performance in a multi-agent game setting. | [werewolves](./werewolves)| ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
| Data Augment | Data augmentation for better tuning results. | [data_augment](./data_augment) | ❌ | ❌ | ❌ | ❌ | ✅ |
|
| Data Augment | Data augmentation for better tuning results. | [data_augment](./data_augment) | ❌ | ❌ | ❌ | ❌ | ✅ |
|
||||||
|
|
||||||
Each example contains a README file with detailed instructions on how to set up and run the tuning process for that specific scenario. Feel free to explore and modify the examples to suit your needs!
|
Each example contains a README file with detailed instructions on how to set up and run the tuning process for that specific scenario. Feel free to explore and modify the examples to suit your needs!
|
||||||
|
|||||||
@@ -1,19 +1,19 @@
|
|||||||
# AgentScope Tuner 中文说明
|
# AgentScope Tuner
|
||||||
|
|
||||||
本目录包含了多个使用 AgentScope Tuner 对 AgentScope 应用进行调优的示例。下表总结了可用的示例:
|
本目录包含了多个使用 AgentScope Tuner 对 AgentScope 应用进行调优的示例。下表总结了可用的示例:
|
||||||
|
|
||||||
| 示例名称 | 描述 | 示例路径 | 多步交互 | LLM 评审 | 工具使用 | 多智能体 | 数据增强 |
|
| 示例名称 | 描述 | 示例路径 | 多步交互 | LLM 评审 | 工具使用 | 多智能体 | 数据增强 |
|
||||||
|------------------|-------------------------------------------|---------------------------------|----------|----------|----------|----------|----------|
|
|------------------|-------------------------------------------|---------------------------------|----------|----------|----------|----------|----------|
|
||||||
| 数学智能体 | 快速入门示例,调优数学智能体以提升其能力。 | [math_agent](./math_agent) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
| 数学智能体 | 快速入门示例,调优数学智能体以提升其能力。 | [math_agent](./math_agent) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| Frozen Lake | 让智能体在多步交互中导航冰湖环境。 | [frozen_lake](./frozen_lake) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
| Frozen Lake | 让智能体在与 frozen lake 环境的多步交互中学习。 | [frozen_lake](./frozen_lake) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| Learn to Ask | 使用 LLM 作为评审,为智能体调优提供反馈 | [learn_to_ask](./learn_to_ask) | ✅ | ✅ | ❌ | ❌ | ❌ |
|
| Learn to Ask | 使用 LLM 作为评审,为智能体调优提供反馈 | [learn_to_ask](./learn_to_ask) | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| 邮件搜索 | 在无标准答案任务中提升智能体的工具使用能力。 | [email_search](./email_search) | ✅ | ✅ | ✅ | ❌ | ❌ |
|
| 邮件搜索 | 在无标准答案任务中提升智能体的工具使用能力。 | [email_search](./email_search) | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| 狼人杀游戏 | 提升智能体在多智能体游戏场景下的表现。 | [werewolf_game](./werewolf_game)| ✅ | ✅ | ✅ | ✅ | ❌ |
|
| 狼人杀游戏 | 提升智能体在多智能体游戏场景下的表现。 | [werewolves](./werewolves)| ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
| 数据增强 | 通过数据增强获得更好的调优效果。 | [data_augment](./data_augment) | ❌ | ❌ | ❌ | ❌ | ✅ |
|
| 数据增强 | 通过数据增强获得更好的调优效果。 | [data_augment](./data_augment) | ❌ | ❌ | ❌ | ❌ | ✅ |
|
||||||
|
|
||||||
每个示例目录下均包含详细的 README 文件,介绍了该场景下的调优流程和使用方法。欢迎根据实际需求进行探索和修改!
|
每个示例目录下均包含详细的 README 文件,介绍了该场景下的调优流程和使用方法。欢迎根据实际需求进行探索和修改!
|
||||||
|
|
||||||
## 先决条件
|
## 前置要求
|
||||||
|
|
||||||
AgentScope Tuner 需要:
|
AgentScope Tuner 需要:
|
||||||
|
|
||||||
@@ -21,6 +21,6 @@ AgentScope Tuner 需要:
|
|||||||
- `agentscope>=1.0.12`
|
- `agentscope>=1.0.12`
|
||||||
- `trinity-rft>=0.4.1`
|
- `trinity-rft>=0.4.1`
|
||||||
|
|
||||||
AgentScope Tuner 构建于 [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) 之上。
|
AgentScope Tuner 基于 [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) 构建。
|
||||||
请参考 [Trinity-RFT 安装指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_installation.html)
|
请参考 [Trinity-RFT 安装指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_installation.html)
|
||||||
获取详细的安装方法。
|
获取详细的安装方法。
|
||||||
|
|||||||
364
tuner/werewolves/README.md
Normal file
364
tuner/werewolves/README.md
Normal file
@@ -0,0 +1,364 @@
|
|||||||
|
# Training Werewolf Game with RL using AgentScope-Tuner
|
||||||
|
|
||||||
|
This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with the AgentScope tuner framework (AS-Tune). We employ the multi-step Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The werewolf game is a complex social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
|
||||||
|
|
||||||
|
- Avoid revealing their identity in public discussions
|
||||||
|
- Coordinate with teammates effectively
|
||||||
|
- Develop advanced strategies like "deep cover" tactics
|
||||||
|
- Deceive villagers and mislead investigations
|
||||||
|
|
||||||
|
## Task Setting
|
||||||
|
|
||||||
|
### Training Objective
|
||||||
|
|
||||||
|
The goal is to train **werewolf players** to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule:
|
||||||
|
- **Reward = +1.0** if werewolves win (all villagers eliminated)
|
||||||
|
- **Reward = 0.0** if villagers win (all werewolves eliminated)
|
||||||
|
- **Reward = -0.1** for game execution errors (penalty to discourage invalid behaviors)
|
||||||
|
|
||||||
|
### Game Configuration
|
||||||
|
|
||||||
|
This implementation is based on the `games/game_werewolves` example but with several key modifications:
|
||||||
|
|
||||||
|
Original 9-Player Setup:
|
||||||
|
- 3 Werewolves, 3 Villagers, 1 Seer, 1 Witch, 1 Hunter
|
||||||
|
- Witch cannot self-rescue (use healing potion on herself)
|
||||||
|
|
||||||
|
Modified 7-Player Setup (This Project):
|
||||||
|
- 2 Werewolves: Kill one player each night, must hide identity during the day
|
||||||
|
- 3 Villagers: Ordinary players without special abilities
|
||||||
|
- 1 Seer: Can check one player's identity each night
|
||||||
|
- 1 Witch: Has two one-time-use potions:
|
||||||
|
- Healing potion: Save a player from being killed at night (**can self-rescue**)
|
||||||
|
- Poison potion: Eliminate one player at night
|
||||||
|
|
||||||
|
We also make slight modification to the prompt, and ask the players to reasoning before they speak publicly.
|
||||||
|
|
||||||
|
### Models
|
||||||
|
|
||||||
|
- **Trainable Model (Werewolf Players)**: `Qwen/Qwen2.5-7B-Instruct`
|
||||||
|
- **Auxiliary Model (Other Roles)**: `Qwen/Qwen3-30B-A3B-Instruct-2507`
|
||||||
|
|
||||||
|
### Algorithm
|
||||||
|
|
||||||
|
**Multi-Step GRPO (Group Relative Policy Optimization)**
|
||||||
|
- Group size: 32 rollouts per training batch
|
||||||
|
- Batch size: 24
|
||||||
|
- Learning rate: 1e-6
|
||||||
|
- Advantage normalization by episode length
|
||||||
|
- Clipping range: [0.2, 0.28]
|
||||||
|
- No KL penalty (kl_coef: 0)
|
||||||
|
|
||||||
|
## Dataset Preparation
|
||||||
|
|
||||||
|
The dataset for this task is minimal and consists only of random **seeds** for role shuffling. Each training episode uses a different seed to randomize player role assignments, ensuring diverse training scenarios.
|
||||||
|
|
||||||
|
### Generate Dataset
|
||||||
|
|
||||||
|
Run the `prepare_data.py` script to generate the dataset:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Generate default dataset (300 seeds for training)
|
||||||
|
python prepare_data.py
|
||||||
|
|
||||||
|
# Or customize the number of seeds
|
||||||
|
python prepare_data.py --num_seeds 500
|
||||||
|
```
|
||||||
|
|
||||||
|
This will create `data/train.jsonl` (or `data/eval.jsonl`) with the following format:
|
||||||
|
```json
|
||||||
|
{"seed": 0}
|
||||||
|
{"seed": 1}
|
||||||
|
{"seed": 2}
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
During training, these seeds are used to shuffle role assignments via `np.random.shuffle()`, creating varied game configurations.
|
||||||
|
|
||||||
|
## Code Implementation
|
||||||
|
|
||||||
|
### High-Level Workflow
|
||||||
|
|
||||||
|
The training workflow consists of the following key components:
|
||||||
|
|
||||||
|
#### 1. Agent Workflow (`run_werewolves_workflow`)
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def run_werewolves_workflow(task, model, auxiliary_models):
|
||||||
|
# 1. Initialize roles
|
||||||
|
roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
|
||||||
|
|
||||||
|
# 2. Shuffle based on task seed
|
||||||
|
np.random.seed(task["seed"])
|
||||||
|
np.random.shuffle(roles)
|
||||||
|
|
||||||
|
# 3. Create agents: werewolves use trainable model, others use auxiliary model
|
||||||
|
players = [
|
||||||
|
ReActAgent(
|
||||||
|
name=f"Player{i+1}",
|
||||||
|
model=model if role == "werewolf" else participant_model,
|
||||||
|
...
|
||||||
|
) for i, role in enumerate(roles)
|
||||||
|
]
|
||||||
|
|
||||||
|
# 4. Run the game
|
||||||
|
good_guy_win = await werewolves_game(players, roles)
|
||||||
|
|
||||||
|
# 5. Compute reward
|
||||||
|
reward = 1.0 if not good_guy_win else 0.0
|
||||||
|
|
||||||
|
return WorkflowOutput(reward=reward, metrics={...})
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. Game Loop (`werewolves_game`)
|
||||||
|
|
||||||
|
Each game consists of alternating night and day phases:
|
||||||
|
|
||||||
|
**Night Phase:**
|
||||||
|
1. **Werewolves' Turn**: Discuss privately and vote to kill a player
|
||||||
|
2. **Witch's Turn**: Decide whether to use healing/poison potions
|
||||||
|
3. **Seer's Turn**: Check one player's identity
|
||||||
|
|
||||||
|
**Day Phase:**
|
||||||
|
1. **Announcement**: Moderator announces who died during the night
|
||||||
|
2. **Discussion**: All alive players discuss with reasoning/statement separation
|
||||||
|
3. **Voting**: All players vote to eliminate one suspected werewolf
|
||||||
|
4. **Last Words**: Eliminated player gives final statement
|
||||||
|
|
||||||
|
The game continues until:
|
||||||
|
- All werewolves are eliminated (villagers win), or
|
||||||
|
- Werewolves equal or outnumber other players (werewolves win)
|
||||||
|
|
||||||
|
#### 3. Reward Calculation
|
||||||
|
|
||||||
|
The reward is computed based on the game outcome from the perspective of werewolves:
|
||||||
|
|
||||||
|
```python
|
||||||
|
if not good_guy_win: # Werewolves win
|
||||||
|
reward = 1.0
|
||||||
|
else: # Villagers win
|
||||||
|
reward = 0.0
|
||||||
|
```
|
||||||
|
|
||||||
|
## How to Run
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
1. Install AgentScope with tuner support:
|
||||||
|
```bash
|
||||||
|
pip install agentscope[full]
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Set up environment variables (optional, can be configured in code):
|
||||||
|
```bash
|
||||||
|
export TRINITY_MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
|
||||||
|
export TRINITY_AUXILIARY_MODEL_PATH="Qwen/Qwen3-30B-A3B-Instruct-2507"
|
||||||
|
export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
The project uses a hybrid configuration approach:
|
||||||
|
|
||||||
|
1. **High-level parameters** in `main.py`:
|
||||||
|
- Model paths
|
||||||
|
- Dataset configuration
|
||||||
|
- Algorithm parameters (group_size, batch_size, learning_rate)
|
||||||
|
|
||||||
|
2. **Detailed infrastructure settings** in `config.yaml`:
|
||||||
|
- Cluster configuration (nodes, GPUs)
|
||||||
|
- Explorer settings (rollout engines, timeouts)
|
||||||
|
- Trainer settings (gradient clipping, batch sizes)
|
||||||
|
- Monitor configuration (WandB integration)
|
||||||
|
|
||||||
|
Key parameters to adjust:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In main.py
|
||||||
|
trained_model_path = "Qwen/Qwen2.5-7B-Instruct"
|
||||||
|
auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507"
|
||||||
|
|
||||||
|
dataset = DatasetConfig(
|
||||||
|
path="data",
|
||||||
|
split="train",
|
||||||
|
total_steps=400, # Total training steps
|
||||||
|
)
|
||||||
|
|
||||||
|
algorithm = AlgorithmConfig(
|
||||||
|
algorithm_type="multi_step_grpo",
|
||||||
|
group_size=32, # Rollouts per batch
|
||||||
|
batch_size=24, # Training batches per step
|
||||||
|
learning_rate=1e-6,
|
||||||
|
save_interval_steps=100,
|
||||||
|
eval_interval_steps=100,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Training Command
|
||||||
|
|
||||||
|
**Step 1: Prepare the dataset**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /path/to/agentscope-samples/training/werewolf_game
|
||||||
|
python prepare_data.py --num_seeds 300
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2: Start Ray cluster**
|
||||||
|
|
||||||
|
Start your ray cluster.
|
||||||
|
```bash
|
||||||
|
# For single node
|
||||||
|
ray start --head
|
||||||
|
|
||||||
|
# For multi-node cluster (e.g., 4 nodes with 8 GPUs each):
|
||||||
|
# On the head node:
|
||||||
|
ray start --head --port=6379
|
||||||
|
|
||||||
|
# On each worker node:
|
||||||
|
ray start --address='<head_node_ip>:6379'
|
||||||
|
# Replace <head_node_ip> with the actual IP address of your head node
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3: Run training**
|
||||||
|
|
||||||
|
Run the training script on the head node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
For distributed training, ensure Ray cluster is properly configured in `config.yaml`, make sure that the node number and gpu number matches:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
cluster:
|
||||||
|
node_num: 4
|
||||||
|
gpu_per_node: 8
|
||||||
|
```
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
### Quantitative Results
|
||||||
|
|
||||||
|
Training on the 7-player werewolf game for 400 steps demonstrates significant improvement:
|
||||||
|
|
||||||
|
| Metric | Before Training | After Training (Step 200+) |
|
||||||
|
|--------|----------------|---------------------------|
|
||||||
|
| Werewolf Win Rate | ~50% | ~85% |
|
||||||
|
| Success Rate | 0.50 | 0.85 |
|
||||||
|
|
||||||
|
**Reward Curve:**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
As shown in the reward curve above, the werewolf win rate steadily increases during training:
|
||||||
|
- **Steps 0-50**: Win rate fluctuates around 50-60% as the model explores strategies
|
||||||
|
- **Steps 50-200**: Win rate climbs to 70-80% as effective strategies emerge
|
||||||
|
- **Steps 200-250**: Win rate stabilizes around 85%
|
||||||
|
|
||||||
|
### Qualitative Results
|
||||||
|
|
||||||
|
#### Before Training: Critical Mistakes
|
||||||
|
|
||||||
|
Untrained models (Qwen2.5-7B-Instruct) make fundamental errors:
|
||||||
|
- **Self-exposure**: Werewolves accidentally reveal identity in public ("as a werewolf, I think...")
|
||||||
|
- **Exposing teammates**: When voted out, werewolves point to teammates in last words
|
||||||
|
- **Random strategy**: No coordination, no threat prioritization (e.g., ignoring the Seer)
|
||||||
|
|
||||||
|
#### After Training: "Deep Cover" Werewolf Strategy
|
||||||
|
|
||||||
|
After 200+ steps, the trained model **converges to a sophisticated "deep cover" tactic**.
|
||||||
|
Below is an example where werewolves sacrifice teammates to gain trust, then dominate late game.
|
||||||
|
|
||||||
|
**Concrete Example:**
|
||||||
|
|
||||||
|
**Setup (Day 1):** Seer publicly identifies Player5 as werewolf. Witch and villagers join the accusation. Player5 (werewolf) is doomed.
|
||||||
|
|
||||||
|
**Player4 (werewolf teammate) responds strategically:**
|
||||||
|
|
||||||
|
**In short:** Player4 sacrifices teammate to gain trust, pretending to be a logical villager.
|
||||||
|
|
||||||
|
**Detailed reasoning and statement:**
|
||||||
|
|
||||||
|
*Private Reasoning (only Player4 sees):*
|
||||||
|
```
|
||||||
|
"I'm a werewolf, and I know Player2 was targeted last night but survived.
|
||||||
|
My teammate Player5 is doomed - Player3 and Player7 are attacking him together.
|
||||||
|
I need to deflect suspicion from MYSELF, not save Player5.
|
||||||
|
|
||||||
|
Strategy: Sacrifice Player5 to gain trust. If I vote against my teammate,
|
||||||
|
villagers will think I'm trustworthy. This creates confusion and positions
|
||||||
|
me as a "confirmed good guy" for later rounds."
|
||||||
|
```
|
||||||
|
|
||||||
|
*Public Statement (all players see):*
|
||||||
|
```
|
||||||
|
✅ "I think Player5 is acting very strangely. He brought up Player2 being targeted,
|
||||||
|
but didn't explain why he's so focused on them. It feels like a distraction tactic.
|
||||||
|
I also noticed that Player3 and Player7 are both accusing Player5, which makes sense -
|
||||||
|
their logic is solid. I think we should seriously consider voting out Player5 -
|
||||||
|
he's the most suspicious person right now."
|
||||||
|
```
|
||||||
|
|
||||||
|
**Result:** Player5 eliminated, but Player4 gains complete trust from Seer and all villagers.
|
||||||
|
|
||||||
|
**Why it works:**
|
||||||
|
- Seer trusts Player4 as strong villager ally → won't check him
|
||||||
|
- Villagers follow Player4's "logical" analysis
|
||||||
|
- Player4 systematically misleads discussions in later rounds
|
||||||
|
- Survives to final 2 players → **werewolves win**
|
||||||
|
|
||||||
|
This demonstrates the essence of trained behavior: **sacrifice pieces strategically to secure ultimate victory**. The model learns that short-term teammate loss is worthwhile for establishing deep cover and long-term dominance.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Bonus: Training Good Guys
|
||||||
|
|
||||||
|
In addition to training werewolves, we also provide a configuration for training the **good guy side** (villagers, seer, and witch). This is a more challenging task as good guys need to:
|
||||||
|
|
||||||
|
- Perform complex reasoning to identify werewolves from subtle behavioral cues
|
||||||
|
- Coordinate effectively without explicit team communication
|
||||||
|
- Resist manipulation and deception from werewolves
|
||||||
|
- **Train multiple roles simultaneously**: Unlike werewolves (single role), good guys include villager, seer, and witch with different abilities, requiring the model to master diverse strategies in one training run, and make optimal use of special abilities (Seer's checks, Witch's potions)
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
Use `config_train_goodguy.yaml` or set `trainable_target: good_guy` in `workflow_args`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
workflow_args:
|
||||||
|
trainable_target: good_guy # Train villager, seer, and witch
|
||||||
|
```
|
||||||
|
|
||||||
|
### Quantitative Results
|
||||||
|
|
||||||
|
We trained `Qwen3-4B-Instruct` as good guys against `Qwen3-30B-A3B-Instruct` werewolves:
|
||||||
|
|
||||||
|
| Metric | Before Training | After ~200 Steps | After ~400 Steps |
|
||||||
|
|--------|----------------|------------------|------------------|
|
||||||
|
| Good Guy Win Rate | ~18% | ~60% | ~80% |
|
||||||
|
|
||||||
|
**Training Curve:**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors.
|
||||||
|
|
||||||
|
### Qualitative Results
|
||||||
|
|
||||||
|
After training, the good guy models exhibit advanced reasoning patterns:
|
||||||
|
|
||||||
|
- **Seer**: Strategic target selection, information concealment in public statements, evidence integration
|
||||||
|
- **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
|
||||||
|
- **Villager**: Evidence-chain analysis, trust building with special roles, consensus formation for team coordination
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AS-Tune's multi-step GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
|
||||||
|
|
||||||
|
**Ready to try it yourself?** Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies!
|
||||||
2
tuner/werewolves/__init__.py
Normal file
2
tuner/werewolves/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""Werewolf game training example for AgentScope."""
|
||||||
241
tuner/werewolves/config.yaml
Normal file
241
tuner/werewolves/config.yaml
Normal file
@@ -0,0 +1,241 @@
|
|||||||
|
# ============================================
|
||||||
|
# Project and Experiment Configuration
|
||||||
|
# ============================================
|
||||||
|
# Project name for grouping experiments
|
||||||
|
project: AgentScope-Werewolves
|
||||||
|
# Unique name for this specific experiment run
|
||||||
|
name: Werewolves-7Player-GRPO
|
||||||
|
# Root directory for saving checkpoints. Uses environment variable if set, otherwise defaults to ./checkpoints
|
||||||
|
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Algorithm Configuration
|
||||||
|
# ============================================
|
||||||
|
algorithm:
|
||||||
|
# Algorithm type: multi_step_grpo for multi-step Group Relative Policy Optimization
|
||||||
|
algorithm_type: multi_step_grpo
|
||||||
|
|
||||||
|
# KL divergence loss function for regularization during training
|
||||||
|
# "low_var_kl": low-variance KL loss suitable for multi-step optimization
|
||||||
|
kl_loss_fn: "low_var_kl"
|
||||||
|
kl_loss_fn_args:
|
||||||
|
# KL coefficient: 0 means no KL penalty (pure reward optimization)
|
||||||
|
kl_coef: 0
|
||||||
|
|
||||||
|
# Advantage function configuration for computing policy gradients
|
||||||
|
advantage_fn_args:
|
||||||
|
# Small epsilon value for numerical stability in advantage normalization
|
||||||
|
epsilon: 1e-6
|
||||||
|
# Normalize advantage by episode length (important for variable-length games)
|
||||||
|
step_norm: true
|
||||||
|
|
||||||
|
# Number of rollouts per task (group size for GRPO)
|
||||||
|
# Higher values provide better gradient estimates but require more compute
|
||||||
|
repeat_times: 32
|
||||||
|
|
||||||
|
# Policy loss function arguments (PPO-style clipping)
|
||||||
|
policy_loss_fn_args:
|
||||||
|
# Lower bound for probability ratio clipping (prevents too large policy updates)
|
||||||
|
clip_range_low: 0.2
|
||||||
|
# Upper bound for probability ratio clipping
|
||||||
|
clip_range_high: 0.28
|
||||||
|
|
||||||
|
# Optimizer configuration
|
||||||
|
optimizer:
|
||||||
|
# Learning rate for policy updates
|
||||||
|
lr: 1e-6
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Model Configuration
|
||||||
|
# ============================================
|
||||||
|
model:
|
||||||
|
# Path to the base model (trainable model for werewolf players)
|
||||||
|
# Uses environment variable if set, otherwise defaults to Qwen2.5-7B-Instruct
|
||||||
|
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-7B-Instruct}
|
||||||
|
# Maximum number of tokens the model can generate per response
|
||||||
|
max_response_tokens: 4096
|
||||||
|
# Total context length the model can handle (prompt + response)
|
||||||
|
max_model_len: 25600
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Cluster Configuration
|
||||||
|
# ============================================
|
||||||
|
cluster:
|
||||||
|
# Number of nodes in the Ray cluster
|
||||||
|
node_num: 4
|
||||||
|
# Number of GPUs per node (total GPUs = node_num * gpu_per_node = 32)
|
||||||
|
gpu_per_node: 8
|
||||||
|
# Ray cluster address: "auto" automatically detects the cluster
|
||||||
|
ray_address: auto
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Buffer Configuration (Data Pipeline)
|
||||||
|
# ============================================
|
||||||
|
buffer:
|
||||||
|
# Total training steps (iterations)
|
||||||
|
total_steps: 400
|
||||||
|
# Batch size for rollout collection (tasks per training step)
|
||||||
|
batch_size: 24
|
||||||
|
# Batch size for training (experiences per gradient update)
|
||||||
|
# For multi-step GRPO, each rollout for task has multiple steps, so we need to accumulate the experiences for gradient update.
|
||||||
|
# Hence we should choose a larger batch size for training (2048 > 16 * 32)
|
||||||
|
train_batch_size: 2048
|
||||||
|
|
||||||
|
# Explorer input configuration (rollout data source)
|
||||||
|
explorer_input:
|
||||||
|
taskset:
|
||||||
|
# Dataset name
|
||||||
|
name: werewolves
|
||||||
|
# Storage type: "file" reads from local file system
|
||||||
|
storage_type: file
|
||||||
|
# Path to the dataset directory containing train.jsonl
|
||||||
|
path: data
|
||||||
|
# Dataset split to use
|
||||||
|
split: 'train'
|
||||||
|
# Rollout generation arguments
|
||||||
|
rollout_args:
|
||||||
|
# Sampling temperature for exploration (1.0 = full stochasticity)
|
||||||
|
temperature: 1.0
|
||||||
|
# Maximum tokens per generation
|
||||||
|
max_tokens: 4096
|
||||||
|
# Workflow-specific arguments
|
||||||
|
workflow_args:
|
||||||
|
# Which side to train: "werewolf" or "good_guy" (villager, seer, witch)
|
||||||
|
# - "werewolf": Train werewolf players (default)
|
||||||
|
# - "good_guy": Train villager, seer, and witch players
|
||||||
|
trainable_target: werewolf
|
||||||
|
# Evaluation datasets (empty for this experiment)
|
||||||
|
eval_tasksets: []
|
||||||
|
|
||||||
|
# Trainer input configuration (training data source)
|
||||||
|
trainer_input:
|
||||||
|
experience_buffer:
|
||||||
|
# Name of the experience buffer
|
||||||
|
name: werewolves
|
||||||
|
# Storage type: "queue" for in-memory communication between explorer and trainer
|
||||||
|
storage_type: queue
|
||||||
|
# Maximum time (seconds) to wait for data from explorer before timeout
|
||||||
|
max_read_timeout: 7200
|
||||||
|
# Replay buffer configuration for experience reuse
|
||||||
|
replay_buffer:
|
||||||
|
# Enable replay buffer to reuse past experiences
|
||||||
|
enable: true
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Explorer Configuration (Rollout Generation)
|
||||||
|
# ============================================
|
||||||
|
explorer:
|
||||||
|
# Number of parallel workflow runners per rollout model instance
|
||||||
|
# Higher values = more parallel game simulations
|
||||||
|
runner_per_model: 16
|
||||||
|
# Maximum time (seconds) to wait for a single task completion
|
||||||
|
max_timeout: 3600
|
||||||
|
# Number of retries if a task fails or times out (0 = no retry)
|
||||||
|
max_retry_times: 0
|
||||||
|
# Number of times each task is repeated within a single runner
|
||||||
|
# (usually 1; repeat_times is handled at algorithm level)
|
||||||
|
max_repeat_times_per_runner: 1
|
||||||
|
|
||||||
|
# Over-rollout configuration (collect more data than needed)
|
||||||
|
over_rollout:
|
||||||
|
# Ratio of tasks to skip waiting for (0.2 = wait for 80% of batch_size, then proceed)
|
||||||
|
# Improves throughput by not waiting for slow tasks
|
||||||
|
ratio: 0.2
|
||||||
|
# Minimum wait time (seconds) after reaching the threshold before proceeding
|
||||||
|
wait_after_min: 15
|
||||||
|
|
||||||
|
# Dynamic timeout configuration (adaptive timeout based on task completion time)
|
||||||
|
dynamic_timeout:
|
||||||
|
# Enable dynamic timeout adjustment
|
||||||
|
enable: true
|
||||||
|
# Timeout multiplier: timeout = average_task_time * ratio
|
||||||
|
ratio: 4
|
||||||
|
|
||||||
|
# Rollout model configuration (trainable model for werewolf players)
|
||||||
|
rollout_model:
|
||||||
|
# Number of vLLM engine instances for parallel inference
|
||||||
|
engine_num: 16
|
||||||
|
# Tensor parallelism size (GPUs per engine instance)
|
||||||
|
tensor_parallel_size: 1
|
||||||
|
# Disable KV cache prefix sharing (usually False for training stability)
|
||||||
|
enable_prefix_caching: false
|
||||||
|
# Disable CUDA graph optimization (False = use CUDA graphs for faster inference)
|
||||||
|
enforce_eager: false
|
||||||
|
# Enable OpenAI-compatible API interface for vLLM
|
||||||
|
enable_openai_api: true
|
||||||
|
# Enable conversation history tracking
|
||||||
|
enable_history: true
|
||||||
|
# Enable automatic tool choice in function calling
|
||||||
|
enable_auto_tool_choice: true
|
||||||
|
# Tool call parser for structured outputs (hermes format)
|
||||||
|
tool_call_parser: hermes
|
||||||
|
# Data type for model weights and activations
|
||||||
|
dtype: bfloat16
|
||||||
|
# Random seed for reproducibility
|
||||||
|
seed: 42
|
||||||
|
|
||||||
|
# Auxiliary models configuration (for non-werewolf players: villagers, seer, witch)
|
||||||
|
auxiliary_models:
|
||||||
|
- # Model name identifier
|
||||||
|
name: participant
|
||||||
|
# Path to auxiliary model (stronger model for stable baseline opponents)
|
||||||
|
model_path: ${oc.env:TRINITY_AUXILIARY_MODEL_PATH,Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||||
|
# Number of vLLM engines for auxiliary model (fewer than rollout model)
|
||||||
|
engine_num: 8
|
||||||
|
# Tensor parallelism size
|
||||||
|
tensor_parallel_size: 1
|
||||||
|
# Disable prefix caching
|
||||||
|
enable_prefix_caching: false
|
||||||
|
# Disable CUDA graph optimization
|
||||||
|
enforce_eager: false
|
||||||
|
# Enable OpenAI API interface
|
||||||
|
enable_openai_api: true
|
||||||
|
# Enable automatic tool choice
|
||||||
|
enable_auto_tool_choice: true
|
||||||
|
# Tool call parser
|
||||||
|
tool_call_parser: hermes
|
||||||
|
# Data type
|
||||||
|
dtype: bfloat16
|
||||||
|
# Random seed
|
||||||
|
seed: 42
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Synchronizer Configuration (Weight Sync)
|
||||||
|
# ============================================
|
||||||
|
synchronizer:
|
||||||
|
# Synchronization style: "dynamic_by_explorer" = sync when explorer requests
|
||||||
|
# Alternatives: "fixed" (sync every N steps), "dynamic_by_trainer"
|
||||||
|
sync_style: dynamic_by_explorer
|
||||||
|
# Synchronization method: "nccl" uses NVIDIA NCCL for fast GPU-to-GPU communication
|
||||||
|
# Alternatives: "checkpoint" (slower, file-based sync)
|
||||||
|
sync_method: 'nccl'
|
||||||
|
# Sync weights every N training steps
|
||||||
|
sync_interval: 1
|
||||||
|
# Timeout (seconds) for weight synchronization before failing
|
||||||
|
sync_timeout: 72000
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Trainer Configuration (Model Training)
|
||||||
|
# ============================================
|
||||||
|
trainer:
|
||||||
|
# Save checkpoint every N training steps (0 = only save at end)
|
||||||
|
save_interval: 100
|
||||||
|
# Gradient clipping threshold to prevent exploding gradients
|
||||||
|
grad_clip: 1.0
|
||||||
|
# Use dynamic batch size to maximize GPU memory utilization
|
||||||
|
use_dynamic_bsz: true
|
||||||
|
# Maximum token length per GPU for memory management
|
||||||
|
# Higher = more efficient but requires more VRAM
|
||||||
|
max_token_len_per_gpu: 16384
|
||||||
|
# Ulysses sequence parallelism size for handling long sequences
|
||||||
|
# 2 = split sequence across 2 GPUs
|
||||||
|
ulysses_sequence_parallel_size: 2
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Monitor Configuration (Logging & Tracking)
|
||||||
|
# ============================================
|
||||||
|
monitor:
|
||||||
|
# Monitoring/logging backend: "wandb" for Weights & Biases
|
||||||
|
# Alternatives: "tensorboard"
|
||||||
|
monitor_type: wandb
|
||||||
|
|
||||||
241
tuner/werewolves/config_train_goodguy.yaml
Normal file
241
tuner/werewolves/config_train_goodguy.yaml
Normal file
@@ -0,0 +1,241 @@
|
|||||||
|
# ============================================
|
||||||
|
# Project and Experiment Configuration
|
||||||
|
# ============================================
|
||||||
|
# Project name for grouping experiments
|
||||||
|
project: AgentScope-Werewolves
|
||||||
|
# Unique name for this specific experiment run (training good guys)
|
||||||
|
name: Werewolves-7Player-GRPO-train-goodguy
|
||||||
|
# Root directory for saving checkpoints. Uses environment variable if set, otherwise defaults to ./checkpoints
|
||||||
|
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Algorithm Configuration
|
||||||
|
# ============================================
|
||||||
|
algorithm:
|
||||||
|
# Algorithm type: multi_step_grpo for multi-step Group Relative Policy Optimization
|
||||||
|
algorithm_type: multi_step_grpo
|
||||||
|
|
||||||
|
# KL divergence loss function for regularization during training
|
||||||
|
# "low_var_kl": low-variance KL loss suitable for multi-step optimization
|
||||||
|
kl_loss_fn: "low_var_kl"
|
||||||
|
kl_loss_fn_args:
|
||||||
|
# KL coefficient: 0 means no KL penalty (pure reward optimization)
|
||||||
|
kl_coef: 0
|
||||||
|
|
||||||
|
# Advantage function configuration for computing policy gradients
|
||||||
|
advantage_fn_args:
|
||||||
|
# Small epsilon value for numerical stability in advantage normalization
|
||||||
|
epsilon: 1e-6
|
||||||
|
# Normalize advantage by episode length (important for variable-length games)
|
||||||
|
step_norm: true
|
||||||
|
|
||||||
|
# Number of rollouts per task (group size for GRPO)
|
||||||
|
# Higher values provide better gradient estimates but require more compute
|
||||||
|
repeat_times: 32
|
||||||
|
|
||||||
|
# Policy loss function arguments (PPO-style clipping)
|
||||||
|
policy_loss_fn_args:
|
||||||
|
# Lower bound for probability ratio clipping (prevents too large policy updates)
|
||||||
|
clip_range_low: 0.2
|
||||||
|
# Upper bound for probability ratio clipping
|
||||||
|
clip_range_high: 0.28
|
||||||
|
|
||||||
|
# Optimizer configuration
|
||||||
|
optimizer:
|
||||||
|
# Learning rate for policy updates
|
||||||
|
lr: 1e-6
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Model Configuration
|
||||||
|
# ============================================
|
||||||
|
model:
|
||||||
|
# Path to the base model (trainable model for good guy players: villager, seer, witch)
|
||||||
|
# Uses environment variable if set, otherwise defaults to Qwen3-4B-Instruct-2507
|
||||||
|
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen3-4B-Instruct-2507}
|
||||||
|
# Maximum number of tokens the model can generate per response
|
||||||
|
max_response_tokens: 4096
|
||||||
|
# Total context length the model can handle (prompt + response)
|
||||||
|
max_model_len: 25600
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Cluster Configuration
|
||||||
|
# ============================================
|
||||||
|
cluster:
|
||||||
|
# Number of nodes in the Ray cluster
|
||||||
|
node_num: 4
|
||||||
|
# Number of GPUs per node (total GPUs = node_num * gpu_per_node = 32)
|
||||||
|
gpu_per_node: 8
|
||||||
|
# Ray cluster address: "auto" automatically detects the cluster
|
||||||
|
ray_address: auto
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Buffer Configuration (Data Pipeline)
|
||||||
|
# ============================================
|
||||||
|
buffer:
|
||||||
|
# Total training steps (iterations)
|
||||||
|
total_steps: 400
|
||||||
|
# Batch size for rollout collection (tasks per training step)
|
||||||
|
batch_size: 24
|
||||||
|
# Batch size for training (experiences per gradient update)
|
||||||
|
# For multi-step GRPO, each rollout for task has multiple steps, so we need to accumulate the experiences for gradient update.
|
||||||
|
# Hence we should choose a larger batch size for training (2048 > 16 * 32)
|
||||||
|
train_batch_size: 2048
|
||||||
|
|
||||||
|
# Explorer input configuration (rollout data source)
|
||||||
|
explorer_input:
|
||||||
|
taskset:
|
||||||
|
# Dataset name
|
||||||
|
name: werewolves
|
||||||
|
# Storage type: "file" reads from local file system
|
||||||
|
storage_type: file
|
||||||
|
# Path to the dataset directory containing train.jsonl
|
||||||
|
path: data
|
||||||
|
# Dataset split to use
|
||||||
|
split: 'train'
|
||||||
|
# Rollout generation arguments
|
||||||
|
rollout_args:
|
||||||
|
# Sampling temperature for exploration (1.0 = full stochasticity)
|
||||||
|
temperature: 1.0
|
||||||
|
# Maximum tokens per generation
|
||||||
|
max_tokens: 4096
|
||||||
|
# Workflow-specific arguments
|
||||||
|
workflow_args:
|
||||||
|
# Which side to train: "werewolf" or "good_guy" (villager, seer, witch)
|
||||||
|
# - "werewolf": Train werewolf players
|
||||||
|
# - "good_guy": Train villager, seer, and witch players (this config)
|
||||||
|
trainable_target: good_guy
|
||||||
|
# Evaluation datasets (empty for this experiment)
|
||||||
|
eval_tasksets: []
|
||||||
|
|
||||||
|
# Trainer input configuration (training data source)
|
||||||
|
trainer_input:
|
||||||
|
experience_buffer:
|
||||||
|
# Name of the experience buffer
|
||||||
|
name: werewolves
|
||||||
|
# Storage type: "queue" for in-memory communication between explorer and trainer
|
||||||
|
storage_type: queue
|
||||||
|
# Maximum time (seconds) to wait for data from explorer before timeout
|
||||||
|
max_read_timeout: 7200
|
||||||
|
# Replay buffer configuration for experience reuse
|
||||||
|
replay_buffer:
|
||||||
|
# Enable replay buffer to reuse past experiences
|
||||||
|
enable: true
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Explorer Configuration (Rollout Generation)
|
||||||
|
# ============================================
|
||||||
|
explorer:
|
||||||
|
# Number of parallel workflow runners per rollout model instance
|
||||||
|
# Higher values = more parallel game simulations
|
||||||
|
runner_per_model: 16
|
||||||
|
# Maximum time (seconds) to wait for a single task completion
|
||||||
|
max_timeout: 3600
|
||||||
|
# Number of retries if a task fails or times out (0 = no retry)
|
||||||
|
max_retry_times: 0
|
||||||
|
# Number of times each task is repeated within a single runner
|
||||||
|
# (usually 1; repeat_times is handled at algorithm level)
|
||||||
|
max_repeat_times_per_runner: 1
|
||||||
|
|
||||||
|
# Over-rollout configuration (collect more data than needed)
|
||||||
|
over_rollout:
|
||||||
|
# Ratio of tasks to skip waiting for (0.2 = wait for 80% of batch_size, then proceed)
|
||||||
|
# Improves throughput by not waiting for slow tasks
|
||||||
|
ratio: 0.2
|
||||||
|
# Minimum wait time (seconds) after reaching the threshold before proceeding
|
||||||
|
wait_after_min: 15
|
||||||
|
|
||||||
|
# Dynamic timeout configuration (adaptive timeout based on task completion time)
|
||||||
|
dynamic_timeout:
|
||||||
|
# Enable dynamic timeout adjustment
|
||||||
|
enable: true
|
||||||
|
# Timeout multiplier: timeout = average_task_time * ratio
|
||||||
|
ratio: 4
|
||||||
|
|
||||||
|
# Rollout model configuration (trainable model for good guy players)
|
||||||
|
rollout_model:
|
||||||
|
# Number of vLLM engine instances for parallel inference
|
||||||
|
engine_num: 16
|
||||||
|
# Tensor parallelism size (GPUs per engine instance)
|
||||||
|
tensor_parallel_size: 1
|
||||||
|
# Disable KV cache prefix sharing (usually False for training stability)
|
||||||
|
enable_prefix_caching: false
|
||||||
|
# Disable CUDA graph optimization (False = use CUDA graphs for faster inference)
|
||||||
|
enforce_eager: false
|
||||||
|
# Enable OpenAI-compatible API interface for vLLM
|
||||||
|
enable_openai_api: true
|
||||||
|
# Enable conversation history tracking
|
||||||
|
enable_history: true
|
||||||
|
# Enable automatic tool choice in function calling
|
||||||
|
enable_auto_tool_choice: true
|
||||||
|
# Tool call parser for structured outputs (hermes format)
|
||||||
|
tool_call_parser: hermes
|
||||||
|
# Data type for model weights and activations
|
||||||
|
dtype: bfloat16
|
||||||
|
# Random seed for reproducibility
|
||||||
|
seed: 42
|
||||||
|
|
||||||
|
# Auxiliary models configuration (for werewolf players when training good guys)
|
||||||
|
auxiliary_models:
|
||||||
|
- # Model name identifier
|
||||||
|
name: participant
|
||||||
|
# Path to auxiliary model (stronger model for stable baseline opponents)
|
||||||
|
model_path: ${oc.env:TRINITY_AUXILIARY_MODEL_PATH,Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||||
|
# Number of vLLM engines for auxiliary model (fewer than rollout model)
|
||||||
|
engine_num: 8
|
||||||
|
# Tensor parallelism size
|
||||||
|
tensor_parallel_size: 1
|
||||||
|
# Disable prefix caching
|
||||||
|
enable_prefix_caching: false
|
||||||
|
# Disable CUDA graph optimization
|
||||||
|
enforce_eager: false
|
||||||
|
# Enable OpenAI API interface
|
||||||
|
enable_openai_api: true
|
||||||
|
# Enable automatic tool choice
|
||||||
|
enable_auto_tool_choice: true
|
||||||
|
# Tool call parser
|
||||||
|
tool_call_parser: hermes
|
||||||
|
# Data type
|
||||||
|
dtype: bfloat16
|
||||||
|
# Random seed
|
||||||
|
seed: 42
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Synchronizer Configuration (Weight Sync)
|
||||||
|
# ============================================
|
||||||
|
synchronizer:
|
||||||
|
# Synchronization style: "dynamic_by_explorer" = sync when explorer requests
|
||||||
|
# Alternatives: "fixed" (sync every N steps), "dynamic_by_trainer"
|
||||||
|
sync_style: dynamic_by_explorer
|
||||||
|
# Synchronization method: "nccl" uses NVIDIA NCCL for fast GPU-to-GPU communication
|
||||||
|
# Alternatives: "checkpoint" (slower, file-based sync)
|
||||||
|
sync_method: 'nccl'
|
||||||
|
# Sync weights every N training steps
|
||||||
|
sync_interval: 1
|
||||||
|
# Timeout (seconds) for weight synchronization before failing
|
||||||
|
sync_timeout: 72000
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Trainer Configuration (Model Training)
|
||||||
|
# ============================================
|
||||||
|
trainer:
|
||||||
|
# Save checkpoint every N training steps (0 = only save at end)
|
||||||
|
save_interval: 100
|
||||||
|
# Gradient clipping threshold to prevent exploding gradients
|
||||||
|
grad_clip: 1.0
|
||||||
|
# Use dynamic batch size to maximize GPU memory utilization
|
||||||
|
use_dynamic_bsz: true
|
||||||
|
# Maximum token length per GPU for memory management
|
||||||
|
# Higher = more efficient but requires more VRAM
|
||||||
|
max_token_len_per_gpu: 16384
|
||||||
|
# Ulysses sequence parallelism size for handling long sequences
|
||||||
|
# 2 = split sequence across 2 GPUs
|
||||||
|
ulysses_sequence_parallel_size: 2
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Monitor Configuration (Logging & Tracking)
|
||||||
|
# ============================================
|
||||||
|
monitor:
|
||||||
|
# Monitoring/logging backend: "wandb" for Weights & Biases
|
||||||
|
# Alternatives: "tensorboard"
|
||||||
|
monitor_type: wandb
|
||||||
|
|
||||||
399
tuner/werewolves/game.py
Normal file
399
tuner/werewolves/game.py
Normal file
@@ -0,0 +1,399 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
# flake8: noqa: E501
|
||||||
|
# pylint: disable=too-many-branches, too-many-statements, no-name-in-module, W0707
|
||||||
|
"""A werewolf game implemented by agentscope with structured reasoning - 7 Player Version."""
|
||||||
|
from utils import (
|
||||||
|
majority_vote,
|
||||||
|
names_to_str,
|
||||||
|
EchoAgent,
|
||||||
|
MAX_GAME_ROUND,
|
||||||
|
MAX_DISCUSSION_ROUND,
|
||||||
|
Players,
|
||||||
|
)
|
||||||
|
from structured_model import (
|
||||||
|
DiscussionModel,
|
||||||
|
PublicDiscussionModel,
|
||||||
|
get_vote_model,
|
||||||
|
get_poison_model,
|
||||||
|
WitchResurrectModel,
|
||||||
|
get_seer_model,
|
||||||
|
)
|
||||||
|
from prompt import EnglishPrompts as Prompts
|
||||||
|
|
||||||
|
# Uncomment the following line to use Chinese prompts
|
||||||
|
# from prompt import ChinesePrompts as Prompts
|
||||||
|
|
||||||
|
from agentscope.agent import ReActAgent
|
||||||
|
from agentscope.message import Msg
|
||||||
|
from agentscope.pipeline import (
|
||||||
|
MsgHub,
|
||||||
|
fanout_pipeline,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class BadGuyException(Exception):
|
||||||
|
...
|
||||||
|
|
||||||
|
|
||||||
|
moderator = EchoAgent()
|
||||||
|
|
||||||
|
|
||||||
|
async def werewolves_game(agents: list[ReActAgent], roles) -> bool:
|
||||||
|
"""The main entry of the werewolf game - 7 Player Version
|
||||||
|
|
||||||
|
Args:
|
||||||
|
agents (`list[ReActAgent]`):
|
||||||
|
A list of 7 agents.
|
||||||
|
"""
|
||||||
|
assert (
|
||||||
|
len(agents) == 7
|
||||||
|
), "The 7-player werewolf game needs exactly 7 players."
|
||||||
|
|
||||||
|
# Init the players' status
|
||||||
|
players = Players()
|
||||||
|
|
||||||
|
# If the witch has healing and poison potion
|
||||||
|
healing, poison = True, True
|
||||||
|
|
||||||
|
# If it's the first day, the dead can leave a message
|
||||||
|
first_day = True
|
||||||
|
|
||||||
|
# Broadcast the game begin message
|
||||||
|
async with MsgHub(participants=agents) as greeting_hub:
|
||||||
|
await greeting_hub.broadcast(
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_all_new_game.format(names_to_str(agents)),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Assign roles to the agents - 2 werewolves, 3 villagers, 1 seer, 1 witch
|
||||||
|
for agent, role in zip(agents, roles):
|
||||||
|
# Tell the agent its role
|
||||||
|
await agent.observe(
|
||||||
|
await moderator(
|
||||||
|
f"[{agent.name} ONLY] {agent.name}, your role is {role}.",
|
||||||
|
),
|
||||||
|
)
|
||||||
|
players.add_player(agent, role)
|
||||||
|
|
||||||
|
# Printing the roles
|
||||||
|
players.print_roles()
|
||||||
|
|
||||||
|
# GAME BEGIN!
|
||||||
|
for _ in range(MAX_GAME_ROUND):
|
||||||
|
# Create a MsgHub for all players to broadcast messages
|
||||||
|
async with MsgHub(
|
||||||
|
participants=players.current_alive,
|
||||||
|
enable_auto_broadcast=False, # manual broadcast only
|
||||||
|
name="alive_players",
|
||||||
|
) as alive_players_hub:
|
||||||
|
# Night phase
|
||||||
|
await alive_players_hub.broadcast(
|
||||||
|
await moderator(Prompts.to_all_night),
|
||||||
|
)
|
||||||
|
killed_player, poisoned_player = None, None
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Werewolves discuss
|
||||||
|
async with MsgHub(
|
||||||
|
players.werewolves,
|
||||||
|
enable_auto_broadcast=True,
|
||||||
|
announcement=await moderator(
|
||||||
|
Prompts.to_wolves_discussion.format(
|
||||||
|
names_to_str(players.werewolves),
|
||||||
|
names_to_str(players.current_alive),
|
||||||
|
),
|
||||||
|
),
|
||||||
|
name="werewolves",
|
||||||
|
) as werewolves_hub:
|
||||||
|
# Discussion
|
||||||
|
n_werewolves = len(players.werewolves)
|
||||||
|
for _ in range(1, MAX_DISCUSSION_ROUND * n_werewolves + 1):
|
||||||
|
res = await players.werewolves[_ % n_werewolves](
|
||||||
|
structured_model=DiscussionModel,
|
||||||
|
)
|
||||||
|
if _ % n_werewolves == 0 and res.metadata.get(
|
||||||
|
"reach_agreement",
|
||||||
|
):
|
||||||
|
break
|
||||||
|
|
||||||
|
# Werewolves vote
|
||||||
|
# Disable auto broadcast to avoid following other's votes
|
||||||
|
werewolves_hub.set_auto_broadcast(False)
|
||||||
|
msgs_vote = await fanout_pipeline(
|
||||||
|
players.werewolves,
|
||||||
|
msg=await moderator(content=Prompts.to_wolves_vote),
|
||||||
|
structured_model=get_vote_model(players.current_alive),
|
||||||
|
enable_gather=False,
|
||||||
|
)
|
||||||
|
killed_player, votes = majority_vote(
|
||||||
|
[_.metadata.get("vote") for _ in msgs_vote],
|
||||||
|
)
|
||||||
|
# Postpone the broadcast of voting
|
||||||
|
await werewolves_hub.broadcast(
|
||||||
|
[
|
||||||
|
*msgs_vote,
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_wolves_res.format(
|
||||||
|
votes,
|
||||||
|
killed_player,
|
||||||
|
),
|
||||||
|
),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
raise BadGuyException(
|
||||||
|
f"Werewolves failed to make a decision: {e}",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Witch's turn
|
||||||
|
await alive_players_hub.broadcast(
|
||||||
|
await moderator(Prompts.to_all_witch_turn),
|
||||||
|
)
|
||||||
|
msg_witch_poison = None
|
||||||
|
for agent in players.witch:
|
||||||
|
# Witch can heal herself (self-rescue allowed)
|
||||||
|
msg_witch_resurrect = None
|
||||||
|
if healing and killed_player:
|
||||||
|
msg_witch_resurrect = await agent(
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_witch_resurrect.format(
|
||||||
|
witch_name=agent.name,
|
||||||
|
dead_name=killed_player,
|
||||||
|
),
|
||||||
|
),
|
||||||
|
structured_model=WitchResurrectModel,
|
||||||
|
)
|
||||||
|
if msg_witch_resurrect.metadata.get("resurrect"):
|
||||||
|
killed_player = None
|
||||||
|
healing = False
|
||||||
|
|
||||||
|
# Has poison potion and hasn't used the healing potion
|
||||||
|
if poison and not (
|
||||||
|
msg_witch_resurrect
|
||||||
|
and msg_witch_resurrect.metadata["resurrect"]
|
||||||
|
):
|
||||||
|
msg_witch_poison = await agent(
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_witch_poison.format(
|
||||||
|
witch_name=agent.name,
|
||||||
|
),
|
||||||
|
),
|
||||||
|
structured_model=get_poison_model(
|
||||||
|
players.current_alive,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
if msg_witch_poison.metadata.get("poison"):
|
||||||
|
poisoned_player = msg_witch_poison.metadata.get("name")
|
||||||
|
poison = False
|
||||||
|
|
||||||
|
# Seer's turn
|
||||||
|
await alive_players_hub.broadcast(
|
||||||
|
await moderator(Prompts.to_all_seer_turn),
|
||||||
|
)
|
||||||
|
for agent in players.seer:
|
||||||
|
msg_seer = await agent(
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_seer.format(
|
||||||
|
agent.name,
|
||||||
|
names_to_str(players.current_alive),
|
||||||
|
),
|
||||||
|
),
|
||||||
|
structured_model=get_seer_model(players.current_alive),
|
||||||
|
)
|
||||||
|
if msg_seer.metadata.get("name"):
|
||||||
|
player = msg_seer.metadata["name"]
|
||||||
|
await agent.observe(
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_seer_result.format(
|
||||||
|
agent_name=player,
|
||||||
|
role=players.name_to_role[player],
|
||||||
|
),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Update alive players (no hunter in 7-player version)
|
||||||
|
dead_tonight = [killed_player, poisoned_player]
|
||||||
|
players.update_players(dead_tonight)
|
||||||
|
|
||||||
|
# Day phase
|
||||||
|
if len([_ for _ in dead_tonight if _]) > 0:
|
||||||
|
await alive_players_hub.broadcast(
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_all_day.format(
|
||||||
|
names_to_str([_ for _ in dead_tonight if _]),
|
||||||
|
),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
# The killed player leave a last message in first night
|
||||||
|
if killed_player and first_day:
|
||||||
|
msg_moderator = await moderator(
|
||||||
|
Prompts.to_dead_player.format(killed_player),
|
||||||
|
)
|
||||||
|
await alive_players_hub.broadcast(msg_moderator)
|
||||||
|
|
||||||
|
# Leave a message with structured reasoning
|
||||||
|
dead_agent = players.name_to_agent[killed_player]
|
||||||
|
last_words_response = await dead_agent(
|
||||||
|
structured_model=PublicDiscussionModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Extract reasoning and statement from metadata
|
||||||
|
reasoning = last_words_response.metadata.get(
|
||||||
|
"reasoning",
|
||||||
|
"",
|
||||||
|
)
|
||||||
|
statement = last_words_response.metadata.get(
|
||||||
|
"statement",
|
||||||
|
"",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Only broadcast the public statement
|
||||||
|
public_last_msg = Msg(
|
||||||
|
name=dead_agent.name,
|
||||||
|
content=statement,
|
||||||
|
role="assistant",
|
||||||
|
)
|
||||||
|
await alive_players_hub.broadcast(public_last_msg)
|
||||||
|
|
||||||
|
# Let the dead player observe their own private reasoning
|
||||||
|
private_reasoning_msg = Msg(
|
||||||
|
name="self_thought",
|
||||||
|
content=f"[PRIVATE REASONING] {reasoning}",
|
||||||
|
role="assistant",
|
||||||
|
)
|
||||||
|
await dead_agent.observe(private_reasoning_msg)
|
||||||
|
|
||||||
|
else:
|
||||||
|
await alive_players_hub.broadcast(
|
||||||
|
await moderator(Prompts.to_all_peace),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check winning
|
||||||
|
res = players.check_winning()
|
||||||
|
if res:
|
||||||
|
await moderator(res)
|
||||||
|
break
|
||||||
|
|
||||||
|
# Discussion - KEY MODIFICATION: Use structured reasoning
|
||||||
|
await alive_players_hub.broadcast(
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_all_discuss.format(
|
||||||
|
names=names_to_str(players.current_alive),
|
||||||
|
),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Instead of sequential_pipeline, we manually handle each player
|
||||||
|
# to separate reasoning from public statement
|
||||||
|
for player in players.current_alive:
|
||||||
|
# Get structured response with reasoning and statement
|
||||||
|
response = await player(
|
||||||
|
structured_model=PublicDiscussionModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Extract reasoning and statement from metadata
|
||||||
|
reasoning = response.metadata.get("reasoning", "")
|
||||||
|
statement = response.metadata.get("statement", "")
|
||||||
|
|
||||||
|
# Only broadcast the public statement to all players
|
||||||
|
public_msg = Msg(
|
||||||
|
name=player.name,
|
||||||
|
content=statement,
|
||||||
|
role="assistant",
|
||||||
|
)
|
||||||
|
await alive_players_hub.broadcast(public_msg)
|
||||||
|
|
||||||
|
# Let the player observe their own private reasoning
|
||||||
|
# This keeps it in their memory but not visible to others
|
||||||
|
private_msg = Msg(
|
||||||
|
name="self_thought",
|
||||||
|
content=f"[PRIVATE REASONING] {reasoning}",
|
||||||
|
role="assistant",
|
||||||
|
)
|
||||||
|
await player.observe(private_msg)
|
||||||
|
|
||||||
|
# Voting
|
||||||
|
msgs_vote = await fanout_pipeline(
|
||||||
|
players.current_alive,
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_all_vote.format(
|
||||||
|
names_to_str(players.current_alive),
|
||||||
|
),
|
||||||
|
),
|
||||||
|
structured_model=get_vote_model(players.current_alive),
|
||||||
|
enable_gather=False,
|
||||||
|
)
|
||||||
|
voted_player, votes = majority_vote(
|
||||||
|
[_.metadata.get("vote") for _ in msgs_vote],
|
||||||
|
)
|
||||||
|
# Broadcast the voting messages together to avoid influencing
|
||||||
|
# each other
|
||||||
|
voting_msgs = [
|
||||||
|
*msgs_vote,
|
||||||
|
await moderator(
|
||||||
|
Prompts.to_all_res.format(votes, voted_player),
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Leave a message if voted
|
||||||
|
if voted_player:
|
||||||
|
prompt_msg = await moderator(
|
||||||
|
Prompts.to_dead_player.format(voted_player),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Get structured last words with reasoning
|
||||||
|
dead_agent = players.name_to_agent[voted_player]
|
||||||
|
last_words_response = await dead_agent(
|
||||||
|
prompt_msg,
|
||||||
|
structured_model=PublicDiscussionModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Extract reasoning and statement from metadata
|
||||||
|
reasoning = last_words_response.metadata.get("reasoning", "")
|
||||||
|
statement = last_words_response.metadata.get("statement", "")
|
||||||
|
|
||||||
|
# Create public statement message
|
||||||
|
public_last_msg = Msg(
|
||||||
|
name=dead_agent.name,
|
||||||
|
content=statement,
|
||||||
|
role="assistant",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Store private reasoning for the dead player
|
||||||
|
private_reasoning_msg = Msg(
|
||||||
|
name="self_thought",
|
||||||
|
content=f"[PRIVATE REASONING] {reasoning}",
|
||||||
|
role="assistant",
|
||||||
|
)
|
||||||
|
await dead_agent.observe(private_reasoning_msg)
|
||||||
|
|
||||||
|
voting_msgs.extend([prompt_msg, public_last_msg])
|
||||||
|
|
||||||
|
await alive_players_hub.broadcast(voting_msgs)
|
||||||
|
|
||||||
|
# Update alive players (no hunter in 7-player version)
|
||||||
|
dead_today = [voted_player]
|
||||||
|
players.update_players(dead_today)
|
||||||
|
|
||||||
|
# Check winning
|
||||||
|
res = players.check_winning()
|
||||||
|
if res:
|
||||||
|
async with MsgHub(players.all_players) as all_players_hub:
|
||||||
|
res_msg = await moderator(res)
|
||||||
|
await all_players_hub.broadcast(res_msg)
|
||||||
|
break
|
||||||
|
|
||||||
|
# The day ends
|
||||||
|
first_day = False
|
||||||
|
|
||||||
|
# # Game over, each player reflects
|
||||||
|
# await fanout_pipeline(
|
||||||
|
# agents=agents,
|
||||||
|
# msg=await moderator(Prompts.to_all_reflect),
|
||||||
|
# )
|
||||||
|
|
||||||
|
alive_wolves = players.werewolves
|
||||||
|
good_guy_win = len(alive_wolves) == 0
|
||||||
|
return good_guy_win
|
||||||
316
tuner/werewolves/main.py
Normal file
316
tuner/werewolves/main.py
Normal file
@@ -0,0 +1,316 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
# flake8: noqa: E501
|
||||||
|
# pylint: disable=C0301,C0413,W0621,W0404,C0412,E0611,E1121
|
||||||
|
"""Example of training a werewolf game agent with Trinity-RFT using AgentScope tuner."""
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict
|
||||||
|
import traceback
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from agentscope.tuner import (
|
||||||
|
tune,
|
||||||
|
WorkflowOutput,
|
||||||
|
TunerModelConfig,
|
||||||
|
)
|
||||||
|
from agentscope.agent import ReActAgent
|
||||||
|
from agentscope.formatter import OpenAIMultiAgentFormatter
|
||||||
|
|
||||||
|
# Add current directory to path for local imports
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
|
||||||
|
from game import BadGuyException, werewolves_game # noqa: E402
|
||||||
|
|
||||||
|
|
||||||
|
async def run_werewolves_workflow(
|
||||||
|
task: Dict,
|
||||||
|
model: TunerModelConfig,
|
||||||
|
auxiliary_models: Dict[str, TunerModelConfig],
|
||||||
|
) -> WorkflowOutput:
|
||||||
|
"""Run the werewolf game workflow.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
task (Dict): The task information containing:
|
||||||
|
- 'seed': for role shuffling
|
||||||
|
- 'workflow_args': optional dict with 'trainable_target' key
|
||||||
|
("werewolf" or "good_guy", default: "werewolf")
|
||||||
|
model (TunerModelConfig): The trainable model.
|
||||||
|
auxiliary_models (Dict[str, TunerModelConfig]): Dictionary of auxiliary
|
||||||
|
models. Expected to have 'participant' key for opponent players.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
WorkflowOutput: Contains reward and metrics from the game.
|
||||||
|
"""
|
||||||
|
# Initialize roles: 2 werewolves, 3 villagers, 1 seer, 1 witch
|
||||||
|
roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
|
||||||
|
|
||||||
|
# Shuffle roles based on task seed for reproducibility
|
||||||
|
seed = task.get("seed", 0)
|
||||||
|
np.random.seed(seed)
|
||||||
|
np.random.shuffle(roles)
|
||||||
|
|
||||||
|
# Get trainable_target from workflow_args (default: "werewolf")
|
||||||
|
# Options: "werewolf" or "good_guy" (villager, seer, witch)
|
||||||
|
workflow_args = task.get("workflow_args", {})
|
||||||
|
trainable_target = workflow_args.get("trainable_target", "werewolf")
|
||||||
|
|
||||||
|
# Get the participant model for opponent players
|
||||||
|
if "participant" not in auxiliary_models:
|
||||||
|
raise ValueError(
|
||||||
|
"Expected 'participant' model in auxiliary_models for opponent players",
|
||||||
|
)
|
||||||
|
participant_model = auxiliary_models["participant"]
|
||||||
|
|
||||||
|
# Create players with appropriate models based on trainable_target
|
||||||
|
players = []
|
||||||
|
for i, role in enumerate(roles):
|
||||||
|
# Determine which model to use based on trainable_target
|
||||||
|
if trainable_target == "werewolf":
|
||||||
|
# Training werewolves: werewolves use trainable model
|
||||||
|
use_trainable = role == "werewolf"
|
||||||
|
else: # trainable_target == "good_guy"
|
||||||
|
# Training good guys: villager, seer, witch use trainable model
|
||||||
|
use_trainable = role in ["villager", "seer", "witch"]
|
||||||
|
|
||||||
|
agent = ReActAgent(
|
||||||
|
name=f"Player{i + 1}",
|
||||||
|
sys_prompt=get_official_agent_prompt(f"Player{i + 1}"),
|
||||||
|
model=model if use_trainable else participant_model,
|
||||||
|
formatter=OpenAIMultiAgentFormatter(),
|
||||||
|
max_iters=3,
|
||||||
|
)
|
||||||
|
players.append(agent)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Run the werewolf game
|
||||||
|
good_guy_win = await werewolves_game(players, roles)
|
||||||
|
|
||||||
|
# Calculate reward based on trainable_target
|
||||||
|
is_success = False
|
||||||
|
if trainable_target == "werewolf":
|
||||||
|
# Training werewolves: reward when werewolves win (good_guy_win = False)
|
||||||
|
if not good_guy_win:
|
||||||
|
raw_reward = 1.0
|
||||||
|
is_success = True
|
||||||
|
else:
|
||||||
|
raw_reward = 0.0
|
||||||
|
else: # trainable_target == "good_guy"
|
||||||
|
# Training good guys: reward when good guys win (good_guy_win = True)
|
||||||
|
if good_guy_win:
|
||||||
|
raw_reward = 1.0
|
||||||
|
is_success = True
|
||||||
|
else:
|
||||||
|
raw_reward = 0.0
|
||||||
|
|
||||||
|
metrics = {
|
||||||
|
"success": float(is_success),
|
||||||
|
"werewolf_win": float(not good_guy_win),
|
||||||
|
"villager_win": float(good_guy_win),
|
||||||
|
"trainable_target": trainable_target,
|
||||||
|
}
|
||||||
|
|
||||||
|
return WorkflowOutput(
|
||||||
|
reward=raw_reward,
|
||||||
|
metrics=metrics,
|
||||||
|
)
|
||||||
|
|
||||||
|
except BadGuyException as e:
|
||||||
|
# If game execution fails, give a small penalty
|
||||||
|
traceback.print_exc()
|
||||||
|
print(
|
||||||
|
f"Error during game execution: {e}. "
|
||||||
|
"Assigning penalty to trainable agents.",
|
||||||
|
)
|
||||||
|
return WorkflowOutput(
|
||||||
|
reward=-0.1,
|
||||||
|
metrics={"success": 0.0, "game_error": 1.0},
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
# Catch any other unexpected errors
|
||||||
|
traceback.print_exc()
|
||||||
|
print(f"Unexpected error: {e}")
|
||||||
|
return WorkflowOutput(
|
||||||
|
reward=-0.1,
|
||||||
|
metrics={"success": 0.0, "unexpected_error": 1.0},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_official_agent_prompt(name: str) -> str:
|
||||||
|
"""Get the system prompt for an agent.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
name (str): The name of the agent.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: The system prompt.
|
||||||
|
"""
|
||||||
|
from textwrap import dedent
|
||||||
|
|
||||||
|
system_prompt = dedent(
|
||||||
|
f"""
|
||||||
|
You're a werewolf game player named {name}.
|
||||||
|
|
||||||
|
# YOUR TARGET
|
||||||
|
Your target is to win the game with your teammates as much as possible.
|
||||||
|
|
||||||
|
# GAME RULES
|
||||||
|
- In werewolf game, players are divided into two werewolves, three villagers, one seer, and one witch.
|
||||||
|
- Werewolves: kill one player each night, and must hide identity during the day.
|
||||||
|
- Villagers: ordinary players without special abilities, try to identify and eliminate werewolves.
|
||||||
|
- Seer: A special villager who can check one player's identity each night.
|
||||||
|
- Witch: A special villager with two one-time-use potions: a healing potion to save a player (including herself) from being killed at night, and a poison to eliminate one player at night.
|
||||||
|
- The game alternates between night and day phases until one side wins:
|
||||||
|
- Night Phase
|
||||||
|
- Werewolves choose one victim
|
||||||
|
- Seer checks one player's identity
|
||||||
|
- Witch decides whether to use potions
|
||||||
|
- Moderator announces who died during the night
|
||||||
|
- Day Phase
|
||||||
|
- All players discuss and vote to eliminate one suspected player
|
||||||
|
|
||||||
|
- The werewolves will win the game if they can eliminate all the villagers.
|
||||||
|
- The villagers will win the game if they can eliminate all the werewolves.
|
||||||
|
|
||||||
|
## During PUBLIC discussion (day phase):
|
||||||
|
- Your response will be split into TWO parts: REASONING (private) and STATEMENT (public)
|
||||||
|
- REASONING: Your internal thoughts - ONLY YOU can see this. Think freely here.
|
||||||
|
- STATEMENT: What you actually say - EVERYONE can see this. Be strategic!
|
||||||
|
|
||||||
|
## For WEREWOLVES in public discussion:
|
||||||
|
- ❌ NEVER say "I'm a werewolf" or "we werewolves" in your STATEMENT
|
||||||
|
- ❌ NEVER reveal your werewolf teammates in your STATEMENT
|
||||||
|
- ❌ NEVER discuss werewolf strategy in your STATEMENT
|
||||||
|
- ✅ In REASONING: freely think about werewolf strategy
|
||||||
|
- ✅ In STATEMENT: pretend to be a villager, seer, or other role
|
||||||
|
- ✅ In STATEMENT: accuse others, defend yourself, but NEVER reveal your true identity
|
||||||
|
|
||||||
|
## For ALL ROLES in public discussion:
|
||||||
|
- Use REASONING to analyze: "Who might be the werewolf? What's my strategy?"
|
||||||
|
- Use STATEMENT to speak: "I think Player X is suspicious because..."
|
||||||
|
- Keep sensitive information in REASONING, not in STATEMENT
|
||||||
|
|
||||||
|
## Examples:
|
||||||
|
### BAD (Werewolf exposing themselves):
|
||||||
|
REASONING: "I'm a werewolf, I should protect my teammates."
|
||||||
|
STATEMENT: "As a werewolf, I think we should vote Player 5." ❌ EXPOSED!
|
||||||
|
|
||||||
|
### GOOD (Werewolf hiding identity):
|
||||||
|
REASONING: "I'm a werewolf. Player 5 might be the seer based on their questions. I should cast suspicion on them without being obvious."
|
||||||
|
STATEMENT: "I find Player 5's behavior suspicious. They've been asking too many questions about people's roles." ✅ HIDDEN!
|
||||||
|
|
||||||
|
### GOOD (Villager analyzing):
|
||||||
|
REASONING: "Player 2 and Player 3 seem to be defending each other. Could they be werewolf teammates?"
|
||||||
|
STATEMENT: "I noticed Player 2 and Player 3 have been very defensive of each other. This makes me suspicious." ✅ STRATEGIC!
|
||||||
|
|
||||||
|
# GAME GUIDANCE
|
||||||
|
- Try your best to win the game with your teammates, tricks, lies, and deception are all allowed, e.g. pretending to be a different role.
|
||||||
|
- During discussion, don't be political, be direct and to the point.
|
||||||
|
- The day phase voting provides important clues. For example, the werewolves may vote together, attack the seer, etc.
|
||||||
|
|
||||||
|
## GAME GUIDANCE FOR WEREWOLF
|
||||||
|
- Seer is your greatest threat, who can check one player's identity each night. Analyze players' speeches, find out the seer and eliminate him/her will greatly increase your chances of winning.
|
||||||
|
- In the first night, making random choices is common for werewolves since no information is available.
|
||||||
|
- Pretending to be other roles (seer, witch or villager) is a common strategy to hide your identity and mislead other villagers in the day phase.
|
||||||
|
- The outcome of the night phase provides important clues. For example, if witch uses the healing or poison potion, etc. Use this information to adjust your strategy.
|
||||||
|
- [CRITICAL] In public discussion, NEVER reveal you are a werewolf. Always pretend to be a villager or other role.
|
||||||
|
|
||||||
|
## GAME GUIDANCE FOR SEER
|
||||||
|
- Seer is very important to villagers, you should earn the villagers' trust, and lead the discussion phase if possible.
|
||||||
|
- Your ability to check one player's identity is crucial.
|
||||||
|
- The outcome of the night phase provides important clues. For example, if witch uses the healing or poison potion, etc. Use this information to adjust your strategy.
|
||||||
|
- Consider when to reveal your identity - too early and werewolves will target you, too late and villagers won't trust you.
|
||||||
|
|
||||||
|
## GAME GUIDANCE FOR WITCH
|
||||||
|
- Witch has two powerful potions, use them wisely to protect key villagers or eliminate suspected werewolves.
|
||||||
|
- [IMPORTANT] You CAN use the healing potion to save yourself if you are killed by werewolves (self-rescue is allowed).
|
||||||
|
- Consider saving the healing potion for critical moments, especially if you think you might be targeted.
|
||||||
|
- The outcome of the night phase provides important clues. Use this information to adjust your strategy. For example, the person you save is likely to be on the villagers' side.
|
||||||
|
|
||||||
|
## GAME GUIDANCE FOR VILLAGER
|
||||||
|
- Protecting special villagers, especially the seer, is crucial for your team's success.
|
||||||
|
- Be cautious and decide whether to trust other players based on their speeches and actions.
|
||||||
|
- Base your decisions on the information you have received, be logical and engage in the discussion to vote out the suspected werewolves.
|
||||||
|
|
||||||
|
# NOTE
|
||||||
|
- [IMPORTANT] DO NOT make up any information that is not provided by the moderator or other players.
|
||||||
|
- This is a TEXT-based game, so DO NOT use or make up any non-textual information.
|
||||||
|
- Always critically reflect on whether your evidence exist, and avoid making assumptions.
|
||||||
|
- Your response should be specific and concise, provide clear reason and avoid unnecessary elaboration.
|
||||||
|
- Generate your one-line response by using the `generate_response` function.
|
||||||
|
- Don't repeat the others' speeches.
|
||||||
|
- [CRITICAL] Remember: REASONING is private (only you see it), STATEMENT is public (everyone sees it). Use this to your advantage!""",
|
||||||
|
)
|
||||||
|
return system_prompt
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
from agentscope.tuner import (
|
||||||
|
DatasetConfig,
|
||||||
|
TunerModelConfig,
|
||||||
|
AlgorithmConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
# High-level configuration in code (easy to modify)
|
||||||
|
config_path = Path(__file__).parent / "config.yaml"
|
||||||
|
|
||||||
|
# Setup Model Path
|
||||||
|
trained_model_path = (
|
||||||
|
"Qwen/Qwen2.5-7B-Instruct" # fill in your model path here
|
||||||
|
)
|
||||||
|
auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507" # fill in your auxiliary model path here
|
||||||
|
|
||||||
|
# Dataset configuration
|
||||||
|
dataset = DatasetConfig(
|
||||||
|
path=str(Path(__file__).parent / "data"),
|
||||||
|
split="train",
|
||||||
|
total_steps=400, # Total training steps
|
||||||
|
)
|
||||||
|
|
||||||
|
# Model configuration (trainable model for werewolf players)
|
||||||
|
model = TunerModelConfig(
|
||||||
|
model_path=trained_model_path,
|
||||||
|
max_model_len=25600,
|
||||||
|
max_tokens=4096,
|
||||||
|
temperature=1.0,
|
||||||
|
inference_engine_num=16,
|
||||||
|
tensor_parallel_size=1,
|
||||||
|
tool_call_parser="hermes",
|
||||||
|
reasoning_parser=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Auxiliary models (for non-werewolf players)
|
||||||
|
auxiliary_models = {
|
||||||
|
"participant": TunerModelConfig(
|
||||||
|
model_path=auxiliary_model_path,
|
||||||
|
max_model_len=25600,
|
||||||
|
max_tokens=4096,
|
||||||
|
temperature=0.1, # Lower temperature for auxiliary models
|
||||||
|
inference_engine_num=8,
|
||||||
|
tensor_parallel_size=1,
|
||||||
|
tool_call_parser="hermes",
|
||||||
|
reasoning_parser=None,
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Algorithm configuration
|
||||||
|
algorithm = AlgorithmConfig(
|
||||||
|
algorithm_type="multi_step_grpo",
|
||||||
|
group_size=32, # repeat_times in Trinity
|
||||||
|
batch_size=24,
|
||||||
|
learning_rate=1e-6,
|
||||||
|
save_interval_steps=100,
|
||||||
|
eval_interval_steps=100,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run training with hybrid configuration
|
||||||
|
# Code parameters above + detailed Trinity config from YAML
|
||||||
|
tune(
|
||||||
|
workflow_func=run_werewolves_workflow,
|
||||||
|
judge_func=None, # We compute reward directly in the workflow
|
||||||
|
train_dataset=dataset,
|
||||||
|
model=model,
|
||||||
|
auxiliary_models=auxiliary_models,
|
||||||
|
algorithm=algorithm,
|
||||||
|
config_path=str(config_path), # For cluster, explorer, trainer details
|
||||||
|
)
|
||||||
71
tuner/werewolves/prepare_data.py
Normal file
71
tuner/werewolves/prepare_data.py
Normal file
@@ -0,0 +1,71 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
# flake8: noqa: E501
|
||||||
|
"""Prepare dataset for werewolf game training.
|
||||||
|
|
||||||
|
This script generates a simple dataset consisting of random seeds for role shuffling.
|
||||||
|
Each seed creates a different initial role assignment, ensuring diverse training scenarios.
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def prepare_dataset(
|
||||||
|
output_dir: str,
|
||||||
|
num_seeds: int = 300,
|
||||||
|
split: str = "train",
|
||||||
|
) -> None:
|
||||||
|
"""Prepare the werewolf game training dataset.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
output_dir (str): Directory to save the dataset.
|
||||||
|
num_seeds (int): Number of seeds to generate. Default: 300.
|
||||||
|
split (str): Dataset split name (e.g., 'train', 'eval'). Default: 'train'.
|
||||||
|
"""
|
||||||
|
output_path = Path(output_dir)
|
||||||
|
output_path.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
output_file = output_path / f"{split}.jsonl"
|
||||||
|
|
||||||
|
print(f"Generating {num_seeds} seeds for {split} split...")
|
||||||
|
|
||||||
|
with open(output_file, "w", encoding="utf-8") as f:
|
||||||
|
for seed in range(num_seeds):
|
||||||
|
data = {"seed": seed}
|
||||||
|
f.write(json.dumps(data) + "\n")
|
||||||
|
|
||||||
|
print(f"Dataset saved to: {output_file}")
|
||||||
|
print(f"Total samples: {num_seeds}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Prepare dataset for werewolf game training",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--output_dir",
|
||||||
|
type=str,
|
||||||
|
default="data",
|
||||||
|
help="Directory to save the dataset (default: data)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--num_seeds",
|
||||||
|
type=int,
|
||||||
|
default=300,
|
||||||
|
help="Number of seeds to generate (default: 300)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--split",
|
||||||
|
type=str,
|
||||||
|
default="train",
|
||||||
|
help="Dataset split name (default: train)",
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
prepare_dataset(
|
||||||
|
output_dir=args.output_dir,
|
||||||
|
num_seeds=args.num_seeds,
|
||||||
|
split=args.split,
|
||||||
|
)
|
||||||
198
tuner/werewolves/prompt.py
Normal file
198
tuner/werewolves/prompt.py
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
# flake8: noqa: E501
|
||||||
|
"""Enhanced prompts with reasoning/statement separation awareness - 7 Player Version."""
|
||||||
|
|
||||||
|
|
||||||
|
class EnglishPrompts:
|
||||||
|
"""English prompts used to guide the werewolf game."""
|
||||||
|
|
||||||
|
to_dead_player = (
|
||||||
|
"{}, you're eliminated now. Now you can make a final statement to "
|
||||||
|
"all alive players before you leave the game.\n\n"
|
||||||
|
"IMPORTANT: Your final words will be structured into two parts:\n"
|
||||||
|
"1. REASONING (Private): Your strategic thinking - NOT visible to others\n"
|
||||||
|
"2. STATEMENT (Public): Your actual final words - VISIBLE to all players\n\n"
|
||||||
|
"Think carefully about what information to reveal or hide in your final statement."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_new_game = (
|
||||||
|
"A new game is starting, the players are: {}. Now we randomly "
|
||||||
|
"reassign the roles to each player and inform them of their roles "
|
||||||
|
"privately."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_night = (
|
||||||
|
"Night has fallen, everyone close your eyes. Werewolves open your "
|
||||||
|
"eyes and choose a player to eliminate tonight."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_wolves_discussion = (
|
||||||
|
"[WEREWOLVES ONLY] {}, you should discuss and "
|
||||||
|
"decide on a player to eliminate tonight. Current alive players "
|
||||||
|
"are {}. Remember to set `reach_agreement` to True if you reach an "
|
||||||
|
"agreement during the discussion."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_wolves_vote = "[WEREWOLVES ONLY] Which player do you vote to kill?"
|
||||||
|
|
||||||
|
to_wolves_res = (
|
||||||
|
"[WEREWOLVES ONLY] The voting result is {}. So you have chosen to "
|
||||||
|
"eliminate {}."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_witch_turn = (
|
||||||
|
"Witch's turn, witch open your eyes and decide your action tonight..."
|
||||||
|
)
|
||||||
|
to_witch_resurrect = (
|
||||||
|
"[WITCH ONLY] {witch_name}, you're the witch, and tonight {dead_name} "
|
||||||
|
"is eliminated. You can resurrect him/her by using your healing "
|
||||||
|
"potion, "
|
||||||
|
"and note you can only use it once in the whole game. Do you want to "
|
||||||
|
"resurrect {dead_name}? Give me your reason and decision."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_witch_resurrect_no = (
|
||||||
|
"[WITCH ONLY] The witch has chosen not to resurrect the player."
|
||||||
|
)
|
||||||
|
to_witch_resurrect_yes = (
|
||||||
|
"[WITCH ONLY] The witch has chosen to resurrect the player."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_witch_poison = (
|
||||||
|
"[WITCH ONLY] {witch_name}, as a witch, you have a one-time-use "
|
||||||
|
"poison potion, do you want to use it tonight? Give me your reason "
|
||||||
|
"and decision."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_seer_turn = (
|
||||||
|
"Seer's turn, seer open your eyes and check one player's identity "
|
||||||
|
"tonight..."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_seer = (
|
||||||
|
"[SEER ONLY] {}, as the seer you can check one player's identity "
|
||||||
|
"tonight. Who do you want to check? Give me your reason and decision."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_seer_result = (
|
||||||
|
"[SEER ONLY] You've checked {agent_name}, and the result is: {role}."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_day = (
|
||||||
|
"The day is coming, all players open your eyes. Last night, "
|
||||||
|
"the following player(s) has been eliminated: {}."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_peace = (
|
||||||
|
"The day is coming, all the players open your eyes. Last night is "
|
||||||
|
"peaceful, no player is eliminated."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_discuss = (
|
||||||
|
"Now the alive players are {names}. The game goes on, it's time to "
|
||||||
|
"discuss and vote a player to be eliminated. Now you each take turns "
|
||||||
|
"to speak once in the order of {names}.\n\n"
|
||||||
|
"IMPORTANT: Your response will be structured into two parts:\n"
|
||||||
|
"1. REASONING (Private): Your internal analysis - NOT visible to others\n"
|
||||||
|
"2. STATEMENT (Public): Your actual speech - VISIBLE to all players\n\n"
|
||||||
|
"Think carefully in your reasoning, then speak strategically in your statement."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_vote = (
|
||||||
|
"Now the discussion is over. Everyone, please vote to eliminate one "
|
||||||
|
"player from the alive players: {}."
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_res = "The voting result is {}. So {} has been voted out."
|
||||||
|
|
||||||
|
to_all_wolf_win = (
|
||||||
|
"There are {n_alive} players alive, and {n_werewolves} of them are "
|
||||||
|
"werewolves. "
|
||||||
|
"The game is over and werewolves win🐺🎉!"
|
||||||
|
"In this game, the true roles of all players are: {true_roles}"
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_village_win = (
|
||||||
|
"All the werewolves have been eliminated."
|
||||||
|
"The game is over and villagers win🏘️🎉!"
|
||||||
|
"In this game, the true roles of all players are: {true_roles}"
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_continue = "The game goes on."
|
||||||
|
|
||||||
|
to_all_reflect = (
|
||||||
|
"The game is over. Now each player can reflect on their performance. "
|
||||||
|
"Note each player only has one chance to speak and the reflection is "
|
||||||
|
"only visible to themselves."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class ChinesePrompts:
|
||||||
|
"""Chinese prompts used to guide the werewolf game."""
|
||||||
|
|
||||||
|
to_dead_player = (
|
||||||
|
"{}, 你已被淘汰。现在你可以向所有存活玩家发表最后的遗言。\n\n"
|
||||||
|
"重要提示:你的遗言将分为两个部分:\n"
|
||||||
|
"1. 推理(私密):你的策略思考 - 其他玩家看不到\n"
|
||||||
|
"2. 发言(公开):你的实际遗言 - 所有玩家都能看到\n\n"
|
||||||
|
"仔细思考在遗言中应该透露或隐藏什么信息。"
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_new_game = "新的一局游戏开始,参与玩家包括:{}。现在为每位玩家重新随机分配身份,并私下告知各自身份。"
|
||||||
|
|
||||||
|
to_all_night = "天黑了,请所有人闭眼。狼人请睁眼,选择今晚要淘汰的一名玩家..."
|
||||||
|
|
||||||
|
to_wolves_discussion = (
|
||||||
|
"[仅狼人可见] {}, 你们可以讨论并决定今晚要淘汰的玩家。当前存活玩家有:{}。"
|
||||||
|
"如果达成一致,请将 `reach_agreement` 设为 True。"
|
||||||
|
)
|
||||||
|
|
||||||
|
to_wolves_vote = "[仅狼人可见] 你投票要杀死哪位玩家?"
|
||||||
|
|
||||||
|
to_wolves_res = "[仅狼人可见] 投票结果为 {},你们选择淘汰 {}。"
|
||||||
|
|
||||||
|
to_all_witch_turn = "轮到女巫行动,女巫请睁眼并决定今晚的操作..."
|
||||||
|
to_witch_resurrect = (
|
||||||
|
"[仅女巫可见] {witch_name},你是女巫,今晚{dead_name}被淘汰。"
|
||||||
|
"你可以用解药救他/她,注意解药全局只能用一次。你要救{dead_name}吗?"
|
||||||
|
"请给出理由和决定。"
|
||||||
|
)
|
||||||
|
|
||||||
|
to_witch_resurrect_no = "[仅女巫可见] 女巫选择不救该玩家。"
|
||||||
|
to_witch_resurrect_yes = "[仅女巫可见] 女巫选择救活该玩家。"
|
||||||
|
|
||||||
|
to_witch_poison = "[仅女巫可见] {witch_name},你有一瓶一次性毒药,今晚要使用吗?请给出理由和决定。"
|
||||||
|
|
||||||
|
to_all_seer_turn = "轮到预言家行动,预言家请睁眼并查验一名玩家身份..."
|
||||||
|
|
||||||
|
to_seer = "[仅预言家可见] {}, 你是预言家,今晚可以查验一名玩家身份。你要查谁?请给出理由和决定。"
|
||||||
|
|
||||||
|
to_seer_result = "[仅预言家可见] 你查验了{agent_name},结果是:{role}。"
|
||||||
|
|
||||||
|
to_all_day = "天亮了,请所有玩家睁眼。昨晚被淘汰的玩家有:{}。"
|
||||||
|
|
||||||
|
to_all_peace = "天亮了,请所有玩家睁眼。昨晚平安夜,无人被淘汰。"
|
||||||
|
|
||||||
|
to_all_discuss = (
|
||||||
|
"现在存活玩家有:{names}。游戏继续,大家开始讨论并投票淘汰一名玩家。请按顺序({names})依次发言。\n\n"
|
||||||
|
"重要提示:你的回答将分为两个部分:\n"
|
||||||
|
"1. 推理(私密):你的内心分析 - 其他玩家看不到\n"
|
||||||
|
"2. 发言(公开):你的实际发言 - 所有玩家都能看到\n\n"
|
||||||
|
"在推理中仔细思考,然后在发言中策略性地表达。"
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_vote = "讨论结束。请大家从存活玩家中投票淘汰一人:{}。"
|
||||||
|
|
||||||
|
to_all_res = "投票结果为 {},{} 被淘汰。"
|
||||||
|
|
||||||
|
to_all_wolf_win = (
|
||||||
|
"当前存活玩家共{n_alive}人,其中{n_werewolves}人为狼人。"
|
||||||
|
"游戏结束,狼人获胜🐺🎉!"
|
||||||
|
"本局所有玩家真实身份为:{true_roles}"
|
||||||
|
)
|
||||||
|
|
||||||
|
to_all_village_win = "所有狼人已被淘汰。游戏结束,村民获胜🏘️🎉!本局所有玩家真实身份为:{true_roles}"
|
||||||
|
|
||||||
|
to_all_continue = "游戏继续。"
|
||||||
|
|
||||||
|
to_all_reflect = "游戏结束。现在每位玩家可以对自己的表现进行反思。注意每位玩家只有一次发言机会,且反思内容仅自己可见。"
|
||||||
BIN
tuner/werewolves/rollout_reward_curve.png
Normal file
BIN
tuner/werewolves/rollout_reward_curve.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 193 KiB |
BIN
tuner/werewolves/rollout_reward_curve_goodguy.png
Normal file
BIN
tuner/werewolves/rollout_reward_curve_goodguy.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 538 KiB |
96
tuner/werewolves/structured_model.py
Normal file
96
tuner/werewolves/structured_model.py
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
# flake8: noqa: E501
|
||||||
|
"""The structured output models used in the werewolf game."""
|
||||||
|
from typing import Literal
|
||||||
|
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
from agentscope.agent import AgentBase
|
||||||
|
|
||||||
|
|
||||||
|
class DiscussionModel(BaseModel):
|
||||||
|
"""The output format for discussion."""
|
||||||
|
|
||||||
|
reach_agreement: bool = Field(
|
||||||
|
description="Whether you have reached an agreement or not",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class PublicDiscussionModel(BaseModel):
|
||||||
|
"""The output format for public discussion with private reasoning.
|
||||||
|
|
||||||
|
This model separates private reasoning from public statements to prevent
|
||||||
|
accidental information leakage (e.g., werewolves revealing their identity).
|
||||||
|
"""
|
||||||
|
|
||||||
|
reasoning: str = Field(
|
||||||
|
description=(
|
||||||
|
"Your PRIVATE reasoning and analysis. This will NOT be shown to "
|
||||||
|
"other players. You can freely think about your strategy, analyze "
|
||||||
|
"other players' behaviors, and plan your next move here. "
|
||||||
|
"If you are a werewolf, you can think about how to hide your identity. "
|
||||||
|
"If you are a villager, you can analyze who might be the werewolf."
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
statement: str = Field(
|
||||||
|
description=(
|
||||||
|
"Your PUBLIC statement to all players. This WILL be visible to everyone. "
|
||||||
|
"Be careful not to reveal sensitive information (e.g., your true role if "
|
||||||
|
"you are a werewolf). "
|
||||||
|
"Your statement should be strategic and help your team win."
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_vote_model(agents: list[AgentBase]) -> type[BaseModel]:
|
||||||
|
"""Get the vote model by player names."""
|
||||||
|
|
||||||
|
class VoteModel(BaseModel):
|
||||||
|
"""The vote output format."""
|
||||||
|
|
||||||
|
vote: Literal[tuple(_.name for _ in agents)] = Field( # type: ignore
|
||||||
|
description="The name of the player you want to vote for",
|
||||||
|
)
|
||||||
|
|
||||||
|
return VoteModel
|
||||||
|
|
||||||
|
|
||||||
|
class WitchResurrectModel(BaseModel):
|
||||||
|
"""The output format for witch resurrect action."""
|
||||||
|
|
||||||
|
resurrect: bool = Field(
|
||||||
|
description="Whether you want to resurrect the player",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_poison_model(agents: list[AgentBase]) -> type[BaseModel]:
|
||||||
|
"""Get the poison model by player names."""
|
||||||
|
|
||||||
|
class WitchPoisonModel(BaseModel):
|
||||||
|
"""The output format for witch poison action."""
|
||||||
|
|
||||||
|
poison: bool = Field(
|
||||||
|
description="Do you want to use the poison potion",
|
||||||
|
)
|
||||||
|
name: Literal[ # type: ignore
|
||||||
|
tuple(_.name for _ in agents)
|
||||||
|
] | None = Field(
|
||||||
|
description="The name of the player you want to poison, if you "
|
||||||
|
"don't want to poison anyone, just leave it empty",
|
||||||
|
default=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
return WitchPoisonModel
|
||||||
|
|
||||||
|
|
||||||
|
def get_seer_model(agents: list[AgentBase]) -> type[BaseModel]:
|
||||||
|
"""Get the seer model by player names."""
|
||||||
|
|
||||||
|
class SeerModel(BaseModel):
|
||||||
|
"""The output format for seer action."""
|
||||||
|
|
||||||
|
name: Literal[tuple(_.name for _ in agents)] = Field( # type: ignore
|
||||||
|
description="The name of the player you want to check",
|
||||||
|
)
|
||||||
|
|
||||||
|
return SeerModel
|
||||||
159
tuner/werewolves/utils.py
Normal file
159
tuner/werewolves/utils.py
Normal file
@@ -0,0 +1,159 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""Utility functions for the werewolf game - 7 Player Version."""
|
||||||
|
from collections import defaultdict
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from agentscope.agent import AgentBase, ReActAgent
|
||||||
|
from agentscope.message import Msg
|
||||||
|
from prompt import EnglishPrompts as Prompts
|
||||||
|
|
||||||
|
MAX_GAME_ROUND = 30
|
||||||
|
MAX_DISCUSSION_ROUND = 3
|
||||||
|
|
||||||
|
|
||||||
|
def majority_vote(votes: list[str]) -> tuple:
|
||||||
|
"""Return the vote with the most counts."""
|
||||||
|
result = max(set(votes), key=votes.count)
|
||||||
|
names, counts = np.unique(votes, return_counts=True)
|
||||||
|
conditions = ", ".join(
|
||||||
|
[f"{name}: {count}" for name, count in zip(names, counts)],
|
||||||
|
)
|
||||||
|
return result, conditions
|
||||||
|
|
||||||
|
|
||||||
|
def names_to_str(agents: list[str] | list[ReActAgent]) -> str:
|
||||||
|
"""Return a string of agent names."""
|
||||||
|
if not agents:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
if len(agents) == 1:
|
||||||
|
if isinstance(agents[0], ReActAgent):
|
||||||
|
return agents[0].name
|
||||||
|
return agents[0]
|
||||||
|
|
||||||
|
names = []
|
||||||
|
for agent in agents:
|
||||||
|
if isinstance(agent, ReActAgent):
|
||||||
|
names.append(agent.name)
|
||||||
|
else:
|
||||||
|
names.append(agent)
|
||||||
|
return ", ".join([*names[:-1], "and " + names[-1]])
|
||||||
|
|
||||||
|
|
||||||
|
class EchoAgent(AgentBase):
|
||||||
|
"""Echo agent that repeats the input message."""
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
super().__init__()
|
||||||
|
self.name = "Moderator"
|
||||||
|
|
||||||
|
async def reply(self, content: str) -> Msg:
|
||||||
|
"""Repeat the input content with its name and role."""
|
||||||
|
msg = Msg(
|
||||||
|
self.name,
|
||||||
|
content,
|
||||||
|
role="assistant",
|
||||||
|
)
|
||||||
|
await self.print(msg)
|
||||||
|
return msg
|
||||||
|
|
||||||
|
async def handle_interrupt(
|
||||||
|
self,
|
||||||
|
*args: Any,
|
||||||
|
**kwargs: Any,
|
||||||
|
) -> Msg:
|
||||||
|
"""Handle interrupt."""
|
||||||
|
|
||||||
|
async def observe(self, msg: Msg | list[Msg] | None) -> None:
|
||||||
|
"""Observe the user's message."""
|
||||||
|
|
||||||
|
|
||||||
|
class Players:
|
||||||
|
"""Maintain the players' status."""
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
"""Initialize the players."""
|
||||||
|
# The mapping from player name to role
|
||||||
|
self.name_to_role = {}
|
||||||
|
self.role_to_names = defaultdict(list)
|
||||||
|
self.name_to_agent = {}
|
||||||
|
self.werewolves = []
|
||||||
|
self.villagers = []
|
||||||
|
self.seer = []
|
||||||
|
self.witch = []
|
||||||
|
self.current_alive = []
|
||||||
|
self.all_players = []
|
||||||
|
|
||||||
|
def add_player(self, player: ReActAgent, role: str) -> None:
|
||||||
|
"""Add a player to the game.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
player (`ReActAgent`):
|
||||||
|
The player to be added.
|
||||||
|
role (`str`):
|
||||||
|
The role of the player.
|
||||||
|
"""
|
||||||
|
self.name_to_role[player.name] = role
|
||||||
|
self.name_to_agent[player.name] = player
|
||||||
|
self.role_to_names[role].append(player.name)
|
||||||
|
self.all_players.append(player)
|
||||||
|
if role == "werewolf":
|
||||||
|
self.werewolves.append(player)
|
||||||
|
elif role == "villager":
|
||||||
|
self.villagers.append(player)
|
||||||
|
elif role == "seer":
|
||||||
|
self.seer.append(player)
|
||||||
|
elif role == "witch":
|
||||||
|
self.witch.append(player)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unknown role: {role}")
|
||||||
|
self.current_alive.append(player)
|
||||||
|
|
||||||
|
def update_players(self, dead_players: list[ReActAgent]) -> None:
|
||||||
|
"""Update the current alive players.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
dead_players (`list[ReActAgent]`):
|
||||||
|
A list of dead players to be removed.
|
||||||
|
"""
|
||||||
|
self.werewolves = [
|
||||||
|
_ for _ in self.werewolves if _.name not in dead_players
|
||||||
|
]
|
||||||
|
self.villagers = [
|
||||||
|
_ for _ in self.villagers if _.name not in dead_players
|
||||||
|
]
|
||||||
|
self.seer = [_ for _ in self.seer if _.name not in dead_players]
|
||||||
|
self.witch = [_ for _ in self.witch if _.name not in dead_players]
|
||||||
|
self.current_alive = [
|
||||||
|
_ for _ in self.current_alive if _.name not in dead_players
|
||||||
|
]
|
||||||
|
|
||||||
|
def print_roles(self) -> None:
|
||||||
|
"""Print the roles of all players."""
|
||||||
|
print("Roles:")
|
||||||
|
for name, role in self.name_to_role.items():
|
||||||
|
print(f" - {name}: {role}")
|
||||||
|
|
||||||
|
def check_winning(self) -> str | None:
|
||||||
|
"""Check if the game is over and return the winning message."""
|
||||||
|
|
||||||
|
# Prepare true roles string
|
||||||
|
true_roles = (
|
||||||
|
f'{names_to_str(self.role_to_names["werewolf"])} are werewolves, '
|
||||||
|
f'{names_to_str(self.role_to_names["villager"])} are villagers, '
|
||||||
|
f'{names_to_str(self.role_to_names["seer"])} is the seer, '
|
||||||
|
f'and {names_to_str(self.role_to_names["witch"])} is the witch.'
|
||||||
|
)
|
||||||
|
|
||||||
|
if len(self.werewolves) * 2 >= len(self.current_alive):
|
||||||
|
return Prompts.to_all_wolf_win.format(
|
||||||
|
n_alive=len(self.current_alive),
|
||||||
|
n_werewolves=len(self.werewolves),
|
||||||
|
true_roles=true_roles,
|
||||||
|
)
|
||||||
|
if self.current_alive and not self.werewolves:
|
||||||
|
return Prompts.to_all_village_win.format(
|
||||||
|
true_roles=true_roles,
|
||||||
|
)
|
||||||
|
return None
|
||||||
Reference in New Issue
Block a user