Add examples for werewolf game tuner (#96)

This commit is contained in:
garyzhang99
2026-01-16 17:25:49 +08:00
committed by GitHub
parent 9503bda45d
commit 5855c5161b
16 changed files with 2095 additions and 8 deletions

View File

@@ -8,7 +8,7 @@ This directory contains several examples of how to use the AgentScope Tuner for
| Frozen Lake | Make an agent to navigate the Frozen Lake environment in multi-step interactions. | [frozen_lake](./frozen_lake) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Learn to Ask | Using LLM as a judge to provide feedback to facilitate agent tuning. | [learn_to_ask](./learn_to_ask) | ✅ | ✅ | ❌ | ❌ | ❌ |
| Email Search | Enhance the tool use ability of your agent on tasks without ground truth. | [email_search](./email_search) | ✅ | ✅ | ✅ | ❌ | ❌ |
| Werewolf Game | Enhance the agent's performance in a multi-agent game setting. | [werewolf_game](./werewolf_game)| ✅ | ✅ | ✅ | ✅ | ❌ |
| Werewolf Game | Enhance the agent's performance in a multi-agent game setting. | [werewolves](./werewolves)| ✅ | ✅ | ✅ | ✅ | ❌ |
| Data Augment | Data augmentation for better tuning results. | [data_augment](./data_augment) | ❌ | ❌ | ❌ | ❌ | ✅ |
Each example contains a README file with detailed instructions on how to set up and run the tuning process for that specific scenario. Feel free to explore and modify the examples to suit your needs!

View File

@@ -1,19 +1,19 @@
# AgentScope Tuner 中文说明
# AgentScope Tuner
本目录包含了多个使用 AgentScope Tuner 对 AgentScope 应用进行调优的示例。下表总结了可用的示例:
| 示例名称 | 描述 | 示例路径 | 多步交互 | LLM 评审 | 工具使用 | 多智能体 | 数据增强 |
|------------------|-------------------------------------------|---------------------------------|----------|----------|----------|----------|----------|
| 数学智能体 | 快速入门示例,调优数学智能体以提升其能力。 | [math_agent](./math_agent) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Frozen Lake | 让智能体在多步交互中导航冰湖环境。 | [frozen_lake](./frozen_lake) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Frozen Lake | 让智能体在与 frozen lake 环境的多步交互中学习。 | [frozen_lake](./frozen_lake) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Learn to Ask | 使用 LLM 作为评审,为智能体调优提供反馈 | [learn_to_ask](./learn_to_ask) | ✅ | ✅ | ❌ | ❌ | ❌ |
| 邮件搜索 | 在无标准答案任务中提升智能体的工具使用能力。 | [email_search](./email_search) | ✅ | ✅ | ✅ | ❌ | ❌ |
| 狼人杀游戏 | 提升智能体在多智能体游戏场景下的表现。 | [werewolf_game](./werewolf_game)| ✅ | ✅ | ✅ | ✅ | ❌ |
| 狼人杀游戏 | 提升智能体在多智能体游戏场景下的表现。 | [werewolves](./werewolves)| ✅ | ✅ | ✅ | ✅ | ❌ |
| 数据增强 | 通过数据增强获得更好的调优效果。 | [data_augment](./data_augment) | ❌ | ❌ | ❌ | ❌ | ✅ |
每个示例目录下均包含详细的 README 文件,介绍了该场景下的调优流程和使用方法。欢迎根据实际需求进行探索和修改!
## 先决条件
## 前置要求
AgentScope Tuner 需要:
@@ -21,6 +21,6 @@ AgentScope Tuner 需要:
- `agentscope>=1.0.12`
- `trinity-rft>=0.4.1`
AgentScope Tuner 构建于 [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) 之上
AgentScope Tuner 于 [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) 构建
请参考 [Trinity-RFT 安装指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_installation.html)
获取详细的安装方法。

364
tuner/werewolves/README.md Normal file
View File

@@ -0,0 +1,364 @@
# Training Werewolf Game with RL using AgentScope-Tuner
This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with the AgentScope tuner framework (AS-Tune). We employ the multi-step Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
## Overview
The werewolf game is a complex social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
- Avoid revealing their identity in public discussions
- Coordinate with teammates effectively
- Develop advanced strategies like "deep cover" tactics
- Deceive villagers and mislead investigations
## Task Setting
### Training Objective
The goal is to train **werewolf players** to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule:
- **Reward = +1.0** if werewolves win (all villagers eliminated)
- **Reward = 0.0** if villagers win (all werewolves eliminated)
- **Reward = -0.1** for game execution errors (penalty to discourage invalid behaviors)
### Game Configuration
This implementation is based on the `games/game_werewolves` example but with several key modifications:
Original 9-Player Setup:
- 3 Werewolves, 3 Villagers, 1 Seer, 1 Witch, 1 Hunter
- Witch cannot self-rescue (use healing potion on herself)
Modified 7-Player Setup (This Project):
- 2 Werewolves: Kill one player each night, must hide identity during the day
- 3 Villagers: Ordinary players without special abilities
- 1 Seer: Can check one player's identity each night
- 1 Witch: Has two one-time-use potions:
- Healing potion: Save a player from being killed at night (**can self-rescue**)
- Poison potion: Eliminate one player at night
We also make slight modification to the prompt, and ask the players to reasoning before they speak publicly.
### Models
- **Trainable Model (Werewolf Players)**: `Qwen/Qwen2.5-7B-Instruct`
- **Auxiliary Model (Other Roles)**: `Qwen/Qwen3-30B-A3B-Instruct-2507`
### Algorithm
**Multi-Step GRPO (Group Relative Policy Optimization)**
- Group size: 32 rollouts per training batch
- Batch size: 24
- Learning rate: 1e-6
- Advantage normalization by episode length
- Clipping range: [0.2, 0.28]
- No KL penalty (kl_coef: 0)
## Dataset Preparation
The dataset for this task is minimal and consists only of random **seeds** for role shuffling. Each training episode uses a different seed to randomize player role assignments, ensuring diverse training scenarios.
### Generate Dataset
Run the `prepare_data.py` script to generate the dataset:
```bash
# Generate default dataset (300 seeds for training)
python prepare_data.py
# Or customize the number of seeds
python prepare_data.py --num_seeds 500
```
This will create `data/train.jsonl` (or `data/eval.jsonl`) with the following format:
```json
{"seed": 0}
{"seed": 1}
{"seed": 2}
...
```
During training, these seeds are used to shuffle role assignments via `np.random.shuffle()`, creating varied game configurations.
## Code Implementation
### High-Level Workflow
The training workflow consists of the following key components:
#### 1. Agent Workflow (`run_werewolves_workflow`)
```python
async def run_werewolves_workflow(task, model, auxiliary_models):
# 1. Initialize roles
roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
# 2. Shuffle based on task seed
np.random.seed(task["seed"])
np.random.shuffle(roles)
# 3. Create agents: werewolves use trainable model, others use auxiliary model
players = [
ReActAgent(
name=f"Player{i+1}",
model=model if role == "werewolf" else participant_model,
...
) for i, role in enumerate(roles)
]
# 4. Run the game
good_guy_win = await werewolves_game(players, roles)
# 5. Compute reward
reward = 1.0 if not good_guy_win else 0.0
return WorkflowOutput(reward=reward, metrics={...})
```
#### 2. Game Loop (`werewolves_game`)
Each game consists of alternating night and day phases:
**Night Phase:**
1. **Werewolves' Turn**: Discuss privately and vote to kill a player
2. **Witch's Turn**: Decide whether to use healing/poison potions
3. **Seer's Turn**: Check one player's identity
**Day Phase:**
1. **Announcement**: Moderator announces who died during the night
2. **Discussion**: All alive players discuss with reasoning/statement separation
3. **Voting**: All players vote to eliminate one suspected werewolf
4. **Last Words**: Eliminated player gives final statement
The game continues until:
- All werewolves are eliminated (villagers win), or
- Werewolves equal or outnumber other players (werewolves win)
#### 3. Reward Calculation
The reward is computed based on the game outcome from the perspective of werewolves:
```python
if not good_guy_win: # Werewolves win
reward = 1.0
else: # Villagers win
reward = 0.0
```
## How to Run
### Prerequisites
1. Install AgentScope with tuner support:
```bash
pip install agentscope[full]
```
2. Set up environment variables (optional, can be configured in code):
```bash
export TRINITY_MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
export TRINITY_AUXILIARY_MODEL_PATH="Qwen/Qwen3-30B-A3B-Instruct-2507"
export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"
```
### Configuration
The project uses a hybrid configuration approach:
1. **High-level parameters** in `main.py`:
- Model paths
- Dataset configuration
- Algorithm parameters (group_size, batch_size, learning_rate)
2. **Detailed infrastructure settings** in `config.yaml`:
- Cluster configuration (nodes, GPUs)
- Explorer settings (rollout engines, timeouts)
- Trainer settings (gradient clipping, batch sizes)
- Monitor configuration (WandB integration)
Key parameters to adjust:
```python
# In main.py
trained_model_path = "Qwen/Qwen2.5-7B-Instruct"
auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507"
dataset = DatasetConfig(
path="data",
split="train",
total_steps=400, # Total training steps
)
algorithm = AlgorithmConfig(
algorithm_type="multi_step_grpo",
group_size=32, # Rollouts per batch
batch_size=24, # Training batches per step
learning_rate=1e-6,
save_interval_steps=100,
eval_interval_steps=100,
)
```
### Training Command
**Step 1: Prepare the dataset**
```bash
cd /path/to/agentscope-samples/training/werewolf_game
python prepare_data.py --num_seeds 300
```
**Step 2: Start Ray cluster**
Start your ray cluster.
```bash
# For single node
ray start --head
# For multi-node cluster (e.g., 4 nodes with 8 GPUs each):
# On the head node:
ray start --head --port=6379
# On each worker node:
ray start --address='<head_node_ip>:6379'
# Replace <head_node_ip> with the actual IP address of your head node
```
**Step 3: Run training**
Run the training script on the head node:
```bash
python main.py
```
For distributed training, ensure Ray cluster is properly configured in `config.yaml`, make sure that the node number and gpu number matches:
```yaml
cluster:
node_num: 4
gpu_per_node: 8
```
## Results
### Quantitative Results
Training on the 7-player werewolf game for 400 steps demonstrates significant improvement:
| Metric | Before Training | After Training (Step 200+) |
|--------|----------------|---------------------------|
| Werewolf Win Rate | ~50% | ~85% |
| Success Rate | 0.50 | 0.85 |
**Reward Curve:**
![Rollout Reward Curve](./rollout_reward_curve.png)
As shown in the reward curve above, the werewolf win rate steadily increases during training:
- **Steps 0-50**: Win rate fluctuates around 50-60% as the model explores strategies
- **Steps 50-200**: Win rate climbs to 70-80% as effective strategies emerge
- **Steps 200-250**: Win rate stabilizes around 85%
### Qualitative Results
#### Before Training: Critical Mistakes
Untrained models (Qwen2.5-7B-Instruct) make fundamental errors:
- **Self-exposure**: Werewolves accidentally reveal identity in public ("as a werewolf, I think...")
- **Exposing teammates**: When voted out, werewolves point to teammates in last words
- **Random strategy**: No coordination, no threat prioritization (e.g., ignoring the Seer)
#### After Training: "Deep Cover" Werewolf Strategy
After 200+ steps, the trained model **converges to a sophisticated "deep cover" tactic**.
Below is an example where werewolves sacrifice teammates to gain trust, then dominate late game.
**Concrete Example:**
**Setup (Day 1):** Seer publicly identifies Player5 as werewolf. Witch and villagers join the accusation. Player5 (werewolf) is doomed.
**Player4 (werewolf teammate) responds strategically:**
**In short:** Player4 sacrifices teammate to gain trust, pretending to be a logical villager.
**Detailed reasoning and statement:**
*Private Reasoning (only Player4 sees):*
```
"I'm a werewolf, and I know Player2 was targeted last night but survived.
My teammate Player5 is doomed - Player3 and Player7 are attacking him together.
I need to deflect suspicion from MYSELF, not save Player5.
Strategy: Sacrifice Player5 to gain trust. If I vote against my teammate,
villagers will think I'm trustworthy. This creates confusion and positions
me as a "confirmed good guy" for later rounds."
```
*Public Statement (all players see):*
```
✅ "I think Player5 is acting very strangely. He brought up Player2 being targeted,
but didn't explain why he's so focused on them. It feels like a distraction tactic.
I also noticed that Player3 and Player7 are both accusing Player5, which makes sense -
their logic is solid. I think we should seriously consider voting out Player5 -
he's the most suspicious person right now."
```
**Result:** Player5 eliminated, but Player4 gains complete trust from Seer and all villagers.
**Why it works:**
- Seer trusts Player4 as strong villager ally → won't check him
- Villagers follow Player4's "logical" analysis
- Player4 systematically misleads discussions in later rounds
- Survives to final 2 players → **werewolves win**
This demonstrates the essence of trained behavior: **sacrifice pieces strategically to secure ultimate victory**. The model learns that short-term teammate loss is worthwhile for establishing deep cover and long-term dominance.
---
## Bonus: Training Good Guys
In addition to training werewolves, we also provide a configuration for training the **good guy side** (villagers, seer, and witch). This is a more challenging task as good guys need to:
- Perform complex reasoning to identify werewolves from subtle behavioral cues
- Coordinate effectively without explicit team communication
- Resist manipulation and deception from werewolves
- **Train multiple roles simultaneously**: Unlike werewolves (single role), good guys include villager, seer, and witch with different abilities, requiring the model to master diverse strategies in one training run, and make optimal use of special abilities (Seer's checks, Witch's potions)
### Configuration
Use `config_train_goodguy.yaml` or set `trainable_target: good_guy` in `workflow_args`:
```yaml
workflow_args:
trainable_target: good_guy # Train villager, seer, and witch
```
### Quantitative Results
We trained `Qwen3-4B-Instruct` as good guys against `Qwen3-30B-A3B-Instruct` werewolves:
| Metric | Before Training | After ~200 Steps | After ~400 Steps |
|--------|----------------|------------------|------------------|
| Good Guy Win Rate | ~18% | ~60% | ~80% |
**Training Curve:**
![Good Guy Training Curve](./rollout_reward_curve_goodguy.png)
The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors.
### Qualitative Results
After training, the good guy models exhibit advanced reasoning patterns:
- **Seer**: Strategic target selection, information concealment in public statements, evidence integration
- **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
- **Villager**: Evidence-chain analysis, trust building with special roles, consensus formation for team coordination
---
## Conclusion
This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AS-Tune's multi-step GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
**Ready to try it yourself?** Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies!

View File

@@ -0,0 +1,2 @@
# -*- coding: utf-8 -*-
"""Werewolf game training example for AgentScope."""

View File

@@ -0,0 +1,241 @@
# ============================================
# Project and Experiment Configuration
# ============================================
# Project name for grouping experiments
project: AgentScope-Werewolves
# Unique name for this specific experiment run
name: Werewolves-7Player-GRPO
# Root directory for saving checkpoints. Uses environment variable if set, otherwise defaults to ./checkpoints
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
# ============================================
# Algorithm Configuration
# ============================================
algorithm:
# Algorithm type: multi_step_grpo for multi-step Group Relative Policy Optimization
algorithm_type: multi_step_grpo
# KL divergence loss function for regularization during training
# "low_var_kl": low-variance KL loss suitable for multi-step optimization
kl_loss_fn: "low_var_kl"
kl_loss_fn_args:
# KL coefficient: 0 means no KL penalty (pure reward optimization)
kl_coef: 0
# Advantage function configuration for computing policy gradients
advantage_fn_args:
# Small epsilon value for numerical stability in advantage normalization
epsilon: 1e-6
# Normalize advantage by episode length (important for variable-length games)
step_norm: true
# Number of rollouts per task (group size for GRPO)
# Higher values provide better gradient estimates but require more compute
repeat_times: 32
# Policy loss function arguments (PPO-style clipping)
policy_loss_fn_args:
# Lower bound for probability ratio clipping (prevents too large policy updates)
clip_range_low: 0.2
# Upper bound for probability ratio clipping
clip_range_high: 0.28
# Optimizer configuration
optimizer:
# Learning rate for policy updates
lr: 1e-6
# ============================================
# Model Configuration
# ============================================
model:
# Path to the base model (trainable model for werewolf players)
# Uses environment variable if set, otherwise defaults to Qwen2.5-7B-Instruct
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-7B-Instruct}
# Maximum number of tokens the model can generate per response
max_response_tokens: 4096
# Total context length the model can handle (prompt + response)
max_model_len: 25600
# ============================================
# Cluster Configuration
# ============================================
cluster:
# Number of nodes in the Ray cluster
node_num: 4
# Number of GPUs per node (total GPUs = node_num * gpu_per_node = 32)
gpu_per_node: 8
# Ray cluster address: "auto" automatically detects the cluster
ray_address: auto
# ============================================
# Buffer Configuration (Data Pipeline)
# ============================================
buffer:
# Total training steps (iterations)
total_steps: 400
# Batch size for rollout collection (tasks per training step)
batch_size: 24
# Batch size for training (experiences per gradient update)
# For multi-step GRPO, each rollout for task has multiple steps, so we need to accumulate the experiences for gradient update.
# Hence we should choose a larger batch size for training (2048 > 16 * 32)
train_batch_size: 2048
# Explorer input configuration (rollout data source)
explorer_input:
taskset:
# Dataset name
name: werewolves
# Storage type: "file" reads from local file system
storage_type: file
# Path to the dataset directory containing train.jsonl
path: data
# Dataset split to use
split: 'train'
# Rollout generation arguments
rollout_args:
# Sampling temperature for exploration (1.0 = full stochasticity)
temperature: 1.0
# Maximum tokens per generation
max_tokens: 4096
# Workflow-specific arguments
workflow_args:
# Which side to train: "werewolf" or "good_guy" (villager, seer, witch)
# - "werewolf": Train werewolf players (default)
# - "good_guy": Train villager, seer, and witch players
trainable_target: werewolf
# Evaluation datasets (empty for this experiment)
eval_tasksets: []
# Trainer input configuration (training data source)
trainer_input:
experience_buffer:
# Name of the experience buffer
name: werewolves
# Storage type: "queue" for in-memory communication between explorer and trainer
storage_type: queue
# Maximum time (seconds) to wait for data from explorer before timeout
max_read_timeout: 7200
# Replay buffer configuration for experience reuse
replay_buffer:
# Enable replay buffer to reuse past experiences
enable: true
# ============================================
# Explorer Configuration (Rollout Generation)
# ============================================
explorer:
# Number of parallel workflow runners per rollout model instance
# Higher values = more parallel game simulations
runner_per_model: 16
# Maximum time (seconds) to wait for a single task completion
max_timeout: 3600
# Number of retries if a task fails or times out (0 = no retry)
max_retry_times: 0
# Number of times each task is repeated within a single runner
# (usually 1; repeat_times is handled at algorithm level)
max_repeat_times_per_runner: 1
# Over-rollout configuration (collect more data than needed)
over_rollout:
# Ratio of tasks to skip waiting for (0.2 = wait for 80% of batch_size, then proceed)
# Improves throughput by not waiting for slow tasks
ratio: 0.2
# Minimum wait time (seconds) after reaching the threshold before proceeding
wait_after_min: 15
# Dynamic timeout configuration (adaptive timeout based on task completion time)
dynamic_timeout:
# Enable dynamic timeout adjustment
enable: true
# Timeout multiplier: timeout = average_task_time * ratio
ratio: 4
# Rollout model configuration (trainable model for werewolf players)
rollout_model:
# Number of vLLM engine instances for parallel inference
engine_num: 16
# Tensor parallelism size (GPUs per engine instance)
tensor_parallel_size: 1
# Disable KV cache prefix sharing (usually False for training stability)
enable_prefix_caching: false
# Disable CUDA graph optimization (False = use CUDA graphs for faster inference)
enforce_eager: false
# Enable OpenAI-compatible API interface for vLLM
enable_openai_api: true
# Enable conversation history tracking
enable_history: true
# Enable automatic tool choice in function calling
enable_auto_tool_choice: true
# Tool call parser for structured outputs (hermes format)
tool_call_parser: hermes
# Data type for model weights and activations
dtype: bfloat16
# Random seed for reproducibility
seed: 42
# Auxiliary models configuration (for non-werewolf players: villagers, seer, witch)
auxiliary_models:
- # Model name identifier
name: participant
# Path to auxiliary model (stronger model for stable baseline opponents)
model_path: ${oc.env:TRINITY_AUXILIARY_MODEL_PATH,Qwen/Qwen3-30B-A3B-Instruct-2507}
# Number of vLLM engines for auxiliary model (fewer than rollout model)
engine_num: 8
# Tensor parallelism size
tensor_parallel_size: 1
# Disable prefix caching
enable_prefix_caching: false
# Disable CUDA graph optimization
enforce_eager: false
# Enable OpenAI API interface
enable_openai_api: true
# Enable automatic tool choice
enable_auto_tool_choice: true
# Tool call parser
tool_call_parser: hermes
# Data type
dtype: bfloat16
# Random seed
seed: 42
# ============================================
# Synchronizer Configuration (Weight Sync)
# ============================================
synchronizer:
# Synchronization style: "dynamic_by_explorer" = sync when explorer requests
# Alternatives: "fixed" (sync every N steps), "dynamic_by_trainer"
sync_style: dynamic_by_explorer
# Synchronization method: "nccl" uses NVIDIA NCCL for fast GPU-to-GPU communication
# Alternatives: "checkpoint" (slower, file-based sync)
sync_method: 'nccl'
# Sync weights every N training steps
sync_interval: 1
# Timeout (seconds) for weight synchronization before failing
sync_timeout: 72000
# ============================================
# Trainer Configuration (Model Training)
# ============================================
trainer:
# Save checkpoint every N training steps (0 = only save at end)
save_interval: 100
# Gradient clipping threshold to prevent exploding gradients
grad_clip: 1.0
# Use dynamic batch size to maximize GPU memory utilization
use_dynamic_bsz: true
# Maximum token length per GPU for memory management
# Higher = more efficient but requires more VRAM
max_token_len_per_gpu: 16384
# Ulysses sequence parallelism size for handling long sequences
# 2 = split sequence across 2 GPUs
ulysses_sequence_parallel_size: 2
# ============================================
# Monitor Configuration (Logging & Tracking)
# ============================================
monitor:
# Monitoring/logging backend: "wandb" for Weights & Biases
# Alternatives: "tensorboard"
monitor_type: wandb

View File

@@ -0,0 +1,241 @@
# ============================================
# Project and Experiment Configuration
# ============================================
# Project name for grouping experiments
project: AgentScope-Werewolves
# Unique name for this specific experiment run (training good guys)
name: Werewolves-7Player-GRPO-train-goodguy
# Root directory for saving checkpoints. Uses environment variable if set, otherwise defaults to ./checkpoints
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
# ============================================
# Algorithm Configuration
# ============================================
algorithm:
# Algorithm type: multi_step_grpo for multi-step Group Relative Policy Optimization
algorithm_type: multi_step_grpo
# KL divergence loss function for regularization during training
# "low_var_kl": low-variance KL loss suitable for multi-step optimization
kl_loss_fn: "low_var_kl"
kl_loss_fn_args:
# KL coefficient: 0 means no KL penalty (pure reward optimization)
kl_coef: 0
# Advantage function configuration for computing policy gradients
advantage_fn_args:
# Small epsilon value for numerical stability in advantage normalization
epsilon: 1e-6
# Normalize advantage by episode length (important for variable-length games)
step_norm: true
# Number of rollouts per task (group size for GRPO)
# Higher values provide better gradient estimates but require more compute
repeat_times: 32
# Policy loss function arguments (PPO-style clipping)
policy_loss_fn_args:
# Lower bound for probability ratio clipping (prevents too large policy updates)
clip_range_low: 0.2
# Upper bound for probability ratio clipping
clip_range_high: 0.28
# Optimizer configuration
optimizer:
# Learning rate for policy updates
lr: 1e-6
# ============================================
# Model Configuration
# ============================================
model:
# Path to the base model (trainable model for good guy players: villager, seer, witch)
# Uses environment variable if set, otherwise defaults to Qwen3-4B-Instruct-2507
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen3-4B-Instruct-2507}
# Maximum number of tokens the model can generate per response
max_response_tokens: 4096
# Total context length the model can handle (prompt + response)
max_model_len: 25600
# ============================================
# Cluster Configuration
# ============================================
cluster:
# Number of nodes in the Ray cluster
node_num: 4
# Number of GPUs per node (total GPUs = node_num * gpu_per_node = 32)
gpu_per_node: 8
# Ray cluster address: "auto" automatically detects the cluster
ray_address: auto
# ============================================
# Buffer Configuration (Data Pipeline)
# ============================================
buffer:
# Total training steps (iterations)
total_steps: 400
# Batch size for rollout collection (tasks per training step)
batch_size: 24
# Batch size for training (experiences per gradient update)
# For multi-step GRPO, each rollout for task has multiple steps, so we need to accumulate the experiences for gradient update.
# Hence we should choose a larger batch size for training (2048 > 16 * 32)
train_batch_size: 2048
# Explorer input configuration (rollout data source)
explorer_input:
taskset:
# Dataset name
name: werewolves
# Storage type: "file" reads from local file system
storage_type: file
# Path to the dataset directory containing train.jsonl
path: data
# Dataset split to use
split: 'train'
# Rollout generation arguments
rollout_args:
# Sampling temperature for exploration (1.0 = full stochasticity)
temperature: 1.0
# Maximum tokens per generation
max_tokens: 4096
# Workflow-specific arguments
workflow_args:
# Which side to train: "werewolf" or "good_guy" (villager, seer, witch)
# - "werewolf": Train werewolf players
# - "good_guy": Train villager, seer, and witch players (this config)
trainable_target: good_guy
# Evaluation datasets (empty for this experiment)
eval_tasksets: []
# Trainer input configuration (training data source)
trainer_input:
experience_buffer:
# Name of the experience buffer
name: werewolves
# Storage type: "queue" for in-memory communication between explorer and trainer
storage_type: queue
# Maximum time (seconds) to wait for data from explorer before timeout
max_read_timeout: 7200
# Replay buffer configuration for experience reuse
replay_buffer:
# Enable replay buffer to reuse past experiences
enable: true
# ============================================
# Explorer Configuration (Rollout Generation)
# ============================================
explorer:
# Number of parallel workflow runners per rollout model instance
# Higher values = more parallel game simulations
runner_per_model: 16
# Maximum time (seconds) to wait for a single task completion
max_timeout: 3600
# Number of retries if a task fails or times out (0 = no retry)
max_retry_times: 0
# Number of times each task is repeated within a single runner
# (usually 1; repeat_times is handled at algorithm level)
max_repeat_times_per_runner: 1
# Over-rollout configuration (collect more data than needed)
over_rollout:
# Ratio of tasks to skip waiting for (0.2 = wait for 80% of batch_size, then proceed)
# Improves throughput by not waiting for slow tasks
ratio: 0.2
# Minimum wait time (seconds) after reaching the threshold before proceeding
wait_after_min: 15
# Dynamic timeout configuration (adaptive timeout based on task completion time)
dynamic_timeout:
# Enable dynamic timeout adjustment
enable: true
# Timeout multiplier: timeout = average_task_time * ratio
ratio: 4
# Rollout model configuration (trainable model for good guy players)
rollout_model:
# Number of vLLM engine instances for parallel inference
engine_num: 16
# Tensor parallelism size (GPUs per engine instance)
tensor_parallel_size: 1
# Disable KV cache prefix sharing (usually False for training stability)
enable_prefix_caching: false
# Disable CUDA graph optimization (False = use CUDA graphs for faster inference)
enforce_eager: false
# Enable OpenAI-compatible API interface for vLLM
enable_openai_api: true
# Enable conversation history tracking
enable_history: true
# Enable automatic tool choice in function calling
enable_auto_tool_choice: true
# Tool call parser for structured outputs (hermes format)
tool_call_parser: hermes
# Data type for model weights and activations
dtype: bfloat16
# Random seed for reproducibility
seed: 42
# Auxiliary models configuration (for werewolf players when training good guys)
auxiliary_models:
- # Model name identifier
name: participant
# Path to auxiliary model (stronger model for stable baseline opponents)
model_path: ${oc.env:TRINITY_AUXILIARY_MODEL_PATH,Qwen/Qwen3-30B-A3B-Instruct-2507}
# Number of vLLM engines for auxiliary model (fewer than rollout model)
engine_num: 8
# Tensor parallelism size
tensor_parallel_size: 1
# Disable prefix caching
enable_prefix_caching: false
# Disable CUDA graph optimization
enforce_eager: false
# Enable OpenAI API interface
enable_openai_api: true
# Enable automatic tool choice
enable_auto_tool_choice: true
# Tool call parser
tool_call_parser: hermes
# Data type
dtype: bfloat16
# Random seed
seed: 42
# ============================================
# Synchronizer Configuration (Weight Sync)
# ============================================
synchronizer:
# Synchronization style: "dynamic_by_explorer" = sync when explorer requests
# Alternatives: "fixed" (sync every N steps), "dynamic_by_trainer"
sync_style: dynamic_by_explorer
# Synchronization method: "nccl" uses NVIDIA NCCL for fast GPU-to-GPU communication
# Alternatives: "checkpoint" (slower, file-based sync)
sync_method: 'nccl'
# Sync weights every N training steps
sync_interval: 1
# Timeout (seconds) for weight synchronization before failing
sync_timeout: 72000
# ============================================
# Trainer Configuration (Model Training)
# ============================================
trainer:
# Save checkpoint every N training steps (0 = only save at end)
save_interval: 100
# Gradient clipping threshold to prevent exploding gradients
grad_clip: 1.0
# Use dynamic batch size to maximize GPU memory utilization
use_dynamic_bsz: true
# Maximum token length per GPU for memory management
# Higher = more efficient but requires more VRAM
max_token_len_per_gpu: 16384
# Ulysses sequence parallelism size for handling long sequences
# 2 = split sequence across 2 GPUs
ulysses_sequence_parallel_size: 2
# ============================================
# Monitor Configuration (Logging & Tracking)
# ============================================
monitor:
# Monitoring/logging backend: "wandb" for Weights & Biases
# Alternatives: "tensorboard"
monitor_type: wandb

399
tuner/werewolves/game.py Normal file
View File

@@ -0,0 +1,399 @@
# -*- coding: utf-8 -*-
# flake8: noqa: E501
# pylint: disable=too-many-branches, too-many-statements, no-name-in-module, W0707
"""A werewolf game implemented by agentscope with structured reasoning - 7 Player Version."""
from utils import (
majority_vote,
names_to_str,
EchoAgent,
MAX_GAME_ROUND,
MAX_DISCUSSION_ROUND,
Players,
)
from structured_model import (
DiscussionModel,
PublicDiscussionModel,
get_vote_model,
get_poison_model,
WitchResurrectModel,
get_seer_model,
)
from prompt import EnglishPrompts as Prompts
# Uncomment the following line to use Chinese prompts
# from prompt import ChinesePrompts as Prompts
from agentscope.agent import ReActAgent
from agentscope.message import Msg
from agentscope.pipeline import (
MsgHub,
fanout_pipeline,
)
class BadGuyException(Exception):
...
moderator = EchoAgent()
async def werewolves_game(agents: list[ReActAgent], roles) -> bool:
"""The main entry of the werewolf game - 7 Player Version
Args:
agents (`list[ReActAgent]`):
A list of 7 agents.
"""
assert (
len(agents) == 7
), "The 7-player werewolf game needs exactly 7 players."
# Init the players' status
players = Players()
# If the witch has healing and poison potion
healing, poison = True, True
# If it's the first day, the dead can leave a message
first_day = True
# Broadcast the game begin message
async with MsgHub(participants=agents) as greeting_hub:
await greeting_hub.broadcast(
await moderator(
Prompts.to_all_new_game.format(names_to_str(agents)),
),
)
# Assign roles to the agents - 2 werewolves, 3 villagers, 1 seer, 1 witch
for agent, role in zip(agents, roles):
# Tell the agent its role
await agent.observe(
await moderator(
f"[{agent.name} ONLY] {agent.name}, your role is {role}.",
),
)
players.add_player(agent, role)
# Printing the roles
players.print_roles()
# GAME BEGIN!
for _ in range(MAX_GAME_ROUND):
# Create a MsgHub for all players to broadcast messages
async with MsgHub(
participants=players.current_alive,
enable_auto_broadcast=False, # manual broadcast only
name="alive_players",
) as alive_players_hub:
# Night phase
await alive_players_hub.broadcast(
await moderator(Prompts.to_all_night),
)
killed_player, poisoned_player = None, None
try:
# Werewolves discuss
async with MsgHub(
players.werewolves,
enable_auto_broadcast=True,
announcement=await moderator(
Prompts.to_wolves_discussion.format(
names_to_str(players.werewolves),
names_to_str(players.current_alive),
),
),
name="werewolves",
) as werewolves_hub:
# Discussion
n_werewolves = len(players.werewolves)
for _ in range(1, MAX_DISCUSSION_ROUND * n_werewolves + 1):
res = await players.werewolves[_ % n_werewolves](
structured_model=DiscussionModel,
)
if _ % n_werewolves == 0 and res.metadata.get(
"reach_agreement",
):
break
# Werewolves vote
# Disable auto broadcast to avoid following other's votes
werewolves_hub.set_auto_broadcast(False)
msgs_vote = await fanout_pipeline(
players.werewolves,
msg=await moderator(content=Prompts.to_wolves_vote),
structured_model=get_vote_model(players.current_alive),
enable_gather=False,
)
killed_player, votes = majority_vote(
[_.metadata.get("vote") for _ in msgs_vote],
)
# Postpone the broadcast of voting
await werewolves_hub.broadcast(
[
*msgs_vote,
await moderator(
Prompts.to_wolves_res.format(
votes,
killed_player,
),
),
],
)
except Exception as e:
raise BadGuyException(
f"Werewolves failed to make a decision: {e}",
)
# Witch's turn
await alive_players_hub.broadcast(
await moderator(Prompts.to_all_witch_turn),
)
msg_witch_poison = None
for agent in players.witch:
# Witch can heal herself (self-rescue allowed)
msg_witch_resurrect = None
if healing and killed_player:
msg_witch_resurrect = await agent(
await moderator(
Prompts.to_witch_resurrect.format(
witch_name=agent.name,
dead_name=killed_player,
),
),
structured_model=WitchResurrectModel,
)
if msg_witch_resurrect.metadata.get("resurrect"):
killed_player = None
healing = False
# Has poison potion and hasn't used the healing potion
if poison and not (
msg_witch_resurrect
and msg_witch_resurrect.metadata["resurrect"]
):
msg_witch_poison = await agent(
await moderator(
Prompts.to_witch_poison.format(
witch_name=agent.name,
),
),
structured_model=get_poison_model(
players.current_alive,
),
)
if msg_witch_poison.metadata.get("poison"):
poisoned_player = msg_witch_poison.metadata.get("name")
poison = False
# Seer's turn
await alive_players_hub.broadcast(
await moderator(Prompts.to_all_seer_turn),
)
for agent in players.seer:
msg_seer = await agent(
await moderator(
Prompts.to_seer.format(
agent.name,
names_to_str(players.current_alive),
),
),
structured_model=get_seer_model(players.current_alive),
)
if msg_seer.metadata.get("name"):
player = msg_seer.metadata["name"]
await agent.observe(
await moderator(
Prompts.to_seer_result.format(
agent_name=player,
role=players.name_to_role[player],
),
),
)
# Update alive players (no hunter in 7-player version)
dead_tonight = [killed_player, poisoned_player]
players.update_players(dead_tonight)
# Day phase
if len([_ for _ in dead_tonight if _]) > 0:
await alive_players_hub.broadcast(
await moderator(
Prompts.to_all_day.format(
names_to_str([_ for _ in dead_tonight if _]),
),
),
)
# The killed player leave a last message in first night
if killed_player and first_day:
msg_moderator = await moderator(
Prompts.to_dead_player.format(killed_player),
)
await alive_players_hub.broadcast(msg_moderator)
# Leave a message with structured reasoning
dead_agent = players.name_to_agent[killed_player]
last_words_response = await dead_agent(
structured_model=PublicDiscussionModel,
)
# Extract reasoning and statement from metadata
reasoning = last_words_response.metadata.get(
"reasoning",
"",
)
statement = last_words_response.metadata.get(
"statement",
"",
)
# Only broadcast the public statement
public_last_msg = Msg(
name=dead_agent.name,
content=statement,
role="assistant",
)
await alive_players_hub.broadcast(public_last_msg)
# Let the dead player observe their own private reasoning
private_reasoning_msg = Msg(
name="self_thought",
content=f"[PRIVATE REASONING] {reasoning}",
role="assistant",
)
await dead_agent.observe(private_reasoning_msg)
else:
await alive_players_hub.broadcast(
await moderator(Prompts.to_all_peace),
)
# Check winning
res = players.check_winning()
if res:
await moderator(res)
break
# Discussion - KEY MODIFICATION: Use structured reasoning
await alive_players_hub.broadcast(
await moderator(
Prompts.to_all_discuss.format(
names=names_to_str(players.current_alive),
),
),
)
# Instead of sequential_pipeline, we manually handle each player
# to separate reasoning from public statement
for player in players.current_alive:
# Get structured response with reasoning and statement
response = await player(
structured_model=PublicDiscussionModel,
)
# Extract reasoning and statement from metadata
reasoning = response.metadata.get("reasoning", "")
statement = response.metadata.get("statement", "")
# Only broadcast the public statement to all players
public_msg = Msg(
name=player.name,
content=statement,
role="assistant",
)
await alive_players_hub.broadcast(public_msg)
# Let the player observe their own private reasoning
# This keeps it in their memory but not visible to others
private_msg = Msg(
name="self_thought",
content=f"[PRIVATE REASONING] {reasoning}",
role="assistant",
)
await player.observe(private_msg)
# Voting
msgs_vote = await fanout_pipeline(
players.current_alive,
await moderator(
Prompts.to_all_vote.format(
names_to_str(players.current_alive),
),
),
structured_model=get_vote_model(players.current_alive),
enable_gather=False,
)
voted_player, votes = majority_vote(
[_.metadata.get("vote") for _ in msgs_vote],
)
# Broadcast the voting messages together to avoid influencing
# each other
voting_msgs = [
*msgs_vote,
await moderator(
Prompts.to_all_res.format(votes, voted_player),
),
]
# Leave a message if voted
if voted_player:
prompt_msg = await moderator(
Prompts.to_dead_player.format(voted_player),
)
# Get structured last words with reasoning
dead_agent = players.name_to_agent[voted_player]
last_words_response = await dead_agent(
prompt_msg,
structured_model=PublicDiscussionModel,
)
# Extract reasoning and statement from metadata
reasoning = last_words_response.metadata.get("reasoning", "")
statement = last_words_response.metadata.get("statement", "")
# Create public statement message
public_last_msg = Msg(
name=dead_agent.name,
content=statement,
role="assistant",
)
# Store private reasoning for the dead player
private_reasoning_msg = Msg(
name="self_thought",
content=f"[PRIVATE REASONING] {reasoning}",
role="assistant",
)
await dead_agent.observe(private_reasoning_msg)
voting_msgs.extend([prompt_msg, public_last_msg])
await alive_players_hub.broadcast(voting_msgs)
# Update alive players (no hunter in 7-player version)
dead_today = [voted_player]
players.update_players(dead_today)
# Check winning
res = players.check_winning()
if res:
async with MsgHub(players.all_players) as all_players_hub:
res_msg = await moderator(res)
await all_players_hub.broadcast(res_msg)
break
# The day ends
first_day = False
# # Game over, each player reflects
# await fanout_pipeline(
# agents=agents,
# msg=await moderator(Prompts.to_all_reflect),
# )
alive_wolves = players.werewolves
good_guy_win = len(alive_wolves) == 0
return good_guy_win

316
tuner/werewolves/main.py Normal file
View File

@@ -0,0 +1,316 @@
# -*- coding: utf-8 -*-
# flake8: noqa: E501
# pylint: disable=C0301,C0413,W0621,W0404,C0412,E0611,E1121
"""Example of training a werewolf game agent with Trinity-RFT using AgentScope tuner."""
import sys
from pathlib import Path
from typing import Dict
import traceback
import numpy as np
from agentscope.tuner import (
tune,
WorkflowOutput,
TunerModelConfig,
)
from agentscope.agent import ReActAgent
from agentscope.formatter import OpenAIMultiAgentFormatter
# Add current directory to path for local imports
sys.path.insert(0, str(Path(__file__).parent))
from game import BadGuyException, werewolves_game # noqa: E402
async def run_werewolves_workflow(
task: Dict,
model: TunerModelConfig,
auxiliary_models: Dict[str, TunerModelConfig],
) -> WorkflowOutput:
"""Run the werewolf game workflow.
Args:
task (Dict): The task information containing:
- 'seed': for role shuffling
- 'workflow_args': optional dict with 'trainable_target' key
("werewolf" or "good_guy", default: "werewolf")
model (TunerModelConfig): The trainable model.
auxiliary_models (Dict[str, TunerModelConfig]): Dictionary of auxiliary
models. Expected to have 'participant' key for opponent players.
Returns:
WorkflowOutput: Contains reward and metrics from the game.
"""
# Initialize roles: 2 werewolves, 3 villagers, 1 seer, 1 witch
roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
# Shuffle roles based on task seed for reproducibility
seed = task.get("seed", 0)
np.random.seed(seed)
np.random.shuffle(roles)
# Get trainable_target from workflow_args (default: "werewolf")
# Options: "werewolf" or "good_guy" (villager, seer, witch)
workflow_args = task.get("workflow_args", {})
trainable_target = workflow_args.get("trainable_target", "werewolf")
# Get the participant model for opponent players
if "participant" not in auxiliary_models:
raise ValueError(
"Expected 'participant' model in auxiliary_models for opponent players",
)
participant_model = auxiliary_models["participant"]
# Create players with appropriate models based on trainable_target
players = []
for i, role in enumerate(roles):
# Determine which model to use based on trainable_target
if trainable_target == "werewolf":
# Training werewolves: werewolves use trainable model
use_trainable = role == "werewolf"
else: # trainable_target == "good_guy"
# Training good guys: villager, seer, witch use trainable model
use_trainable = role in ["villager", "seer", "witch"]
agent = ReActAgent(
name=f"Player{i + 1}",
sys_prompt=get_official_agent_prompt(f"Player{i + 1}"),
model=model if use_trainable else participant_model,
formatter=OpenAIMultiAgentFormatter(),
max_iters=3,
)
players.append(agent)
try:
# Run the werewolf game
good_guy_win = await werewolves_game(players, roles)
# Calculate reward based on trainable_target
is_success = False
if trainable_target == "werewolf":
# Training werewolves: reward when werewolves win (good_guy_win = False)
if not good_guy_win:
raw_reward = 1.0
is_success = True
else:
raw_reward = 0.0
else: # trainable_target == "good_guy"
# Training good guys: reward when good guys win (good_guy_win = True)
if good_guy_win:
raw_reward = 1.0
is_success = True
else:
raw_reward = 0.0
metrics = {
"success": float(is_success),
"werewolf_win": float(not good_guy_win),
"villager_win": float(good_guy_win),
"trainable_target": trainable_target,
}
return WorkflowOutput(
reward=raw_reward,
metrics=metrics,
)
except BadGuyException as e:
# If game execution fails, give a small penalty
traceback.print_exc()
print(
f"Error during game execution: {e}. "
"Assigning penalty to trainable agents.",
)
return WorkflowOutput(
reward=-0.1,
metrics={"success": 0.0, "game_error": 1.0},
)
except Exception as e:
# Catch any other unexpected errors
traceback.print_exc()
print(f"Unexpected error: {e}")
return WorkflowOutput(
reward=-0.1,
metrics={"success": 0.0, "unexpected_error": 1.0},
)
def get_official_agent_prompt(name: str) -> str:
"""Get the system prompt for an agent.
Args:
name (str): The name of the agent.
Returns:
str: The system prompt.
"""
from textwrap import dedent
system_prompt = dedent(
f"""
You're a werewolf game player named {name}.
# YOUR TARGET
Your target is to win the game with your teammates as much as possible.
# GAME RULES
- In werewolf game, players are divided into two werewolves, three villagers, one seer, and one witch.
- Werewolves: kill one player each night, and must hide identity during the day.
- Villagers: ordinary players without special abilities, try to identify and eliminate werewolves.
- Seer: A special villager who can check one player's identity each night.
- Witch: A special villager with two one-time-use potions: a healing potion to save a player (including herself) from being killed at night, and a poison to eliminate one player at night.
- The game alternates between night and day phases until one side wins:
- Night Phase
- Werewolves choose one victim
- Seer checks one player's identity
- Witch decides whether to use potions
- Moderator announces who died during the night
- Day Phase
- All players discuss and vote to eliminate one suspected player
- The werewolves will win the game if they can eliminate all the villagers.
- The villagers will win the game if they can eliminate all the werewolves.
## During PUBLIC discussion (day phase):
- Your response will be split into TWO parts: REASONING (private) and STATEMENT (public)
- REASONING: Your internal thoughts - ONLY YOU can see this. Think freely here.
- STATEMENT: What you actually say - EVERYONE can see this. Be strategic!
## For WEREWOLVES in public discussion:
- ❌ NEVER say "I'm a werewolf" or "we werewolves" in your STATEMENT
- ❌ NEVER reveal your werewolf teammates in your STATEMENT
- ❌ NEVER discuss werewolf strategy in your STATEMENT
- ✅ In REASONING: freely think about werewolf strategy
- ✅ In STATEMENT: pretend to be a villager, seer, or other role
- ✅ In STATEMENT: accuse others, defend yourself, but NEVER reveal your true identity
## For ALL ROLES in public discussion:
- Use REASONING to analyze: "Who might be the werewolf? What's my strategy?"
- Use STATEMENT to speak: "I think Player X is suspicious because..."
- Keep sensitive information in REASONING, not in STATEMENT
## Examples:
### BAD (Werewolf exposing themselves):
REASONING: "I'm a werewolf, I should protect my teammates."
STATEMENT: "As a werewolf, I think we should vote Player 5." ❌ EXPOSED!
### GOOD (Werewolf hiding identity):
REASONING: "I'm a werewolf. Player 5 might be the seer based on their questions. I should cast suspicion on them without being obvious."
STATEMENT: "I find Player 5's behavior suspicious. They've been asking too many questions about people's roles." ✅ HIDDEN!
### GOOD (Villager analyzing):
REASONING: "Player 2 and Player 3 seem to be defending each other. Could they be werewolf teammates?"
STATEMENT: "I noticed Player 2 and Player 3 have been very defensive of each other. This makes me suspicious." ✅ STRATEGIC!
# GAME GUIDANCE
- Try your best to win the game with your teammates, tricks, lies, and deception are all allowed, e.g. pretending to be a different role.
- During discussion, don't be political, be direct and to the point.
- The day phase voting provides important clues. For example, the werewolves may vote together, attack the seer, etc.
## GAME GUIDANCE FOR WEREWOLF
- Seer is your greatest threat, who can check one player's identity each night. Analyze players' speeches, find out the seer and eliminate him/her will greatly increase your chances of winning.
- In the first night, making random choices is common for werewolves since no information is available.
- Pretending to be other roles (seer, witch or villager) is a common strategy to hide your identity and mislead other villagers in the day phase.
- The outcome of the night phase provides important clues. For example, if witch uses the healing or poison potion, etc. Use this information to adjust your strategy.
- [CRITICAL] In public discussion, NEVER reveal you are a werewolf. Always pretend to be a villager or other role.
## GAME GUIDANCE FOR SEER
- Seer is very important to villagers, you should earn the villagers' trust, and lead the discussion phase if possible.
- Your ability to check one player's identity is crucial.
- The outcome of the night phase provides important clues. For example, if witch uses the healing or poison potion, etc. Use this information to adjust your strategy.
- Consider when to reveal your identity - too early and werewolves will target you, too late and villagers won't trust you.
## GAME GUIDANCE FOR WITCH
- Witch has two powerful potions, use them wisely to protect key villagers or eliminate suspected werewolves.
- [IMPORTANT] You CAN use the healing potion to save yourself if you are killed by werewolves (self-rescue is allowed).
- Consider saving the healing potion for critical moments, especially if you think you might be targeted.
- The outcome of the night phase provides important clues. Use this information to adjust your strategy. For example, the person you save is likely to be on the villagers' side.
## GAME GUIDANCE FOR VILLAGER
- Protecting special villagers, especially the seer, is crucial for your team's success.
- Be cautious and decide whether to trust other players based on their speeches and actions.
- Base your decisions on the information you have received, be logical and engage in the discussion to vote out the suspected werewolves.
# NOTE
- [IMPORTANT] DO NOT make up any information that is not provided by the moderator or other players.
- This is a TEXT-based game, so DO NOT use or make up any non-textual information.
- Always critically reflect on whether your evidence exist, and avoid making assumptions.
- Your response should be specific and concise, provide clear reason and avoid unnecessary elaboration.
- Generate your one-line response by using the `generate_response` function.
- Don't repeat the others' speeches.
- [CRITICAL] Remember: REASONING is private (only you see it), STATEMENT is public (everyone sees it). Use this to your advantage!""",
)
return system_prompt
if __name__ == "__main__":
from agentscope.tuner import (
DatasetConfig,
TunerModelConfig,
AlgorithmConfig,
)
# High-level configuration in code (easy to modify)
config_path = Path(__file__).parent / "config.yaml"
# Setup Model Path
trained_model_path = (
"Qwen/Qwen2.5-7B-Instruct" # fill in your model path here
)
auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507" # fill in your auxiliary model path here
# Dataset configuration
dataset = DatasetConfig(
path=str(Path(__file__).parent / "data"),
split="train",
total_steps=400, # Total training steps
)
# Model configuration (trainable model for werewolf players)
model = TunerModelConfig(
model_path=trained_model_path,
max_model_len=25600,
max_tokens=4096,
temperature=1.0,
inference_engine_num=16,
tensor_parallel_size=1,
tool_call_parser="hermes",
reasoning_parser=None,
)
# Auxiliary models (for non-werewolf players)
auxiliary_models = {
"participant": TunerModelConfig(
model_path=auxiliary_model_path,
max_model_len=25600,
max_tokens=4096,
temperature=0.1, # Lower temperature for auxiliary models
inference_engine_num=8,
tensor_parallel_size=1,
tool_call_parser="hermes",
reasoning_parser=None,
),
}
# Algorithm configuration
algorithm = AlgorithmConfig(
algorithm_type="multi_step_grpo",
group_size=32, # repeat_times in Trinity
batch_size=24,
learning_rate=1e-6,
save_interval_steps=100,
eval_interval_steps=100,
)
# Run training with hybrid configuration
# Code parameters above + detailed Trinity config from YAML
tune(
workflow_func=run_werewolves_workflow,
judge_func=None, # We compute reward directly in the workflow
train_dataset=dataset,
model=model,
auxiliary_models=auxiliary_models,
algorithm=algorithm,
config_path=str(config_path), # For cluster, explorer, trainer details
)

View File

@@ -0,0 +1,71 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# flake8: noqa: E501
"""Prepare dataset for werewolf game training.
This script generates a simple dataset consisting of random seeds for role shuffling.
Each seed creates a different initial role assignment, ensuring diverse training scenarios.
"""
import json
import argparse
from pathlib import Path
def prepare_dataset(
output_dir: str,
num_seeds: int = 300,
split: str = "train",
) -> None:
"""Prepare the werewolf game training dataset.
Args:
output_dir (str): Directory to save the dataset.
num_seeds (int): Number of seeds to generate. Default: 300.
split (str): Dataset split name (e.g., 'train', 'eval'). Default: 'train'.
"""
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
output_file = output_path / f"{split}.jsonl"
print(f"Generating {num_seeds} seeds for {split} split...")
with open(output_file, "w", encoding="utf-8") as f:
for seed in range(num_seeds):
data = {"seed": seed}
f.write(json.dumps(data) + "\n")
print(f"Dataset saved to: {output_file}")
print(f"Total samples: {num_seeds}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Prepare dataset for werewolf game training",
)
parser.add_argument(
"--output_dir",
type=str,
default="data",
help="Directory to save the dataset (default: data)",
)
parser.add_argument(
"--num_seeds",
type=int,
default=300,
help="Number of seeds to generate (default: 300)",
)
parser.add_argument(
"--split",
type=str,
default="train",
help="Dataset split name (default: train)",
)
args = parser.parse_args()
prepare_dataset(
output_dir=args.output_dir,
num_seeds=args.num_seeds,
split=args.split,
)

198
tuner/werewolves/prompt.py Normal file
View File

@@ -0,0 +1,198 @@
# -*- coding: utf-8 -*-
# flake8: noqa: E501
"""Enhanced prompts with reasoning/statement separation awareness - 7 Player Version."""
class EnglishPrompts:
"""English prompts used to guide the werewolf game."""
to_dead_player = (
"{}, you're eliminated now. Now you can make a final statement to "
"all alive players before you leave the game.\n\n"
"IMPORTANT: Your final words will be structured into two parts:\n"
"1. REASONING (Private): Your strategic thinking - NOT visible to others\n"
"2. STATEMENT (Public): Your actual final words - VISIBLE to all players\n\n"
"Think carefully about what information to reveal or hide in your final statement."
)
to_all_new_game = (
"A new game is starting, the players are: {}. Now we randomly "
"reassign the roles to each player and inform them of their roles "
"privately."
)
to_all_night = (
"Night has fallen, everyone close your eyes. Werewolves open your "
"eyes and choose a player to eliminate tonight."
)
to_wolves_discussion = (
"[WEREWOLVES ONLY] {}, you should discuss and "
"decide on a player to eliminate tonight. Current alive players "
"are {}. Remember to set `reach_agreement` to True if you reach an "
"agreement during the discussion."
)
to_wolves_vote = "[WEREWOLVES ONLY] Which player do you vote to kill?"
to_wolves_res = (
"[WEREWOLVES ONLY] The voting result is {}. So you have chosen to "
"eliminate {}."
)
to_all_witch_turn = (
"Witch's turn, witch open your eyes and decide your action tonight..."
)
to_witch_resurrect = (
"[WITCH ONLY] {witch_name}, you're the witch, and tonight {dead_name} "
"is eliminated. You can resurrect him/her by using your healing "
"potion, "
"and note you can only use it once in the whole game. Do you want to "
"resurrect {dead_name}? Give me your reason and decision."
)
to_witch_resurrect_no = (
"[WITCH ONLY] The witch has chosen not to resurrect the player."
)
to_witch_resurrect_yes = (
"[WITCH ONLY] The witch has chosen to resurrect the player."
)
to_witch_poison = (
"[WITCH ONLY] {witch_name}, as a witch, you have a one-time-use "
"poison potion, do you want to use it tonight? Give me your reason "
"and decision."
)
to_all_seer_turn = (
"Seer's turn, seer open your eyes and check one player's identity "
"tonight..."
)
to_seer = (
"[SEER ONLY] {}, as the seer you can check one player's identity "
"tonight. Who do you want to check? Give me your reason and decision."
)
to_seer_result = (
"[SEER ONLY] You've checked {agent_name}, and the result is: {role}."
)
to_all_day = (
"The day is coming, all players open your eyes. Last night, "
"the following player(s) has been eliminated: {}."
)
to_all_peace = (
"The day is coming, all the players open your eyes. Last night is "
"peaceful, no player is eliminated."
)
to_all_discuss = (
"Now the alive players are {names}. The game goes on, it's time to "
"discuss and vote a player to be eliminated. Now you each take turns "
"to speak once in the order of {names}.\n\n"
"IMPORTANT: Your response will be structured into two parts:\n"
"1. REASONING (Private): Your internal analysis - NOT visible to others\n"
"2. STATEMENT (Public): Your actual speech - VISIBLE to all players\n\n"
"Think carefully in your reasoning, then speak strategically in your statement."
)
to_all_vote = (
"Now the discussion is over. Everyone, please vote to eliminate one "
"player from the alive players: {}."
)
to_all_res = "The voting result is {}. So {} has been voted out."
to_all_wolf_win = (
"There are {n_alive} players alive, and {n_werewolves} of them are "
"werewolves. "
"The game is over and werewolves win🐺🎉!"
"In this game, the true roles of all players are: {true_roles}"
)
to_all_village_win = (
"All the werewolves have been eliminated."
"The game is over and villagers win🏘🎉!"
"In this game, the true roles of all players are: {true_roles}"
)
to_all_continue = "The game goes on."
to_all_reflect = (
"The game is over. Now each player can reflect on their performance. "
"Note each player only has one chance to speak and the reflection is "
"only visible to themselves."
)
class ChinesePrompts:
"""Chinese prompts used to guide the werewolf game."""
to_dead_player = (
"{}, 你已被淘汰。现在你可以向所有存活玩家发表最后的遗言。\n\n"
"重要提示:你的遗言将分为两个部分:\n"
"1. 推理(私密):你的策略思考 - 其他玩家看不到\n"
"2. 发言(公开):你的实际遗言 - 所有玩家都能看到\n\n"
"仔细思考在遗言中应该透露或隐藏什么信息。"
)
to_all_new_game = "新的一局游戏开始,参与玩家包括:{}。现在为每位玩家重新随机分配身份,并私下告知各自身份。"
to_all_night = "天黑了,请所有人闭眼。狼人请睁眼,选择今晚要淘汰的一名玩家..."
to_wolves_discussion = (
"[仅狼人可见] {}, 你们可以讨论并决定今晚要淘汰的玩家。当前存活玩家有:{}"
"如果达成一致,请将 `reach_agreement` 设为 True。"
)
to_wolves_vote = "[仅狼人可见] 你投票要杀死哪位玩家?"
to_wolves_res = "[仅狼人可见] 投票结果为 {},你们选择淘汰 {}"
to_all_witch_turn = "轮到女巫行动,女巫请睁眼并决定今晚的操作..."
to_witch_resurrect = (
"[仅女巫可见] {witch_name},你是女巫,今晚{dead_name}被淘汰。"
"你可以用解药救他/她,注意解药全局只能用一次。你要救{dead_name}吗?"
"请给出理由和决定。"
)
to_witch_resurrect_no = "[仅女巫可见] 女巫选择不救该玩家。"
to_witch_resurrect_yes = "[仅女巫可见] 女巫选择救活该玩家。"
to_witch_poison = "[仅女巫可见] {witch_name},你有一瓶一次性毒药,今晚要使用吗?请给出理由和决定。"
to_all_seer_turn = "轮到预言家行动,预言家请睁眼并查验一名玩家身份..."
to_seer = "[仅预言家可见] {}, 你是预言家,今晚可以查验一名玩家身份。你要查谁?请给出理由和决定。"
to_seer_result = "[仅预言家可见] 你查验了{agent_name},结果是:{role}"
to_all_day = "天亮了,请所有玩家睁眼。昨晚被淘汰的玩家有:{}"
to_all_peace = "天亮了,请所有玩家睁眼。昨晚平安夜,无人被淘汰。"
to_all_discuss = (
"现在存活玩家有:{names}。游戏继续,大家开始讨论并投票淘汰一名玩家。请按顺序({names})依次发言。\n\n"
"重要提示:你的回答将分为两个部分:\n"
"1. 推理(私密):你的内心分析 - 其他玩家看不到\n"
"2. 发言(公开):你的实际发言 - 所有玩家都能看到\n\n"
"在推理中仔细思考,然后在发言中策略性地表达。"
)
to_all_vote = "讨论结束。请大家从存活玩家中投票淘汰一人:{}"
to_all_res = "投票结果为 {}{} 被淘汰。"
to_all_wolf_win = (
"当前存活玩家共{n_alive}人,其中{n_werewolves}人为狼人。"
"游戏结束,狼人获胜🐺🎉!"
"本局所有玩家真实身份为:{true_roles}"
)
to_all_village_win = "所有狼人已被淘汰。游戏结束,村民获胜🏘️🎉!本局所有玩家真实身份为:{true_roles}"
to_all_continue = "游戏继续。"
to_all_reflect = "游戏结束。现在每位玩家可以对自己的表现进行反思。注意每位玩家只有一次发言机会,且反思内容仅自己可见。"

Binary file not shown.

After

Width:  |  Height:  |  Size: 193 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 538 KiB

View File

@@ -0,0 +1,96 @@
# -*- coding: utf-8 -*-
# flake8: noqa: E501
"""The structured output models used in the werewolf game."""
from typing import Literal
from pydantic import BaseModel, Field
from agentscope.agent import AgentBase
class DiscussionModel(BaseModel):
"""The output format for discussion."""
reach_agreement: bool = Field(
description="Whether you have reached an agreement or not",
)
class PublicDiscussionModel(BaseModel):
"""The output format for public discussion with private reasoning.
This model separates private reasoning from public statements to prevent
accidental information leakage (e.g., werewolves revealing their identity).
"""
reasoning: str = Field(
description=(
"Your PRIVATE reasoning and analysis. This will NOT be shown to "
"other players. You can freely think about your strategy, analyze "
"other players' behaviors, and plan your next move here. "
"If you are a werewolf, you can think about how to hide your identity. "
"If you are a villager, you can analyze who might be the werewolf."
),
)
statement: str = Field(
description=(
"Your PUBLIC statement to all players. This WILL be visible to everyone. "
"Be careful not to reveal sensitive information (e.g., your true role if "
"you are a werewolf). "
"Your statement should be strategic and help your team win."
),
)
def get_vote_model(agents: list[AgentBase]) -> type[BaseModel]:
"""Get the vote model by player names."""
class VoteModel(BaseModel):
"""The vote output format."""
vote: Literal[tuple(_.name for _ in agents)] = Field( # type: ignore
description="The name of the player you want to vote for",
)
return VoteModel
class WitchResurrectModel(BaseModel):
"""The output format for witch resurrect action."""
resurrect: bool = Field(
description="Whether you want to resurrect the player",
)
def get_poison_model(agents: list[AgentBase]) -> type[BaseModel]:
"""Get the poison model by player names."""
class WitchPoisonModel(BaseModel):
"""The output format for witch poison action."""
poison: bool = Field(
description="Do you want to use the poison potion",
)
name: Literal[ # type: ignore
tuple(_.name for _ in agents)
] | None = Field(
description="The name of the player you want to poison, if you "
"don't want to poison anyone, just leave it empty",
default=None,
)
return WitchPoisonModel
def get_seer_model(agents: list[AgentBase]) -> type[BaseModel]:
"""Get the seer model by player names."""
class SeerModel(BaseModel):
"""The output format for seer action."""
name: Literal[tuple(_.name for _ in agents)] = Field( # type: ignore
description="The name of the player you want to check",
)
return SeerModel

159
tuner/werewolves/utils.py Normal file
View File

@@ -0,0 +1,159 @@
# -*- coding: utf-8 -*-
"""Utility functions for the werewolf game - 7 Player Version."""
from collections import defaultdict
from typing import Any
import numpy as np
from agentscope.agent import AgentBase, ReActAgent
from agentscope.message import Msg
from prompt import EnglishPrompts as Prompts
MAX_GAME_ROUND = 30
MAX_DISCUSSION_ROUND = 3
def majority_vote(votes: list[str]) -> tuple:
"""Return the vote with the most counts."""
result = max(set(votes), key=votes.count)
names, counts = np.unique(votes, return_counts=True)
conditions = ", ".join(
[f"{name}: {count}" for name, count in zip(names, counts)],
)
return result, conditions
def names_to_str(agents: list[str] | list[ReActAgent]) -> str:
"""Return a string of agent names."""
if not agents:
return ""
if len(agents) == 1:
if isinstance(agents[0], ReActAgent):
return agents[0].name
return agents[0]
names = []
for agent in agents:
if isinstance(agent, ReActAgent):
names.append(agent.name)
else:
names.append(agent)
return ", ".join([*names[:-1], "and " + names[-1]])
class EchoAgent(AgentBase):
"""Echo agent that repeats the input message."""
def __init__(self) -> None:
super().__init__()
self.name = "Moderator"
async def reply(self, content: str) -> Msg:
"""Repeat the input content with its name and role."""
msg = Msg(
self.name,
content,
role="assistant",
)
await self.print(msg)
return msg
async def handle_interrupt(
self,
*args: Any,
**kwargs: Any,
) -> Msg:
"""Handle interrupt."""
async def observe(self, msg: Msg | list[Msg] | None) -> None:
"""Observe the user's message."""
class Players:
"""Maintain the players' status."""
def __init__(self) -> None:
"""Initialize the players."""
# The mapping from player name to role
self.name_to_role = {}
self.role_to_names = defaultdict(list)
self.name_to_agent = {}
self.werewolves = []
self.villagers = []
self.seer = []
self.witch = []
self.current_alive = []
self.all_players = []
def add_player(self, player: ReActAgent, role: str) -> None:
"""Add a player to the game.
Args:
player (`ReActAgent`):
The player to be added.
role (`str`):
The role of the player.
"""
self.name_to_role[player.name] = role
self.name_to_agent[player.name] = player
self.role_to_names[role].append(player.name)
self.all_players.append(player)
if role == "werewolf":
self.werewolves.append(player)
elif role == "villager":
self.villagers.append(player)
elif role == "seer":
self.seer.append(player)
elif role == "witch":
self.witch.append(player)
else:
raise ValueError(f"Unknown role: {role}")
self.current_alive.append(player)
def update_players(self, dead_players: list[ReActAgent]) -> None:
"""Update the current alive players.
Args:
dead_players (`list[ReActAgent]`):
A list of dead players to be removed.
"""
self.werewolves = [
_ for _ in self.werewolves if _.name not in dead_players
]
self.villagers = [
_ for _ in self.villagers if _.name not in dead_players
]
self.seer = [_ for _ in self.seer if _.name not in dead_players]
self.witch = [_ for _ in self.witch if _.name not in dead_players]
self.current_alive = [
_ for _ in self.current_alive if _.name not in dead_players
]
def print_roles(self) -> None:
"""Print the roles of all players."""
print("Roles:")
for name, role in self.name_to_role.items():
print(f" - {name}: {role}")
def check_winning(self) -> str | None:
"""Check if the game is over and return the winning message."""
# Prepare true roles string
true_roles = (
f'{names_to_str(self.role_to_names["werewolf"])} are werewolves, '
f'{names_to_str(self.role_to_names["villager"])} are villagers, '
f'{names_to_str(self.role_to_names["seer"])} is the seer, '
f'and {names_to_str(self.role_to_names["witch"])} is the witch.'
)
if len(self.werewolves) * 2 >= len(self.current_alive):
return Prompts.to_all_wolf_win.format(
n_alive=len(self.current_alive),
n_werewolves=len(self.werewolves),
true_roles=true_roles,
)
if self.current_alive and not self.werewolves:
return Prompts.to_all_village_win.format(
true_roles=true_roles,
)
return None