Add examples for werewolf game tuner (#96)

2026-01-16 17:25:49 +08:00
parent 9503bda45d
commit 5855c5161b
16 changed files with 2095 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -72,7 +72,7 @@ This is a repository that **brings together a variety of ready-to-run Python age
 │   ├── frozen_lake/                        # Teach an agent to play a game requiring multiple steps
 │   ├── learn_to_ask/                       # Using LLM-as-a-judge to facilitate agent tuning
 │   ├── email_search/                       # Enhance the tool use ability of your agent
-│   ├── werewolf_game/                      # Enhance a multi-agent application
+│   ├── werewolves/                         # Enhance a multi-agent application
 │   └── data_augment/                       # Data augmentation for tuning
 ├── sample_template/                        # Template for new sample contributions
 └── README.md
--- a/README_zh.md
+++ b/README_zh.md
@@ -72,7 +72,7 @@
 │   ├── frozen_lake/                        # 教一个智能体玩需要多步操作的游戏
 │   ├── learn_to_ask/                       # 使用 LLM 作为评委辅助智能体训练
 │   ├── email_search/                       # 提升智能体的工具使用能力
-│   ├── werewolf_game/                      # 强化多智能体应用能力
+│   ├── werewolves/                         # 强化多智能体应用能力
 │   └── data_augment/                       # 增强用于调优的数据
 ├── sample_template/                        # 新样例贡献模板
 └── README.md
--- a/tuner/README.md
+++ b/tuner/README.md
@@ -8,7 +8,7 @@ This directory contains several examples of how to use the AgentScope Tuner for
 | Frozen Lake       | Make an agent to navigate the Frozen Lake environment in multi-step interactions.  | [frozen_lake](./frozen_lake)    | ✅ | ❌ | ❌ | ❌ | ❌ |
 | Learn to Ask      | Using LLM as a judge to provide feedback to facilitate agent tuning.               | [learn_to_ask](./learn_to_ask)  | ✅ | ✅ | ❌ | ❌ | ❌ |
 | Email Search      | Enhance the tool use ability of your agent on tasks without ground truth.          | [email_search](./email_search)  | ✅ | ✅ | ✅ | ❌ | ❌ |
-| Werewolf Game     | Enhance the agent's performance in a multi-agent game setting.                     | [werewolf_game](./werewolf_game)| ✅ | ✅ | ✅ | ✅ | ❌ |
+| Werewolf Game     | Enhance the agent's performance in a multi-agent game setting.                     | [werewolves](./werewolves)| ✅ | ✅ | ✅ | ✅ | ❌ |
 | Data Augment      | Data augmentation for better tuning results.                                       | [data_augment](./data_augment)  | ❌ | ❌ | ❌ | ❌ | ✅ |
 Each example contains a README file with detailed instructions on how to set up and run the tuning process for that specific scenario. Feel free to explore and modify the examples to suit your needs!
--- a/tuner/README_zh.md
+++ b/tuner/README_zh.md
@@ -1,19 +1,19 @@
-# AgentScope Tuner 中文说明
+# AgentScope Tuner
 本目录包含了多个使用 AgentScope Tuner 对 AgentScope 应用进行调优的示例。下表总结了可用的示例：
 | 示例名称         | 描述                                                                 | 示例路径                        | 多步交互 | LLM 评审 | 工具使用 | 多智能体 | 数据增强 |
 |------------------|-------------------------------------------|---------------------------------|----------|----------|----------|----------|----------|
 | 数学智能体         | 快速入门示例，调优数学智能体以提升其能力。     | [math_agent](./math_agent)      | ✅       | ❌       | ❌       | ❌       | ❌       |
-| Frozen Lake       | 让智能体在多步交互中导航冰湖环境。           | [frozen_lake](./frozen_lake)    | ✅       | ❌       | ❌       | ❌       | ❌       |
+| Frozen Lake       | 让智能体在与 frozen lake 环境的多步交互中学习。           | [frozen_lake](./frozen_lake)    | ✅       | ❌       | ❌       | ❌       | ❌       |
 | Learn to Ask      | 使用 LLM 作为评审，为智能体调优提供反馈      | [learn_to_ask](./learn_to_ask)  | ✅       | ✅       | ❌       | ❌       | ❌       |
 | 邮件搜索         | 在无标准答案任务中提升智能体的工具使用能力。     | [email_search](./email_search)  | ✅       | ✅       | ✅       | ❌       | ❌       |
-| 狼人杀游戏       | 提升智能体在多智能体游戏场景下的表现。          | [werewolf_game](./werewolf_game)| ✅       | ✅       | ✅       | ✅       | ❌       |
+| 狼人杀游戏       | 提升智能体在多智能体游戏场景下的表现。          | [werewolves](./werewolves)| ✅       | ✅       | ✅       | ✅       | ❌       |
 | 数据增强         | 通过数据增强获得更好的调优效果。               | [data_augment](./data_augment)  | ❌       | ❌       | ❌       | ❌       | ✅       |
 每个示例目录下均包含详细的 README 文件，介绍了该场景下的调优流程和使用方法。欢迎根据实际需求进行探索和修改！
-## 先决条件
+## 前置要求
 AgentScope Tuner 需要：
@@ -21,6 +21,6 @@ AgentScope Tuner 需要：
 - `agentscope>=1.0.12`
 - `trinity-rft>=0.4.1`
-AgentScope Tuner 构建于 [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) 之上。
+AgentScope Tuner 基于 [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) 构建。
 请参考 [Trinity-RFT 安装指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_installation.html)
 获取详细的安装方法。
--- a/tuner/werewolves/README.md
+++ b/tuner/werewolves/README.md
@@ -0,0 +1,364 @@
 # Training Werewolf Game with RL using AgentScope-Tuner
 This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with the AgentScope tuner framework (AS-Tune). We employ the multi-step Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%.
 ## Overview
 The werewolf game is a complex social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to:
 - Avoid revealing their identity in public discussions
 - Coordinate with teammates effectively
 - Develop advanced strategies like "deep cover" tactics
 - Deceive villagers and mislead investigations
 ## Task Setting
 ### Training Objective
 The goal is to train **werewolf players** to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule:
 - **Reward = +1.0** if werewolves win (all villagers eliminated)
 - **Reward = 0.0** if villagers win (all werewolves eliminated)
 - **Reward = -0.1** for game execution errors (penalty to discourage invalid behaviors)
 ### Game Configuration
 This implementation is based on the `games/game_werewolves` example but with several key modifications:
 Original 9-Player Setup:
 - 3 Werewolves, 3 Villagers, 1 Seer, 1 Witch, 1 Hunter
 - Witch cannot self-rescue (use healing potion on herself)
 Modified 7-Player Setup (This Project):
 - 2 Werewolves: Kill one player each night, must hide identity during the day
 - 3 Villagers: Ordinary players without special abilities
 - 1 Seer: Can check one player's identity each night
 - 1 Witch: Has two one-time-use potions:
  - Healing potion: Save a player from being killed at night (**can self-rescue**)
  - Poison potion: Eliminate one player at night
 We also make slight modification to the prompt, and ask the players to reasoning before they speak publicly.
 ### Models
 - **Trainable Model (Werewolf Players)**: `Qwen/Qwen2.5-7B-Instruct`
 - **Auxiliary Model (Other Roles)**: `Qwen/Qwen3-30B-A3B-Instruct-2507`
 ### Algorithm
 **Multi-Step GRPO (Group Relative Policy Optimization)**
 - Group size: 32 rollouts per training batch
 - Batch size: 24
 - Learning rate: 1e-6
 - Advantage normalization by episode length
 - Clipping range: [0.2, 0.28]
 - No KL penalty (kl_coef: 0)
 ## Dataset Preparation
 The dataset for this task is minimal and consists only of random **seeds** for role shuffling. Each training episode uses a different seed to randomize player role assignments, ensuring diverse training scenarios.
 ### Generate Dataset
 Run the `prepare_data.py` script to generate the dataset:
 ```bash
 # Generate default dataset (300 seeds for training)
 python prepare_data.py
 # Or customize the number of seeds
 python prepare_data.py --num_seeds 500
 ```
 This will create `data/train.jsonl` (or `data/eval.jsonl`) with the following format:
 ```json
 {"seed": 0}
 {"seed": 1}
 {"seed": 2}
 ...
 ```
 During training, these seeds are used to shuffle role assignments via `np.random.shuffle()`, creating varied game configurations.
 ## Code Implementation
 ### High-Level Workflow
 The training workflow consists of the following key components:
 #### 1. Agent Workflow (`run_werewolves_workflow`)
 ```python
 async def run_werewolves_workflow(task, model, auxiliary_models):
    # 1. Initialize roles
    roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
    # 2. Shuffle based on task seed
    np.random.seed(task["seed"])
    np.random.shuffle(roles)
    # 3. Create agents: werewolves use trainable model, others use auxiliary model
    players = [
        ReActAgent(
            name=f"Player{i+1}",
            model=model if role == "werewolf" else participant_model,
            ...
        ) for i, role in enumerate(roles)
    ]
    # 4. Run the game
    good_guy_win = await werewolves_game(players, roles)
    # 5. Compute reward
    reward = 1.0 if not good_guy_win else 0.0
    return WorkflowOutput(reward=reward, metrics={...})
 ```
 #### 2. Game Loop (`werewolves_game`)
 Each game consists of alternating night and day phases:
 **Night Phase:**
 1. **Werewolves' Turn**: Discuss privately and vote to kill a player
 2. **Witch's Turn**: Decide whether to use healing/poison potions
 3. **Seer's Turn**: Check one player's identity
 **Day Phase:**
 1. **Announcement**: Moderator announces who died during the night
 2. **Discussion**: All alive players discuss with reasoning/statement separation
 3. **Voting**: All players vote to eliminate one suspected werewolf
 4. **Last Words**: Eliminated player gives final statement
 The game continues until:
 - All werewolves are eliminated (villagers win), or
 - Werewolves equal or outnumber other players (werewolves win)
 #### 3. Reward Calculation
 The reward is computed based on the game outcome from the perspective of werewolves:
 ```python
 if not good_guy_win:  # Werewolves win
    reward = 1.0
 else:                 # Villagers win
    reward = 0.0
 ```
 ## How to Run
 ### Prerequisites
 1. Install AgentScope with tuner support:
 ```bash
 pip install agentscope[full]
 ```
 2. Set up environment variables (optional, can be configured in code):
 ```bash
 export TRINITY_MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
 export TRINITY_AUXILIARY_MODEL_PATH="Qwen/Qwen3-30B-A3B-Instruct-2507"
 export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints"
 ```
 ### Configuration
 The project uses a hybrid configuration approach:
 1. **High-level parameters** in `main.py`:
   - Model paths
   - Dataset configuration
   - Algorithm parameters (group_size, batch_size, learning_rate)
 2. **Detailed infrastructure settings** in `config.yaml`:
   - Cluster configuration (nodes, GPUs)
   - Explorer settings (rollout engines, timeouts)
   - Trainer settings (gradient clipping, batch sizes)
   - Monitor configuration (WandB integration)
 Key parameters to adjust:
 ```python
 # In main.py
 trained_model_path = "Qwen/Qwen2.5-7B-Instruct"
 auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507"
 dataset = DatasetConfig(
    path="data",
    split="train",
    total_steps=400,  # Total training steps
 )
 algorithm = AlgorithmConfig(
    algorithm_type="multi_step_grpo",
    group_size=32,    # Rollouts per batch
    batch_size=24,    # Training batches per step
    learning_rate=1e-6,
    save_interval_steps=100,
    eval_interval_steps=100,
 )
 ```
 ### Training Command
 **Step 1: Prepare the dataset**
 ```bash
 cd /path/to/agentscope-samples/training/werewolf_game
 python prepare_data.py --num_seeds 300
 ```
 **Step 2: Start Ray cluster**
 Start your ray cluster.
 ```bash
 # For single node
 ray start --head
 # For multi-node cluster (e.g., 4 nodes with 8 GPUs each):
 # On the head node:
 ray start --head --port=6379
 # On each worker node:
 ray start --address='<head_node_ip>:6379'
 # Replace <head_node_ip> with the actual IP address of your head node
 ```
 **Step 3: Run training**
 Run the training script on the head node:
 ```bash
 python main.py
 ```
 For distributed training, ensure Ray cluster is properly configured in `config.yaml`, make sure that the node number and gpu number matches:
 ```yaml
 cluster:
  node_num: 4
  gpu_per_node: 8
 ```
 ## Results
 ### Quantitative Results
 Training on the 7-player werewolf game for 400 steps demonstrates significant improvement:
 | Metric | Before Training | After Training (Step 200+) |
 |--------|----------------|---------------------------|
 | Werewolf Win Rate | ~50%  | ~85% |
 | Success Rate | 0.50 | 0.85 |
 **Reward Curve:**
 ![Rollout Reward Curve](./rollout_reward_curve.png)
 As shown in the reward curve above, the werewolf win rate steadily increases during training:
 - **Steps 0-50**: Win rate fluctuates around 50-60% as the model explores strategies
 - **Steps 50-200**: Win rate climbs to 70-80% as effective strategies emerge
 - **Steps 200-250**: Win rate stabilizes around 85%
 ### Qualitative Results
 #### Before Training: Critical Mistakes
 Untrained models (Qwen2.5-7B-Instruct) make fundamental errors:
 - **Self-exposure**: Werewolves accidentally reveal identity in public ("as a werewolf, I think...")
 - **Exposing teammates**: When voted out, werewolves point to teammates in last words
 - **Random strategy**: No coordination, no threat prioritization (e.g., ignoring the Seer)
 #### After Training: "Deep Cover" Werewolf Strategy
 After 200+ steps, the trained model **converges to a sophisticated "deep cover" tactic**.
 Below is an example where werewolves sacrifice teammates to gain trust, then dominate late game.
 **Concrete Example:**
 **Setup (Day 1):** Seer publicly identifies Player5 as werewolf. Witch and villagers join the accusation. Player5 (werewolf) is doomed.
 **Player4 (werewolf teammate) responds strategically:**
 **In short:** Player4 sacrifices teammate to gain trust, pretending to be a logical villager.
 **Detailed reasoning and statement:**
 *Private Reasoning (only Player4 sees):*
 ```
 "I'm a werewolf, and I know Player2 was targeted last night but survived.
 My teammate Player5 is doomed - Player3 and Player7 are attacking him together.
 I need to deflect suspicion from MYSELF, not save Player5.
 Strategy: Sacrifice Player5 to gain trust. If I vote against my teammate,
 villagers will think I'm trustworthy. This creates confusion and positions
 me as a "confirmed good guy" for later rounds."
 ```
 *Public Statement (all players see):*
 ```
 ✅ "I think Player5 is acting very strangely. He brought up Player2 being targeted,
 but didn't explain why he's so focused on them. It feels like a distraction tactic.
 I also noticed that Player3 and Player7 are both accusing Player5, which makes sense -
 their logic is solid. I think we should seriously consider voting out Player5 -
 he's the most suspicious person right now."
 ```
 **Result:** Player5 eliminated, but Player4 gains complete trust from Seer and all villagers.
 **Why it works:**
 - Seer trusts Player4 as strong villager ally → won't check him
 - Villagers follow Player4's "logical" analysis
 - Player4 systematically misleads discussions in later rounds
 - Survives to final 2 players → **werewolves win**
 This demonstrates the essence of trained behavior: **sacrifice pieces strategically to secure ultimate victory**. The model learns that short-term teammate loss is worthwhile for establishing deep cover and long-term dominance.
 ---
 ## Bonus: Training Good Guys
 In addition to training werewolves, we also provide a configuration for training the **good guy side** (villagers, seer, and witch). This is a more challenging task as good guys need to:
 - Perform complex reasoning to identify werewolves from subtle behavioral cues
 - Coordinate effectively without explicit team communication
 - Resist manipulation and deception from werewolves
 - **Train multiple roles simultaneously**: Unlike werewolves (single role), good guys include villager, seer, and witch with different abilities, requiring the model to master diverse strategies in one training run, and make optimal use of special abilities (Seer's checks, Witch's potions)
 ### Configuration
 Use `config_train_goodguy.yaml` or set `trainable_target: good_guy` in `workflow_args`:
 ```yaml
 workflow_args:
  trainable_target: good_guy  # Train villager, seer, and witch
 ```
 ### Quantitative Results
 We trained `Qwen3-4B-Instruct` as good guys against `Qwen3-30B-A3B-Instruct` werewolves:
 | Metric | Before Training | After ~200 Steps | After ~400 Steps |
 |--------|----------------|------------------|------------------|
 | Good Guy Win Rate | ~18% | ~60% | ~80% |
 **Training Curve:**
 ![Good Guy Training Curve](./rollout_reward_curve_goodguy.png)
 The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors.
 ### Qualitative Results
 After training, the good guy models exhibit advanced reasoning patterns:
 - **Seer**: Strategic target selection, information concealment in public statements, evidence integration
 - **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
 - **Villager**: Evidence-chain analysis, trust building with special roles, consensus formation for team coordination
 ---
 ## Conclusion
 This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AS-Tune's multi-step GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management.
 **Ready to try it yourself?** Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies!
--- a/tuner/werewolves/init.py
+++ b/tuner/werewolves/init.py
@@ -0,0 +1,2 @@
 # -*- coding: utf-8 -*-
 """Werewolf game training example for AgentScope."""
--- a/tuner/werewolves/config.yaml
+++ b/tuner/werewolves/config.yaml
@@ -0,0 +1,241 @@
 # ============================================
 # Project and Experiment Configuration
 # ============================================
 # Project name for grouping experiments
 project: AgentScope-Werewolves
 # Unique name for this specific experiment run
 name: Werewolves-7Player-GRPO
 # Root directory for saving checkpoints. Uses environment variable if set, otherwise defaults to ./checkpoints
 checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
 # ============================================
 # Algorithm Configuration
 # ============================================
 algorithm:
  # Algorithm type: multi_step_grpo for multi-step Group Relative Policy Optimization
  algorithm_type: multi_step_grpo
  # KL divergence loss function for regularization during training
  # "low_var_kl": low-variance KL loss suitable for multi-step optimization
  kl_loss_fn: "low_var_kl"
  kl_loss_fn_args:
    # KL coefficient: 0 means no KL penalty (pure reward optimization)
    kl_coef: 0
  # Advantage function configuration for computing policy gradients
  advantage_fn_args:
    # Small epsilon value for numerical stability in advantage normalization
    epsilon: 1e-6
    # Normalize advantage by episode length (important for variable-length games)
    step_norm: true
  # Number of rollouts per task (group size for GRPO)
  # Higher values provide better gradient estimates but require more compute
  repeat_times: 32
  # Policy loss function arguments (PPO-style clipping)
  policy_loss_fn_args:
    # Lower bound for probability ratio clipping (prevents too large policy updates)
    clip_range_low: 0.2
    # Upper bound for probability ratio clipping
    clip_range_high: 0.28
  # Optimizer configuration
  optimizer:
    # Learning rate for policy updates
    lr: 1e-6
 # ============================================
 # Model Configuration
 # ============================================
 model:
  # Path to the base model (trainable model for werewolf players)
  # Uses environment variable if set, otherwise defaults to Qwen2.5-7B-Instruct
  model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-7B-Instruct}
  # Maximum number of tokens the model can generate per response
  max_response_tokens: 4096
  # Total context length the model can handle (prompt + response)
  max_model_len: 25600
 # ============================================
 # Cluster Configuration
 # ============================================
 cluster:
  # Number of nodes in the Ray cluster
  node_num: 4
  # Number of GPUs per node (total GPUs = node_num * gpu_per_node = 32)
  gpu_per_node: 8
  # Ray cluster address: "auto" automatically detects the cluster
  ray_address: auto
 # ============================================
 # Buffer Configuration (Data Pipeline)
 # ============================================
 buffer:
  # Total training steps (iterations)
  total_steps: 400
  # Batch size for rollout collection (tasks per training step)
  batch_size: 24
  # Batch size for training (experiences per gradient update)
  # For multi-step GRPO, each rollout for task has multiple steps, so we need to accumulate the experiences for gradient update.
  # Hence we should choose a larger batch size for training (2048 > 16 * 32)
  train_batch_size: 2048
  # Explorer input configuration (rollout data source)
  explorer_input:
    taskset:
      # Dataset name
      name: werewolves
      # Storage type: "file" reads from local file system
      storage_type: file
      # Path to the dataset directory containing train.jsonl
      path: data
      # Dataset split to use
      split: 'train'
      # Rollout generation arguments
      rollout_args:
        # Sampling temperature for exploration (1.0 = full stochasticity)
        temperature: 1.0
        # Maximum tokens per generation
        max_tokens: 4096
      # Workflow-specific arguments
      workflow_args:
        # Which side to train: "werewolf" or "good_guy" (villager, seer, witch)
        # - "werewolf": Train werewolf players (default)
        # - "good_guy": Train villager, seer, and witch players
        trainable_target: werewolf
    # Evaluation datasets (empty for this experiment)
    eval_tasksets: []
  # Trainer input configuration (training data source)
  trainer_input:
    experience_buffer:
      # Name of the experience buffer
      name: werewolves
      # Storage type: "queue" for in-memory communication between explorer and trainer
      storage_type: queue
      # Maximum time (seconds) to wait for data from explorer before timeout
      max_read_timeout: 7200
      # Replay buffer configuration for experience reuse
      replay_buffer:
        # Enable replay buffer to reuse past experiences
        enable: true
 # ============================================
 # Explorer Configuration (Rollout Generation)
 # ============================================
 explorer:
  # Number of parallel workflow runners per rollout model instance
  # Higher values = more parallel game simulations
  runner_per_model: 16
  # Maximum time (seconds) to wait for a single task completion
  max_timeout: 3600
  # Number of retries if a task fails or times out (0 = no retry)
  max_retry_times: 0
  # Number of times each task is repeated within a single runner
  # (usually 1; repeat_times is handled at algorithm level)
  max_repeat_times_per_runner: 1
  # Over-rollout configuration (collect more data than needed)
  over_rollout:
    # Ratio of tasks to skip waiting for (0.2 = wait for 80% of batch_size, then proceed)
    # Improves throughput by not waiting for slow tasks
    ratio: 0.2
    # Minimum wait time (seconds) after reaching the threshold before proceeding
    wait_after_min: 15
  # Dynamic timeout configuration (adaptive timeout based on task completion time)
  dynamic_timeout:
    # Enable dynamic timeout adjustment
    enable: true
    # Timeout multiplier: timeout = average_task_time * ratio
    ratio: 4
  # Rollout model configuration (trainable model for werewolf players)
  rollout_model:
    # Number of vLLM engine instances for parallel inference
    engine_num: 16
    # Tensor parallelism size (GPUs per engine instance)
    tensor_parallel_size: 1
    # Disable KV cache prefix sharing (usually False for training stability)
    enable_prefix_caching: false
    # Disable CUDA graph optimization (False = use CUDA graphs for faster inference)
    enforce_eager: false
    # Enable OpenAI-compatible API interface for vLLM
    enable_openai_api: true
    # Enable conversation history tracking
    enable_history: true
    # Enable automatic tool choice in function calling
    enable_auto_tool_choice: true
    # Tool call parser for structured outputs (hermes format)
    tool_call_parser: hermes
    # Data type for model weights and activations
    dtype: bfloat16
    # Random seed for reproducibility
    seed: 42
  # Auxiliary models configuration (for non-werewolf players: villagers, seer, witch)
  auxiliary_models:
    - # Model name identifier
      name: participant
      # Path to auxiliary model (stronger model for stable baseline opponents)
      model_path: ${oc.env:TRINITY_AUXILIARY_MODEL_PATH,Qwen/Qwen3-30B-A3B-Instruct-2507}
      # Number of vLLM engines for auxiliary model (fewer than rollout model)
      engine_num: 8
      # Tensor parallelism size
      tensor_parallel_size: 1
      # Disable prefix caching
      enable_prefix_caching: false
      # Disable CUDA graph optimization
      enforce_eager: false
      # Enable OpenAI API interface
      enable_openai_api: true
      # Enable automatic tool choice
      enable_auto_tool_choice: true
      # Tool call parser
      tool_call_parser: hermes
      # Data type
      dtype: bfloat16
      # Random seed
      seed: 42
 # ============================================
 # Synchronizer Configuration (Weight Sync)
 # ============================================
 synchronizer:
  # Synchronization style: "dynamic_by_explorer" = sync when explorer requests
  # Alternatives: "fixed" (sync every N steps), "dynamic_by_trainer"
  sync_style: dynamic_by_explorer
  # Synchronization method: "nccl" uses NVIDIA NCCL for fast GPU-to-GPU communication
  # Alternatives: "checkpoint" (slower, file-based sync)
  sync_method: 'nccl'
  # Sync weights every N training steps
  sync_interval: 1
  # Timeout (seconds) for weight synchronization before failing
  sync_timeout: 72000
 # ============================================
 # Trainer Configuration (Model Training)
 # ============================================
 trainer:
  # Save checkpoint every N training steps (0 = only save at end)
  save_interval: 100
  # Gradient clipping threshold to prevent exploding gradients
  grad_clip: 1.0
  # Use dynamic batch size to maximize GPU memory utilization
  use_dynamic_bsz: true
  # Maximum token length per GPU for memory management
  # Higher = more efficient but requires more VRAM
  max_token_len_per_gpu: 16384
  # Ulysses sequence parallelism size for handling long sequences
  # 2 = split sequence across 2 GPUs
  ulysses_sequence_parallel_size: 2
 # ============================================
 # Monitor Configuration (Logging & Tracking)
 # ============================================
 monitor:
  # Monitoring/logging backend: "wandb" for Weights & Biases
  # Alternatives: "tensorboard"
  monitor_type: wandb
--- a/tuner/werewolves/config_train_goodguy.yaml
+++ b/tuner/werewolves/config_train_goodguy.yaml
@@ -0,0 +1,241 @@
 # ============================================
 # Project and Experiment Configuration
 # ============================================
 # Project name for grouping experiments
 project: AgentScope-Werewolves
 # Unique name for this specific experiment run (training good guys)
 name: Werewolves-7Player-GRPO-train-goodguy
 # Root directory for saving checkpoints. Uses environment variable if set, otherwise defaults to ./checkpoints
 checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
 # ============================================
 # Algorithm Configuration
 # ============================================
 algorithm:
  # Algorithm type: multi_step_grpo for multi-step Group Relative Policy Optimization
  algorithm_type: multi_step_grpo
  # KL divergence loss function for regularization during training
  # "low_var_kl": low-variance KL loss suitable for multi-step optimization
  kl_loss_fn: "low_var_kl"
  kl_loss_fn_args:
    # KL coefficient: 0 means no KL penalty (pure reward optimization)
    kl_coef: 0
  # Advantage function configuration for computing policy gradients
  advantage_fn_args:
    # Small epsilon value for numerical stability in advantage normalization
    epsilon: 1e-6
    # Normalize advantage by episode length (important for variable-length games)
    step_norm: true
  # Number of rollouts per task (group size for GRPO)
  # Higher values provide better gradient estimates but require more compute
  repeat_times: 32
  # Policy loss function arguments (PPO-style clipping)
  policy_loss_fn_args:
    # Lower bound for probability ratio clipping (prevents too large policy updates)
    clip_range_low: 0.2
    # Upper bound for probability ratio clipping
    clip_range_high: 0.28
  # Optimizer configuration
  optimizer:
    # Learning rate for policy updates
    lr: 1e-6
 # ============================================
 # Model Configuration
 # ============================================
 model:
  # Path to the base model (trainable model for good guy players: villager, seer, witch)
  # Uses environment variable if set, otherwise defaults to Qwen3-4B-Instruct-2507
  model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen3-4B-Instruct-2507}
  # Maximum number of tokens the model can generate per response
  max_response_tokens: 4096
  # Total context length the model can handle (prompt + response)
  max_model_len: 25600
 # ============================================
 # Cluster Configuration
 # ============================================
 cluster:
  # Number of nodes in the Ray cluster
  node_num: 4
  # Number of GPUs per node (total GPUs = node_num * gpu_per_node = 32)
  gpu_per_node: 8
  # Ray cluster address: "auto" automatically detects the cluster
  ray_address: auto
 # ============================================
 # Buffer Configuration (Data Pipeline)
 # ============================================
 buffer:
  # Total training steps (iterations)
  total_steps: 400
  # Batch size for rollout collection (tasks per training step)
  batch_size: 24
  # Batch size for training (experiences per gradient update)
  # For multi-step GRPO, each rollout for task has multiple steps, so we need to accumulate the experiences for gradient update.
  # Hence we should choose a larger batch size for training (2048 > 16 * 32)
  train_batch_size: 2048
  # Explorer input configuration (rollout data source)
  explorer_input:
    taskset:
      # Dataset name
      name: werewolves
      # Storage type: "file" reads from local file system
      storage_type: file
      # Path to the dataset directory containing train.jsonl
      path: data
      # Dataset split to use
      split: 'train'
      # Rollout generation arguments
      rollout_args:
        # Sampling temperature for exploration (1.0 = full stochasticity)
        temperature: 1.0
        # Maximum tokens per generation
        max_tokens: 4096
      # Workflow-specific arguments
      workflow_args:
        # Which side to train: "werewolf" or "good_guy" (villager, seer, witch)
        # - "werewolf": Train werewolf players
        # - "good_guy": Train villager, seer, and witch players (this config)
        trainable_target: good_guy
    # Evaluation datasets (empty for this experiment)
    eval_tasksets: []
  # Trainer input configuration (training data source)
  trainer_input:
    experience_buffer:
      # Name of the experience buffer
      name: werewolves
      # Storage type: "queue" for in-memory communication between explorer and trainer
      storage_type: queue
      # Maximum time (seconds) to wait for data from explorer before timeout
      max_read_timeout: 7200
      # Replay buffer configuration for experience reuse
      replay_buffer:
        # Enable replay buffer to reuse past experiences
        enable: true
 # ============================================
 # Explorer Configuration (Rollout Generation)
 # ============================================
 explorer:
  # Number of parallel workflow runners per rollout model instance
  # Higher values = more parallel game simulations
  runner_per_model: 16
  # Maximum time (seconds) to wait for a single task completion
  max_timeout: 3600
  # Number of retries if a task fails or times out (0 = no retry)
  max_retry_times: 0
  # Number of times each task is repeated within a single runner
  # (usually 1; repeat_times is handled at algorithm level)
  max_repeat_times_per_runner: 1
  # Over-rollout configuration (collect more data than needed)
  over_rollout:
    # Ratio of tasks to skip waiting for (0.2 = wait for 80% of batch_size, then proceed)
    # Improves throughput by not waiting for slow tasks
    ratio: 0.2
    # Minimum wait time (seconds) after reaching the threshold before proceeding
    wait_after_min: 15
  # Dynamic timeout configuration (adaptive timeout based on task completion time)
  dynamic_timeout:
    # Enable dynamic timeout adjustment
    enable: true
    # Timeout multiplier: timeout = average_task_time * ratio
    ratio: 4
  # Rollout model configuration (trainable model for good guy players)
  rollout_model:
    # Number of vLLM engine instances for parallel inference
    engine_num: 16
    # Tensor parallelism size (GPUs per engine instance)
    tensor_parallel_size: 1
    # Disable KV cache prefix sharing (usually False for training stability)
    enable_prefix_caching: false
    # Disable CUDA graph optimization (False = use CUDA graphs for faster inference)
    enforce_eager: false
    # Enable OpenAI-compatible API interface for vLLM
    enable_openai_api: true
    # Enable conversation history tracking
    enable_history: true
    # Enable automatic tool choice in function calling
    enable_auto_tool_choice: true
    # Tool call parser for structured outputs (hermes format)
    tool_call_parser: hermes
    # Data type for model weights and activations
    dtype: bfloat16
    # Random seed for reproducibility
    seed: 42
  # Auxiliary models configuration (for werewolf players when training good guys)
  auxiliary_models:
    - # Model name identifier
      name: participant
      # Path to auxiliary model (stronger model for stable baseline opponents)
      model_path: ${oc.env:TRINITY_AUXILIARY_MODEL_PATH,Qwen/Qwen3-30B-A3B-Instruct-2507}
      # Number of vLLM engines for auxiliary model (fewer than rollout model)
      engine_num: 8
      # Tensor parallelism size
      tensor_parallel_size: 1
      # Disable prefix caching
      enable_prefix_caching: false
      # Disable CUDA graph optimization
      enforce_eager: false
      # Enable OpenAI API interface
      enable_openai_api: true
      # Enable automatic tool choice
      enable_auto_tool_choice: true
      # Tool call parser
      tool_call_parser: hermes
      # Data type
      dtype: bfloat16
      # Random seed
      seed: 42
 # ============================================
 # Synchronizer Configuration (Weight Sync)
 # ============================================
 synchronizer:
  # Synchronization style: "dynamic_by_explorer" = sync when explorer requests
  # Alternatives: "fixed" (sync every N steps), "dynamic_by_trainer"
  sync_style: dynamic_by_explorer
  # Synchronization method: "nccl" uses NVIDIA NCCL for fast GPU-to-GPU communication
  # Alternatives: "checkpoint" (slower, file-based sync)
  sync_method: 'nccl'
  # Sync weights every N training steps
  sync_interval: 1
  # Timeout (seconds) for weight synchronization before failing
  sync_timeout: 72000
 # ============================================
 # Trainer Configuration (Model Training)
 # ============================================
 trainer:
  # Save checkpoint every N training steps (0 = only save at end)
  save_interval: 100
  # Gradient clipping threshold to prevent exploding gradients
  grad_clip: 1.0
  # Use dynamic batch size to maximize GPU memory utilization
  use_dynamic_bsz: true
  # Maximum token length per GPU for memory management
  # Higher = more efficient but requires more VRAM
  max_token_len_per_gpu: 16384
  # Ulysses sequence parallelism size for handling long sequences
  # 2 = split sequence across 2 GPUs
  ulysses_sequence_parallel_size: 2
 # ============================================
 # Monitor Configuration (Logging & Tracking)
 # ============================================
 monitor:
  # Monitoring/logging backend: "wandb" for Weights & Biases
  # Alternatives: "tensorboard"
  monitor_type: wandb
--- a/tuner/werewolves/game.py
+++ b/tuner/werewolves/game.py
@@ -0,0 +1,399 @@
 # -*- coding: utf-8 -*-
 # flake8: noqa: E501
 # pylint: disable=too-many-branches, too-many-statements, no-name-in-module, W0707
 """A werewolf game implemented by agentscope with structured reasoning - 7 Player Version."""
 from utils import (
    majority_vote,
    names_to_str,
    EchoAgent,
    MAX_GAME_ROUND,
    MAX_DISCUSSION_ROUND,
    Players,
 )
 from structured_model import (
    DiscussionModel,
    PublicDiscussionModel,
    get_vote_model,
    get_poison_model,
    WitchResurrectModel,
    get_seer_model,
 )
 from prompt import EnglishPrompts as Prompts
 # Uncomment the following line to use Chinese prompts
 # from prompt import ChinesePrompts as Prompts
 from agentscope.agent import ReActAgent
 from agentscope.message import Msg
 from agentscope.pipeline import (
    MsgHub,
    fanout_pipeline,
 )
 class BadGuyException(Exception):
    ...
 moderator = EchoAgent()
 async def werewolves_game(agents: list[ReActAgent], roles) -> bool:
    """The main entry of the werewolf game - 7 Player Version
    Args:
        agents (`list[ReActAgent]`):
            A list of 7 agents.
    """
    assert (
        len(agents) == 7
    ), "The 7-player werewolf game needs exactly 7 players."
    # Init the players' status
    players = Players()
    # If the witch has healing and poison potion
    healing, poison = True, True
    # If it's the first day, the dead can leave a message
    first_day = True
    # Broadcast the game begin message
    async with MsgHub(participants=agents) as greeting_hub:
        await greeting_hub.broadcast(
            await moderator(
                Prompts.to_all_new_game.format(names_to_str(agents)),
            ),
        )
    # Assign roles to the agents - 2 werewolves, 3 villagers, 1 seer, 1 witch
    for agent, role in zip(agents, roles):
        # Tell the agent its role
        await agent.observe(
            await moderator(
                f"[{agent.name} ONLY] {agent.name}, your role is {role}.",
            ),
        )
        players.add_player(agent, role)
    # Printing the roles
    players.print_roles()
    # GAME BEGIN!
    for _ in range(MAX_GAME_ROUND):
        # Create a MsgHub for all players to broadcast messages
        async with MsgHub(
            participants=players.current_alive,
            enable_auto_broadcast=False,  # manual broadcast only
            name="alive_players",
        ) as alive_players_hub:
            # Night phase
            await alive_players_hub.broadcast(
                await moderator(Prompts.to_all_night),
            )
            killed_player, poisoned_player = None, None
            try:
                # Werewolves discuss
                async with MsgHub(
                    players.werewolves,
                    enable_auto_broadcast=True,
                    announcement=await moderator(
                        Prompts.to_wolves_discussion.format(
                            names_to_str(players.werewolves),
                            names_to_str(players.current_alive),
                        ),
                    ),
                    name="werewolves",
                ) as werewolves_hub:
                    # Discussion
                    n_werewolves = len(players.werewolves)
                    for _ in range(1, MAX_DISCUSSION_ROUND * n_werewolves + 1):
                        res = await players.werewolves[_ % n_werewolves](
                            structured_model=DiscussionModel,
                        )
                        if _ % n_werewolves == 0 and res.metadata.get(
                            "reach_agreement",
                        ):
                            break
                    # Werewolves vote
                    # Disable auto broadcast to avoid following other's votes
                    werewolves_hub.set_auto_broadcast(False)
                    msgs_vote = await fanout_pipeline(
                        players.werewolves,
                        msg=await moderator(content=Prompts.to_wolves_vote),
                        structured_model=get_vote_model(players.current_alive),
                        enable_gather=False,
                    )
                    killed_player, votes = majority_vote(
                        [_.metadata.get("vote") for _ in msgs_vote],
                    )
                    # Postpone the broadcast of voting
                    await werewolves_hub.broadcast(
                        [
                            *msgs_vote,
                            await moderator(
                                Prompts.to_wolves_res.format(
                                    votes,
                                    killed_player,
                                ),
                            ),
                        ],
                    )
            except Exception as e:
                raise BadGuyException(
                    f"Werewolves failed to make a decision: {e}",
                )
            # Witch's turn
            await alive_players_hub.broadcast(
                await moderator(Prompts.to_all_witch_turn),
            )
            msg_witch_poison = None
            for agent in players.witch:
                # Witch can heal herself (self-rescue allowed)
                msg_witch_resurrect = None
                if healing and killed_player:
                    msg_witch_resurrect = await agent(
                        await moderator(
                            Prompts.to_witch_resurrect.format(
                                witch_name=agent.name,
                                dead_name=killed_player,
                            ),
                        ),
                        structured_model=WitchResurrectModel,
                    )
                    if msg_witch_resurrect.metadata.get("resurrect"):
                        killed_player = None
                        healing = False
                # Has poison potion and hasn't used the healing potion
                if poison and not (
                    msg_witch_resurrect
                    and msg_witch_resurrect.metadata["resurrect"]
                ):
                    msg_witch_poison = await agent(
                        await moderator(
                            Prompts.to_witch_poison.format(
                                witch_name=agent.name,
                            ),
                        ),
                        structured_model=get_poison_model(
                            players.current_alive,
                        ),
                    )
                    if msg_witch_poison.metadata.get("poison"):
                        poisoned_player = msg_witch_poison.metadata.get("name")
                        poison = False
            # Seer's turn
            await alive_players_hub.broadcast(
                await moderator(Prompts.to_all_seer_turn),
            )
            for agent in players.seer:
                msg_seer = await agent(
                    await moderator(
                        Prompts.to_seer.format(
                            agent.name,
                            names_to_str(players.current_alive),
                        ),
                    ),
                    structured_model=get_seer_model(players.current_alive),
                )
                if msg_seer.metadata.get("name"):
                    player = msg_seer.metadata["name"]
                    await agent.observe(
                        await moderator(
                            Prompts.to_seer_result.format(
                                agent_name=player,
                                role=players.name_to_role[player],
                            ),
                        ),
                    )
            # Update alive players (no hunter in 7-player version)
            dead_tonight = [killed_player, poisoned_player]
            players.update_players(dead_tonight)
            # Day phase
            if len([_ for _ in dead_tonight if _]) > 0:
                await alive_players_hub.broadcast(
                    await moderator(
                        Prompts.to_all_day.format(
                            names_to_str([_ for _ in dead_tonight if _]),
                        ),
                    ),
                )
                # The killed player leave a last message in first night
                if killed_player and first_day:
                    msg_moderator = await moderator(
                        Prompts.to_dead_player.format(killed_player),
                    )
                    await alive_players_hub.broadcast(msg_moderator)
                    # Leave a message with structured reasoning
                    dead_agent = players.name_to_agent[killed_player]
                    last_words_response = await dead_agent(
                        structured_model=PublicDiscussionModel,
                    )
                    # Extract reasoning and statement from metadata
                    reasoning = last_words_response.metadata.get(
                        "reasoning",
                        "",
                    )
                    statement = last_words_response.metadata.get(
                        "statement",
                        "",
                    )
                    # Only broadcast the public statement
                    public_last_msg = Msg(
                        name=dead_agent.name,
                        content=statement,
                        role="assistant",
                    )
                    await alive_players_hub.broadcast(public_last_msg)
                    # Let the dead player observe their own private reasoning
                    private_reasoning_msg = Msg(
                        name="self_thought",
                        content=f"[PRIVATE REASONING] {reasoning}",
                        role="assistant",
                    )
                    await dead_agent.observe(private_reasoning_msg)
            else:
                await alive_players_hub.broadcast(
                    await moderator(Prompts.to_all_peace),
                )
            # Check winning
            res = players.check_winning()
            if res:
                await moderator(res)
                break
            # Discussion - KEY MODIFICATION: Use structured reasoning
            await alive_players_hub.broadcast(
                await moderator(
                    Prompts.to_all_discuss.format(
                        names=names_to_str(players.current_alive),
                    ),
                ),
            )
            # Instead of sequential_pipeline, we manually handle each player
            # to separate reasoning from public statement
            for player in players.current_alive:
                # Get structured response with reasoning and statement
                response = await player(
                    structured_model=PublicDiscussionModel,
                )
                # Extract reasoning and statement from metadata
                reasoning = response.metadata.get("reasoning", "")
                statement = response.metadata.get("statement", "")
                # Only broadcast the public statement to all players
                public_msg = Msg(
                    name=player.name,
                    content=statement,
                    role="assistant",
                )
                await alive_players_hub.broadcast(public_msg)
                # Let the player observe their own private reasoning
                # This keeps it in their memory but not visible to others
                private_msg = Msg(
                    name="self_thought",
                    content=f"[PRIVATE REASONING] {reasoning}",
                    role="assistant",
                )
                await player.observe(private_msg)
            # Voting
            msgs_vote = await fanout_pipeline(
                players.current_alive,
                await moderator(
                    Prompts.to_all_vote.format(
                        names_to_str(players.current_alive),
                    ),
                ),
                structured_model=get_vote_model(players.current_alive),
                enable_gather=False,
            )
            voted_player, votes = majority_vote(
                [_.metadata.get("vote") for _ in msgs_vote],
            )
            # Broadcast the voting messages together to avoid influencing
            # each other
            voting_msgs = [
                *msgs_vote,
                await moderator(
                    Prompts.to_all_res.format(votes, voted_player),
                ),
            ]
            # Leave a message if voted
            if voted_player:
                prompt_msg = await moderator(
                    Prompts.to_dead_player.format(voted_player),
                )
                # Get structured last words with reasoning
                dead_agent = players.name_to_agent[voted_player]
                last_words_response = await dead_agent(
                    prompt_msg,
                    structured_model=PublicDiscussionModel,
                )
                # Extract reasoning and statement from metadata
                reasoning = last_words_response.metadata.get("reasoning", "")
                statement = last_words_response.metadata.get("statement", "")
                # Create public statement message
                public_last_msg = Msg(
                    name=dead_agent.name,
                    content=statement,
                    role="assistant",
                )
                # Store private reasoning for the dead player
                private_reasoning_msg = Msg(
                    name="self_thought",
                    content=f"[PRIVATE REASONING] {reasoning}",
                    role="assistant",
                )
                await dead_agent.observe(private_reasoning_msg)
                voting_msgs.extend([prompt_msg, public_last_msg])
            await alive_players_hub.broadcast(voting_msgs)
            # Update alive players (no hunter in 7-player version)
            dead_today = [voted_player]
            players.update_players(dead_today)
            # Check winning
            res = players.check_winning()
            if res:
                async with MsgHub(players.all_players) as all_players_hub:
                    res_msg = await moderator(res)
                    await all_players_hub.broadcast(res_msg)
                break
        # The day ends
        first_day = False
    # # Game over, each player reflects
    # await fanout_pipeline(
    #     agents=agents,
    #     msg=await moderator(Prompts.to_all_reflect),
    # )
    alive_wolves = players.werewolves
    good_guy_win = len(alive_wolves) == 0
    return good_guy_win
--- a/tuner/werewolves/main.py
+++ b/tuner/werewolves/main.py
@@ -0,0 +1,316 @@
 # -*- coding: utf-8 -*-
 # flake8: noqa: E501
 # pylint: disable=C0301,C0413,W0621,W0404,C0412,E0611,E1121
 """Example of training a werewolf game agent with Trinity-RFT using AgentScope tuner."""
 import sys
 from pathlib import Path
 from typing import Dict
 import traceback
 import numpy as np
 from agentscope.tuner import (
    tune,
    WorkflowOutput,
    TunerModelConfig,
 )
 from agentscope.agent import ReActAgent
 from agentscope.formatter import OpenAIMultiAgentFormatter
 # Add current directory to path for local imports
 sys.path.insert(0, str(Path(__file__).parent))
 from game import BadGuyException, werewolves_game  # noqa: E402
 async def run_werewolves_workflow(
    task: Dict,
    model: TunerModelConfig,
    auxiliary_models: Dict[str, TunerModelConfig],
 ) -> WorkflowOutput:
    """Run the werewolf game workflow.
    Args:
        task (Dict): The task information containing:
            - 'seed': for role shuffling
            - 'workflow_args': optional dict with 'trainable_target' key
              ("werewolf" or "good_guy", default: "werewolf")
        model (TunerModelConfig): The trainable model.
        auxiliary_models (Dict[str, TunerModelConfig]): Dictionary of auxiliary
            models. Expected to have 'participant' key for opponent players.
    Returns:
        WorkflowOutput: Contains reward and metrics from the game.
    """
    # Initialize roles: 2 werewolves, 3 villagers, 1 seer, 1 witch
    roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
    # Shuffle roles based on task seed for reproducibility
    seed = task.get("seed", 0)
    np.random.seed(seed)
    np.random.shuffle(roles)
    # Get trainable_target from workflow_args (default: "werewolf")
    # Options: "werewolf" or "good_guy" (villager, seer, witch)
    workflow_args = task.get("workflow_args", {})
    trainable_target = workflow_args.get("trainable_target", "werewolf")
    # Get the participant model for opponent players
    if "participant" not in auxiliary_models:
        raise ValueError(
            "Expected 'participant' model in auxiliary_models for opponent players",
        )
    participant_model = auxiliary_models["participant"]
    # Create players with appropriate models based on trainable_target
    players = []
    for i, role in enumerate(roles):
        # Determine which model to use based on trainable_target
        if trainable_target == "werewolf":
            # Training werewolves: werewolves use trainable model
            use_trainable = role == "werewolf"
        else:  # trainable_target == "good_guy"
            # Training good guys: villager, seer, witch use trainable model
            use_trainable = role in ["villager", "seer", "witch"]
        agent = ReActAgent(
            name=f"Player{i + 1}",
            sys_prompt=get_official_agent_prompt(f"Player{i + 1}"),
            model=model if use_trainable else participant_model,
            formatter=OpenAIMultiAgentFormatter(),
            max_iters=3,
        )
        players.append(agent)
    try:
        # Run the werewolf game
        good_guy_win = await werewolves_game(players, roles)
        # Calculate reward based on trainable_target
        is_success = False
        if trainable_target == "werewolf":
            # Training werewolves: reward when werewolves win (good_guy_win = False)
            if not good_guy_win:
                raw_reward = 1.0
                is_success = True
            else:
                raw_reward = 0.0
        else:  # trainable_target == "good_guy"
            # Training good guys: reward when good guys win (good_guy_win = True)
            if good_guy_win:
                raw_reward = 1.0
                is_success = True
            else:
                raw_reward = 0.0
        metrics = {
            "success": float(is_success),
            "werewolf_win": float(not good_guy_win),
            "villager_win": float(good_guy_win),
            "trainable_target": trainable_target,
        }
        return WorkflowOutput(
            reward=raw_reward,
            metrics=metrics,
        )
    except BadGuyException as e:
        # If game execution fails, give a small penalty
        traceback.print_exc()
        print(
            f"Error during game execution: {e}. "
            "Assigning penalty to trainable agents.",
        )
        return WorkflowOutput(
            reward=-0.1,
            metrics={"success": 0.0, "game_error": 1.0},
        )
    except Exception as e:
        # Catch any other unexpected errors
        traceback.print_exc()
        print(f"Unexpected error: {e}")
        return WorkflowOutput(
            reward=-0.1,
            metrics={"success": 0.0, "unexpected_error": 1.0},
        )
 def get_official_agent_prompt(name: str) -> str:
    """Get the system prompt for an agent.
    Args:
        name (str): The name of the agent.
    Returns:
        str: The system prompt.
    """
    from textwrap import dedent
    system_prompt = dedent(
        f"""
        You're a werewolf game player named {name}.
        # YOUR TARGET
        Your target is to win the game with your teammates as much as possible.
        # GAME RULES
        - In werewolf game, players are divided into two werewolves, three villagers, one seer, and one witch.
            - Werewolves: kill one player each night, and must hide identity during the day.
            - Villagers: ordinary players without special abilities, try to identify and eliminate werewolves.
                - Seer: A special villager who can check one player's identity each night.
                - Witch: A special villager with two one-time-use potions: a healing potion to save a player (including herself) from being killed at night, and a poison to eliminate one player at night.
        - The game alternates between night and day phases until one side wins:
            - Night Phase
                - Werewolves choose one victim
                - Seer checks one player's identity
                - Witch decides whether to use potions
                - Moderator announces who died during the night
            - Day Phase
                - All players discuss and vote to eliminate one suspected player
        - The werewolves will win the game if they can eliminate all the villagers.
        - The villagers will win the game if they can eliminate all the werewolves.
        ## During PUBLIC discussion (day phase):
        - Your response will be split into TWO parts: REASONING (private) and STATEMENT (public)
        - REASONING: Your internal thoughts - ONLY YOU can see this. Think freely here.
        - STATEMENT: What you actually say - EVERYONE can see this. Be strategic!
        ## For WEREWOLVES in public discussion:
        - ❌ NEVER say "I'm a werewolf" or "we werewolves" in your STATEMENT
        - ❌ NEVER reveal your werewolf teammates in your STATEMENT
        - ❌ NEVER discuss werewolf strategy in your STATEMENT
        - ✅ In REASONING: freely think about werewolf strategy
        - ✅ In STATEMENT: pretend to be a villager, seer, or other role
        - ✅ In STATEMENT: accuse others, defend yourself, but NEVER reveal your true identity
        ## For ALL ROLES in public discussion:
        - Use REASONING to analyze: "Who might be the werewolf? What's my strategy?"
        - Use STATEMENT to speak: "I think Player X is suspicious because..."
        - Keep sensitive information in REASONING, not in STATEMENT
        ## Examples:
        ### BAD (Werewolf exposing themselves):
        REASONING: "I'm a werewolf, I should protect my teammates."
        STATEMENT: "As a werewolf, I think we should vote Player 5."  ❌ EXPOSED!
        ### GOOD (Werewolf hiding identity):
        REASONING: "I'm a werewolf. Player 5 might be the seer based on their questions. I should cast suspicion on them without being obvious."
        STATEMENT: "I find Player 5's behavior suspicious. They've been asking too many questions about people's roles."  ✅ HIDDEN!
        ### GOOD (Villager analyzing):
        REASONING: "Player 2 and Player 3 seem to be defending each other. Could they be werewolf teammates?"
        STATEMENT: "I noticed Player 2 and Player 3 have been very defensive of each other. This makes me suspicious."  ✅ STRATEGIC!
        # GAME GUIDANCE
        - Try your best to win the game with your teammates, tricks, lies, and deception are all allowed, e.g. pretending to be a different role.
        - During discussion, don't be political, be direct and to the point.
        - The day phase voting provides important clues. For example, the werewolves may vote together, attack the seer, etc.
        ## GAME GUIDANCE FOR WEREWOLF
        - Seer is your greatest threat, who can check one player's identity each night. Analyze players' speeches, find out the seer and eliminate him/her will greatly increase your chances of winning.
        - In the first night, making random choices is common for werewolves since no information is available.
        - Pretending to be other roles (seer, witch or villager) is a common strategy to hide your identity and mislead other villagers in the day phase.
        - The outcome of the night phase provides important clues. For example, if witch uses the healing or poison potion, etc. Use this information to adjust your strategy.
        - [CRITICAL] In public discussion, NEVER reveal you are a werewolf. Always pretend to be a villager or other role.
        ## GAME GUIDANCE FOR SEER
        - Seer is very important to villagers, you should earn the villagers' trust, and lead the discussion phase if possible.
        - Your ability to check one player's identity is crucial.
        - The outcome of the night phase provides important clues. For example, if witch uses the healing or poison potion, etc. Use this information to adjust your strategy.
        - Consider when to reveal your identity - too early and werewolves will target you, too late and villagers won't trust you.
        ## GAME GUIDANCE FOR WITCH
        - Witch has two powerful potions, use them wisely to protect key villagers or eliminate suspected werewolves.
        - [IMPORTANT] You CAN use the healing potion to save yourself if you are killed by werewolves (self-rescue is allowed).
        - Consider saving the healing potion for critical moments, especially if you think you might be targeted.
        - The outcome of the night phase provides important clues. Use this information to adjust your strategy. For example, the person you save is likely to be on the villagers' side.
        ## GAME GUIDANCE FOR VILLAGER
        - Protecting special villagers, especially the seer, is crucial for your team's success.
        - Be cautious and decide whether to trust other players based on their speeches and actions.
        - Base your decisions on the information you have received, be logical and engage in the discussion to vote out the suspected werewolves.
        # NOTE
        - [IMPORTANT] DO NOT make up any information that is not provided by the moderator or other players.
        - This is a TEXT-based game, so DO NOT use or make up any non-textual information.
        - Always critically reflect on whether your evidence exist, and avoid making assumptions.
        - Your response should be specific and concise, provide clear reason and avoid unnecessary elaboration.
        - Generate your one-line response by using the `generate_response` function.
        - Don't repeat the others' speeches.
        - [CRITICAL] Remember: REASONING is private (only you see it), STATEMENT is public (everyone sees it). Use this to your advantage!""",
    )
    return system_prompt
 if __name__ == "__main__":
    from agentscope.tuner import (
        DatasetConfig,
        TunerModelConfig,
        AlgorithmConfig,
    )
    # High-level configuration in code (easy to modify)
    config_path = Path(__file__).parent / "config.yaml"
    # Setup Model Path
    trained_model_path = (
        "Qwen/Qwen2.5-7B-Instruct"  # fill in your model path here
    )
    auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507"  # fill in your auxiliary model path here
    # Dataset configuration
    dataset = DatasetConfig(
        path=str(Path(__file__).parent / "data"),
        split="train",
        total_steps=400,  # Total training steps
    )
    # Model configuration (trainable model for werewolf players)
    model = TunerModelConfig(
        model_path=trained_model_path,
        max_model_len=25600,
        max_tokens=4096,
        temperature=1.0,
        inference_engine_num=16,
        tensor_parallel_size=1,
        tool_call_parser="hermes",
        reasoning_parser=None,
    )
    # Auxiliary models (for non-werewolf players)
    auxiliary_models = {
        "participant": TunerModelConfig(
            model_path=auxiliary_model_path,
            max_model_len=25600,
            max_tokens=4096,
            temperature=0.1,  # Lower temperature for auxiliary models
            inference_engine_num=8,
            tensor_parallel_size=1,
            tool_call_parser="hermes",
            reasoning_parser=None,
        ),
    }
    # Algorithm configuration
    algorithm = AlgorithmConfig(
        algorithm_type="multi_step_grpo",
        group_size=32,  # repeat_times in Trinity
        batch_size=24,
        learning_rate=1e-6,
        save_interval_steps=100,
        eval_interval_steps=100,
    )
    # Run training with hybrid configuration
    # Code parameters above + detailed Trinity config from YAML
    tune(
        workflow_func=run_werewolves_workflow,
        judge_func=None,  # We compute reward directly in the workflow
        train_dataset=dataset,
        model=model,
        auxiliary_models=auxiliary_models,
        algorithm=algorithm,
        config_path=str(config_path),  # For cluster, explorer, trainer details
    )
--- a/tuner/werewolves/prepare_data.py
+++ b/tuner/werewolves/prepare_data.py
@@ -0,0 +1,71 @@
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 # flake8: noqa: E501
 """Prepare dataset for werewolf game training.
 This script generates a simple dataset consisting of random seeds for role shuffling.
 Each seed creates a different initial role assignment, ensuring diverse training scenarios.
 """
 import json
 import argparse
 from pathlib import Path
 def prepare_dataset(
    output_dir: str,
    num_seeds: int = 300,
    split: str = "train",
 ) -> None:
    """Prepare the werewolf game training dataset.
    Args:
        output_dir (str): Directory to save the dataset.
        num_seeds (int): Number of seeds to generate. Default: 300.
        split (str): Dataset split name (e.g., 'train', 'eval'). Default: 'train'.
    """
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    output_file = output_path / f"{split}.jsonl"
    print(f"Generating {num_seeds} seeds for {split} split...")
    with open(output_file, "w", encoding="utf-8") as f:
        for seed in range(num_seeds):
            data = {"seed": seed}
            f.write(json.dumps(data) + "\n")
    print(f"Dataset saved to: {output_file}")
    print(f"Total samples: {num_seeds}")
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Prepare dataset for werewolf game training",
    )
    parser.add_argument(
        "--output_dir",
        type=str,
        default="data",
        help="Directory to save the dataset (default: data)",
    )
    parser.add_argument(
        "--num_seeds",
        type=int,
        default=300,
        help="Number of seeds to generate (default: 300)",
    )
    parser.add_argument(
        "--split",
        type=str,
        default="train",
        help="Dataset split name (default: train)",
    )
    args = parser.parse_args()
    prepare_dataset(
        output_dir=args.output_dir,
        num_seeds=args.num_seeds,
        split=args.split,
    )
--- a/tuner/werewolves/prompt.py
+++ b/tuner/werewolves/prompt.py
@@ -0,0 +1,198 @@
 # -*- coding: utf-8 -*-
 # flake8: noqa: E501
 """Enhanced prompts with reasoning/statement separation awareness - 7 Player Version."""
 class EnglishPrompts:
    """English prompts used to guide the werewolf game."""
    to_dead_player = (
        "{}, you're eliminated now. Now you can make a final statement to "
        "all alive players before you leave the game.\n\n"
        "IMPORTANT: Your final words will be structured into two parts:\n"
        "1. REASONING (Private): Your strategic thinking - NOT visible to others\n"
        "2. STATEMENT (Public): Your actual final words - VISIBLE to all players\n\n"
        "Think carefully about what information to reveal or hide in your final statement."
    )
    to_all_new_game = (
        "A new game is starting, the players are: {}. Now we randomly "
        "reassign the roles to each player and inform them of their roles "
        "privately."
    )
    to_all_night = (
        "Night has fallen, everyone close your eyes. Werewolves open your "
        "eyes and choose a player to eliminate tonight."
    )
    to_wolves_discussion = (
        "[WEREWOLVES ONLY] {}, you should discuss and "
        "decide on a player to eliminate tonight. Current alive players "
        "are {}. Remember to set `reach_agreement` to True if you reach an "
        "agreement during the discussion."
    )
    to_wolves_vote = "[WEREWOLVES ONLY] Which player do you vote to kill?"
    to_wolves_res = (
        "[WEREWOLVES ONLY] The voting result is {}. So you have chosen to "
        "eliminate {}."
    )
    to_all_witch_turn = (
        "Witch's turn, witch open your eyes and decide your action tonight..."
    )
    to_witch_resurrect = (
        "[WITCH ONLY] {witch_name}, you're the witch, and tonight {dead_name} "
        "is eliminated. You can resurrect him/her by using your healing "
        "potion, "
        "and note you can only use it once in the whole game. Do you want to "
        "resurrect {dead_name}? Give me your reason and decision."
    )
    to_witch_resurrect_no = (
        "[WITCH ONLY] The witch has chosen not to resurrect the player."
    )
    to_witch_resurrect_yes = (
        "[WITCH ONLY] The witch has chosen to resurrect the player."
    )
    to_witch_poison = (
        "[WITCH ONLY] {witch_name}, as a witch, you have a one-time-use "
        "poison potion, do you want to use it tonight? Give me your reason "
        "and decision."
    )
    to_all_seer_turn = (
        "Seer's turn, seer open your eyes and check one player's identity "
        "tonight..."
    )
    to_seer = (
        "[SEER ONLY] {}, as the seer you can check one player's identity "
        "tonight. Who do you want to check? Give me your reason and decision."
    )
    to_seer_result = (
        "[SEER ONLY] You've checked {agent_name}, and the result is: {role}."
    )
    to_all_day = (
        "The day is coming, all players open your eyes. Last night, "
        "the following player(s) has been eliminated: {}."
    )
    to_all_peace = (
        "The day is coming, all the players open your eyes. Last night is "
        "peaceful, no player is eliminated."
    )
    to_all_discuss = (
        "Now the alive players are {names}. The game goes on, it's time to "
        "discuss and vote a player to be eliminated. Now you each take turns "
        "to speak once in the order of {names}.\n\n"
        "IMPORTANT: Your response will be structured into two parts:\n"
        "1. REASONING (Private): Your internal analysis - NOT visible to others\n"
        "2. STATEMENT (Public): Your actual speech - VISIBLE to all players\n\n"
        "Think carefully in your reasoning, then speak strategically in your statement."
    )
    to_all_vote = (
        "Now the discussion is over. Everyone, please vote to eliminate one "
        "player from the alive players: {}."
    )
    to_all_res = "The voting result is {}. So {} has been voted out."
    to_all_wolf_win = (
        "There are {n_alive} players alive, and {n_werewolves} of them are "
        "werewolves. "
        "The game is over and werewolves win🐺🎉!"
        "In this game, the true roles of all players are: {true_roles}"
    )
    to_all_village_win = (
        "All the werewolves have been eliminated."
        "The game is over and villagers win🏘️🎉!"
        "In this game, the true roles of all players are: {true_roles}"
    )
    to_all_continue = "The game goes on."
    to_all_reflect = (
        "The game is over. Now each player can reflect on their performance. "
        "Note each player only has one chance to speak and the reflection is "
        "only visible to themselves."
    )
 class ChinesePrompts:
    """Chinese prompts used to guide the werewolf game."""
    to_dead_player = (
        "{}, 你已被淘汰。现在你可以向所有存活玩家发表最后的遗言。\n\n"
        "重要提示：你的遗言将分为两个部分：\n"
        "1. 推理（私密）：你的策略思考 - 其他玩家看不到\n"
        "2. 发言（公开）：你的实际遗言 - 所有玩家都能看到\n\n"
        "仔细思考在遗言中应该透露或隐藏什么信息。"
    )
    to_all_new_game = "新的一局游戏开始，参与玩家包括：{}。现在为每位玩家重新随机分配身份，并私下告知各自身份。"
    to_all_night = "天黑了，请所有人闭眼。狼人请睁眼，选择今晚要淘汰的一名玩家..."
    to_wolves_discussion = (
        "[仅狼人可见] {}, 你们可以讨论并决定今晚要淘汰的玩家。当前存活玩家有：{}。"
        "如果达成一致，请将 `reach_agreement` 设为 True。"
    )
    to_wolves_vote = "[仅狼人可见] 你投票要杀死哪位玩家？"
    to_wolves_res = "[仅狼人可见] 投票结果为 {}，你们选择淘汰 {}。"
    to_all_witch_turn = "轮到女巫行动，女巫请睁眼并决定今晚的操作..."
    to_witch_resurrect = (
        "[仅女巫可见] {witch_name}，你是女巫，今晚{dead_name}被淘汰。"
        "你可以用解药救他/她，注意解药全局只能用一次。你要救{dead_name}吗？"
        "请给出理由和决定。"
    )
    to_witch_resurrect_no = "[仅女巫可见] 女巫选择不救该玩家。"
    to_witch_resurrect_yes = "[仅女巫可见] 女巫选择救活该玩家。"
    to_witch_poison = "[仅女巫可见] {witch_name}，你有一瓶一次性毒药，今晚要使用吗？请给出理由和决定。"
    to_all_seer_turn = "轮到预言家行动，预言家请睁眼并查验一名玩家身份..."
    to_seer = "[仅预言家可见] {}, 你是预言家，今晚可以查验一名玩家身份。你要查谁？请给出理由和决定。"
    to_seer_result = "[仅预言家可见] 你查验了{agent_name}，结果是：{role}。"
    to_all_day = "天亮了，请所有玩家睁眼。昨晚被淘汰的玩家有：{}。"
    to_all_peace = "天亮了，请所有玩家睁眼。昨晚平安夜，无人被淘汰。"
    to_all_discuss = (
        "现在存活玩家有：{names}。游戏继续，大家开始讨论并投票淘汰一名玩家。请按顺序（{names}）依次发言。\n\n"
        "重要提示：你的回答将分为两个部分：\n"
        "1. 推理（私密）：你的内心分析 - 其他玩家看不到\n"
        "2. 发言（公开）：你的实际发言 - 所有玩家都能看到\n\n"
        "在推理中仔细思考，然后在发言中策略性地表达。"
    )
    to_all_vote = "讨论结束。请大家从存活玩家中投票淘汰一人：{}。"
    to_all_res = "投票结果为 {}，{} 被淘汰。"
    to_all_wolf_win = (
        "当前存活玩家共{n_alive}人，其中{n_werewolves}人为狼人。"
        "游戏结束，狼人获胜🐺🎉！"
        "本局所有玩家真实身份为：{true_roles}"
    )
    to_all_village_win = "所有狼人已被淘汰。游戏结束，村民获胜🏘️🎉！本局所有玩家真实身份为：{true_roles}"
    to_all_continue = "游戏继续。"
    to_all_reflect = "游戏结束。现在每位玩家可以对自己的表现进行反思。注意每位玩家只有一次发言机会，且反思内容仅自己可见。"
--- a/tuner/werewolves/rollout_reward_curve.png
+++ b/tuner/werewolves/rollout_reward_curve.png
--- a/tuner/werewolves/rollout_reward_curve_goodguy.png
+++ b/tuner/werewolves/rollout_reward_curve_goodguy.png
--- a/tuner/werewolves/structured_model.py
+++ b/tuner/werewolves/structured_model.py
@@ -0,0 +1,96 @@
 # -*- coding: utf-8 -*-
 # flake8: noqa: E501
 """The structured output models used in the werewolf game."""
 from typing import Literal
 from pydantic import BaseModel, Field
 from agentscope.agent import AgentBase
 class DiscussionModel(BaseModel):
    """The output format for discussion."""
    reach_agreement: bool = Field(
        description="Whether you have reached an agreement or not",
    )
 class PublicDiscussionModel(BaseModel):
    """The output format for public discussion with private reasoning.
    This model separates private reasoning from public statements to prevent
    accidental information leakage (e.g., werewolves revealing their identity).
    """
    reasoning: str = Field(
        description=(
            "Your PRIVATE reasoning and analysis. This will NOT be shown to "
            "other players. You can freely think about your strategy, analyze "
            "other players' behaviors, and plan your next move here. "
            "If you are a werewolf, you can think about how to hide your identity. "
            "If you are a villager, you can analyze who might be the werewolf."
        ),
    )
    statement: str = Field(
        description=(
            "Your PUBLIC statement to all players. This WILL be visible to everyone. "
            "Be careful not to reveal sensitive information (e.g., your true role if "
            "you are a werewolf). "
            "Your statement should be strategic and help your team win."
        ),
    )
 def get_vote_model(agents: list[AgentBase]) -> type[BaseModel]:
    """Get the vote model by player names."""
    class VoteModel(BaseModel):
        """The vote output format."""
        vote: Literal[tuple(_.name for _ in agents)] = Field(  # type: ignore
            description="The name of the player you want to vote for",
        )
    return VoteModel
 class WitchResurrectModel(BaseModel):
    """The output format for witch resurrect action."""
    resurrect: bool = Field(
        description="Whether you want to resurrect the player",
    )
 def get_poison_model(agents: list[AgentBase]) -> type[BaseModel]:
    """Get the poison model by player names."""
    class WitchPoisonModel(BaseModel):
        """The output format for witch poison action."""
        poison: bool = Field(
            description="Do you want to use the poison potion",
        )
        name: Literal[  # type: ignore
            tuple(_.name for _ in agents)
        ] | None = Field(
            description="The name of the player you want to poison, if you "
            "don't want to poison anyone, just leave it empty",
            default=None,
        )
    return WitchPoisonModel
 def get_seer_model(agents: list[AgentBase]) -> type[BaseModel]:
    """Get the seer model by player names."""
    class SeerModel(BaseModel):
        """The output format for seer action."""
        name: Literal[tuple(_.name for _ in agents)] = Field(  # type: ignore
            description="The name of the player you want to check",
        )
    return SeerModel
--- a/tuner/werewolves/utils.py
+++ b/tuner/werewolves/utils.py
@@ -0,0 +1,159 @@
 # -*- coding: utf-8 -*-
 """Utility functions for the werewolf game - 7 Player Version."""
 from collections import defaultdict
 from typing import Any
 import numpy as np
 from agentscope.agent import AgentBase, ReActAgent
 from agentscope.message import Msg
 from prompt import EnglishPrompts as Prompts
 MAX_GAME_ROUND = 30
 MAX_DISCUSSION_ROUND = 3
 def majority_vote(votes: list[str]) -> tuple:
    """Return the vote with the most counts."""
    result = max(set(votes), key=votes.count)
    names, counts = np.unique(votes, return_counts=True)
    conditions = ", ".join(
        [f"{name}: {count}" for name, count in zip(names, counts)],
    )
    return result, conditions
 def names_to_str(agents: list[str] | list[ReActAgent]) -> str:
    """Return a string of agent names."""
    if not agents:
        return ""
    if len(agents) == 1:
        if isinstance(agents[0], ReActAgent):
            return agents[0].name
        return agents[0]
    names = []
    for agent in agents:
        if isinstance(agent, ReActAgent):
            names.append(agent.name)
        else:
            names.append(agent)
    return ", ".join([*names[:-1], "and " + names[-1]])
 class EchoAgent(AgentBase):
    """Echo agent that repeats the input message."""
    def __init__(self) -> None:
        super().__init__()
        self.name = "Moderator"
    async def reply(self, content: str) -> Msg:
        """Repeat the input content with its name and role."""
        msg = Msg(
            self.name,
            content,
            role="assistant",
        )
        await self.print(msg)
        return msg
    async def handle_interrupt(
        self,
        *args: Any,
        **kwargs: Any,
    ) -> Msg:
        """Handle interrupt."""
    async def observe(self, msg: Msg | list[Msg] | None) -> None:
        """Observe the user's message."""
 class Players:
    """Maintain the players' status."""
    def __init__(self) -> None:
        """Initialize the players."""
        # The mapping from player name to role
        self.name_to_role = {}
        self.role_to_names = defaultdict(list)
        self.name_to_agent = {}
        self.werewolves = []
        self.villagers = []
        self.seer = []
        self.witch = []
        self.current_alive = []
        self.all_players = []
    def add_player(self, player: ReActAgent, role: str) -> None:
        """Add a player to the game.
        Args:
            player (`ReActAgent`):
                The player to be added.
            role (`str`):
                The role of the player.
        """
        self.name_to_role[player.name] = role
        self.name_to_agent[player.name] = player
        self.role_to_names[role].append(player.name)
        self.all_players.append(player)
        if role == "werewolf":
            self.werewolves.append(player)
        elif role == "villager":
            self.villagers.append(player)
        elif role == "seer":
            self.seer.append(player)
        elif role == "witch":
            self.witch.append(player)
        else:
            raise ValueError(f"Unknown role: {role}")
        self.current_alive.append(player)
    def update_players(self, dead_players: list[ReActAgent]) -> None:
        """Update the current alive players.
        Args:
            dead_players (`list[ReActAgent]`):
                A list of dead players to be removed.
        """
        self.werewolves = [
            _ for _ in self.werewolves if _.name not in dead_players
        ]
        self.villagers = [
            _ for _ in self.villagers if _.name not in dead_players
        ]
        self.seer = [_ for _ in self.seer if _.name not in dead_players]
        self.witch = [_ for _ in self.witch if _.name not in dead_players]
        self.current_alive = [
            _ for _ in self.current_alive if _.name not in dead_players
        ]
    def print_roles(self) -> None:
        """Print the roles of all players."""
        print("Roles:")
        for name, role in self.name_to_role.items():
            print(f" - {name}: {role}")
    def check_winning(self) -> str | None:
        """Check if the game is over and return the winning message."""
        # Prepare true roles string
        true_roles = (
            f'{names_to_str(self.role_to_names["werewolf"])} are werewolves, '
            f'{names_to_str(self.role_to_names["villager"])} are villagers, '
            f'{names_to_str(self.role_to_names["seer"])} is the seer, '
            f'and {names_to_str(self.role_to_names["witch"])} is the witch.'
        )
        if len(self.werewolves) * 2 >= len(self.current_alive):
            return Prompts.to_all_wolf_win.format(
                n_alive=len(self.current_alive),
                n_werewolves=len(self.werewolves),
                true_roles=true_roles,
            )
        if self.current_alive and not self.werewolves:
            return Prompts.to_all_village_win.format(
                true_roles=true_roles,
            )
        return None
		`@@ -0,0 +1,2 @@`
							`# -- coding: utf-8 --`
							`"""Werewolf game training example for AgentScope."""`