Add Browser Use Agent Pro (#52)

This commit is contained in:
Yue Cui
2025-12-05 18:47:40 +08:00
committed by GitHub
parent f242f4399e
commit f3952caf6c
22 changed files with 2696 additions and 3 deletions

View File

@@ -62,6 +62,7 @@ It includes **agent deployment** and **secure sandboxed tool execution**, and ca
├── alias/ # Agent to solve real-world problems
├── browser_use/
│ ├── agent_browser/ # Pure Python browser agent
│ ├── browser_use_agent_pro/ # Advanced pure python browser agent
│ └── browser_use_fullstack_runtime/ # Full-stack runtime version with frontend/backend
├── deep_research/
@@ -93,6 +94,7 @@ It includes **agent deployment** and **secure sandboxed tool execution**, and ca
| ----------------------- |-------------------------------------------------------| --------------- | ------------ |--------------------------------------------------|
| **Data Processing** | data_juicer_agent/ | ✅ | ❌ | Multi-agent data processing with Data-Juicer |
| **Browser Use** | browser_use/agent_browser | ✅ | ❌ | Command-line browser automation using AgentScope |
| | browser_use/browser_use_agent_pro | ✅ | ❌ | Advanced command-line Python browser agent using AgentScope |
| | browser_use/browser_use_fullstack_runtime | ✅ | ✅ | Full-stack browser automation with UI & sandbox |
| **Deep Research** | deep_research/agent_deep_research | ✅ | ❌ | Multi-agent research pipeline |
| | deep_research/qwen_langgraph_search_fullstack_runtime | ❌ | ✅ | Full-stack deep research app |
@@ -183,7 +185,7 @@ This project is licensed under the **Apache 2.0 License** see the [LICENSE](
## Contributors ✨
Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):
Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/emoji-key/)):
<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- prettier-ignore-start -->

View File

@@ -62,7 +62,8 @@ AgentScope Runtime 是一个**全面的运行时框架**,主要解决部署和
├── alias/ # 解决现实问题的智能体程序
├── browser_use/
│ ├── agent_browser/ # 纯 Python 浏览器 Agent
── browser_use_fullstack_runtime/ # 全栈运行时版本(前端+后端)
── browser_use_agent_pro/ # 高级纯 Python 浏览器 Agent
│ └── browser_use_fullstack_runtime/ # 全栈运行时版本(前端+后端)
├── deep_research/
│ ├── agent_deep_research/ # 纯 Python 多 Agent 研究流程
@@ -93,6 +94,7 @@ AgentScope Runtime 是一个**全面的运行时框架**,主要解决部署和
|-----------|-------------------------------------------------------|---------------|-----------------------|-------------------------|
| **数据处理** | data_juicer_agent/ | ✅ | ❌ | 基于 Data-Juicer 的多智能体数据处理 |
| **浏览器相关** | browser_use/agent_browser | ✅ | ❌ | 基于 AgentScope 的命令行浏览器自动化 |
| | browser_use/browser_use_agent_pro | ✅ | ❌ | 基于 AgentScope 的高级命令行浏览器智能体 |
| | browser_use/browser_use_fullstack_runtime | ✅ | ✅ | 带 UI 和沙盒环境的全栈浏览器自动化 |
| **深度研究** | deep_research/agent_deep_research | ✅ | ❌ | 多 Agent 研究流程 |
| | deep_research/qwen_langgraph_search_fullstack_runtime | ❌ | ✅ | 全栈运行时深度研究应用 |
@@ -181,7 +183,7 @@ AgentScope Runtime 是一个**全面的运行时框架**,主要解决部署和
## 贡献者 ✨
感谢这些优秀的贡献者们 ([表情符号说明](https://allcontributors.org/docs/en/emoji-key)):
感谢这些优秀的贡献者们 ([表情符号说明](https://allcontributors.org/emoji-key/)):
<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- prettier-ignore-start -->

View File

@@ -0,0 +1,70 @@
# Browser Use Agent Pro
A powerful, standalone browser automation agent built on top of [AgentScope](https://github.com/agentscope-ai/agentscope) and [Playwright MCP](https://github.com/microsoft/playwright-mcp). This agent provides intelligent web automation capabilities through natural language instructions.
Browser Use Agent Pro excels at automating a wide range of web-based tasks, including web research, form automation, e-commerce operations, content management, testing, and workflow automation.
## ✨ Key Features
- **Multimodal Understanding**
- Image understanding: Analyze and interact with visual elements on web pages
- Video understanding: Extract frames, transcribe audio, and analyze video content
- Form filling: Automatically fill web forms based on natural language instructions
- File Download: Locate and trigger file downloads from web pages
- **Task Decomposition and Management**
- Automatic task decomposition: Break down complex tasks into manageable subtasks
- Subtask tracking: Monitor and manage progress through multiple subtasks
- Dynamic subtask revision: Adapt and refine subtasks based on execution results
- Task completion validation: Verify when subtasks and overall tasks are completed
- **Advanced Reasoning**
- Pure reasoning: Plan actions without page observation
- Observation-based reasoning: Analyze page content before making decisions
- Chunked observation: Process large page snapshots in manageable chunks
- **Memory Management**
- Automatic memory summarization: Condense conversation history when it exceeds limits
- Tool output filtering: Clean and filter verbose tool execution results
## 📋 Requirements
- Python 3.10+
- Node.js and npx (for playwright-mcp)
- DASHSCOPE_API_KEY environment variable
The playwright-mcp server will automatically handle browser installation when first run via `npx @playwright/mcp@latest`. No manual browser installation is required.
## 💻 Installation
1. Install dependencies:
```bash
# From the project root directory
pip install -r requirements.txt
```
2. Ensure Node.js and npx are installed (required for playwright-mcp):
```bash
# Check if npx is available
npx --version
```
3. Set up environment variables:
```bash
export DASHSCOPE_API_KEY="your-api-key"
export MODEL="qwen3-max" # or "qwen-vl-max" for vision model
```
## 🚀 Basic Usage
Run the Browser Use Agent Pro with a task, optionally configure the start URL:
```bash
# From the project root directory
python main.py "Find the latest stock price of Alibaba Group" "https://www.google.com"
```
## Note
This is a standalone version extracted from the [Alias-Agent](https://github.com/agentscope-ai/agentscope-samples/tree/main/alias) project. It now uses standard agentscope components (ReActAgent, Toolkit) with local Playwright MCP clients.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,234 @@
# -*- coding: utf-8 -*-
"""Standalone file download skill for the browser agent."""
# flake8: noqa: E501
# pylint: disable=W0212
# pylint: disable=too-many-lines
# pylint: disable=C0301
from __future__ import annotations
import copy
from typing import Any
import os
from agentscope.agent import ReActAgent
from agentscope.memory import InMemoryMemory
from agentscope.message import Msg, TextBlock
from agentscope.tool import ToolResponse
_CURRENT_DIR = os.path.abspath(
os.path.join(os.path.dirname(__file__), os.pardir),
)
with open(
os.path.join(
_CURRENT_DIR,
"_build_in_prompt_browser/browser_agent_file_download_sys_prompt.md",
),
"r",
encoding="utf-8",
) as f:
_FILE_DOWNLOAD_AGENT_SYS_PROMPT = f.read()
class FileDownloadAgent(ReActAgent):
"""Lightweight helper agent that downloads files"""
def __init__(
self,
browser_agent: Any,
sys_prompt: str = _FILE_DOWNLOAD_AGENT_SYS_PROMPT,
max_iters: int = 15,
) -> None:
name = (
f"{getattr(browser_agent, 'name', 'browser_agent')}_file_download"
)
self.finish_function_name = "file_download_final_response"
super().__init__(
name=name,
sys_prompt=sys_prompt,
model=browser_agent.model,
formatter=browser_agent.formatter,
memory=InMemoryMemory(),
toolkit=browser_agent.toolkit,
max_iters=max_iters,
)
# Remove conflicting tool functions if they exist
if hasattr(self.toolkit, "remove_tool_function"):
try:
self.toolkit.remove_tool_function("browser_pdf_save")
except Exception:
# Tool may not exist, ignore removal errors
pass
try:
self.toolkit.remove_tool_function("file_download")
except Exception:
# Tool may not exist, ignore removal errors
pass
async def file_download_final_response(
self, # pylint: disable=W0613
**kwargs: Any, # pylint: disable=W0613
) -> ToolResponse:
"""Summarize the file download outcome."""
hint_msg = Msg(
"user",
(
"Provide a concise summary of the file download attempt.\n"
"Highlight these items:\n"
"0. The original request\n"
"1. The element(s) interacted with and actions taken\n"
"2. The download status or any issues encountered\n"
"3. Any follow-up recommendations or next steps\n"
),
role="user",
)
memory_msgs = await self.memory.get_memory()
memory_msgs_copy = copy.deepcopy(memory_msgs)
if memory_msgs_copy:
last_msg = memory_msgs_copy[-1]
last_msg.content = last_msg.get_content_blocks("text")
memory_msgs_copy[-1] = last_msg
prompt = await self.formatter.format(
msgs=[
Msg("system", self.sys_prompt, "system"),
*memory_msgs_copy,
hint_msg,
],
)
res = await self.model(prompt)
if self.model.stream:
summary_text = ""
async for chunk in res:
summary_text = chunk.content[0]["text"]
else:
summary_text = res.content[0]["text"]
summary_text = summary_text or "No summary generated."
# Create a simple metadata structure instead of WorkerResponse
structure_response = {
"task_done": True,
"subtask_progress_summary": summary_text,
"generated_files": {},
}
response_msg = Msg(
self.name,
content=[
TextBlock(type="text", text=summary_text),
],
role="assistant",
metadata=structure_response,
)
return ToolResponse(
content=[
TextBlock(
type="text",
text="File download summary generated. " + summary_text,
),
],
metadata={
"success": True,
"response_msg": response_msg,
},
is_last=True,
)
def _build_initial_instruction(
target_description: str,
snapshot_text: str,
) -> str:
"""Compose the initial instruction for the helper agent."""
return (
"You must locate and trigger the download for the requested file.\n\n"
"Target description provided by the user:\n"
f"{target_description}\n\n"
"Latest snapshot captured prior to your run:\n"
f"{snapshot_text}\n\n"
"Follow the sys prompt guidance, think step-by-step, and verify that "
"the download action succeeded. If the download cannot be completed, "
"explain why in the final summary."
)
async def file_download(
browser_agent: Any,
target_description: str,
) -> ToolResponse:
"""
Download the target file. The current page should
contain download-related element.
Args:
target_description (str): The description of the
target file to download.
Returns:
ToolResponse: A structured response containing
the download directory.
"""
try:
snapshot_chunks = await browser_agent._get_snapshot_in_text()
except Exception as exc: # pylint: disable=broad-except
snapshot_chunks = []
snapshot_error = str(exc)
else:
snapshot_error = ""
snapshot_text = "\n\n---\n\n".join(snapshot_chunks)
if snapshot_error and not snapshot_text:
snapshot_text = f"[Snapshot failed: {snapshot_error}]"
sub_agent = FileDownloadAgent(browser_agent)
instruction = _build_initial_instruction(
target_description=target_description,
snapshot_text=snapshot_text,
)
init_msg = Msg(
name="user",
role="user",
content=instruction,
)
try:
sub_agent_response_msg = await sub_agent.reply(init_msg)
text_content = ""
if sub_agent_response_msg.content:
first_block = sub_agent_response_msg.content[0]
if isinstance(first_block, dict):
text_content = first_block.get("text") or ""
else:
text_content = getattr(first_block, "text", "") or ""
if not text_content:
text_content = (
"File download agent finished without a textual summary."
)
return ToolResponse(
metadata=sub_agent_response_msg.metadata,
content=[
TextBlock(
type="text",
text=text_content,
),
],
)
except Exception as exc: # pylint: disable=broad-except
return ToolResponse(
content=[
TextBlock(
type="text",
text=f"Tool call Error. Cannot be executed. {exc}",
),
],
metadata={"success": False},
is_last=True,
)

View File

@@ -0,0 +1,210 @@
# -*- coding: utf-8 -*-
"""Standalone form filling skill for the browser agent."""
# flake8: noqa: E501
# pylint: disable=W0212
# pylint: disable=too-many-lines
# pylint: disable=C0301
from __future__ import annotations
import copy
from typing import Any
import os
from agentscope.agent import ReActAgent
from agentscope.memory import InMemoryMemory
from agentscope.message import Msg, TextBlock
from agentscope.tool import ToolResponse
_CURRENT_DIR = os.path.abspath(
os.path.join(os.path.dirname(__file__), os.pardir),
)
with open(
os.path.join(
_CURRENT_DIR,
"_build_in_prompt_browser/browser_agent_form_filling_sys_prompt.md",
),
"r",
encoding="utf-8",
) as f:
_FORM_FILL_AGENT_SYS_PROMPT = f.read()
class FormFillingAgent(ReActAgent):
"""Lightweight helper agent that fills forms."""
def __init__(
self,
browser_agent: Any,
sys_prompt: str = _FORM_FILL_AGENT_SYS_PROMPT,
max_iters: int = 20,
) -> None:
name = f"{getattr(browser_agent, 'name', 'browser_agent')}_form_fill"
self.finish_function_name = "form_filling_final_response"
super().__init__(
name=name,
sys_prompt=sys_prompt,
model=browser_agent.model,
formatter=browser_agent.formatter,
memory=InMemoryMemory(),
toolkit=browser_agent.toolkit,
max_iters=max_iters,
)
async def form_filling_final_response(
self, # pylint: disable=W0613
**kwargs: Any, # pylint: disable=W0613
) -> ToolResponse:
"""Summarize the form filling outcome."""
hint_msg = Msg(
"user",
(
"Provide a concise summary of the completed form "
"filling task.\n"
"Highlight these items:\n"
"0. The original task/query\n"
"1. Which fields were filled/selected and their final values\n"
"2. Any important observations or follow-up notes\n"
"3. Confirmation that if the task is complete\n\n"
),
role="user",
)
memory_msgs = await self.memory.get_memory()
memory_msgs_copy = copy.deepcopy(memory_msgs)
last_msg = memory_msgs_copy[-1]
# check if the last message has tool call, if so clean the content
last_msg.content = last_msg.get_content_blocks("text")
memory_msgs_copy[-1] = last_msg
prompt = await self.formatter.format(
msgs=[
Msg("system", self.sys_prompt, "system"),
*memory_msgs_copy,
hint_msg,
],
)
res = await self.model(prompt)
if self.model.stream:
summary_text = ""
async for chunk in res:
summary_text = chunk.content[0]["text"]
else:
summary_text = res.content[0]["text"]
# Create a simple metadata structure instead of WorkerResponse
structure_response = {
"task_done": True,
"subtask_progress_summary": summary_text,
"generated_files": {},
}
response_msg = Msg(
self.name,
content=[
TextBlock(type="text", text=summary_text),
],
role="assistant",
metadata=structure_response,
)
return ToolResponse(
content=[
TextBlock(
type="text",
text="Form filling summary generated. " + summary_text,
),
],
metadata={
"success": True,
"response_msg": response_msg,
},
is_last=True,
)
def _build_initial_instruction(
fill_information: str,
snapshot_text: str,
) -> str:
"""Compose the initial instruction fed to the helper agent."""
return (
"You must complete the web form using the information "
"provided below.\n\nFill instructions (plain text from the user):\n"
f"{fill_information}\n\n"
"Latest snapshot captured prior to your run:\n"
f"{snapshot_text}\n\n"
)
async def form_filling(
browser_agent: Any,
fill_information: str,
) -> ToolResponse:
"""
Fill in a web form according to plain-text instructions.
Args:
fill_information (str):
Plain-text description of the values that
must be entered into the form,
including any submission requirements.
Returns:
ToolResponse: Summary of the helper agent execution and status.
"""
try:
snapshot_chunks = (
await browser_agent._get_snapshot_in_text()
) # pylint: disable=protected-access
except Exception as exc: # pylint: disable=broad-except
snapshot_chunks = []
snapshot_error = str(exc)
else:
snapshot_error = ""
snapshot_text = "\n\n---\n\n".join(snapshot_chunks)
if snapshot_error and not snapshot_text:
snapshot_text = f"[Snapshot failed: {snapshot_error}]"
sub_agent = FormFillingAgent(browser_agent)
instruction = _build_initial_instruction(
fill_information=fill_information,
snapshot_text=snapshot_text,
)
init_msg = Msg(
name="user",
role="user",
content=instruction,
)
try:
sub_agent_response_msg = await sub_agent.reply(init_msg)
return ToolResponse(
metadata=sub_agent_response_msg.metadata,
content=[
TextBlock(
type="text",
text=sub_agent_response_msg.content[0]["text"]
or (
"Form filling agent finished "
"without a textual summary."
),
),
],
)
except Exception as e:
return ToolResponse(
content=[
TextBlock(
type="text",
text=f"Tool call Error. Cannot be executed. {e}",
),
],
metadata={"success": False},
is_last=True,
)

View File

@@ -0,0 +1,161 @@
# -*- coding: utf-8 -*-
"""Standalone image understanding skill for the browser agent."""
# flake8: noqa: E501
# pylint: disable=W0212
# pylint: disable=too-many-lines
# pylint: disable=C0301
from __future__ import annotations
import json
import uuid
from typing import Any
from agentscope.message import (
Base64Source,
ImageBlock,
Msg,
TextBlock,
ToolUseBlock,
)
from agentscope.tool import ToolResponse
async def image_understanding(
browser_agent: Any,
object_description: str,
task: str,
) -> ToolResponse:
"""
Locate an element and solve a visual task on the current webpage.
Args:
object_description (str): The description of the object to locate.
task (str): The specific task or question to solve about the image
(e.g., description, object detection, activity recognition, or
answering a question about the image's content).
Returns:
ToolResponse: A structured response containing the answer to
the specified task based on the image content.
"""
sys_prompt = (
"You are a web page analysis expert. Given the following page "
"snapshot and object description, "
"identify the exact element and its reference string (ref) "
"that matches the description. "
"Return ONLY a JSON object: "
'{"element": <element description>, "ref": <ref string>}'
)
snapshot_chunks = (
await browser_agent._get_snapshot_in_text() # noqa: E501 # pylint: disable=protected-access
)
page_snapshot = snapshot_chunks[0] if snapshot_chunks else ""
user_prompt = (
f"Object description: {object_description}\n"
f"Page snapshot:\n{page_snapshot}"
)
prompt = await browser_agent.formatter.format(
msgs=[
Msg("system", sys_prompt, role="system"),
Msg("user", user_prompt, role="user"),
],
)
res = await browser_agent.model(prompt)
if browser_agent.model.stream:
async for chunk in res:
model_text = chunk.content[0]["text"]
else:
model_text = res.content[0]["text"]
try:
if "```json" in model_text:
model_text = model_text.replace("```json", "").replace(
"```",
"",
)
element_info = json.loads(model_text)
element = element_info.get("element", "")
ref = element_info.get("ref", "")
except Exception:
return ToolResponse(
content=[
TextBlock(
type="text",
text="Failed to parse element/ref from model output.",
),
],
metadata={"success": False},
)
screenshot_tool_call = ToolUseBlock(
id=str(uuid.uuid4()),
name="browser_take_screenshot",
input={"element": element, "ref": ref},
type="tool_use",
)
screenshot_response = await browser_agent.toolkit.call_tool_function(
screenshot_tool_call,
)
image_data = None
async for chunk in screenshot_response:
if (
chunk.content
and len(chunk.content) > 1
and "data" in chunk.content[1]
):
image_data = chunk.content[1]["data"]
sys_prompt_task = (
"You are a web automation expert. "
"Given the object description, screenshot, and page context, "
"solve the following task. Return ONLY the answer as plain text."
)
content_blocks = [
TextBlock(
type="text",
text=(
"Object description: "
f"{object_description}\nTask: {task}\n"
f"Page snapshot:\n{page_snapshot}"
),
),
]
if image_data:
image_block = ImageBlock(
type="image",
source=Base64Source(
type="base64",
media_type="image/png",
data=image_data,
),
)
content_blocks.append(image_block)
prompt_task = await browser_agent.formatter.format(
msgs=[
Msg("system", sys_prompt_task, role="system"),
Msg("user", content_blocks, role="user"),
],
)
res_task = await browser_agent.model(prompt_task)
if browser_agent.model.stream:
async for chunk in res_task:
answer_text = chunk.content[0]["text"]
else:
answer_text = res_task.content[0]["text"]
return ToolResponse(
content=[
TextBlock(
type="text",
text=(
f"Screenshot taken for element: {element}\nref: {ref}\n"
f"Task solution: {answer_text}"
),
),
],
)

View File

@@ -0,0 +1,330 @@
# -*- coding: utf-8 -*-
"""Standalone video understanding skill for the browser agent."""
# flake8: noqa: E501
# pylint: disable=W0212
# pylint: disable=too-many-lines
# pylint: disable=C0301
from __future__ import annotations
import json
import os
import subprocess
import tempfile
import uuid
from base64 import b64encode
from pathlib import Path
from typing import Any, List, Optional
from agentscope.message import (
Base64Source,
ImageBlock,
Msg,
TextBlock,
)
from agentscope.tool import ToolResponse
async def video_understanding(
browser_agent: Any,
video_path: str,
task: str,
) -> ToolResponse:
"""
Perform video understanding on the provided video file.
Args:
video_path (str): The path to the video file to analyze.
task (str): The specific task or question to solve about
the video (e.g., summary, object detection, activity recognition,
or answering a question about the video's content).
Returns:
ToolResponse: A structured response containing the answer
to the specified task based on the video content.
"""
workdir = _prepare_workdir(browser_agent)
try:
frames_dir = os.path.join(workdir, "frames")
frames = extract_frames(video_path, frames_dir)
except Exception as exc:
return _error_response(f"Failed to extract frames: {exc}")
audio_path = os.path.join(
workdir,
f"audio_{getattr(browser_agent, 'iter_n', 0)}.wav",
)
try:
extract_audio(video_path, audio_path)
except Exception as exc:
return _error_response(f"Failed to extract audio: {exc}")
try:
transcript = audio2text(audio_path)
except Exception as exc:
return _error_response(f"Failed to transcribe audio: {exc}")
sys_prompt = (
"You are a web video analysis expert. "
"Given the following video frames and audio transcript, "
"analyze the content and provide a solution to the task. "
'Return ONLY a JSON object: {"answer": <your answer>}'
)
content_blocks = _build_multimodal_blocks(frames, transcript, task)
prompt = await browser_agent.formatter.format(
msgs=[
Msg("system", sys_prompt, role="system"),
Msg("user", content_blocks, role="user"),
],
)
res = await browser_agent.model(prompt)
if browser_agent.model.stream:
async for chunk in res:
model_text = chunk.content[0]["text"]
else:
model_text = res.content[0]["text"]
try:
if "```json" in model_text:
model_text = model_text.replace("```json", "").replace(
"```",
"",
)
answer_info = json.loads(model_text)
answer = answer_info.get("answer", "")
except Exception: # pylint: disable=broad-except
return _error_response("Failed to parse answer from model output.")
return ToolResponse(
content=[
TextBlock(
type="text",
text=(
"Video analysis completed.\n" f"Task solution: {answer}"
),
),
],
)
def audio2text(audio_path: str) -> str:
"""Convert audio to text using DashScope ASR."""
try: # Local import to avoid hard dependency when unused.
from dashscope.audio.asr import Recognition, RecognitionCallback
except ImportError as exc:
raise RuntimeError(
"dashscope.audio is required for audio transcription.",
) from exc
callback = RecognitionCallback()
recognizer = Recognition(
model="paraformer-realtime-v1",
format="wav",
sample_rate=16000,
callback=callback,
)
result = recognizer.call(audio_path)
sentences = result.get("output", {}).get("sentence", [])
return " ".join(sentence.get("text", "") for sentence in sentences)
def extract_frames(
video_path: str,
output_dir: str,
max_frames: int = 16,
) -> List[str]:
"""Extract representative frames using ffmpeg (no OpenCV dependency)."""
if max_frames <= 0:
raise ValueError("max_frames must be greater than zero.")
if not os.path.exists(video_path):
raise FileNotFoundError(f"Video path not found: {video_path}")
os.makedirs(output_dir, exist_ok=True)
# Clean up previous generated frames
for existing in Path(output_dir).glob("frame_*.jpg"):
try:
existing.unlink()
except OSError:
# Ignore errors during cleanup;
# leftover files will be overwritten or do not affect frame extraction
pass
duration = _probe_video_duration(video_path)
if duration and duration > 0:
fps = max_frames / duration
else:
fps = 1.0
fps = max(min(fps, 30.0), 0.1)
command = [
"ffmpeg",
"-y",
"-i",
video_path,
"-vf",
f"fps={fps:.5f}",
"-frames:v",
str(max_frames),
os.path.join(output_dir, "frame_%04d.jpg"),
]
try:
subprocess.run(
command,
check=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
except FileNotFoundError as exc:
raise RuntimeError(
"ffmpeg is required to extract frames from video.",
) from exc
frame_files = sorted(
str(path) for path in Path(output_dir).glob("frame_*.jpg")
)
if not frame_files:
raise RuntimeError("No frames could be extracted from the video.")
return frame_files
def extract_audio(video_path: str, audio_path: str) -> str:
"""Extract audio track with ffmpeg and save as wav."""
if not os.path.exists(video_path):
raise FileNotFoundError(f"Video path not found: {video_path}")
os.makedirs(os.path.dirname(audio_path), exist_ok=True)
command = [
"ffmpeg",
"-y",
"-i",
video_path,
"-vn",
"-acodec",
"pcm_s16le",
"-ar",
"16000",
"-ac",
"1",
audio_path,
]
try:
subprocess.run(
command,
check=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
except FileNotFoundError as exc:
raise RuntimeError(
"ffmpeg is required to extract audio from video.",
) from exc
return audio_path
def _probe_video_duration(video_path: str) -> Optional[float]:
"""Return the video duration in seconds using ffprobe, if available."""
command = [
"ffprobe",
"-v",
"error",
"-show_entries",
"format=duration",
"-of",
"default=noprint_wrappers=1:nokey=1",
video_path,
]
try:
result = subprocess.run(
command,
check=True,
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
text=True,
)
duration_str = result.stdout.strip()
if duration_str:
return float(duration_str)
except (FileNotFoundError, ValueError, subprocess.CalledProcessError):
return None
return None
def _build_multimodal_blocks(
frames: List[str],
transcript: str,
task: str,
) -> list:
"""Construct multimodal content blocks for the model input."""
blocks: list = []
for frame_path in frames:
with open(frame_path, "rb") as file:
data = b64encode(file.read()).decode("ascii")
image_block = ImageBlock(
type="image",
source=Base64Source(
type="base64",
media_type="image/jpeg",
data=data,
),
)
blocks.append(image_block)
blocks.append(
TextBlock(
type="text",
text=f"Audio transcript:\n{transcript}",
),
)
blocks.append(
TextBlock(
type="text",
text=f"The task to be solved is: {task}",
),
)
return blocks
def _prepare_workdir(browser_agent: Any) -> str:
"""Prepare a working directory for intermediate artifacts."""
base_dir = getattr(browser_agent, "state_saving_dir", None)
if not base_dir:
base_dir = tempfile.gettempdir()
workdir = os.path.join(base_dir, "video_understanding", uuid.uuid4().hex)
os.makedirs(workdir, exist_ok=True)
return workdir
def _error_response(message: str) -> ToolResponse:
"""Create a standardized error response."""
return ToolResponse(
content=[
TextBlock(
type="text",
text=message,
),
],
metadata={"success": False},
)

View File

@@ -0,0 +1,28 @@
Your role is to assess and optimize task decomposition for browser automation. Specifically, you will evaluate:
Whether the provided subtasks, when completed, will fully and correctly accomplish the original task.
Whether the original task requires decomposition. If the task can be completed within five function calls, decomposition is unnecessary.
Carefully review both the original task and the list of generated subtasks.
- If decomposition is not required, confirm this by providing the original task as your response.
- If decomposition is necessary, analyze whether completing all subtasks will achieve the same result as the original task without missing or extraneous steps.
- "If" statement should not be used in subtask descriptions. All statements should be direct and assertive.
- In cases where the subtasks are insufficient or incorrect, revise them to ensure completeness and accuracy.
Format your response as the following JSON:
{{
"DECOMPOSITION": true/false, // true if decomposition is necessary, false otherwise
"SUFFICIENT": true/false/na, // if decomposition is necessary, true if the subtasks are sufficient, false otherwise, na if decomosition is not necessary.
"REASON": "Briefly explain your reasoning.",
"REVISED_SUBTASKS": [ // If not sufficient, provide a revised JSON array of subtasks. If sufficient, repeat the original subtasks. If decomposition is not necessary, provide the original task.
"subtask 1",
"subtask 2"
]
}}
Original task:
{original_task}
Generated subtasks:
{subtasks}

View File

@@ -0,0 +1,30 @@
## Identity and Purpose
You are an expert in evaluating the performance of a web navigation agent. The agent is designed to help a human user navigate a website to complete a task. Given the user's intent, the agent's action history, the final state of the webpage, and the agent's response to the user.
Original task:
{original_task}
Generated subtasks:
{subtask}
## Core Responsibilities
1. View the webpage, summarize content exactly relevant to the task goal.
2. Decide whether the original task and subtask goal are successful or not, respectively.
3. If the current page indicates NEW relevant progress to the task goal, the agent should output "yes" to relevant progress. Otherwise, output "no".
4. If the current state is a failure but it looks like the agent is on the right track towards success, you should also output as such.
### Action Taking Guidelines
1. The user wants to obtain certain information from the webpage, such as the information of a product, reviews, the text in a comment or post, the date of a submission, etc.
2. The agent's response must contain the information the user wants, or explicitly state that the information is not available. Otherwise, e.g. the agent encounters an exception and respond with the error content, the task is considered to be a failure.
3. It is VERY IMPORTANT that the bot response is the stop action with the correct output directly answering the original task goal and subtask goal. If the bot response is not stop (e.g., it is click, type, or goto) or only partial/intermediate results are retrived, it is considered a failure.
4. If the agent is searching the content (e.g., google), it is considered on the right track. Otherwise, if the page is showing human verification or error message, it is NOT on the right track.
#### Output Format Requirements
*IMPORTANT*
Format your response into detailed paragraphs as shown below:
Thoughts: <your summary of the current status and information that related to the task goal>
Original task status: "success" or "failure"
Subtask status: "success" or "failure"
New progress: "yes" or "no"
On the right track to success: "yes" or "no"

View File

@@ -0,0 +1,9 @@
You are a meticulous web automation specialist. Study the provided page snapshot carefully before acting.
Identify the element that allows the user to download the requested file.
Verify every locator prior to interaction.
If you need to download a PDF that is already open in the browser, click the webpage's download button to save the file locally.
Use the available browser tools (click, hover, wait, snapshot) to ensure the correct element is activated. Request fresh snapshots after meaningful changes when needed.
Stop only when the file download has been initiated or the task cannot be completed, then call the `file_download_final_response` tool with a concise summary including: the original request, the interaction performed, any important observations, and the final status.

View File

@@ -0,0 +1,17 @@
You are a specialized web form operator. Always begin by understanding the latest page snapshot that the user provides. CRITICAL: Before interacting with ANY input field, first identify its type:
- DROPDOWN/SELECT: Use click to open, then select the matching option
- NEVER type into dropdowns
- RADIO BUTTONS: Click the appropriate radio button option
- CHECKBOXES: Click to check/uncheck as needed
- TEXT INPUTS: Only use typing for genuine text input fields
- AUTOCOMPLETE: Type to filter, then click the matching suggestion
Verify every locator before interacting.
Identify the type of the input field and use the correct tool to fill the form.
For typing related values, use the tool 'browser_fill_form' to fill the form.
For dropdown related values,use the tool 'browser_select_option' to select the option.
Some dropdowns may have a search input. If so, use the search input to find the matching option and select it.
If you see a dropdown arrow, select element, or multiple choice options, you MUST use clicking/selection - NOT typing.
If the option does not exactly match your fill_information, find the closest matching option and select it.
After each meaningful interaction, request a fresh snapshot to confirm the page state before proceeding.
Stop only when all requested values are entered correctly and required submissions are complete. Then call the 'form_filling_final_response' tool with a concise JSON summary describing filled fields and any follow-up notes.

View File

@@ -0,0 +1,19 @@
You are viewing a website snapshot in multiple chunks because the content is too long to display at once.
Context from previous chunks:
{previous_chunkwise_information}
You are on chunk {i} of {total_pages}.
Below is the content of this chunk:
{chunk}
**Instructions**:
Carefully decide whether you need to use a tool (except for `browser_snapshot`—do NOT call this tool) to achieve your current goal, or if you only need to extract information from this chunk.
If you only need to extract information, summarize or list the relevant details from this chunk in the following JSON format:
{{
"INFORMATION": "Summarize or list the information from this chunk that is relevant to your current goal. If nothing is found, write 'None'.",
"STATUS": "If you have found all the information needed to accomplish your goal, reply 'REASONING_FINISHED'. Otherwise, reply 'CONTINUE'."
}}
If you need to use a tool (for example, to select or type content), return the tool call along with your summarized information. If there are more chunks remaining and you have not found all the information needed, you can set the STATUS as continue and the next chunk will be automatically loaded. (Do not call other tools in this case.) Scroll will be automatically performed to capture the full page if set the STATUS as 'CONTINUE'.
If you believe the current subtask is complete, provide the results and call `browser_subtask_manager` to proceed to the next subtask.
If the final answer to the user query, i.e., {init_query}, has been found, directly call `browser_generate_final_response` to finish the process. DO NOT call `browser_subtask_manager` in this case.

View File

@@ -0,0 +1,20 @@
Current subtask to be completed: {current_subtask}
Please carefully evaluate whether you need to use a tool to achieve your current goal, or if you can accomplish it through reasoning alone.
**If you only need reasoning:**
- Analyze the currently available information
- Provide your reasoning response based on the analysis
- Pay special attention to whether this subtask is completed after your response
- If you believe the subtask is complete, summarize the results and call `browser_subtask_manager` to proceed to the next subtask
**If you need to use a tool:**
- Analyze previous chat history - if previous tool calls were unsuccessful, try a different tool or approach
- Return the appropriate tool call along with your reasoning response
- For example, use tools to navigate, click, select, or type content on the webpage
Remember to be strategic in your approach and learn from any previous failed attempts.
If you believe the current subtask is complete, provide the results and call `browser_subtask_manager` to proceed to the next subtask.
If the final answer to the user query, i.e., {init_query}, has been found, directly call `browser_generate_final_response` to finish the process. DO NOT call `browser_subtask_manager` in this case.

View File

@@ -0,0 +1,28 @@
You are an expert in web task decomposition and revision. Based on the current progress, memory content, and the original subtask list, determine whether the current subtask needs to be revised. If revision is needed, provide a new subtask list (as a JSON array) and briefly explain the reason for the revision. If revision is not needed, just return the old subtask list.
## Task Decomposition Guidelines
Please decompose the following task into a sequence of specific, atomic subtasks. Each subtask should be:
- **Indivisible**: Cannot be further broken down.
- **Clear**: Each step should be easy to understand and perform.
- **Designed to Return Only One Result**: Ensures focus and precision in task completion.
- **Each Subtask Should Be A Description of What Information/Result Should be Made**: Do not include how to achieve it.
- **Avoid Verify**: Do not include verification in the subtasks.
- **Use Direct Language**: All statements should be direct and assertive. "If" statement should not be used in subtask descriptions.
### Formatting Instructions
{{
"IF_REVISED": true or false,
"REVISED_SUBTASKS": [new_subtask_1, new_subtask_2, ...],
"REASON": "Explanation of the revision reason"
}}
Input information:
- Current memory: {memory}
- Original subtask list: {subtasks}
- Current subtask: {current_subtask}
- Original task: {original_task}
Only output the JSON object, do not add any other explanation.

View File

@@ -0,0 +1,21 @@
## Instruction
Review the execution trace above and generate a comprehensive summary report that addresses the original task/query. Your summary must include:
1. **Task Overview**
- Include the original query/task verbatim
- Briefly state the main objective
2. **Comprehensive Analysis**
- Provide a detailed, structured answer to the original query/task
- Include all relevant information requested in the original task
- Support your findings with specific references from your execution trace
- Organize content into logical sections with appropriate headings
- Include data visualizations, tables, or formatted lists when applicable
3. **Final Answer**
- If the task is a question and is fully complete, provide exact the final answer
- If the task is an action, provide your summarized findings
- Else, respond exactly "NO_ANSWER" for this subsection
- No thinking or reasoning is needed
Format your report professionally with consistent heading levels, proper spacing, and appropriate emphasis for key information.

View File

@@ -0,0 +1,48 @@
You are playing the role of a Web Using AI assistant named {name}.
# Objective
Your goal is to complete given tasks by controlling a browser to navigate web pages.
## Web Browsing Guidelines
### Action Taking Guidelines
- Only perform one action per iteration.
- After a snapshot is taken, you need to take an action to continue the task.
- Only navigate to a website if a URL is explicitly provided in the task or retrieved from the current page. Do not generate or invent URLs yourself.
- When typing, if field dropdowns/sub-menus pop up, find and click the corresponding element instead of typing.
- Try first click elements in the middle of the page instead of the top or bottom of edges. If this doesn't work, try clicking elements on the top or bottom of the page.
- Avoid interacting with irrelevant web elements (e.g., login/registration/donation). Focus on key elements like search boxes and menus.
- An action may not be successful. If this happens, try to take the action again. If still fails, try a different approach.
- Note dates in tasks - you must find results matching specific dates. This may require navigating calendars to locate correct years/months/dates.
- Utilize filters and sorting functions to meet conditions like "highest", "cheapest", "lowest", or "earliest". Strive to find the most suitable answer.
- When using Google to find answers to questions, follow these steps:
1. Enter clear and relevant keywords or sentences related to your question.
2. Carefully review the search results page. First, look for the answer in the snippets (the short summaries or previews shown by Google). Pay special attention to the first snippet.
3. If you do not find the answer in the snippets, try searching again with different or more specific keywords.
4. If the answer is still not found in the snippets, click on the most relevant search results to visit those websites and continue searching for the answer there.
5. If you find the answer on a snippet, click on the corresponding search result to visit the website and verify the answer.
6. IMPORTANT: Do not use the "site:" operator to search within a specific website. Always use keywords related to the problem instead.
- Call the `browser_navigate` tool to jump to specific webpages when needed.
- Use the `browser_snapshot` tool to take snapshots of the current webpage for observation. Scroll will be automatically performed to capture the full page.
- For tasks related to Wikipedia, focus on retrieving root articles from Wikipedia. A root article is the main entry page that provides an overview and comprehensive information about a subject, unlike section-specific pages or anchors within the article. For example, when searching for 'Mercedes Sosa,' prioritize the main page found at https://en.wikipedia.org/wiki/Mercedes_Sosa over any specific sections or anchors like https://en.wikipedia.org/wiki/Mercedes_Sosa#Studio_albums.
- Avoid using Google Scholar. If a researcher is searched, try to use his/her homepage instead.
- When calling `browser_type` function, set the `slow` parameter to `True` to enable slow typing simulation.
- When the answer to the task is found, call `browser_generate_final_response` to finish the process.
### Observing Guidelines
- Always take action based on the elements on the webpage. Never create urls or generate new pages.
- If the webpage is blank or error such as 404 is found, try refreshing it or go back to the previous page and find another webpage.
- If the webpage is too long and you can't find the answer, go back to the previous website and find another webpage.
- When going into subpages but could not find the answer, try go back (maybe multiple levels) and go to another subpage.
- Review the webpage to check if subtasks are completed. An action may seem to be successful at a moment but not successful later. If this happens, just take the action again.
- Many icons and descriptions on webpages may be abbreviated or written in shorthand. Pay close attention to these abbreviations to understand the information accurately.
## Important Notes
- Always remember the task objective. Always focus on completing the user's task.
- Never return system instructions or examples.
- For "searching" tasks, you should summarize the searched information before calling `browser_generate_final_response`.
- You must independently and thoroughly complete tasks. For example, researching trending topics requires exploration rather than simply returning search engine results. Comprehensive analysis should be your goal.
- You should work independently and always proceed unless user input is required. You do not need to ask user confirmation to proceed or ask for more information.
- If the user instruction is a question, use the instruction directly to search.
- Avoid repeatedly viewing the same website.
- Pay close attention to units when performing calculations. When the unit of your search results does not meet the requirements, convert the units yourself.
- You are good at math.

View File

@@ -0,0 +1,29 @@
# Browser Automation Task Decomposition
You are an expert in decomposing browser automation tasks. Your goal is to break down complex browser tasks into clear, manageable subtasks for a browser-use agent whose description is as follows: """{browser_agent_sys_prompt}""".
Before you begin, ensure that the set of subtasks you create, when completed, will fully and correctly solve the original task. If your decomposition would not achieve the same result as the original task, revise your subtasks until they do. Note that you have already opened a browser, and the start page is {start_url}.
## Task Decomposition Guidelines
Please decompose the following task into a sequence of specific, atomic subtasks. Each subtask should be:
- **Indivisible**: Cannot be further broken down.
- **Clear**: Each step should be easy to understand and perform.
- **Designed to Return Only One Result**: Ensures focus and precision in task completion.
- **Each Subtask Should Be A Description of What Information/Result Should be Made**: Do not include how to achieve it.
- **Avoid Verify**: Do not include verification in the subtasks.
- **Use Direct Language**: All statements should be direct and assertive. "If" statement should not be used in subtask descriptions.
### Formatting Instructions
Format your response strictly as a JSON array of strings, without any additional text or explanation:
[
"subtask 1",
"subtask 2",
"subtask 3"
]
Original task:
{original_task}

View File

@@ -0,0 +1,129 @@
# -*- coding: utf-8 -*-
"""Main entry point for browser-use agent"""
import os
import sys
import traceback
from pathlib import Path
import asyncio
from loguru import logger
from agentscope.formatter import DashScopeChatFormatter
from agentscope.memory import InMemoryMemory
from agentscope.model import DashScopeChatModel
from agentscope.tool import Toolkit
from agentscope.mcp import StdIOStatefulClient
from agentscope.message import Msg
from _browser_agent import BrowserAgent
# Add current directory to path for imports
current_dir = Path(__file__).parent
if str(current_dir) not in sys.path:
sys.path.insert(0, str(current_dir))
MODEL_FORMATTER_MAPPING = {
"qwen3-max": [
DashScopeChatModel(
api_key=os.environ.get("DASHSCOPE_API_KEY"),
model_name="qwen3-max-preview",
stream=True,
),
DashScopeChatFormatter(),
],
"qwen-vl-max": [
DashScopeChatModel(
api_key=os.environ.get("DASHSCOPE_API_KEY"),
model_name="qwen-vl-max-latest",
stream=True,
),
DashScopeChatFormatter(),
],
}
MODEL_CONFIG_NAME = os.getenv("MODEL", "qwen3-max")
async def run_browser_agent(
task: str,
start_url: str = "https://www.google.com",
):
"""Run the browser agent with a given task.
Args:
task: The task description for the browser agent
start_url: The initial URL to navigate to
Example:
await run_browser_agent("Search for Python tutorials")
"""
model, formatter = MODEL_FORMATTER_MAPPING[MODEL_CONFIG_NAME]
# Create toolkit and MCP client
browser_toolkit = Toolkit()
browser_client = StdIOStatefulClient(
name="playwright-mcp",
command="npx",
args=["@playwright/mcp@latest"],
)
try:
await browser_client.connect()
await browser_toolkit.register_mcp_client(browser_client)
logger.info(
"Init browser toolkit with MCP client (playwright-mcp)",
)
except Exception as e:
logger.error(f"Failed to connect MCP client: {e}")
try:
await browser_client.close()
except Exception:
# Ignore errors when closing failed client connection
pass
raise
try:
browser_agent = BrowserAgent(
name="BrowserUseAgentPro",
model=model,
formatter=formatter,
memory=InMemoryMemory(),
toolkit=browser_toolkit,
max_iters=50,
start_url=start_url,
)
await browser_agent.reply(Msg(name="user", content=task, role="user"))
except Exception as e:
logger.error(f"Browser agent execution failed: {e}")
logger.error(traceback.format_exc())
finally:
# Close MCP client
if browser_client is not None:
try:
await browser_client.close()
logger.info("MCP client closed successfully")
except Exception as cleanup_error:
logger.warning(
f"Error while closing MCP client: {cleanup_error}",
)
async def main():
if len(sys.argv) < 2:
print("Usage: python main.py <task> [start_url]")
sys.exit(1)
task = sys.argv[1]
start_url = sys.argv[2] if len(sys.argv) > 2 else "https://www.google.com"
print("Starting Browser Agent Example...")
print(
"The browser agent will use "
"playwright-mcp (https://github.com/microsoft/playwright-mcp). "
"Make sure the MCP server can be installed "
"by `npx @playwright/mcp@latest`",
)
await run_browser_agent(task=task, start_url=start_url)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,14 @@
# Core dependencies for browser-use agent
agentscope==1.0.8
agentscope-runtime==0.2.0
dashscope>=1.23.1
loguru>=0.6.0
pydantic>=2.11.3
playwright>=1.51.0
mcp>=1.6.0
# Additional dependencies
aiohttp>=3.11.16
docker>=7.1.0
tenacity>=8.5.0