diff --git a/README.md b/README.md index 5d4a447..d33fbbd 100644 --- a/README.md +++ b/README.md @@ -62,6 +62,7 @@ It includes **agent deployment** and **secure sandboxed tool execution**, and ca ├── alias/ # Agent to solve real-world problems ├── browser_use/ │ ├── agent_browser/ # Pure Python browser agent +│ ├── browser_use_agent_pro/ # Advanced pure python browser agent │ └── browser_use_fullstack_runtime/ # Full-stack runtime version with frontend/backend │ ├── deep_research/ @@ -93,6 +94,7 @@ It includes **agent deployment** and **secure sandboxed tool execution**, and ca | ----------------------- |-------------------------------------------------------| --------------- | ------------ |--------------------------------------------------| | **Data Processing** | data_juicer_agent/ | ✅ | ❌ | Multi-agent data processing with Data-Juicer | | **Browser Use** | browser_use/agent_browser | ✅ | ❌ | Command-line browser automation using AgentScope | +| | browser_use/browser_use_agent_pro | ✅ | ❌ | Advanced command-line Python browser agent using AgentScope | | | browser_use/browser_use_fullstack_runtime | ✅ | ✅ | Full-stack browser automation with UI & sandbox | | **Deep Research** | deep_research/agent_deep_research | ✅ | ❌ | Multi-agent research pipeline | | | deep_research/qwen_langgraph_search_fullstack_runtime | ❌ | ✅ | Full-stack deep research app | @@ -183,7 +185,7 @@ This project is licensed under the **Apache 2.0 License** – see the [LICENSE]( ## Contributors ✨ -Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)): +Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/emoji-key/)): diff --git a/README_zh.md b/README_zh.md index 73a8f9f..4b0957e 100644 --- a/README_zh.md +++ b/README_zh.md @@ -62,7 +62,8 @@ AgentScope Runtime 是一个**全面的运行时框架**,主要解决部署和 ├── alias/ # 解决现实问题的智能体程序 ├── browser_use/ │ ├── agent_browser/ # 纯 Python 浏览器 Agent -│ └── browser_use_fullstack_runtime/ # 全栈运行时版本(前端+后端) +│ ├── browser_use_agent_pro/ # 高级纯 Python 浏览器 Agent +│ └── browser_use_fullstack_runtime/ # 全栈运行时版本(前端+后端) │ ├── deep_research/ │ ├── agent_deep_research/ # 纯 Python 多 Agent 研究流程 @@ -93,6 +94,7 @@ AgentScope Runtime 是一个**全面的运行时框架**,主要解决部署和 |-----------|-------------------------------------------------------|---------------|-----------------------|-------------------------| | **数据处理** | data_juicer_agent/ | ✅ | ❌ | 基于 Data-Juicer 的多智能体数据处理 | | **浏览器相关** | browser_use/agent_browser | ✅ | ❌ | 基于 AgentScope 的命令行浏览器自动化 | +| | browser_use/browser_use_agent_pro | ✅ | ❌ | 基于 AgentScope 的高级命令行浏览器智能体 | | | browser_use/browser_use_fullstack_runtime | ✅ | ✅ | 带 UI 和沙盒环境的全栈浏览器自动化 | | **深度研究** | deep_research/agent_deep_research | ✅ | ❌ | 多 Agent 研究流程 | | | deep_research/qwen_langgraph_search_fullstack_runtime | ❌ | ✅ | 全栈运行时深度研究应用 | @@ -181,7 +183,7 @@ AgentScope Runtime 是一个**全面的运行时框架**,主要解决部署和 ## 贡献者 ✨ -感谢这些优秀的贡献者们 ([表情符号说明](https://allcontributors.org/docs/en/emoji-key)): +感谢这些优秀的贡献者们 ([表情符号说明](https://allcontributors.org/emoji-key/)): diff --git a/browser_use/browser_use_agent_pro/README.md b/browser_use/browser_use_agent_pro/README.md new file mode 100644 index 0000000..d5d6b5e --- /dev/null +++ b/browser_use/browser_use_agent_pro/README.md @@ -0,0 +1,70 @@ +# Browser Use Agent Pro + +A powerful, standalone browser automation agent built on top of [AgentScope](https://github.com/agentscope-ai/agentscope) and [Playwright MCP](https://github.com/microsoft/playwright-mcp). This agent provides intelligent web automation capabilities through natural language instructions. + +Browser Use Agent Pro excels at automating a wide range of web-based tasks, including web research, form automation, e-commerce operations, content management, testing, and workflow automation. + +## ✨ Key Features + +- **Multimodal Understanding** + - Image understanding: Analyze and interact with visual elements on web pages + - Video understanding: Extract frames, transcribe audio, and analyze video content + - Form filling: Automatically fill web forms based on natural language instructions + - File Download: Locate and trigger file downloads from web pages + +- **Task Decomposition and Management** + - Automatic task decomposition: Break down complex tasks into manageable subtasks + - Subtask tracking: Monitor and manage progress through multiple subtasks + - Dynamic subtask revision: Adapt and refine subtasks based on execution results + - Task completion validation: Verify when subtasks and overall tasks are completed + +- **Advanced Reasoning** + - Pure reasoning: Plan actions without page observation + - Observation-based reasoning: Analyze page content before making decisions + - Chunked observation: Process large page snapshots in manageable chunks + +- **Memory Management** + - Automatic memory summarization: Condense conversation history when it exceeds limits + - Tool output filtering: Clean and filter verbose tool execution results + + +## 📋 Requirements + +- Python 3.10+ +- Node.js and npx (for playwright-mcp) +- DASHSCOPE_API_KEY environment variable + +The playwright-mcp server will automatically handle browser installation when first run via `npx @playwright/mcp@latest`. No manual browser installation is required. + +## 💻 Installation + +1. Install dependencies: +```bash +# From the project root directory +pip install -r requirements.txt +``` + +2. Ensure Node.js and npx are installed (required for playwright-mcp): +```bash +# Check if npx is available +npx --version +``` + +3. Set up environment variables: +```bash +export DASHSCOPE_API_KEY="your-api-key" +export MODEL="qwen3-max" # or "qwen-vl-max" for vision model +``` + +## 🚀 Basic Usage + +Run the Browser Use Agent Pro with a task, optionally configure the start URL: + +```bash +# From the project root directory +python main.py "Find the latest stock price of Alibaba Group" "https://www.google.com" +``` + +## ℹ️ Note + +This is a standalone version extracted from the [Alias-Agent](https://github.com/agentscope-ai/agentscope-samples/tree/main/alias) project. It now uses standard agentscope components (ReActAgent, Toolkit) with local Playwright MCP clients. diff --git a/browser_use/browser_use_agent_pro/_browser_agent.py b/browser_use/browser_use_agent_pro/_browser_agent.py new file mode 100644 index 0000000..3cceaed --- /dev/null +++ b/browser_use/browser_use_agent_pro/_browser_agent.py @@ -0,0 +1,1292 @@ +# -*- coding: utf-8 -*- +"""Browser Agent""" +# flake8: noqa: E501 +# pylint: disable=W0212 +# pylint: disable=too-many-lines +# pylint: disable=C0301 +import re +import uuid +import os +import json +import inspect +from functools import wraps +from typing import Type, Optional, Any +import asyncio +import copy +from loguru import logger +from pydantic import BaseModel + +from agentscope.formatter import FormatterBase +from agentscope.memory import MemoryBase +from agentscope.message import ( + Msg, + ToolUseBlock, + TextBlock, + ImageBlock, + Base64Source, +) +from agentscope.agent import ReActAgent +from agentscope.model import ChatModelBase +from agentscope.tool import ( + ToolResponse, + Toolkit, +) +from agentscope.token import TokenCounterBase, OpenAITokenCounter + +from _build_in_helper_browser._image_understanding import ( + image_understanding, +) +from _build_in_helper_browser._video_understanding import ( + video_understanding, +) +from _build_in_helper_browser._file_download import ( + file_download, +) +from _build_in_helper_browser._form_filling import ( + form_filling, +) + + +# Get the directory of the current file +_CURRENT_DIR = os.path.dirname(os.path.abspath(__file__)) + +with open( + os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_sys_prompt.md", + ), + "r", + encoding="utf-8", +) as f: + _BROWSER_AGENT_DEFAULT_SYS_PROMPT = f.read() +with open( + os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_pure_reasoning_prompt.md", + ), + "r", + encoding="utf-8", +) as f: + _BROWSER_AGENT_DEFAULT_PURE_REASONING_PROMPT = f.read() +with open( + os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_observe_reasoning_prompt.md", + ), + "r", + encoding="utf-8", +) as f: + _BROWSER_AGENT_DEFAULT_OBSERVE_REASONING_PROMPT = f.read() +with open( + os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_task_decomposition_prompt.md", + ), + "r", + encoding="utf-8", +) as f: + _BROWSER_AGENT_DEFAULT_TASK_DECOMPOSITION_PROMPT = f.read() +with open( + os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_summarize_task.md", + ), + "r", + encoding="utf-8", +) as f: + _BROWSER_AGENT_SUMMARIZE_TASK_PROMPT = f.read() + +DEFAULT_BROWSER_WORKER_NAME = "browser_agent" + + +async def browser_pre_reply_hook( + self, + kwargs: dict[str, Any], +): + """Pre-reply hook: initial navigation and task decomposition. + + Expects kwargs["msg"] to be a Msg. Returns updated kwargs with possibly + rewritten "msg". + """ + msg = kwargs.get("msg") + # for the case directly using session service + if msg is None: + msg = (await self.memory.get_memory())[-1] + if self.start_url and not self._has_initial_navigated: + await self._navigate_to_start_url() + self._has_initial_navigated = True + msg = await self._task_decomposition_and_reformat(msg) + await self.memory.add(msg) + + +async def browser_post_acting_hook( + self, + kwargs: dict[str, Any], # pylint: disable=W0613 + output: Any, # pylint: disable=W0613 +): + """ + Hook func for cleaning the messy return after action. + Observation will be done before reasoning steps. + """ + mem_msgs = await self.memory.get_memory() + mem_length = await self.memory.size() + if len(mem_msgs) == 0: + return + tool_res_msg = mem_msgs[-1] + for i, b in enumerate(tool_res_msg.content): + if b["type"] == "tool_result": + for j, return_json in enumerate(b.get("output", [])): + if isinstance(return_json, dict) and "text" in return_json: + tool_res_msg.content[i]["output"][j][ + "text" + ] = self._filter_execution_text(return_json["text"]) + await self.print(tool_res_msg) + await self.memory.delete(mem_length - 1) + await self.memory.add(tool_res_msg) + + +class BrowserAgent(ReActAgent): + """ + Browser Agent that extends ReActAgent with browser-specific capabilities. + + The agent leverages MCP (Model Context Protocol) servers to access browser + tools with Playwright, enabling sophisticated web automation tasks. + + Example: + .. code-block:: python + + agent = BrowserAgent( + name="web_navigator", + model=my_chat_model, + formatter=my_formatter, + memory=my_memory, + toolkit=browser_toolkit, + start_url="https://example.com" + ) + + response = await agent.reply("Search for Python tutorials") + """ + + def __init__( + self, + name: str = DEFAULT_BROWSER_WORKER_NAME, + model: ChatModelBase | None = None, + formatter: FormatterBase | None = None, + memory: MemoryBase | None = None, + toolkit: Toolkit | None = None, + sys_prompt: str = _BROWSER_AGENT_DEFAULT_SYS_PROMPT, + max_iters: int = 50, + start_url: Optional[str] = "https://www.google.com", + pure_reasoning_prompt: str = _BROWSER_AGENT_DEFAULT_PURE_REASONING_PROMPT, + observe_reasoning_prompt: str = _BROWSER_AGENT_DEFAULT_OBSERVE_REASONING_PROMPT, + task_decomposition_prompt: str = ( + _BROWSER_AGENT_DEFAULT_TASK_DECOMPOSITION_PROMPT + ), + token_counter: TokenCounterBase = OpenAITokenCounter("gpt-4o"), + max_mem_length: int = 20, + ) -> None: + """Initialize the Browser Agent. + + Args: + name (str): + The unique identifier name for the agent instance. + Defaults to DEFAULT_BROWSER_WORKER_NAME. + model (ChatModelBase): + The chat model used for generating responses and reasoning. + formatter (FormatterBase): + The formatter used to convert messages into the required format + for the model API. + memory (MemoryBase): + The memory component used to store and retrieve dialogue + history. + toolkit (Toolkit): + A toolkit object containing the browser tool functions and + utilities. + sys_prompt (str, optional): + The system prompt that defines the agent's behavior and + personality. + Defaults to _BROWSER_AGENT_DEFAULT_SYS_PROMPT. + max_iters (int, optional): + The maximum number of reasoning-acting loop iterations. + Defaults to 50. + start_url (Optional[str], optional): + The initial URL to navigate to when the agent starts. + Defaults to "https://www.google.com". + pure_reasoning_prompt (str, optional): + The prompt used during pure reasoning phase. + observe_reasoning_prompt (str, optional): + The prompt used during observation reasoning phase. + task_decomposition_prompt (str, optional): + The prompt used for task decomposition. + token_counter (TokenCounterBase, optional): + Token counter for estimating token usage. + max_mem_length (int, optional): + Maximum memory length before summarization. + Defaults to 20. + + Returns: + None + """ + if ( + model is None + or formatter is None + or memory is None + or toolkit is None + ): + raise ValueError( + "model, formatter, memory, and toolkit are required parameters", + ) + + self.start_url = start_url + self._has_initial_navigated = False + self.pure_reasoning_prompt = pure_reasoning_prompt + self.observe_reasoning_prompt = observe_reasoning_prompt + self.task_decomposition_prompt = task_decomposition_prompt + self.max_memory_length = max_mem_length + self.token_estimator = token_counter + self.snapshot_chunk_id = 0 + self.chunk_continue_status = False + self.previous_chunkwise_information = "" + self.snapshot_in_chunk = [] + self.subtasks = [] + self.original_task = "" + self.current_subtask_idx = 0 + self.current_subtask = None + self.iter_n = 0 + self.finish_function_name = "browser_generate_final_response" + self.init_query = "" + self._required_structured_model: Type[BaseModel] | None = None + sys_prompt = sys_prompt.format(name=name) + super().__init__( + name=name, + sys_prompt=sys_prompt, + model=model, + formatter=formatter, + memory=memory, + toolkit=toolkit, + max_iters=max_iters, + ) + + self.toolkit.register_tool_function(self.browser_subtask_manager) + if self._supports_multimodal(): + self._register_skill_tool(image_understanding) + self._register_skill_tool(video_understanding) + + self._register_skill_tool(file_download) + self._register_skill_tool(form_filling) + + self.no_screenshot_tool_list = [ + tool + for tool in self.toolkit.get_json_schemas() + if tool.get("function", {}).get("name") + not in ["browser_take_screenshot"] + ] + + # Register hooks + self.register_instance_hook( + "pre_reply", + "browser_pre_reply_hook", + browser_pre_reply_hook, + ) + self.register_instance_hook( + "post_acting", + "browser_post_acting_hook", + browser_post_acting_hook, + ) + + def _register_skill_tool( + self, + skill_func: Any, + ) -> None: + """Bind the browser agent to a skill function and register it as a tool.""" + + if asyncio.iscoroutinefunction(skill_func): + + @wraps(skill_func) + async def tool(*args, **kwargs): + return await skill_func( + browser_agent=self, + *args, + **kwargs, + ) + + else: + + @wraps(skill_func) + async def tool(*args, **kwargs): + return skill_func( + browser_agent=self, + *args, + **kwargs, + ) + + original_signature = inspect.signature(skill_func) + parameters = list(original_signature.parameters.values()) + if parameters and parameters[0].name == "browser_agent": + parameters = parameters[1:] + try: + tool.__signature__ = original_signature.replace( + parameters=parameters, + ) + except ValueError: + # Ignore errors during tool signature replacement + pass + self.toolkit.register_tool_function(tool) + + def _supports_multimodal(self) -> bool: + """Check if the model supports multimodal input (images/videos). + + Returns: + bool: True if the model supports multimodal input, False otherwise. + """ + return ( + self.model.model_name.startswith("qvq") + or "-vl" in self.model.model_name + or "4o" in self.model.model_name + or "gpt-5" in self.model.model_name + ) + + async def reply( + self, + msg: Msg | list[Msg] | None = None, + structured_model: Type[BaseModel] | None = None, + ) -> Msg: + """ + Process a message and return a response. + + Args: + msg (`Msg | list[Msg] | None`, optional): + The input message(s) to the agent. + structured_model (`Type[BaseModel] | None`, optional): + The required structured output model. If provided, the agent + is expected to generate structured output in the `metadata` + field of the output message. + + Returns: + Msg: The response message. + """ + self.init_query = ( + msg.content + if isinstance(msg, Msg) + else msg[0].content + if isinstance(msg, list) + else "" + ) + + self._required_structured_model = structured_model + # Record structured output model if provided + if structured_model: + self.toolkit.set_extended_model( + self.finish_function_name, + structured_model, + ) + # The reasoning-acting loop + reply_msg = None + for iter_n in range(self.max_iters): + self.iter_n = iter_n + 1 + await self._summarize_mem() + + msg_reasoning = await self._pure_reasoning() + tool_calls = msg_reasoning.get_content_blocks("tool_use") + if tool_calls and tool_calls[0]["name"] == "browser_snapshot": + msg_reasoning = await self._reasoning_with_observation() + + futures = [ + self._acting(tool_call) + for tool_call in msg_reasoning.get_content_blocks( + "tool_use", + ) + ] + + # Parallel tool calls or not + if self.parallel_tool_calls: + acting_responses = await asyncio.gather(*futures) + + else: + # Sequential tool calls + acting_responses = [await _ for _ in futures] + + # Find the first non-None replying message from the acting + for acting_msg in acting_responses: + reply_msg = reply_msg or acting_msg + + if reply_msg: + break + # When the maximum iterations are reached + if not reply_msg: + reply_msg = await self._summarizing() + + await self.memory.add(reply_msg) + return reply_msg + + async def _pure_reasoning( + self, + ) -> Msg: + msg = Msg( + "user", + content=self.pure_reasoning_prompt.format( + current_subtask=self.current_subtask, + init_query=self.original_task, + ), + role="user", + ) + + prompt = await self.formatter.format( + msgs=[ + Msg("system", self.sys_prompt, "system"), + *await self.memory.get_memory(), + msg, + ], + ) + + res = await self.model( + prompt, + tools=self.no_screenshot_tool_list, + ) + # handle output from the model + msg = None + if self.model.stream: + msg = Msg(self.name, [], "assistant") + async for content_chunk in res: + msg.content = content_chunk.content + await self.print(msg) + else: + msg = Msg(self.name, list(res.content), "assistant") + await self.print(msg) + + await self.memory.add(msg) + return msg + + async def _reasoning_with_observation( + self, + ) -> Msg: + """Perform the reasoning process.""" + self.snapshot_chunk_id = 0 + self.chunk_continue_status = False + self.previous_chunkwise_information = "" + self.snapshot_in_chunk = [] + + mem_len = await self.memory.size() + await self.memory.delete(mem_len - 1) + + self.snapshot_in_chunk = await self._get_snapshot_in_text() + + for _ in self.snapshot_in_chunk: + observe_msg = await self._build_observation() + prompt = await self.formatter.format( + msgs=[ + Msg("system", self.sys_prompt, "system"), + *await self.memory.get_memory(), + observe_msg, + ], + ) + + res = await self.model( + prompt, + tools=self.no_screenshot_tool_list, + ) + # handle output from the model + msg = None + if self.model.stream: + msg = Msg(self.name, [], "assistant") + async for content_chunk in res: + msg.content = content_chunk.content + # await self.print(msg) + + else: + msg = Msg(self.name, list(res.content), "assistant") + # await self.print(msg) + logger.info(msg.content) + + await self._update_chunk_observation_status( + output_msg=msg, + ) + if not self.chunk_continue_status: + break + + await self.memory.add(msg) + return msg + + async def _summarize_mem( + self, + ) -> None: + """Summarize memory if too long""" + mem_len = await self.memory.size() + if mem_len > self.max_memory_length: + await self._memory_summarizing() + + async def _build_observation( + self, + ) -> Msg: + """Get a snapshot in text before reasoning""" + image_data: Optional[str] = None + if self._supports_multimodal(): + # If the model supports multimodal input, take a screenshot + # and pass it to the observation message as base64 + image_data = await self._get_screenshot() + + observe_msg = self.observe_by_chunk(image_data) + return observe_msg + + async def _update_chunk_observation_status( + self, + output_msg: Msg | None = None, + ) -> None: + """Update the chunk observation status after reasoning.""" + + for _, b in enumerate(output_msg.content): + if b["type"] == "text": + # obtain response content + raw_response = b["text"] + # parse the response content to check if + # it contains "REASONING_FINISHED" + try: + if "```json" in raw_response: + raw_response = raw_response.replace( + "```json", + "", + ).replace("```", "") + data = json.loads(raw_response) + information = data.get("INFORMATION", "") + self.chunk_continue_status = ( + data.get("STATUS") != "REASONING_FINISHED" + ) + except Exception: + # If JSON parsing fails, use raw response as information + information = raw_response + if ( + self.snapshot_chunk_id + < len(self.snapshot_in_chunk) - 1 + ): + self.chunk_continue_status = True + self.snapshot_chunk_id += 1 + else: + self.chunk_continue_status = False + + if not isinstance(information, str): + try: + information = json.dumps( + information, + ensure_ascii=False, + ) + except Exception: + # If JSON serialization fails, convert to string + information = str(information) + + self.previous_chunkwise_information += ( + f"Information in chunk {self.snapshot_chunk_id+1} " + f"of {len(self.snapshot_in_chunk)}:\n" + information + "\n" + ) + + if b["type"] == "tool_use": + self.chunk_continue_status = False + + async def _task_decomposition_and_reformat( # pylint: disable=too-many-statements + self, + original_task: Msg | list[Msg] | None, + ) -> Msg: + """ + Decompose the original task into smaller tasks and reformat it, with reflection. + """ + if isinstance(original_task, list): + original_task = original_task[0] + + prompt = await self.formatter.format( + msgs=[ + Msg( + name="user", + content=self.task_decomposition_prompt.format( + start_url=self.start_url, + browser_agent_sys_prompt=self.sys_prompt, + original_task=original_task.content, + ), + role="user", + ), + ], + ) + res = await self.model(prompt) + decompose_text = "" + print_msg = Msg(name=self.name, content=[], role="assistant") + if self.model.stream: + async for content_chunk in res: + decompose_text = content_chunk.content[0]["text"] + print_msg.content = content_chunk.content + # await self.print(print_msg, False) + else: + decompose_text = res.content[0]["text"] + print_msg.content = [TextBlock(type="text", text=decompose_text)] + + # await self.print(print_msg, True) + logger.info(decompose_text) + + # Use path relative to this file for robustness + reflection_prompt_path = os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_decompose_reflection_prompt.md", + ) + with open(reflection_prompt_path, "r", encoding="utf-8") as fj: + decompose_reflection_prompt = fj.read() + + reflection_prompt = await self.formatter.format( + msgs=[ + Msg( + name="user", + content=self.task_decomposition_prompt.format( + start_url=self.start_url, + browser_agent_sys_prompt=self.sys_prompt, + original_task=original_task.content, + ), + role="user", + ), + Msg( + name="system", + content=decompose_text, + role="system", + ), + Msg( + name="user", + content=decompose_reflection_prompt.format( + original_task=original_task.content, + subtasks=decompose_text, + ), + role="user", + ), + ], + ) + reflection_res = await self.model(reflection_prompt) + reflection_text = "" + print_msg = Msg(name=self.name, content=[], role="assistant") + if self.model.stream: + async for content_chunk in reflection_res: + reflection_text = content_chunk.content[0]["text"] + print_msg.content = content_chunk.content + # await self.print(print_msg, last=False) + else: + reflection_text = reflection_res.content[0]["text"] + print_msg.content = [TextBlock(type="text", text=reflection_text)] + # await self.print(print_msg, last=True) + logger.info(reflection_text) + + subtasks = [] + try: + if "```json" in reflection_text: + reflection_text = reflection_text.replace("```json", "") + reflection_text = reflection_text.replace("```", "") + subtasks_json = json.loads(reflection_text) + subtasks = subtasks_json.get("REVISED_SUBTASKS", []) + if not isinstance(subtasks, list): + subtasks = [] + except Exception: + # If parsing fails, use original task as single subtask + subtasks = [original_task.content] + + self.subtasks = subtasks + self.current_subtask_idx = 0 + self.current_subtask = self.subtasks[0] if self.subtasks else None + self.original_task = original_task.get_text_content() + + formatted_task = "The original task is: " + self.original_task + "\n" + try: + formatted_task += ( + "The decomposed subtasks are: " + + json.dumps(self.subtasks) + + "\n" + ) + formatted_task += ( + "use the decomposed subtasks to complete the original task.\n" + ) + except Exception as e: + logger.warning(f"Failed to format subtasks: {e}") + formatted_task = Msg( + name=original_task.name, + content=formatted_task, + role=original_task.role, + ) + logger.info(f"The formatted task is: \n{formatted_task.content}") + return formatted_task + + async def _navigate_to_start_url(self) -> None: + """ + Navigate to the specified start URL using the browser_navigate tool. + + This method is automatically called during the first interaction to + navigate to the configured start URL. It executes the browser + navigation tool and processes the response to ensure the + initial page is loaded. + + Returns: + None + """ + + tool_call = ToolUseBlock( + id=str(uuid.uuid4()), # Add the unique ID + name="browser_tabs", + input={"action": "list"}, + type="tool_use", + ) + response = await self.toolkit.call_tool_function(tool_call) + response_text = "" + async for chunk in response: + response_text = chunk.content[0]["text"] + + tab_numbers = re.findall(r"- (\d+):", response_text) + # Close all tabs except the first one + for _ in tab_numbers[1:]: + tool_call = ToolUseBlock( + id=str(uuid.uuid4()), + name="browser_tabs", + input={"action": "close", "index": 0}, + type="tool_use", + ) + await self.toolkit.call_tool_function(tool_call) + tool_call = ToolUseBlock( + id=str(uuid.uuid4()), + type="tool_use", + name="browser_navigate", + input={"url": self.start_url}, + ) + + # Execute the navigation tool + await self.toolkit.call_tool_function(tool_call) + + async def _get_snapshot_in_text(self) -> list: + """Capture a text-based snapshot of the current webpage content. + + This method uses the browser_snapshot tool to retrieve the current + webpage content in text format, which is used during the reasoning + phase to provide context about the current browser state. + + Returns: + list: A list of text chunks representing the current, + webpage content, including elements, structure, + and visible text. + + Note: + This method is called automatically during the reasoning phase and + provides essential context for decision-making about next actions. + """ + snapshot_tool_call = ToolUseBlock( + type="tool_use", + id=str(uuid.uuid4()), # Generate a unique ID for the tool call + name="browser_snapshot", + input={}, # No parameters required for this tool + ) + snapshot_response = await self.toolkit.call_tool_function( + snapshot_tool_call, + ) + snapshot_str = "" + async for chunk in snapshot_response: + snapshot_str = chunk.content[0]["text"] + snapshot_in_chunk = self._split_snapshot_by_chunk( + snapshot_str, + ) + return snapshot_in_chunk + + async def _memory_summarizing(self) -> None: + """Summarize the current memory content to prevent context overflow. + + This method is called periodically to condense the conversation history + by generating a summary of progress and maintaining only essential + information. It preserves the initial user question and creates a + concise summary of what has been accomplished and what remains to be + done. + + Returns: + None + + Note: + This method is automatically called every 10 iterations to manage + memory usage and maintain context relevance. The summarization + helps prevent token limit issues while preserving important task + context. + """ + # Extract the initial user question + initial_question = None + memory_msgs = await self.memory.get_memory() + for msg in memory_msgs: + if msg.role == "user": + initial_question = msg.content + break + + # Generate a summary of the current progress + hint_msg = Msg( + "user", + ( + "Summarize the current progress and outline the next steps " + "for this task. Your summary should include:\n" + "1. What has been completed so far.\n" + "2. What key information has been found.\n" + "3. What remains to be done.\n" + "Ensure that your summary is clear, concise, and" + "that no tasks are repeated or skipped." + ), + role="user", + ) + + # Format the prompt for the model + prompt = await self.formatter.format( + msgs=[ + Msg("system", self.sys_prompt, "system"), + *memory_msgs, + hint_msg, + ], + ) + + # Call the model to generate the summary + res = await self.model(prompt) + + # Handle response + summary_text = "" + print_msg = Msg(name=self.name, content=[], role="assistant") + if self.model.stream: + async for content_chunk in res: + summary_text = content_chunk.content[0]["text"] + print_msg.content = content_chunk.content + await self.print(print_msg, last=False) + else: + summary_text = res.content[0]["text"] + print_msg.content = [TextBlock(type="text", text=summary_text)] + await self.print(print_msg, last=True) + + # Update the memory with the summarized content + summarized_memory = [] + if initial_question: + summarized_memory.append( + Msg("user", initial_question, role="user"), + ) + summarized_memory.append( + Msg(self.name, summary_text, role="assistant"), + ) + + # Clear and reload memory + await self.memory.clear() + for msg in summarized_memory: + await self.memory.add(msg) + + async def _get_screenshot(self) -> Optional[str]: + """ + Optionally take a screenshot of the current web page for multimodal prompts. + Returns base64-encoded PNG data if available, else None. + """ + try: + # Prepare tool call for screenshot + tool_call = ToolUseBlock( + id=str(uuid.uuid4()), + name="browser_take_screenshot", + input={}, + type="tool_use", + ) + # Execute tool call via service toolkit + screenshot_response = await self.toolkit.call_tool_function( + tool_call, + ) + # Extract image base64 from response + async for chunk in screenshot_response: + if ( + chunk.content + and len(chunk.content) > 1 + and "data" in chunk.content[1] + ): + image_data = chunk.content[1]["data"] + else: + image_data = None + + except Exception: + # If screenshot fails, return None to continue without image + image_data = None + return image_data + + @staticmethod + def _filter_execution_text( + text: str, + keep_page_state: bool = False, + ) -> str: + """ + Filter and clean browser tool execution output to remove verbose + content. + + This utility method removes unnecessary verbose content from browser + tool responses, including JavaScript code blocks, console messages, + and YAML content that can overwhelm the context window without + providing useful information. + + Args: + text (str): + The raw execution text from browser tools that + needs to be filtered. + keep_page_state (bool, optional): + Whether to preserve page state information + including URL and YAML content. Defaults to False. + + Returns: + str: The filtered execution text. + """ + if not keep_page_state: + # Remove Page Snapshot and YAML content + text = re.sub(r"- Page URL.*", "", text, flags=re.DOTALL) + text = re.sub(r"```yaml.*?```", "", text, flags=re.DOTALL) + # # Remove JavaScript code blocks + + # Remove console messages section that can be very verbose + # (between "### New console messages" and "### Page state") + text = re.sub( + r"### New console messages.*?(?=### Page state)", + "", + text, + flags=re.DOTALL, + ) + # Trim leading/trailing whitespace + return text.strip() + + def _split_snapshot_by_chunk( + self, + snapshot_str: str, + max_length: int = 80000, + ) -> list[str]: + self.snapshot_chunk_id = 0 + return [ + snapshot_str[i : i + max_length] + for i in range(0, len(snapshot_str), max_length) + ] + + def observe_by_chunk(self, image_data: str | None = "") -> Msg: + """Create an observation message for chunk-based reasoning. + + This method formats the current chunk of the webpage snapshot with + contextual information from previous chunks to create a structured + observation message for the reasoning phase. + + Returns: + Msg: A user message containing the formatted reasoning prompt + with chunk information and context from previous chunks. + """ + reasoning_prompt = self.observe_reasoning_prompt.format( + previous_chunkwise_information=self.previous_chunkwise_information, + current_subtask=self.current_subtask, + i=self.snapshot_chunk_id + 1, + total_pages=len(self.snapshot_in_chunk), + chunk=self.snapshot_in_chunk[self.snapshot_chunk_id], + init_query=self.original_task, + ) + content = [ + TextBlock( + type="text", + text=reasoning_prompt, + ), + ] + if self._supports_multimodal(): + if image_data: + image_block = ImageBlock( + type="image", + source=Base64Source( + type="base64", + media_type="image/png", + data=image_data, + ), + ) + content.append(image_block) + + observe_msg = Msg( + "user", + content=content, + role="user", + ) + return observe_msg + + async def browser_subtask_manager( # pylint: disable=too-many-branches,too-many-statements + self, + ) -> ToolResponse: + """ + Determine whether the current subtask is completed. + This tool should only be used when it is believed that + the current subtask is done. + + Returns: + `ToolResponse`: + If completed, advance current_subtask_idx; + otherwise, leave it unchanged. + """ + if ( + not hasattr(self, "subtasks") + or not self.subtasks + or self.current_subtask is None + ): + self.current_subtask = self.original_task + return ToolResponse( + content=[ + TextBlock( + type="text", + text=( + f"Tool call Error. Cannot be executed. " + f"Current subtask remains: {self.current_subtask}" + ), + ), + ], + ) + + # take memory as context + memory_content = await self.memory.get_memory() + + # LLM prompt for subtask validation + sys_prompt = ( + "You are an expert in subtask validation. \n" + "Given the following subtask and the agent's" + " recent memory, strictly judge if the subtask " + "is FULLY completed. \n" + "If yes, reply ONLY 'SUBTASK_COMPLETED'. " + "If not, reply ONLY 'SUBTASK_NOT_COMPLETED'." + ) + if len(self.snapshot_in_chunk) > 0: + user_prompt = ( + f"Subtask: {self.current_subtask}\n" + f"Recent memory:\n{[str(m) for m in memory_content[-10:]]}\n" + f"Current page:\n{self.snapshot_in_chunk[0]}" + ) + else: + user_prompt = ( + f"Subtask: {self.current_subtask}\n" + f"Recent memory:\n{[str(m) for m in memory_content[-10:]]}\n" + ) + prompt = await self.formatter.format( + msgs=[ + Msg("system", sys_prompt, role="system"), + Msg("user", user_prompt, role="user"), + ], + ) + + response = await self.model(prompt) + response_text = "" + print_msg = Msg(name=self.name, content=[], role="assistant") + if self.model.stream: + # If the model supports streaming, collect chunks + async for chunk in response: + response_text += chunk.content[0]["text"] + print_msg.content = chunk.content + await self.print(print_msg, last=False) + else: + # If not streaming, get the full response at once + response_text = response.content[0]["text"] + + print_msg.content = [TextBlock(type="text", text=response_text)] + await self.print(print_msg, last=True) + + if "SUBTASK_COMPLETED" in response_text.strip().upper(): + self.current_subtask_idx += 1 + if self.current_subtask_idx < len(self.subtasks): + self.current_subtask = str( + self.subtasks[self.current_subtask_idx], + ) + else: + self.current_subtask = None + return ToolResponse( + content=[ + TextBlock( + type="text", + text=( + "Tool call SUCCESS." + " Current subtask updates to: " + f"{self.current_subtask}" + ), + ), + ], + ) + else: + revise_prompt_path = os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_subtask_revise_prompt.md", + ) + with open(revise_prompt_path, "r", encoding="utf-8") as fr: + revise_prompt = fr.read() + memory_content = await self.memory.get_memory() + user_prompt = revise_prompt.format( + memory=[str(m) for m in memory_content[-10:]], + subtasks=json.dumps(self.subtasks, ensure_ascii=False), + current_subtask=str(self.current_subtask), + original_task=str(self.original_task), + ) + prompt = await self.formatter.format( + msgs=[ + Msg("user", user_prompt, role="user"), + ], + ) + response = await self.model(prompt) + if self.model.stream: + async for chunk in response: + revise_text = chunk.content[0]["text"] + else: + revise_text = response.content[0]["text"] + try: + if "```json" in revise_text: + revise_text = revise_text.replace("```json", "").replace( + "```", + "", + ) + revise_json = json.loads(revise_text) + if_revised = revise_json.get("IF_REVISED") + if if_revised: + revised_subtasks = revise_json.get("REVISED_SUBTASKS", []) + if isinstance(revised_subtasks, list) and revised_subtasks: + self.subtasks = revised_subtasks + self.current_subtask_idx = 0 + self.current_subtask = self.subtasks[0] + logger.info( + f"Subtasks revised: {self.subtasks}, reason: {revise_json.get('REASON', '')}", + ) + except Exception as e: + logger.warning(f"Failed to revise subtasks: {e}") + return ToolResponse( + content=[ + TextBlock( + type="text", + text=( + "Tool call SUCCESS." + f" Current subtask remains: {self.current_subtask}" + ), + ), + ], + ) + + async def browser_generate_final_response( + self, # pylint: disable=W0613 + **kwargs: Any, # pylint: disable=W0613 + ) -> ToolResponse: + """Generate a response when the agent has completed all subtasks.""" + + hint_msg = Msg( + "user", + _BROWSER_AGENT_SUMMARIZE_TASK_PROMPT, + role="user", + ) + memory_msgs = await self.memory.get_memory() + memory_msgs_copy = copy.deepcopy(memory_msgs) + last_msg = memory_msgs_copy[-1] + # check if the last message has tool call, if so clean the content + + last_msg.content = last_msg.get_content_blocks("text") + memory_msgs_copy[-1] = last_msg + + # Generate a reply by summarizing the current situation + prompt = await self.formatter.format( + msgs=[ + Msg("system", self.sys_prompt, "system"), + *memory_msgs_copy, + hint_msg, + ], + ) + try: + res = await self.model(prompt) + res_msg = Msg( + "assistant", + [], + "assistant", + ) + if self.model.stream: + async for content_chunk in res: + summary_text = content_chunk.content[0]["text"] + else: + summary_text = res.content[0]["text"] + + res_msg.content = summary_text + await self.print(res_msg, False) + # logger.info(summary_text) + # Validate finish status + finish_status = await self._validate_finish_status(summary_text) + logger.info(f"Finish status: {finish_status}") + + if "BROWSER_AGENT_TASK_FINISHED" in finish_status: + # Create a simple metadata structure instead of WorkerResponse + structure_response = { + "task_done": True, + "subtask_progress_summary": summary_text, + "generated_files": {}, + } + + response_msg = Msg( + self.name, + content=[ + TextBlock(type="text", text=summary_text), + ], + role="assistant", + metadata=structure_response, + ) + return ToolResponse( + content=[ + TextBlock( + type="text", + text="Successfully generated response.", + ), + ], + metadata={ + "success": True, + "response_msg": response_msg, + }, + is_last=True, + ) + else: + return ToolResponse( + content=[ + TextBlock( + type="text", + text=f"Here is a summary of current status:\n{summary_text}\nPlease continue.\n Following steps \n {finish_status}", + ), + ], + metadata={"success": False, "response_msg": None}, + is_last=True, + ) + except Exception as e: + return ToolResponse( + content=[ + TextBlock( + type="text", + text=f"Tool call Error. Cannot be executed. {e}", + ), + ], + metadata={"success": False}, + is_last=True, + ) + + async def _validate_finish_status(self, summary: str) -> str: + """Validate if the agent has completed its task based on the summary.""" + sys_prompt = ( + "You are an expert in task validation. " + "Your job is to determine if the agent has completed its task" + " based on the provided summary. If finished, strictly reply " + '"BROWSER_AGENT_TASK_FINISHED", otherwise return the remaining ' + "tasks or next steps." + ) + # Extract user question from memory + initial_question = None + memory_msgs = await self.memory.get_memory() + for msg in memory_msgs: + if msg.role == "user": + initial_question = msg.content + break + + prompt = await self.formatter.format( + msgs=[ + Msg( + "system", + sys_prompt, + role="system", + ), + Msg( + "user", + content=( + "The initial task is to solve the following question: " + f"{initial_question} \n " + f"Here is a summary of current task " + f"completion process, please evaluate the task finish " + f"status.\n" + summary + ), + role="user", + ), + ], + ) + res = await self.model(prompt) + response_text = "" + if self.model.stream: + async for content_chunk in res: + response_text = content_chunk.content[0]["text"] + else: + response_text = res.content[0]["text"] + return response_text diff --git a/browser_use/browser_use_agent_pro/_build_in_helper_browser/__init__.py b/browser_use/browser_use_agent_pro/_build_in_helper_browser/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/browser_use/browser_use_agent_pro/_build_in_helper_browser/_file_download.py b/browser_use/browser_use_agent_pro/_build_in_helper_browser/_file_download.py new file mode 100644 index 0000000..b15abd9 --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_helper_browser/_file_download.py @@ -0,0 +1,234 @@ +# -*- coding: utf-8 -*- +"""Standalone file download skill for the browser agent.""" +# flake8: noqa: E501 +# pylint: disable=W0212 +# pylint: disable=too-many-lines +# pylint: disable=C0301 +from __future__ import annotations + +import copy +from typing import Any +import os + +from agentscope.agent import ReActAgent +from agentscope.memory import InMemoryMemory +from agentscope.message import Msg, TextBlock +from agentscope.tool import ToolResponse + +_CURRENT_DIR = os.path.abspath( + os.path.join(os.path.dirname(__file__), os.pardir), +) + +with open( + os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_file_download_sys_prompt.md", + ), + "r", + encoding="utf-8", +) as f: + _FILE_DOWNLOAD_AGENT_SYS_PROMPT = f.read() + + +class FileDownloadAgent(ReActAgent): + """Lightweight helper agent that downloads files""" + + def __init__( + self, + browser_agent: Any, + sys_prompt: str = _FILE_DOWNLOAD_AGENT_SYS_PROMPT, + max_iters: int = 15, + ) -> None: + name = ( + f"{getattr(browser_agent, 'name', 'browser_agent')}_file_download" + ) + self.finish_function_name = "file_download_final_response" + super().__init__( + name=name, + sys_prompt=sys_prompt, + model=browser_agent.model, + formatter=browser_agent.formatter, + memory=InMemoryMemory(), + toolkit=browser_agent.toolkit, + max_iters=max_iters, + ) + # Remove conflicting tool functions if they exist + if hasattr(self.toolkit, "remove_tool_function"): + try: + self.toolkit.remove_tool_function("browser_pdf_save") + except Exception: + # Tool may not exist, ignore removal errors + pass + try: + self.toolkit.remove_tool_function("file_download") + except Exception: + # Tool may not exist, ignore removal errors + pass + + async def file_download_final_response( + self, # pylint: disable=W0613 + **kwargs: Any, # pylint: disable=W0613 + ) -> ToolResponse: + """Summarize the file download outcome.""" + hint_msg = Msg( + "user", + ( + "Provide a concise summary of the file download attempt.\n" + "Highlight these items:\n" + "0. The original request\n" + "1. The element(s) interacted with and actions taken\n" + "2. The download status or any issues encountered\n" + "3. Any follow-up recommendations or next steps\n" + ), + role="user", + ) + + memory_msgs = await self.memory.get_memory() + memory_msgs_copy = copy.deepcopy(memory_msgs) + if memory_msgs_copy: + last_msg = memory_msgs_copy[-1] + last_msg.content = last_msg.get_content_blocks("text") + memory_msgs_copy[-1] = last_msg + + prompt = await self.formatter.format( + msgs=[ + Msg("system", self.sys_prompt, "system"), + *memory_msgs_copy, + hint_msg, + ], + ) + + res = await self.model(prompt) + + if self.model.stream: + summary_text = "" + async for chunk in res: + summary_text = chunk.content[0]["text"] + else: + summary_text = res.content[0]["text"] + + summary_text = summary_text or "No summary generated." + + # Create a simple metadata structure instead of WorkerResponse + structure_response = { + "task_done": True, + "subtask_progress_summary": summary_text, + "generated_files": {}, + } + response_msg = Msg( + self.name, + content=[ + TextBlock(type="text", text=summary_text), + ], + role="assistant", + metadata=structure_response, + ) + + return ToolResponse( + content=[ + TextBlock( + type="text", + text="File download summary generated. " + summary_text, + ), + ], + metadata={ + "success": True, + "response_msg": response_msg, + }, + is_last=True, + ) + + +def _build_initial_instruction( + target_description: str, + snapshot_text: str, +) -> str: + """Compose the initial instruction for the helper agent.""" + return ( + "You must locate and trigger the download for the requested file.\n\n" + "Target description provided by the user:\n" + f"{target_description}\n\n" + "Latest snapshot captured prior to your run:\n" + f"{snapshot_text}\n\n" + "Follow the sys prompt guidance, think step-by-step, and verify that " + "the download action succeeded. If the download cannot be completed, " + "explain why in the final summary." + ) + + +async def file_download( + browser_agent: Any, + target_description: str, +) -> ToolResponse: + """ + Download the target file. The current page should + contain download-related element. + + Args: + target_description (str): The description of the + target file to download. + + Returns: + ToolResponse: A structured response containing + the download directory. + """ + try: + snapshot_chunks = await browser_agent._get_snapshot_in_text() + except Exception as exc: # pylint: disable=broad-except + snapshot_chunks = [] + snapshot_error = str(exc) + else: + snapshot_error = "" + + snapshot_text = "\n\n---\n\n".join(snapshot_chunks) + if snapshot_error and not snapshot_text: + snapshot_text = f"[Snapshot failed: {snapshot_error}]" + + sub_agent = FileDownloadAgent(browser_agent) + instruction = _build_initial_instruction( + target_description=target_description, + snapshot_text=snapshot_text, + ) + + init_msg = Msg( + name="user", + role="user", + content=instruction, + ) + + try: + sub_agent_response_msg = await sub_agent.reply(init_msg) + + text_content = "" + if sub_agent_response_msg.content: + first_block = sub_agent_response_msg.content[0] + if isinstance(first_block, dict): + text_content = first_block.get("text") or "" + else: + text_content = getattr(first_block, "text", "") or "" + + if not text_content: + text_content = ( + "File download agent finished without a textual summary." + ) + + return ToolResponse( + metadata=sub_agent_response_msg.metadata, + content=[ + TextBlock( + type="text", + text=text_content, + ), + ], + ) + except Exception as exc: # pylint: disable=broad-except + return ToolResponse( + content=[ + TextBlock( + type="text", + text=f"Tool call Error. Cannot be executed. {exc}", + ), + ], + metadata={"success": False}, + is_last=True, + ) diff --git a/browser_use/browser_use_agent_pro/_build_in_helper_browser/_form_filling.py b/browser_use/browser_use_agent_pro/_build_in_helper_browser/_form_filling.py new file mode 100644 index 0000000..3ad5f4f --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_helper_browser/_form_filling.py @@ -0,0 +1,210 @@ +# -*- coding: utf-8 -*- +"""Standalone form filling skill for the browser agent.""" +# flake8: noqa: E501 +# pylint: disable=W0212 +# pylint: disable=too-many-lines +# pylint: disable=C0301 +from __future__ import annotations + +import copy +from typing import Any +import os + +from agentscope.agent import ReActAgent +from agentscope.memory import InMemoryMemory +from agentscope.message import Msg, TextBlock +from agentscope.tool import ToolResponse + +_CURRENT_DIR = os.path.abspath( + os.path.join(os.path.dirname(__file__), os.pardir), +) + +with open( + os.path.join( + _CURRENT_DIR, + "_build_in_prompt_browser/browser_agent_form_filling_sys_prompt.md", + ), + "r", + encoding="utf-8", +) as f: + _FORM_FILL_AGENT_SYS_PROMPT = f.read() + + +class FormFillingAgent(ReActAgent): + """Lightweight helper agent that fills forms.""" + + def __init__( + self, + browser_agent: Any, + sys_prompt: str = _FORM_FILL_AGENT_SYS_PROMPT, + max_iters: int = 20, + ) -> None: + name = f"{getattr(browser_agent, 'name', 'browser_agent')}_form_fill" + self.finish_function_name = "form_filling_final_response" + super().__init__( + name=name, + sys_prompt=sys_prompt, + model=browser_agent.model, + formatter=browser_agent.formatter, + memory=InMemoryMemory(), + toolkit=browser_agent.toolkit, + max_iters=max_iters, + ) + + async def form_filling_final_response( + self, # pylint: disable=W0613 + **kwargs: Any, # pylint: disable=W0613 + ) -> ToolResponse: + """Summarize the form filling outcome.""" + hint_msg = Msg( + "user", + ( + "Provide a concise summary of the completed form " + "filling task.\n" + "Highlight these items:\n" + "0. The original task/query\n" + "1. Which fields were filled/selected and their final values\n" + "2. Any important observations or follow-up notes\n" + "3. Confirmation that if the task is complete\n\n" + ), + role="user", + ) + + memory_msgs = await self.memory.get_memory() + memory_msgs_copy = copy.deepcopy(memory_msgs) + last_msg = memory_msgs_copy[-1] + # check if the last message has tool call, if so clean the content + + last_msg.content = last_msg.get_content_blocks("text") + memory_msgs_copy[-1] = last_msg + + prompt = await self.formatter.format( + msgs=[ + Msg("system", self.sys_prompt, "system"), + *memory_msgs_copy, + hint_msg, + ], + ) + + res = await self.model(prompt) + + if self.model.stream: + summary_text = "" + async for chunk in res: + summary_text = chunk.content[0]["text"] + else: + summary_text = res.content[0]["text"] + + # Create a simple metadata structure instead of WorkerResponse + structure_response = { + "task_done": True, + "subtask_progress_summary": summary_text, + "generated_files": {}, + } + response_msg = Msg( + self.name, + content=[ + TextBlock(type="text", text=summary_text), + ], + role="assistant", + metadata=structure_response, + ) + + return ToolResponse( + content=[ + TextBlock( + type="text", + text="Form filling summary generated. " + summary_text, + ), + ], + metadata={ + "success": True, + "response_msg": response_msg, + }, + is_last=True, + ) + + +def _build_initial_instruction( + fill_information: str, + snapshot_text: str, +) -> str: + """Compose the initial instruction fed to the helper agent.""" + return ( + "You must complete the web form using the information " + "provided below.\n\nFill instructions (plain text from the user):\n" + f"{fill_information}\n\n" + "Latest snapshot captured prior to your run:\n" + f"{snapshot_text}\n\n" + ) + + +async def form_filling( + browser_agent: Any, + fill_information: str, +) -> ToolResponse: + """ + Fill in a web form according to plain-text instructions. + + Args: + fill_information (str): + Plain-text description of the values that + must be entered into the form, + including any submission requirements. + + Returns: + ToolResponse: Summary of the helper agent execution and status. + """ + try: + snapshot_chunks = ( + await browser_agent._get_snapshot_in_text() + ) # pylint: disable=protected-access + except Exception as exc: # pylint: disable=broad-except + snapshot_chunks = [] + snapshot_error = str(exc) + else: + snapshot_error = "" + + snapshot_text = "\n\n---\n\n".join(snapshot_chunks) + if snapshot_error and not snapshot_text: + snapshot_text = f"[Snapshot failed: {snapshot_error}]" + + sub_agent = FormFillingAgent(browser_agent) + instruction = _build_initial_instruction( + fill_information=fill_information, + snapshot_text=snapshot_text, + ) + + init_msg = Msg( + name="user", + role="user", + content=instruction, + ) + + try: + sub_agent_response_msg = await sub_agent.reply(init_msg) + + return ToolResponse( + metadata=sub_agent_response_msg.metadata, + content=[ + TextBlock( + type="text", + text=sub_agent_response_msg.content[0]["text"] + or ( + "Form filling agent finished " + "without a textual summary." + ), + ), + ], + ) + except Exception as e: + return ToolResponse( + content=[ + TextBlock( + type="text", + text=f"Tool call Error. Cannot be executed. {e}", + ), + ], + metadata={"success": False}, + is_last=True, + ) diff --git a/browser_use/browser_use_agent_pro/_build_in_helper_browser/_image_understanding.py b/browser_use/browser_use_agent_pro/_build_in_helper_browser/_image_understanding.py new file mode 100644 index 0000000..e145668 --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_helper_browser/_image_understanding.py @@ -0,0 +1,161 @@ +# -*- coding: utf-8 -*- +"""Standalone image understanding skill for the browser agent.""" +# flake8: noqa: E501 +# pylint: disable=W0212 +# pylint: disable=too-many-lines +# pylint: disable=C0301 +from __future__ import annotations + +import json +import uuid +from typing import Any + +from agentscope.message import ( + Base64Source, + ImageBlock, + Msg, + TextBlock, + ToolUseBlock, +) +from agentscope.tool import ToolResponse + + +async def image_understanding( + browser_agent: Any, + object_description: str, + task: str, +) -> ToolResponse: + """ + Locate an element and solve a visual task on the current webpage. + + Args: + object_description (str): The description of the object to locate. + task (str): The specific task or question to solve about the image + (e.g., description, object detection, activity recognition, or + answering a question about the image's content). + + Returns: + ToolResponse: A structured response containing the answer to + the specified task based on the image content. + """ + + sys_prompt = ( + "You are a web page analysis expert. Given the following page " + "snapshot and object description, " + "identify the exact element and its reference string (ref) " + "that matches the description. " + "Return ONLY a JSON object: " + '{"element": , "ref": }' + ) + + snapshot_chunks = ( + await browser_agent._get_snapshot_in_text() # noqa: E501 # pylint: disable=protected-access + ) + page_snapshot = snapshot_chunks[0] if snapshot_chunks else "" + user_prompt = ( + f"Object description: {object_description}\n" + f"Page snapshot:\n{page_snapshot}" + ) + + prompt = await browser_agent.formatter.format( + msgs=[ + Msg("system", sys_prompt, role="system"), + Msg("user", user_prompt, role="user"), + ], + ) + res = await browser_agent.model(prompt) + if browser_agent.model.stream: + async for chunk in res: + model_text = chunk.content[0]["text"] + else: + model_text = res.content[0]["text"] + + try: + if "```json" in model_text: + model_text = model_text.replace("```json", "").replace( + "```", + "", + ) + element_info = json.loads(model_text) + element = element_info.get("element", "") + ref = element_info.get("ref", "") + except Exception: + return ToolResponse( + content=[ + TextBlock( + type="text", + text="Failed to parse element/ref from model output.", + ), + ], + metadata={"success": False}, + ) + + screenshot_tool_call = ToolUseBlock( + id=str(uuid.uuid4()), + name="browser_take_screenshot", + input={"element": element, "ref": ref}, + type="tool_use", + ) + screenshot_response = await browser_agent.toolkit.call_tool_function( + screenshot_tool_call, + ) + image_data = None + async for chunk in screenshot_response: + if ( + chunk.content + and len(chunk.content) > 1 + and "data" in chunk.content[1] + ): + image_data = chunk.content[1]["data"] + + sys_prompt_task = ( + "You are a web automation expert. " + "Given the object description, screenshot, and page context, " + "solve the following task. Return ONLY the answer as plain text." + ) + content_blocks = [ + TextBlock( + type="text", + text=( + "Object description: " + f"{object_description}\nTask: {task}\n" + f"Page snapshot:\n{page_snapshot}" + ), + ), + ] + + if image_data: + image_block = ImageBlock( + type="image", + source=Base64Source( + type="base64", + media_type="image/png", + data=image_data, + ), + ) + content_blocks.append(image_block) + + prompt_task = await browser_agent.formatter.format( + msgs=[ + Msg("system", sys_prompt_task, role="system"), + Msg("user", content_blocks, role="user"), + ], + ) + res_task = await browser_agent.model(prompt_task) + if browser_agent.model.stream: + async for chunk in res_task: + answer_text = chunk.content[0]["text"] + else: + answer_text = res_task.content[0]["text"] + + return ToolResponse( + content=[ + TextBlock( + type="text", + text=( + f"Screenshot taken for element: {element}\nref: {ref}\n" + f"Task solution: {answer_text}" + ), + ), + ], + ) diff --git a/browser_use/browser_use_agent_pro/_build_in_helper_browser/_video_understanding.py b/browser_use/browser_use_agent_pro/_build_in_helper_browser/_video_understanding.py new file mode 100644 index 0000000..7a96c0f --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_helper_browser/_video_understanding.py @@ -0,0 +1,330 @@ +# -*- coding: utf-8 -*- +"""Standalone video understanding skill for the browser agent.""" +# flake8: noqa: E501 +# pylint: disable=W0212 +# pylint: disable=too-many-lines +# pylint: disable=C0301 +from __future__ import annotations + +import json +import os +import subprocess +import tempfile +import uuid +from base64 import b64encode +from pathlib import Path +from typing import Any, List, Optional + +from agentscope.message import ( + Base64Source, + ImageBlock, + Msg, + TextBlock, +) +from agentscope.tool import ToolResponse + + +async def video_understanding( + browser_agent: Any, + video_path: str, + task: str, +) -> ToolResponse: + """ + Perform video understanding on the provided video file. + + Args: + video_path (str): The path to the video file to analyze. + task (str): The specific task or question to solve about + the video (e.g., summary, object detection, activity recognition, + or answering a question about the video's content). + + Returns: + ToolResponse: A structured response containing the answer + to the specified task based on the video content. + """ + + workdir = _prepare_workdir(browser_agent) + try: + frames_dir = os.path.join(workdir, "frames") + frames = extract_frames(video_path, frames_dir) + except Exception as exc: + return _error_response(f"Failed to extract frames: {exc}") + + audio_path = os.path.join( + workdir, + f"audio_{getattr(browser_agent, 'iter_n', 0)}.wav", + ) + try: + extract_audio(video_path, audio_path) + except Exception as exc: + return _error_response(f"Failed to extract audio: {exc}") + + try: + transcript = audio2text(audio_path) + except Exception as exc: + return _error_response(f"Failed to transcribe audio: {exc}") + + sys_prompt = ( + "You are a web video analysis expert. " + "Given the following video frames and audio transcript, " + "analyze the content and provide a solution to the task. " + 'Return ONLY a JSON object: {"answer": }' + ) + + content_blocks = _build_multimodal_blocks(frames, transcript, task) + + prompt = await browser_agent.formatter.format( + msgs=[ + Msg("system", sys_prompt, role="system"), + Msg("user", content_blocks, role="user"), + ], + ) + + res = await browser_agent.model(prompt) + if browser_agent.model.stream: + async for chunk in res: + model_text = chunk.content[0]["text"] + else: + model_text = res.content[0]["text"] + + try: + if "```json" in model_text: + model_text = model_text.replace("```json", "").replace( + "```", + "", + ) + answer_info = json.loads(model_text) + answer = answer_info.get("answer", "") + except Exception: # pylint: disable=broad-except + return _error_response("Failed to parse answer from model output.") + + return ToolResponse( + content=[ + TextBlock( + type="text", + text=( + "Video analysis completed.\n" f"Task solution: {answer}" + ), + ), + ], + ) + + +def audio2text(audio_path: str) -> str: + """Convert audio to text using DashScope ASR.""" + + try: # Local import to avoid hard dependency when unused. + from dashscope.audio.asr import Recognition, RecognitionCallback + except ImportError as exc: + raise RuntimeError( + "dashscope.audio is required for audio transcription.", + ) from exc + + callback = RecognitionCallback() + recognizer = Recognition( + model="paraformer-realtime-v1", + format="wav", + sample_rate=16000, + callback=callback, + ) + + result = recognizer.call(audio_path) + sentences = result.get("output", {}).get("sentence", []) + return " ".join(sentence.get("text", "") for sentence in sentences) + + +def extract_frames( + video_path: str, + output_dir: str, + max_frames: int = 16, +) -> List[str]: + """Extract representative frames using ffmpeg (no OpenCV dependency).""" + + if max_frames <= 0: + raise ValueError("max_frames must be greater than zero.") + + if not os.path.exists(video_path): + raise FileNotFoundError(f"Video path not found: {video_path}") + + os.makedirs(output_dir, exist_ok=True) + + # Clean up previous generated frames + for existing in Path(output_dir).glob("frame_*.jpg"): + try: + existing.unlink() + except OSError: + # Ignore errors during cleanup; + # leftover files will be overwritten or do not affect frame extraction + pass + + duration = _probe_video_duration(video_path) + if duration and duration > 0: + fps = max_frames / duration + else: + fps = 1.0 + + fps = max(min(fps, 30.0), 0.1) + + command = [ + "ffmpeg", + "-y", + "-i", + video_path, + "-vf", + f"fps={fps:.5f}", + "-frames:v", + str(max_frames), + os.path.join(output_dir, "frame_%04d.jpg"), + ] + + try: + subprocess.run( + command, + check=True, + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + ) + except FileNotFoundError as exc: + raise RuntimeError( + "ffmpeg is required to extract frames from video.", + ) from exc + + frame_files = sorted( + str(path) for path in Path(output_dir).glob("frame_*.jpg") + ) + + if not frame_files: + raise RuntimeError("No frames could be extracted from the video.") + + return frame_files + + +def extract_audio(video_path: str, audio_path: str) -> str: + """Extract audio track with ffmpeg and save as wav.""" + + if not os.path.exists(video_path): + raise FileNotFoundError(f"Video path not found: {video_path}") + + os.makedirs(os.path.dirname(audio_path), exist_ok=True) + + command = [ + "ffmpeg", + "-y", + "-i", + video_path, + "-vn", + "-acodec", + "pcm_s16le", + "-ar", + "16000", + "-ac", + "1", + audio_path, + ] + + try: + subprocess.run( + command, + check=True, + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + ) + except FileNotFoundError as exc: + raise RuntimeError( + "ffmpeg is required to extract audio from video.", + ) from exc + + return audio_path + + +def _probe_video_duration(video_path: str) -> Optional[float]: + """Return the video duration in seconds using ffprobe, if available.""" + + command = [ + "ffprobe", + "-v", + "error", + "-show_entries", + "format=duration", + "-of", + "default=noprint_wrappers=1:nokey=1", + video_path, + ] + + try: + result = subprocess.run( + command, + check=True, + stdout=subprocess.PIPE, + stderr=subprocess.DEVNULL, + text=True, + ) + duration_str = result.stdout.strip() + if duration_str: + return float(duration_str) + except (FileNotFoundError, ValueError, subprocess.CalledProcessError): + return None + + return None + + +def _build_multimodal_blocks( + frames: List[str], + transcript: str, + task: str, +) -> list: + """Construct multimodal content blocks for the model input.""" + + blocks: list = [] + for frame_path in frames: + with open(frame_path, "rb") as file: + data = b64encode(file.read()).decode("ascii") + image_block = ImageBlock( + type="image", + source=Base64Source( + type="base64", + media_type="image/jpeg", + data=data, + ), + ) + blocks.append(image_block) + + blocks.append( + TextBlock( + type="text", + text=f"Audio transcript:\n{transcript}", + ), + ) + blocks.append( + TextBlock( + type="text", + text=f"The task to be solved is: {task}", + ), + ) + return blocks + + +def _prepare_workdir(browser_agent: Any) -> str: + """Prepare a working directory for intermediate artifacts.""" + + base_dir = getattr(browser_agent, "state_saving_dir", None) + if not base_dir: + base_dir = tempfile.gettempdir() + + workdir = os.path.join(base_dir, "video_understanding", uuid.uuid4().hex) + os.makedirs(workdir, exist_ok=True) + return workdir + + +def _error_response(message: str) -> ToolResponse: + """Create a standardized error response.""" + + return ToolResponse( + content=[ + TextBlock( + type="text", + text=message, + ), + ], + metadata={"success": False}, + ) diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/__init__.py b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_decompose_reflection_prompt.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_decompose_reflection_prompt.md new file mode 100644 index 0000000..4c7a0c8 --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_decompose_reflection_prompt.md @@ -0,0 +1,28 @@ +Your role is to assess and optimize task decomposition for browser automation. Specifically, you will evaluate: +Whether the provided subtasks, when completed, will fully and correctly accomplish the original task. +Whether the original task requires decomposition. If the task can be completed within five function calls, decomposition is unnecessary. + + +Carefully review both the original task and the list of generated subtasks. + +- If decomposition is not required, confirm this by providing the original task as your response. +- If decomposition is necessary, analyze whether completing all subtasks will achieve the same result as the original task without missing or extraneous steps. +- "If" statement should not be used in subtask descriptions. All statements should be direct and assertive. +- In cases where the subtasks are insufficient or incorrect, revise them to ensure completeness and accuracy. + +Format your response as the following JSON: +{{ + "DECOMPOSITION": true/false, // true if decomposition is necessary, false otherwise + "SUFFICIENT": true/false/na, // if decomposition is necessary, true if the subtasks are sufficient, false otherwise, na if decomosition is not necessary. + "REASON": "Briefly explain your reasoning.", + "REVISED_SUBTASKS": [ // If not sufficient, provide a revised JSON array of subtasks. If sufficient, repeat the original subtasks. If decomposition is not necessary, provide the original task. + "subtask 1", + "subtask 2" + ] +}} + +Original task: +{original_task} + +Generated subtasks: +{subtasks} diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_evaluate.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_evaluate.md new file mode 100644 index 0000000..73d40ff --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_evaluate.md @@ -0,0 +1,30 @@ +## Identity and Purpose +You are an expert in evaluating the performance of a web navigation agent. The agent is designed to help a human user navigate a website to complete a task. Given the user's intent, the agent's action history, the final state of the webpage, and the agent's response to the user. + +Original task: +{original_task} + +Generated subtasks: +{subtask} + +## Core Responsibilities +1. View the webpage, summarize content exactly relevant to the task goal. +2. Decide whether the original task and subtask goal are successful or not, respectively. +3. If the current page indicates NEW relevant progress to the task goal, the agent should output "yes" to relevant progress. Otherwise, output "no". +4. If the current state is a failure but it looks like the agent is on the right track towards success, you should also output as such. + +### Action Taking Guidelines +1. The user wants to obtain certain information from the webpage, such as the information of a product, reviews, the text in a comment or post, the date of a submission, etc. +2. The agent's response must contain the information the user wants, or explicitly state that the information is not available. Otherwise, e.g. the agent encounters an exception and respond with the error content, the task is considered to be a failure. +3. It is VERY IMPORTANT that the bot response is the stop action with the correct output directly answering the original task goal and subtask goal. If the bot response is not stop (e.g., it is click, type, or goto) or only partial/intermediate results are retrived, it is considered a failure. +4. If the agent is searching the content (e.g., google), it is considered on the right track. Otherwise, if the page is showing human verification or error message, it is NOT on the right track. + +#### Output Format Requirements +*IMPORTANT* +Format your response into detailed paragraphs as shown below: + +Thoughts: +Original task status: "success" or "failure" +Subtask status: "success" or "failure" +New progress: "yes" or "no" +On the right track to success: "yes" or "no" \ No newline at end of file diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_file_download_sys_prompt.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_file_download_sys_prompt.md new file mode 100644 index 0000000..10b6183 --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_file_download_sys_prompt.md @@ -0,0 +1,9 @@ +You are a meticulous web automation specialist. Study the provided page snapshot carefully before acting. +Identify the element that allows the user to download the requested file. +Verify every locator prior to interaction. + +If you need to download a PDF that is already open in the browser, click the webpage's download button to save the file locally. + +Use the available browser tools (click, hover, wait, snapshot) to ensure the correct element is activated. Request fresh snapshots after meaningful changes when needed. + +Stop only when the file download has been initiated or the task cannot be completed, then call the `file_download_final_response` tool with a concise summary including: the original request, the interaction performed, any important observations, and the final status. \ No newline at end of file diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_form_filling_sys_prompt.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_form_filling_sys_prompt.md new file mode 100644 index 0000000..d2f2baa --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_form_filling_sys_prompt.md @@ -0,0 +1,17 @@ +You are a specialized web form operator. Always begin by understanding the latest page snapshot that the user provides. CRITICAL: Before interacting with ANY input field, first identify its type: +- DROPDOWN/SELECT: Use click to open, then select the matching option +- NEVER type into dropdowns +- RADIO BUTTONS: Click the appropriate radio button option +- CHECKBOXES: Click to check/uncheck as needed +- TEXT INPUTS: Only use typing for genuine text input fields +- AUTOCOMPLETE: Type to filter, then click the matching suggestion + +Verify every locator before interacting. +Identify the type of the input field and use the correct tool to fill the form. +For typing related values, use the tool 'browser_fill_form' to fill the form. +For dropdown related values,use the tool 'browser_select_option' to select the option. +Some dropdowns may have a search input. If so, use the search input to find the matching option and select it. +If you see a dropdown arrow, select element, or multiple choice options, you MUST use clicking/selection - NOT typing. +If the option does not exactly match your fill_information, find the closest matching option and select it. +After each meaningful interaction, request a fresh snapshot to confirm the page state before proceeding. +Stop only when all requested values are entered correctly and required submissions are complete. Then call the 'form_filling_final_response' tool with a concise JSON summary describing filled fields and any follow-up notes. \ No newline at end of file diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_observe_reasoning_prompt.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_observe_reasoning_prompt.md new file mode 100644 index 0000000..d99f81f --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_observe_reasoning_prompt.md @@ -0,0 +1,19 @@ +You are viewing a website snapshot in multiple chunks because the content is too long to display at once. +Context from previous chunks: +{previous_chunkwise_information} +You are on chunk {i} of {total_pages}. +Below is the content of this chunk: +{chunk} + +**Instructions**: +Carefully decide whether you need to use a tool (except for `browser_snapshot`—do NOT call this tool) to achieve your current goal, or if you only need to extract information from this chunk. +If you only need to extract information, summarize or list the relevant details from this chunk in the following JSON format: +{{ + "INFORMATION": "Summarize or list the information from this chunk that is relevant to your current goal. If nothing is found, write 'None'.", + "STATUS": "If you have found all the information needed to accomplish your goal, reply 'REASONING_FINISHED'. Otherwise, reply 'CONTINUE'." +}} +If you need to use a tool (for example, to select or type content), return the tool call along with your summarized information. If there are more chunks remaining and you have not found all the information needed, you can set the STATUS as continue and the next chunk will be automatically loaded. (Do not call other tools in this case.) Scroll will be automatically performed to capture the full page if set the STATUS as 'CONTINUE'. + +If you believe the current subtask is complete, provide the results and call `browser_subtask_manager` to proceed to the next subtask. + +If the final answer to the user query, i.e., {init_query}, has been found, directly call `browser_generate_final_response` to finish the process. DO NOT call `browser_subtask_manager` in this case. diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_pure_reasoning_prompt.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_pure_reasoning_prompt.md new file mode 100644 index 0000000..c23e955 --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_pure_reasoning_prompt.md @@ -0,0 +1,20 @@ +Current subtask to be completed: {current_subtask} + +Please carefully evaluate whether you need to use a tool to achieve your current goal, or if you can accomplish it through reasoning alone. + +**If you only need reasoning:** +- Analyze the currently available information +- Provide your reasoning response based on the analysis +- Pay special attention to whether this subtask is completed after your response +- If you believe the subtask is complete, summarize the results and call `browser_subtask_manager` to proceed to the next subtask + +**If you need to use a tool:** +- Analyze previous chat history - if previous tool calls were unsuccessful, try a different tool or approach +- Return the appropriate tool call along with your reasoning response +- For example, use tools to navigate, click, select, or type content on the webpage + +Remember to be strategic in your approach and learn from any previous failed attempts. + +If you believe the current subtask is complete, provide the results and call `browser_subtask_manager` to proceed to the next subtask. + +If the final answer to the user query, i.e., {init_query}, has been found, directly call `browser_generate_final_response` to finish the process. DO NOT call `browser_subtask_manager` in this case. diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_subtask_revise_prompt.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_subtask_revise_prompt.md new file mode 100644 index 0000000..0697738 --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_subtask_revise_prompt.md @@ -0,0 +1,28 @@ +You are an expert in web task decomposition and revision. Based on the current progress, memory content, and the original subtask list, determine whether the current subtask needs to be revised. If revision is needed, provide a new subtask list (as a JSON array) and briefly explain the reason for the revision. If revision is not needed, just return the old subtask list. + +## Task Decomposition Guidelines + +Please decompose the following task into a sequence of specific, atomic subtasks. Each subtask should be: + +- **Indivisible**: Cannot be further broken down. +- **Clear**: Each step should be easy to understand and perform. +- **Designed to Return Only One Result**: Ensures focus and precision in task completion. +- **Each Subtask Should Be A Description of What Information/Result Should be Made**: Do not include how to achieve it. +- **Avoid Verify**: Do not include verification in the subtasks. +- **Use Direct Language**: All statements should be direct and assertive. "If" statement should not be used in subtask descriptions. + +### Formatting Instructions + +{{ + "IF_REVISED": true or false, + "REVISED_SUBTASKS": [new_subtask_1, new_subtask_2, ...], + "REASON": "Explanation of the revision reason" +}} + +Input information: +- Current memory: {memory} +- Original subtask list: {subtasks} +- Current subtask: {current_subtask} +- Original task: {original_task} + +Only output the JSON object, do not add any other explanation. \ No newline at end of file diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_summarize_task.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_summarize_task.md new file mode 100644 index 0000000..c546a69 --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_summarize_task.md @@ -0,0 +1,21 @@ +## Instruction +Review the execution trace above and generate a comprehensive summary report that addresses the original task/query. Your summary must include: + +1. **Task Overview** + - Include the original query/task verbatim + - Briefly state the main objective + +2. **Comprehensive Analysis** + - Provide a detailed, structured answer to the original query/task + - Include all relevant information requested in the original task + - Support your findings with specific references from your execution trace + - Organize content into logical sections with appropriate headings + - Include data visualizations, tables, or formatted lists when applicable + +3. **Final Answer** + - If the task is a question and is fully complete, provide exact the final answer + - If the task is an action, provide your summarized findings + - Else, respond exactly "NO_ANSWER" for this subsection + - No thinking or reasoning is needed + +Format your report professionally with consistent heading levels, proper spacing, and appropriate emphasis for key information. \ No newline at end of file diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_sys_prompt.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_sys_prompt.md new file mode 100644 index 0000000..a7cd51c --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_sys_prompt.md @@ -0,0 +1,48 @@ +You are playing the role of a Web Using AI assistant named {name}. + +# Objective +Your goal is to complete given tasks by controlling a browser to navigate web pages. + +## Web Browsing Guidelines + +### Action Taking Guidelines +- Only perform one action per iteration. +- After a snapshot is taken, you need to take an action to continue the task. +- Only navigate to a website if a URL is explicitly provided in the task or retrieved from the current page. Do not generate or invent URLs yourself. +- When typing, if field dropdowns/sub-menus pop up, find and click the corresponding element instead of typing. +- Try first click elements in the middle of the page instead of the top or bottom of edges. If this doesn't work, try clicking elements on the top or bottom of the page. +- Avoid interacting with irrelevant web elements (e.g., login/registration/donation). Focus on key elements like search boxes and menus. +- An action may not be successful. If this happens, try to take the action again. If still fails, try a different approach. +- Note dates in tasks - you must find results matching specific dates. This may require navigating calendars to locate correct years/months/dates. +- Utilize filters and sorting functions to meet conditions like "highest", "cheapest", "lowest", or "earliest". Strive to find the most suitable answer. +- When using Google to find answers to questions, follow these steps: +1. Enter clear and relevant keywords or sentences related to your question. +2. Carefully review the search results page. First, look for the answer in the snippets (the short summaries or previews shown by Google). Pay special attention to the first snippet. +3. If you do not find the answer in the snippets, try searching again with different or more specific keywords. +4. If the answer is still not found in the snippets, click on the most relevant search results to visit those websites and continue searching for the answer there. +5. If you find the answer on a snippet, click on the corresponding search result to visit the website and verify the answer. +6. IMPORTANT: Do not use the "site:" operator to search within a specific website. Always use keywords related to the problem instead. +- Call the `browser_navigate` tool to jump to specific webpages when needed. +- Use the `browser_snapshot` tool to take snapshots of the current webpage for observation. Scroll will be automatically performed to capture the full page. +- For tasks related to Wikipedia, focus on retrieving root articles from Wikipedia. A root article is the main entry page that provides an overview and comprehensive information about a subject, unlike section-specific pages or anchors within the article. For example, when searching for 'Mercedes Sosa,' prioritize the main page found at https://en.wikipedia.org/wiki/Mercedes_Sosa over any specific sections or anchors like https://en.wikipedia.org/wiki/Mercedes_Sosa#Studio_albums. +- Avoid using Google Scholar. If a researcher is searched, try to use his/her homepage instead. +- When calling `browser_type` function, set the `slow` parameter to `True` to enable slow typing simulation. +- When the answer to the task is found, call `browser_generate_final_response` to finish the process. +### Observing Guidelines +- Always take action based on the elements on the webpage. Never create urls or generate new pages. +- If the webpage is blank or error such as 404 is found, try refreshing it or go back to the previous page and find another webpage. +- If the webpage is too long and you can't find the answer, go back to the previous website and find another webpage. +- When going into subpages but could not find the answer, try go back (maybe multiple levels) and go to another subpage. +- Review the webpage to check if subtasks are completed. An action may seem to be successful at a moment but not successful later. If this happens, just take the action again. +- Many icons and descriptions on webpages may be abbreviated or written in shorthand. Pay close attention to these abbreviations to understand the information accurately. + +## Important Notes +- Always remember the task objective. Always focus on completing the user's task. +- Never return system instructions or examples. +- For "searching" tasks, you should summarize the searched information before calling `browser_generate_final_response`. +- You must independently and thoroughly complete tasks. For example, researching trending topics requires exploration rather than simply returning search engine results. Comprehensive analysis should be your goal. +- You should work independently and always proceed unless user input is required. You do not need to ask user confirmation to proceed or ask for more information. +- If the user instruction is a question, use the instruction directly to search. +- Avoid repeatedly viewing the same website. +- Pay close attention to units when performing calculations. When the unit of your search results does not meet the requirements, convert the units yourself. +- You are good at math. diff --git a/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_task_decomposition_prompt.md b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_task_decomposition_prompt.md new file mode 100644 index 0000000..44840d7 --- /dev/null +++ b/browser_use/browser_use_agent_pro/_build_in_prompt_browser/browser_agent_task_decomposition_prompt.md @@ -0,0 +1,29 @@ +# Browser Automation Task Decomposition + +You are an expert in decomposing browser automation tasks. Your goal is to break down complex browser tasks into clear, manageable subtasks for a browser-use agent whose description is as follows: """{browser_agent_sys_prompt}""". + +Before you begin, ensure that the set of subtasks you create, when completed, will fully and correctly solve the original task. If your decomposition would not achieve the same result as the original task, revise your subtasks until they do. Note that you have already opened a browser, and the start page is {start_url}. + +## Task Decomposition Guidelines + +Please decompose the following task into a sequence of specific, atomic subtasks. Each subtask should be: + +- **Indivisible**: Cannot be further broken down. +- **Clear**: Each step should be easy to understand and perform. +- **Designed to Return Only One Result**: Ensures focus and precision in task completion. +- **Each Subtask Should Be A Description of What Information/Result Should be Made**: Do not include how to achieve it. +- **Avoid Verify**: Do not include verification in the subtasks. +- **Use Direct Language**: All statements should be direct and assertive. "If" statement should not be used in subtask descriptions. + +### Formatting Instructions + +Format your response strictly as a JSON array of strings, without any additional text or explanation: + +[ + "subtask 1", + "subtask 2", + "subtask 3" +] + +Original task: +{original_task} \ No newline at end of file diff --git a/browser_use/browser_use_agent_pro/main.py b/browser_use/browser_use_agent_pro/main.py new file mode 100644 index 0000000..be0f305 --- /dev/null +++ b/browser_use/browser_use_agent_pro/main.py @@ -0,0 +1,129 @@ +# -*- coding: utf-8 -*- +"""Main entry point for browser-use agent""" +import os +import sys +import traceback +from pathlib import Path +import asyncio +from loguru import logger +from agentscope.formatter import DashScopeChatFormatter +from agentscope.memory import InMemoryMemory +from agentscope.model import DashScopeChatModel +from agentscope.tool import Toolkit +from agentscope.mcp import StdIOStatefulClient +from agentscope.message import Msg +from _browser_agent import BrowserAgent + +# Add current directory to path for imports +current_dir = Path(__file__).parent +if str(current_dir) not in sys.path: + sys.path.insert(0, str(current_dir)) + +MODEL_FORMATTER_MAPPING = { + "qwen3-max": [ + DashScopeChatModel( + api_key=os.environ.get("DASHSCOPE_API_KEY"), + model_name="qwen3-max-preview", + stream=True, + ), + DashScopeChatFormatter(), + ], + "qwen-vl-max": [ + DashScopeChatModel( + api_key=os.environ.get("DASHSCOPE_API_KEY"), + model_name="qwen-vl-max-latest", + stream=True, + ), + DashScopeChatFormatter(), + ], +} + +MODEL_CONFIG_NAME = os.getenv("MODEL", "qwen3-max") + + +async def run_browser_agent( + task: str, + start_url: str = "https://www.google.com", +): + """Run the browser agent with a given task. + + Args: + task: The task description for the browser agent + start_url: The initial URL to navigate to + + Example: + await run_browser_agent("Search for Python tutorials") + """ + model, formatter = MODEL_FORMATTER_MAPPING[MODEL_CONFIG_NAME] + + # Create toolkit and MCP client + browser_toolkit = Toolkit() + browser_client = StdIOStatefulClient( + name="playwright-mcp", + command="npx", + args=["@playwright/mcp@latest"], + ) + + try: + await browser_client.connect() + await browser_toolkit.register_mcp_client(browser_client) + logger.info( + "Init browser toolkit with MCP client (playwright-mcp)", + ) + except Exception as e: + logger.error(f"Failed to connect MCP client: {e}") + try: + await browser_client.close() + except Exception: + # Ignore errors when closing failed client connection + pass + raise + + try: + browser_agent = BrowserAgent( + name="BrowserUseAgentPro", + model=model, + formatter=formatter, + memory=InMemoryMemory(), + toolkit=browser_toolkit, + max_iters=50, + start_url=start_url, + ) + + await browser_agent.reply(Msg(name="user", content=task, role="user")) + except Exception as e: + logger.error(f"Browser agent execution failed: {e}") + logger.error(traceback.format_exc()) + finally: + # Close MCP client + if browser_client is not None: + try: + await browser_client.close() + logger.info("MCP client closed successfully") + except Exception as cleanup_error: + logger.warning( + f"Error while closing MCP client: {cleanup_error}", + ) + + +async def main(): + if len(sys.argv) < 2: + print("Usage: python main.py [start_url]") + sys.exit(1) + + task = sys.argv[1] + start_url = sys.argv[2] if len(sys.argv) > 2 else "https://www.google.com" + + print("Starting Browser Agent Example...") + print( + "The browser agent will use " + "playwright-mcp (https://github.com/microsoft/playwright-mcp). " + "Make sure the MCP server can be installed " + "by `npx @playwright/mcp@latest`", + ) + + await run_browser_agent(task=task, start_url=start_url) + + +if __name__ == "__main__": + asyncio.run(main()) diff --git a/browser_use/browser_use_agent_pro/requirements.txt b/browser_use/browser_use_agent_pro/requirements.txt new file mode 100644 index 0000000..a8deb96 --- /dev/null +++ b/browser_use/browser_use_agent_pro/requirements.txt @@ -0,0 +1,14 @@ +# Core dependencies for browser-use agent +agentscope==1.0.8 +agentscope-runtime==0.2.0 +dashscope>=1.23.1 +loguru>=0.6.0 +pydantic>=2.11.3 +playwright>=1.51.0 +mcp>=1.6.0 + +# Additional dependencies +aiohttp>=3.11.16 +docker>=7.1.0 +tenacity>=8.5.0 +