stock/docs/source/monitoring.rst
2026-02-27 03:17:12 +08:00

373 lines
7.8 KiB
ReStructuredText

Monitoring & Alerts
===================
OpenClaw provides comprehensive monitoring and alerting capabilities to track system health and trading performance.
Overview
--------
Monitoring Components
~~~~~~~~~~~~~~~~~~~~~
* **Metrics Collection**: Performance and system metrics
* **Alerting**: Real-time notifications for critical events
* **Dashboards**: Visual monitoring interface
* **Logging**: Structured logging for debugging
* **Health Checks**: System availability monitoring
Quick Start
-----------
Basic Monitoring
~~~~~~~~~~~~~~~~
.. code-block:: python
from openclaw.monitoring.metrics import MetricsCollector
# Create collector
collector = MetricsCollector()
# Record metric
collector.record("trade.pnl", value=150.0, tags={
"symbol": "AAPL",
"strategy": "trend_following"
})
# Get statistics
stats = collector.get_stats("trade.pnl")
print(f"Avg PnL: {stats.mean:.2f}")
Setting Up Alerts
~~~~~~~~~~~~~~~~~
.. code-block:: python
from openclaw.monitoring.alerts import AlertManager, AlertRule
# Create alert manager
alerts = AlertManager()
# Define alert rule
rule = AlertRule(
name="high_drawdown",
condition="drawdown > 0.10",
severity="critical",
channels=["email", "slack"]
)
# Add rule
alerts.add_rule(rule)
# Check conditions
alerts.check_all(agent_state)
Metrics Collection
------------------
Built-in Metrics
~~~~~~~~~~~~~~~~
Trading Metrics:
* Trade count and frequency
* Win/loss ratio
* Average profit/loss
* Sharpe ratio
* Maximum drawdown
* Position sizes
System Metrics:
* API latency
* Error rates
* Decision costs
* Agent survival rates
* Workflow execution time
Custom Metrics
~~~~~~~~~~~~~~
.. code-block:: python
from openclaw.monitoring.metrics import Metric
# Create custom metric
custom_metric = Metric(
name="custom_factor.performance",
type="gauge",
description="Performance of custom trading factor",
unit="percent"
)
# Record value
custom_metric.record(15.5, tags={
"factor_name": "my_factor",
"symbol": "AAPL"
})
Metric Types
~~~~~~~~~~~~
**Counter**: Cumulative values (e.g., total trades)
.. code-block:: python
collector.increment("trades.total", tags={"symbol": "AAPL"})
**Gauge**: Point-in-time values (e.g., current balance)
.. code-block:: python
collector.gauge("agent.balance", value=1500.0, tags={"agent_id": "agent_001"})
**Histogram**: Distribution of values (e.g., trade PnL)
.. code-block:: python
collector.histogram("trade.pnl", value=100.0)
**Timer**: Duration measurements (e.g., analysis time)
.. code-block:: python
with collector.timer("analysis.duration"):
result = agent.analyze("AAPL")
Alerting System
---------------
Alert Rules
~~~~~~~~~~~
.. code-block:: python
from openclaw.monitoring.alerts import AlertRule, AlertCondition
# Create rule with multiple conditions
rule = AlertRule(
name="agent_distress",
description="Agent is in critical condition",
conditions=[
AlertCondition(
metric="agent.balance",
operator="less_than",
threshold=300.0
),
AlertCondition(
metric="agent.drawdown",
operator="greater_than",
threshold=0.70
)
],
severity="critical",
cooldown_minutes=60
)
alerts.add_rule(rule)
Alert Channels
~~~~~~~~~~~~~~
Email Alerts:
.. code-block:: python
from openclaw.monitoring.channels import EmailChannel
email = EmailChannel(
smtp_server="smtp.gmail.com",
smtp_port=587,
username="alerts@example.com",
password="app_password"
)
alerts.register_channel("email", email)
Slack Alerts:
.. code-block:: python
from openclaw.monitoring.channels import SlackChannel
slack = SlackChannel(
webhook_url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
)
alerts.register_channel("slack", slack)
Webhook Alerts:
.. code-block:: python
from openclaw.monitoring.channels import WebhookChannel
webhook = WebhookChannel(
url="https://api.example.com/alerts",
headers={"Authorization": "Bearer token123"}
)
alerts.register_channel("webhook", webhook)
Alert Severity Levels
~~~~~~~~~~~~~~~~~~~~~
* **INFO**: General information, no action required
* **WARNING**: Attention needed soon
* **CRITICAL**: Immediate action required
* **EMERGENCY**: System stopping event
Dashboard
---------
Web Dashboard
~~~~~~~~~~~~~
Start the monitoring dashboard:
.. code-block:: bash
openclaw dashboard --port 8080
Access at: http://localhost:8080
Dashboard Components:
* Real-time P&L chart
* Agent status overview
* System health metrics
* Recent alerts
* Active trades
* Performance statistics
Custom Dashboards
~~~~~~~~~~~~~~~~~
.. code-block:: python
from openclaw.dashboard.builder import DashboardBuilder
builder = DashboardBuilder()
# Add widgets
builder.add_line_chart(
title="Portfolio Value",
metric="portfolio.value",
time_range="1d"
)
builder.add_gauge(
title="Win Rate",
metric="performance.win_rate",
min_value=0,
max_value=1
)
builder.add_table(
title="Active Agents",
query="SELECT * FROM agents WHERE status='active'"
)
# Build dashboard
dashboard = builder.build()
dashboard.serve(port=8080)
Logging
-------
Structured Logging
~~~~~~~~~~~~~~~~~~
.. code-block:: python
from openclaw.utils.logging import get_logger
logger = get_logger("my_module")
# Different log levels
logger.debug("Debug information")
logger.info("General information")
logger.warning("Warning message")
logger.error("Error occurred")
logger.critical("Critical failure")
# Structured logging
logger.info("Trade executed", extra={
"symbol": "AAPL",
"side": "buy",
"quantity": 100,
"price": 150.0
})
Log Configuration
~~~~~~~~~~~~~~~~~
.. code-block:: yaml
# config/logging.yaml
logging:
level: INFO
format: json
outputs:
- type: file
path: /var/log/openclaw/trading.log
rotation: "1 day"
retention: "30 days"
- type: console
format: text
Health Checks
-------------
System Health
~~~~~~~~~~~~~
.. code-block:: python
from openclaw.monitoring.health import HealthChecker
health = HealthChecker()
# Register checks
health.add_check("database", check_database_connection)
health.add_check("exchange_api", check_exchange_api)
health.add_check("data_feed", check_data_feed)
# Run checks
status = health.check_all()
if status.healthy:
print("System healthy")
else:
for check, result in status.checks.items():
if not result.healthy:
print(f"{check}: FAILED - {result.message}")
Agent Health
~~~~~~~~~~~~
.. code-block:: python
from openclaw.monitoring.health import AgentHealthMonitor
monitor = AgentHealthMonitor()
# Check agent health
for agent in agents:
health = monitor.check_agent(agent)
if health.status == "critical":
alerts.send(f"Agent {agent.agent_id} is critical")
elif health.status == "struggling":
logger.warning(f"Agent {agent.agent_id} is struggling")
Monitoring Best Practices
-------------------------
1. **Monitor key metrics**: Focus on P&L, drawdown, and survival rates
2. **Set appropriate thresholds**: Avoid alert fatigue
3. **Use cooldown periods**: Prevent alert spam
4. **Regular health checks**: Automated system verification
5. **Centralized logging**: Aggregate logs for analysis
6. **Retention policies**: Manage data storage costs