MBE 监控和日志指南

本文档介绍 MBE (Mises Behavior Engine) 的监控和日志系统配置及使用方法。

日志系统

MBE 使用 Loguru 作为日志库,支持结构化日志(JSON 格式)和彩色文本日志。

配置选项

在 .env 文件中配置:

# 日志级别: DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_LEVEL=INFO

# 是否启用 JSON 格式日志(生产环境推荐)
ENABLE_JSON_LOGS=false

# 日志文件路径(留空则不写入文件)
LOG_FILE=/var/log/mbe/app.log

日志格式

开发环境(彩色文本)

2026-02-06 10:30:45.123 | INFO     | main:root:156 [rid:a1b2c3d4] - MBE Monorepo started successfully!

[rid:a1b2c3d4]: 请求追踪 ID(前8位),用于关联同一请求的所有日志

生产环境(JSON 格式)

{
  "timestamp": "2026-02-06T10:30:45.123456",
  "level": "INFO",
  "message": "MBE Monorepo started successfully!",
  "module": "main",
  "function": "root",
  "line": 156,
  "request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}

代码中使用

from loguru import logger

# 基本日志
logger.info("User logged in")
logger.warning("High memory usage detected")
logger.error("Database connection failed")

# 带额外字段的日志(通过 bind)
logger.bind(user_id=123, action="purchase").info("User made a purchase")

# 在 JSON 日志中会显示为:
# {"timestamp": "...", "level": "INFO", "message": "User made a purchase",
#  "extra": {"user_id": 123, "action": "purchase"}}

日志文件轮转

日志文件配置自动轮转:

文件大小达到 500MB 时自动轮转
保留 30 天的历史日志
旧日志自动压缩为 zip 格式

监控指标

MBE 提供多个监控端点,用于实时监控系统状态。

配置

在 .env 中启用:

ENABLE_METRICS=true

监控端点

1. `/api/metrics` - JSON 格式指标

返回应用和系统指标的 JSON 格式数据:

curl http://localhost:8000/api/metrics

响应示例:

{
  "timestamp": "2026-02-06T10:30:45.123456",
  "uptime_seconds": 3600,
  "application": {
    "requests_total": 1523,
    "requests_by_status": {
      "200": 1450,
      "404": 50,
      "500": 23
    },
    "avg_response_time_ms": 45.23
  },
  "system": {
    "cpu": {
      "percent": 35.2,
      "count": 8
    },
    "memory": {
      "total_bytes": 17179869184,
      "available_bytes": 8589934592,
      "percent": 50.0
    },
    "disk": {
      "total_bytes": 1099511627776,
      "used_bytes": 549755813888,
      "percent": 50.0
    }
  }
}

2. `/api/metrics/prometheus` - Prometheus 格式

提供 Prometheus 兼容的文本格式指标,可直接用于 Prometheus 抓取:

curl http://localhost:8000/api/metrics/prometheus

响应示例:

# HELP mbe_requests_total Total number of requests
# TYPE mbe_requests_total counter
mbe_requests_total 1523

# HELP mbe_avg_response_time_ms Average response time in milliseconds
# TYPE mbe_avg_response_time_ms gauge
mbe_avg_response_time_ms 45.23

# HELP mbe_cpu_percent CPU usage percentage
# TYPE mbe_cpu_percent gauge
mbe_cpu_percent 35.2

Prometheus 集成

在 Prometheus 配置文件中添加:

scrape_configs:
  - job_name: 'mbe'
    scrape_interval: 15s
    static_configs:
      - targets: ['mbe-api:8000']
    metrics_path: '/api/metrics/prometheus'

请求追踪

每个 HTTP 请求都会自动分配一个唯一的 请求追踪 ID,用于关联该请求的所有日志。

自动生成

中间件会自动为每个请求生成 UUID 格式的追踪 ID:

a1b2c3d4-e5f6-7890-abcd-ef1234567890

客户端指定

客户端可以通过 HTTP Header 指定请求 ID:

curl -H "X-Request-ID: my-custom-id-001" http://localhost:8000/api/health

响应 Header

追踪 ID 会包含在响应的 Header 中:

X-Request-ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
X-Response-Time: 12.34ms

日志关联

所有属于同一请求的日志都会包含相同的 request_id:

{"timestamp": "...", "level": "INFO", "message": "Request started: GET /api/chat", "request_id": "a1b2c3d4-..."}
{"timestamp": "...", "level": "INFO", "message": "User authenticated", "request_id": "a1b2c3d4-..."}
{"timestamp": "...", "level": "INFO", "message": "Request completed: GET /api/chat -> 200", "request_id": "a1b2c3d4-..."}

健康检查

MBE 提供多层次的健康检查端点。

1. `/api/health` - 基础健康检查

用于负载均衡器探测,快速响应:

curl http://localhost:8000/api/health

响应:

{
  "status": "healthy",
  "service": "mises-behavior-engine",
  "timestamp": 1675680645.123456
}

2. `/api/health/detailed` - 详细健康检查

包含数据库、Redis、系统资源、进程信息:

curl http://localhost:8000/api/health/detailed

响应:

{
  "status": "healthy",
  "checks": {
    "service": "healthy",
    "timestamp": 1675680645.123456,
    "database": {
      "connected": true,
      "latency_ms": 2.34
    },
    "redis": {
      "connected": true,
      "latency_ms": 1.12
    },
    "system": {
      "cpu_percent": 35.2,
      "memory_percent": 50.0,
      "disk_percent": 45.6
    },
    "process": {
      "pid": 12345,
      "cpu_percent": 5.2,
      "memory_mb": 512.34,
      "threads": 8,
      "open_files": 15
    }
  }
}

3. `/api/health/resilience` - 弹性系统状态

查看熔断器、重试统计、LLM 客户端状态:

curl http://localhost:8000/api/health/resilience

4. `/api/health/critique` - Self-Critique 系统状态

查看 MBE 核心验证系统的状态:

curl http://localhost:8000/api/health/critique

生产环境配置

1. 启用 JSON 日志

在 .env 中设置:

ENABLE_JSON_LOGS=true
LOG_FILE=/var/log/mbe/app.log
LOG_LEVEL=INFO

2. 配置日志收集

使用 Filebeat 或 Fluentd 收集日志到 Elasticsearch:

Filebeat 配置示例

filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/mbe/app.log
    json.keys_under_root: true
    json.add_error_key: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "mbe-logs-%{+yyyy.MM.dd}"

3. 设置 Prometheus 监控

在 Prometheus 中抓取 MBE 指标,配合 Grafana 可视化:

scrape_configs:
  - job_name: 'mbe-production'
    scrape_interval: 30s
    static_configs:
      - targets: ['mbe-api:8000']
    metrics_path: '/api/metrics/prometheus'

4. 配置告警

在 Prometheus AlertManager 中配置告警规则:

groups:
  - name: mbe_alerts
    rules:
      - alert: HighResponseTime
        expr: mbe_avg_response_time_ms > 1000
        for: 5m
        annotations:
          summary: "MBE response time too high"

      - alert: DatabaseDown
        expr: up{job="mbe-production"} == 0
        for: 1m
        annotations:
          summary: "MBE database connection lost"

5. Nginx 日志整合

在 Nginx 配置中添加 X-Request-ID 传递:

location / {
    proxy_pass http://mbe-api:8000;
    
    # 传递请求追踪 ID
    proxy_set_header X-Request-ID $request_id;
    
    # 记录到 Nginx 访问日志
    log_format trace '$remote_addr - $request_id - [$time_local] '
                     '"$request" $status $body_bytes_sent '
                     '"$http_referer" "$http_user_agent"';
    access_log /var/log/nginx/mbe_access.log trace;
}

故障排查示例

1. 追踪慢请求

通过 X-Response-Time Header 发现慢请求:

curl -I http://localhost:8000/api/chat
# X-Response-Time: 1234.56ms

然后在日志中搜索该请求的 request_id:

grep "a1b2c3d4" /var/log/mbe/app.log | jq .

2. 分析错误趋势

在 Elasticsearch 中查询错误日志:

{
  "query": {
    "bool": {
      "must": [
        {"term": {"level": "ERROR"}},
        {"range": {"timestamp": {"gte": "now-1h"}}}
      ]
    }
  }
}

3. 监控资源使用

实时监控 CPU/内存:

watch -n 2 'curl -s http://localhost:8000/api/metrics | jq .system'

参考资源

APM (应用性能监控)

MBE 内置轻量级 APM 系统,基于函数装饰器和上下文管理器追踪代码性能。

配置

APM 默认启用,无需额外配置。

使用方法

1. 装饰器追踪函数

from utils.apm import trace

@trace("get_user_profile", user_type="premium")
async def get_user_profile(user_id: int):
    # 自动追踪执行时间和错误
    return await db.query(...)

# 支持同步和异步函数
@trace("process_data")
def process_data(data):
    return data.transform()

2. 上下文管理器追踪代码块

from utils.apm import trace_block

async def complex_operation():
    # 追踪特定代码块
    async with trace_block("fetch_data", source="api"):
        data = await fetch_from_api()
    
    async with trace_block("process_data", item_count=len(data)):
        result = process(data)
    
    return result

3. 查看性能统计

访问 /api/performance/apm/stats 查看所有操作的性能统计:

curl http://localhost:8000/api/performance/apm/stats

响应示例:

{
  "get_user_profile": {
    "operation": "get_user_profile",
    "count": 1523,
    "avg_duration_ms": 45.23,
    "min_duration_ms": 12.34,
    "max_duration_ms": 234.56,
    "error_count": 5,
    "error_rate": 0.33
  },
  "process_data": {
    "operation": "process_data",
    "count": 856,
    "avg_duration_ms": 123.45,
    ...
  }
}

4. Prometheus 指标

访问 /api/performance/apm/metrics 获取 Prometheus 格式的 APM 指标:

curl http://localhost:8000/api/performance/apm/metrics

数据库慢查询监控

自动检测并记录执行时间超过阈值的数据库查询。

配置

慢查询监控在应用启动时自动启用:

# main.py 中已自动配置
enable_slow_query_logging(engine, threshold_ms=100)  # 100ms 阈值

查看慢查询

1. 获取慢查询列表

curl "http://localhost:8000/api/performance/slow-queries?limit=20"

响应:

{
  "stats": {
    "total_count": 156,
    "recent_count": 20,
    "avg_duration_ms": 234.56,
    "max_duration_ms": 1234.56,
    "slowest_query": {
      "query": "SELECT * FROM users WHERE ...",
      "duration_ms": 1234.56,
      "timestamp": "2026-02-06T10:30:45.123456"
    }
  },
  "queries": [
    {
      "query": "SELECT * FROM conversation_history WHERE user_id = ? ORDER BY ...",
      "duration_ms": 234.56,
      "params": "{'user_id': 123}",
      "timestamp": "2026-02-06T10:30:45.123456"
    },
    ...
  ]
}

2. 仅查看统计信息

curl http://localhost:8000/api/performance/slow-queries/stats

在代码中使用

from utils.slow_query_monitor import track_query_time

# 装饰器追踪查询时间
@track_query_time("get_user_conversations", threshold_ms=200)
async def get_user_conversations(db: AsyncSession, user_id: int):
    result = await db.execute(...)
    return result

调整阈值

修改 shared/src/utils/slow_query_monitor.py 中的常量:

SLOW_QUERY_THRESHOLD_MS = 100  # 修改为你需要的阈值（毫秒）

Grafana + Prometheus 集成

1. 部署 Prometheus

docker-compose.monitoring.yml:

version: '3.9'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: mbe-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./monitoring/grafana/mbe-alerts.yml:/etc/prometheus/alerts/mbe-alerts.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    restart: always

  grafana:
    image: grafana/grafana:latest
    container_name: mbe-grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_SERVER_ROOT_URL=http://localhost:3000
    restart: always

  alertmanager:
    image: prom/alertmanager:latest
    container_name: mbe-alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/grafana/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: always

volumes:
  prometheus-data:
  grafana-data:

2. Prometheus 配置

monitoring/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# 告警规则文件
rule_files:
  - /etc/prometheus/alerts/mbe-alerts.yml

# AlertManager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# 抓取配置
scrape_configs:
  - job_name: 'mbe-api'
    scrape_interval: 30s
    static_configs:
      - targets: ['mbe-api:8000']
    metrics_path: '/api/metrics/prometheus'
  
  - job_name: 'mbe-apm'
    scrape_interval: 60s
    static_configs:
      - targets: ['mbe-api:8000']
    metrics_path: '/api/performance/apm/metrics'

3. 启动监控栈

# 启动 Prometheus + Grafana + AlertManager
docker-compose -f docker-compose.monitoring.yml up -d

# 访问 Grafana: http://localhost:3000 (admin/admin)
# 访问 Prometheus: http://localhost:9090
# 访问 AlertManager: http://localhost:9093

4. 导入 Grafana 仪表板

登录 Grafana (http://localhost:3000)
添加 Prometheus 数据源:
- Configuration -> Data Sources -> Add data source
- 选择 Prometheus
- URL: http://prometheus:9090
- Save & Test
导入仪表板:
- Create -> Import
- 上传 monitoring/grafana/mbe-dashboard.json

5. 配置告警通知

编辑 monitoring/grafana/alertmanager.yml:

Email: 配置 SMTP 设置
Slack: 添加 Webhook URL
钉钉: 添加机器人 Token

重启 AlertManager 生效:

docker-compose -f docker-compose.monitoring.yml restart alertmanager

性能优化建议

1. 数据库查询优化

添加索引到频繁查询的字段
避免 N+1 查询问题
使用分页而非一次性加载大量数据
定期分析慢查询日志并优化

2. 缓存策略

使用 Redis 缓存热点数据
设置合理的 TTL
使用缓存预热避免缓存雪崩

3. 连接池优化

当前配置:

# database.py
pool_size=10        # 连接池大小
max_overflow=20     # 最大溢出连接

根据并发量调整:

低并发（< 100 RPS）: pool_size=10
中并发（100-500 RPS）: pool_size=20
高并发（> 500 RPS）: pool_size=50+

4. 异步处理

对于耗时操作，使用 Celery 异步任务:

from tasks.celery_tasks import process_heavy_task

# 同步请求中触发异步任务
task = process_heavy_task.delay(data)
return {"task_id": task.id, "status": "processing"}

总结

完整的监控和日志系统包括:

结构化日志 - JSON/文本格式,支持日志聚合
请求追踪 - 唯一 ID 关联请求的所有日志
系统指标 - CPU/内存/磁盘/进程监控
应用指标 - 请求数/响应时间/错误率
APM - 函数级性能追踪
慢查询监控 - 数据库性能分析
Grafana 仪表板 - 可视化展示
Prometheus 告警 - 自动告警通知

通过这些工具,可以全面监控 MBE 系统的运行状态并快速定位问题。

MBE 监控和日志指南

目录

日志系统

配置选项

日志格式

开发环境(彩色文本)

生产环境(JSON 格式)

代码中使用

日志文件轮转

监控指标

配置

监控端点

1. /api/metrics - JSON 格式指标

2. /api/metrics/prometheus - Prometheus 格式

Prometheus 集成

请求追踪

自动生成

客户端指定

响应 Header

日志关联

健康检查

1. /api/health - 基础健康检查

2. /api/health/detailed - 详细健康检查

3. /api/health/resilience - 弹性系统状态

4. /api/health/critique - Self-Critique 系统状态

生产环境配置

1. 启用 JSON 日志

2. 配置日志收集

Filebeat 配置示例

3. 设置 Prometheus 监控

4. 配置告警

5. Nginx 日志整合

故障排查示例

1. 追踪慢请求

2. 分析错误趋势

3. 监控资源使用

参考资源

APM (应用性能监控)

配置

使用方法

1. 装饰器追踪函数

2. 上下文管理器追踪代码块

3. 查看性能统计

4. Prometheus 指标

数据库慢查询监控

配置

查看慢查询

1. 获取慢查询列表

2. 仅查看统计信息

在代码中使用

调整阈值

Grafana + Prometheus 集成

1. 部署 Prometheus

2. Prometheus 配置

3. 启动监控栈

4. 导入 Grafana 仪表板

5. 配置告警通知

性能优化建议

1. 数据库查询优化

2. 缓存策略

3. 连接池优化

4. 异步处理

总结

1. `/api/metrics` - JSON 格式指标

2. `/api/metrics/prometheus` - Prometheus 格式

1. `/api/health` - 基础健康检查

2. `/api/health/detailed` - 详细健康检查

3. `/api/health/resilience` - 弹性系统状态

4. `/api/health/critique` - Self-Critique 系统状态