优先级 3 完成总结：监控和日志系统

✅ 完成时间

2026-02-06

📊 完成内容概览

1. 结构化日志系统

文件： shared/src/utils/logging_config.py

功能：

✅ JSON 格式日志（生产环境，便于 ELK Stack 收集）
✅ 彩色文本日志（开发环境，便于调试）
✅ 日志级别动态控制（LOG_LEVEL 环境变量）
✅ 日志文件自动轮转（500MB/30天/ZIP压缩）
✅ 请求追踪 ID 自动关联

配置：

LOG_LEVEL=INFO                 # 日志级别
ENABLE_JSON_LOGS=false         # 生产环境设为 true
LOG_FILE=/var/log/mbe/app.log  # 可选的日志文件路径

2. 请求追踪系统

文件： shared/src/utils/request_tracking.py

功能：

✅ 自动为每个 HTTP 请求生成唯一 UUID
✅ 支持客户端通过 X-Request-ID Header 指定
✅ 使用 contextvars 在异步环境中传递
✅ 所有日志自动包含 request_id 字段
✅ 响应 Header 包含 X-Request-ID 和 X-Response-Time

效果： 同一请求的所有日志都带有相同的 request_id，便于追踪完整的请求链路。

3. 监控指标系统

文件： shared/src/utils/metrics.py

API 端点：

/api/metrics - JSON 格式指标
/api/metrics/prometheus - Prometheus 兼容格式

指标类型：

应用指标：请求总数、按状态码分类、平均响应时间、运行时间
系统指标：CPU/内存/磁盘使用率、进程信息

集成： Prometheus 可直接抓取 /api/metrics/prometheus 端点。

4. 增强健康检查

文件： shared/src/api/health.py

端点层级：

/api/health - 基础健康检查（快速响应，用于负载均衡器）
/api/health/detailed - 详细健康检查
- 数据库连接状态和延迟
- Redis 连接状态和延迟
- 系统资源（CPU/内存/磁盘）
- 进程信息（PID/CPU/内存/线程数/文件句柄）
/api/health/resilience - 弹性系统状态（熔断器、重试统计）
/api/health/critique - Self-Critique 验证系统状态

5. APM (应用性能监控)

文件： shared/src/utils/apm.py

核心功能：

✅ @trace 装饰器追踪函数执行时间和错误
✅ trace_block 上下文管理器追踪代码块
✅ 支持同步和异步函数
✅ 性能统计（平均/最小/最大执行时间、错误率）
✅ 导出 Prometheus 格式指标

使用示例：

from utils.apm import trace, trace_block

@trace("get_user_profile", user_type="premium")
async def get_user_profile(user_id: int):
    return await db.query(...)

async with trace_block("fetch_data", source="api"):
    data = await fetch_from_api()

查看统计：

/api/performance/apm/stats - 性能统计
/api/performance/apm/metrics - Prometheus 格式

6. 数据库慢查询监控

文件： shared/src/utils/slow_query_monitor.py

核心功能：

✅ 自动检测执行时间超过阈值的查询（默认 100ms）
✅ SQLAlchemy 事件监听器自动记录
✅ @track_query_time 装饰器手动追踪
✅ 慢查询列表和统计信息
✅ 集成到 main.py 启动流程

API 端点：

/api/performance/slow-queries?limit=50 - 慢查询列表
/api/performance/slow-queries/stats - 统计信息

使用示例：

from utils.slow_query_monitor import track_query_time

@track_query_time("get_user_conversations", threshold_ms=200)
async def get_user_conversations(db: AsyncSession, user_id: int):
    result = await db.execute(...)
    return result

7. Grafana + Prometheus 监控栈

配置文件：

monitoring/prometheus/prometheus.yml - Prometheus 配置
monitoring/grafana/mbe-alerts.yml - 告警规则
monitoring/grafana/alertmanager.yml - 告警通知配置
docker-compose.monitoring.yml - 监控栈部署

告警规则分级：

Critical（严重）

服务宕机（1分钟）
数据库连接失败（1分钟）
Redis 连接失败（1分钟）
磁盘空间 > 90%（5分钟）

Warning（警告）

平均响应时间 > 1000ms（5分钟）
5xx 错误率 > 5%（5分钟）
内存使用率 > 85%（10分钟）
CPU 使用率 > 80%（10分钟）
数据库延迟 > 100ms（5分钟）
Redis 延迟 > 50ms（5分钟）

Info（信息）

服务重启检测
慢查询数量增加

通知渠道：

Email（SMTP）
Slack
钉钉（Webhook）

8. 性能监控 API

文件： shared/src/api/performance.py

端点总览：

端点	方法	功能
`/api/performance/slow-queries`	GET	获取慢查询列表
`/api/performance/slow-queries/stats`	GET	慢查询统计
`/api/performance/slow-queries`	DELETE	清空慢查询记录
`/api/performance/apm/stats`	GET	APM 性能统计
`/api/performance/apm/metrics`	GET	Prometheus 格式
`/api/performance/apm`	DELETE	清空 APM 数据

📄 文档

`docs/MONITORING.md` - 完整监控指南

内容包括：

日志系统配置和使用
监控指标端点说明
请求追踪机制
健康检查层级
APM 使用方法和示例
数据库慢查询监控
Grafana + Prometheus 完整集成指南
生产环境配置
故障排查示例
性能优化建议

🚀 部署监控栈

1. 启动监控服务

docker-compose -f docker-compose.monitoring.yml up -d

2. 访问服务

Grafana: http://localhost:3000 (admin/admin)
Prometheus: http://localhost:9090
AlertManager: http://localhost:9093

3. 配置 Grafana

添加 Prometheus 数据源（URL: http://prometheus:9090）
导入仪表板（monitoring/grafana/mbe-dashboard.json）

4. 配置告警通知

编辑 monitoring/grafana/alertmanager.yml:

配置 SMTP（Email）
添加 Slack Webhook URL
添加钉钉机器人 Token

重启 AlertManager:

docker-compose -f docker-compose.monitoring.yml restart alertmanager

🎯 使用场景示例

场景 1：追踪慢请求

# 1. 发现慢请求（通过 Header）
curl -I http://localhost:8000/api/chat
# X-Response-Time: 1234.56ms
# X-Request-ID: a1b2c3d4-...

# 2. 在日志中搜索该请求的所有日志
grep "a1b2c3d4" /var/log/mbe/app.log | jq .

场景 2：查看慢查询

# 获取最近的慢查询
curl "http://localhost:8000/api/performance/slow-queries?limit=20"

# 查看统计信息
curl http://localhost:8000/api/performance/slow-queries/stats

场景 3：分析函数性能

# 查看所有操作的性能统计
curl http://localhost:8000/api/performance/apm/stats

# 查看特定操作
curl "http://localhost:8000/api/performance/apm/stats?operation=get_user_profile"

场景 4：实时监控系统资源

# 实时查看系统指标
watch -n 2 'curl -s http://localhost:8000/api/metrics | jq .system'

📈 监控能力总结

完整的监控栈

✅ 日志层 - 结构化日志 + 请求追踪
✅ 指标层 - Prometheus 指标 + 系统资源
✅ 追踪层 - APM 性能追踪
✅ 数据库层 - 慢查询监控
✅ 可视化层 - Grafana 仪表板
✅ 告警层 - AlertManager 自动通知

关键优势

零侵入 - 通过装饰器和中间件自动启用
轻量级 - 无需额外的 APM 服务（如 New Relic）
开源 - 基于 Prometheus + Grafana 生态
灵活 - 支持多种通知渠道和自定义告警规则
生产就绪 - 完整的告警、日志、追踪体系

🔜 后续可选增强

日志聚合 - 集成 ELK Stack (Elasticsearch + Logstash + Kibana)
分布式追踪 - 集成 Jaeger 或 Zipkin
更多指标 - Celery 队列长度、缓存命中率等
自定义仪表板 - 根据业务需求定制 Grafana 面板
移动端告警 - 集成 PagerDuty 或企业微信

🎉 总结

**优先级 3（监控和日志）**已全部完成！

系统现在具备：

完整的可观测性（Observability）
生产级别的监控和告警
详细的性能分析能力
快速的问题定位机制

下一步建议： 继续优先级 4（安全和优化），包括 CORS 配置、速率限制、连接池优化等。