小智连接稳定性优化方案

一、问题诊断

常见不稳定表现

表现	可能原因
"Mises引擎暂时无响应"	MBE 超时、Cloudflare 中断
"专家系统正在处理"	LLM 响应慢、超时
工具调用无返回	WebSocket 断开
响应断断续续	网络延迟、心跳超时
连接频繁断开	Cloudflare Tunnel 不稳定

延迟链路分析

小智终端 → 小智云服务器 → Cloudflare → 本地MBE → DeepSeek API
    10ms      100-200ms       200-500ms    请求处理     3-15秒
    
总延迟：约 4-16 秒（可能超过小智 12-15 秒超时）

二、快速检查清单

# 1. 检查 MBE 服务状态
curl http://localhost:8000/health

# 2. 检查 MCP 客户端日志
docker logs mbe-mcp-client --tail 50

# 3. 检查 Cloudflare Tunnel
cloudflared tunnel info

# 4. 测试 DeepSeek API 延迟
curl -w "响应时间: %{time_total}s\n" \
  -X POST https://api.deepseek.com/v1/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"deepseek-chat","messages":[{"role":"user","content":"hello"}]}'

三、优化方案

方案 1：优化 MBE 响应时间（推荐）

1.1 启用流式响应（最重要）

小智支持流式响应，可以边生成边返回，减少用户等待感：

# src/mcp/client.py - 修改工具响应

async def _handle_ask_expert(self, args: Dict[str, Any]) -> Dict[str, Any]:
    """流式专家问答"""
    question = args.get("question", "")
    
    # 1. 立即返回"正在思考"（500ms内）
    yield {
        "content": [{"type": "text", "text": "🤔 正在查阅资料..."}],
        "isPartial": True
    }
    
    # 2. 开始实际处理
    result = await router.route_and_answer(question)
    
    # 3. 返回完整结果
    yield {
        "content": [{"type": "text", "text": result.answer}],
        "isPartial": False
    }

1.2 缩短超时时间

修改 src/mcp/client.py：

# 将超时从 10-20 秒缩短到 8 秒，确保在小智超时前返回
async def _handle_analyze(self, args):
    result = await asyncio.wait_for(
        self.engine.process(...),
        timeout=8.0  # 从 10 秒改为 8 秒
    )

1.3 增加快速回退响应

当处理超时时，返回简短的安慰性响应：

except asyncio.TimeoutError:
    # 返回快速回退响应，而不是错误
    return {
        "content": [{"type": "text", "text": "我在认真听，请继续说。"}]
    }

方案 2：优化 WebSocket 连接

2.1 调整心跳参数

修改 src/mcp/client.py：

self.ws = await websockets.connect(
    self.websocket_url,
    ping_interval=20,    # 从 30 改为 20 秒
    ping_timeout=10,
    close_timeout=5,
    max_size=10 * 1024 * 1024,  # 10MB
    compression=None  # 禁用压缩减少延迟
)

2.2 添加连接保活

async def _keep_alive(self):
    """主动发送心跳保持连接"""
    while self.running:
        try:
            await asyncio.sleep(15)  # 每 15 秒发送一次
            if self.ws and not self.ws.closed:
                await self.ws.ping()
        except Exception as e:
            logger.warning(f"Keep-alive failed: {e}")

方案 3：优化 Cloudflare Tunnel

3.1 调整 Tunnel 配置

创建优化的 config.yml：

tunnel: mbe-tunnel
credentials-file: /path/to/credentials.json

# 全局配置
originRequest:
  connectTimeout: 10s
  tcpKeepAlive: 15s
  noHappyEyeballs: false
  keepAliveConnections: 100
  keepAliveTimeout: 90s
  httpHostHeader: ""
  originServerName: ""

ingress:
  - hostname: mbe.hi-maker.com
    service: http://localhost:8000
    originRequest:
      connectTimeout: 30s
      noTLSVerify: false
      # WebSocket 优化
      proxyType: ""
  - service: http_status:404

3.2 使用 Cloudflare 的 WebSocket 优化

在 Cloudflare Dashboard 中：

进入 hi-maker.com 域名设置
Rules → Settings：启用 WebSockets
Speed → Optimization：禁用 Minify（避免干扰 WS）
Network：启用 WebSocket 流量

方案 4：云端部署（最彻底）

将 MBE 部署到云服务器，消除本地 → Cloudflare 的延迟：

小智终端 → 小智云 → 云端MBE → DeepSeek API
    10ms    100ms    直连      3-15秒

总延迟：约 3-15 秒（减少 200-500ms）

参考 docs/deployment/云端部署方案.md。

四、监控与告警

4.1 添加延迟监控

# src/mcp/client.py

import time

async def _handle_tool_call(self, msg_id: str, params: Dict[str, Any]):
    start_time = time.time()
    
    try:
        result = await handler(tool_args)
        
        # 记录延迟
        latency = time.time() - start_time
        logger.info(f"Tool {tool_name} completed in {latency:.2f}s")
        
        # 延迟告警
        if latency > 8:
            logger.warning(f"⚠️ High latency: {tool_name} took {latency:.2f}s")
        
        await self._send_response(msg_id, result)
    except Exception as e:
        latency = time.time() - start_time
        logger.error(f"Tool {tool_name} failed after {latency:.2f}s: {e}")

4.2 连接状态监控脚本

#!/bin/bash
# check_xiaozhi_connection.sh

echo "检查小智连接状态..."

# 检查 MCP 客户端进程
if docker ps | grep -q mbe-mcp-client; then
    echo "✅ MCP 客户端运行中"
    
    # 检查最近日志
    ERRORS=$(docker logs mbe-mcp-client --since 5m 2>&1 | grep -c "ERROR\|Connection closed\|TimeoutError")
    if [ "$ERRORS" -gt 5 ]; then
        echo "⚠️ 最近 5 分钟有 $ERRORS 次错误"
    else
        echo "✅ 连接稳定"
    fi
else
    echo "❌ MCP 客户端未运行"
fi

# 检查最近的工具调用
TOOL_CALLS=$(docker logs mbe-mcp-client --since 5m 2>&1 | grep -c "Tool call:")
COMPLETIONS=$(docker logs mbe-mcp-client --since 5m 2>&1 | grep -c "completed\|✅")

echo "📊 最近 5 分钟统计:"
echo "   工具调用: $TOOL_CALLS"
echo "   成功完成: $COMPLETIONS"

五、推荐配置

5.1 环境变量优化

# .env 添加以下配置

# LLM 超时
LLM_TIMEOUT=8

# 启用快速回退
ENABLE_FAST_FALLBACK=true

# WebSocket 心跳
WS_PING_INTERVAL=20
WS_PING_TIMEOUT=10

# 预加载
PRELOAD_MODELS=true

5.2 完整的 docker-compose.yml

services:
  mbe-api:
    image: mbe-api:latest
    environment:
      - LLM_TIMEOUT=8
      - ENABLE_FAST_FALLBACK=true
      - PRELOAD_MODELS=true
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: always

  mbe-mcp-client:
    image: mbe-mcp-client:latest
    environment:
      - MCP_ENDPOINT=${XIAOZHI_MCP_URL}
      - WS_PING_INTERVAL=20
    depends_on:
      mbe-api:
        condition: service_healthy
    restart: always

六、故障排除

问题 1：频繁显示"专家系统正在处理"

原因：LLM 响应时间超过小智客户端超时

解决：

检查 DeepSeek API 延迟
缩短 MBE 超时到 8 秒
启用快速回退响应

问题 2：WebSocket 频繁断开重连

原因：心跳超时或网络不稳定

解决：

调整心跳参数
检查 Cloudflare Tunnel 状态
考虑云端部署

问题 3：首次调用很慢

原因：模型冷启动

解决：

确认 PRELOAD_MODELS=true
检查预加载日志
增加预热请求

问题 4：特定时间段不稳定

原因：网络拥堵或 API 限流

解决：

检查网络质量
监控 DeepSeek API 状态
错峰使用或增加缓存

七、紧急处理

如果小智完全无法连接 MBE：

# 1. 重启 MCP 客户端
docker restart mbe-mcp-client

# 2. 查看日志
docker logs mbe-mcp-client --tail 100

# 3. 检查 Token 是否过期
# 在小智 APP 中重新生成 MCP Token

# 4. 重启整个 MBE 服务
docker compose restart

更新日期: 2026-01-21