多 LLM 智能体工作流的 GPU 调度：从过度订阅到精准分配

一句话总结

通过分析多 LLM 工作流中各模型执行时间占比的稳定性，Scepsy 实现了比独立优化方案高 2.4x 吞吐量、低 27x 延迟的智能体服务调度。

为什么需要这个？

问题：多 LLM 工作流下 GPU 资源严重浪费

现代智能体系统（如 AutoGPT、LangGraph、CrewAI）通常编排多个 LLM 协作完成任务：

用户请求
  ├── Planner LLM（规划任务）
  │     ├── Executor LLM × N（并行执行）
  │     │     └── Critic LLM（评估结果）
  │     └── Summarizer LLM（汇总）
  └── 返回结果

实际部署时有两个核心痛点：

执行路径不可预测：同一个 workflow，第一次调用 Executor 3 次，第二次可能调用 7 次，总延迟差 3 倍
LLM 数量 > GPU 数量：一个生产系统可能有 8 个不同 LLM，但只有 4 张 GPU，必须分时复用

现有方案的问题：

方案	问题
给每个 LLM 分配固定 GPU	利用率低，LLM 等待各自的 GPU 而相互阻塞
按请求量等比分配	忽略了不同 LLM 计算量的本质差异
用户手工指定	需要专业知识，换个 workflow 就失效

核心原理：稳定的时间占比

直觉：混乱中有秩序

想象一家餐厅（GPU 集群）服务不同菜系（LLM）。虽然每桌客人点什么菜完全随机，但长期统计下来，中餐、日料、西餐各占厨房时间的比例相当稳定。

Scepsy 的关键洞察：

虽然单次工作流的端到端延迟不可预测，但各 LLM 占总计算时间的比例（share）在多次执行中高度稳定。

假设某工作流运行 100 次，统计每个 LLM 的 GPU 计算时间：

Planner LLM:    平均占 15% ± 2%   ← 非常稳定
Executor LLM:   平均占 60% ± 5%   ← 稳定
Critic LLM:     平均占 20% ± 3%   ← 稳定
Summarizer LLM: 平均占  5% ± 1%   ← 稳定

为什么稳定？ 即使执行次数随机变化，分布本身是稳定的（大数定律）。这就是 Scepsy 能够工作的数学基础。

从时间占比到 GPU 分配

有了稳定的时间占比 $s_i$，GPU 分配问题就变成了一个约束优化问题：

\[\text{最小化} \quad \text{E2E\_Latency}(a_1, a_2, ..., a_n)\] \[\text{约束} \quad \sum_i \text{GPU}(a_i) \leq \text{Total\_GPU}\]

其中 $a_i = (f_i, p_i, r_i)$ 是每个 LLM 的分配三元组：

$f_i$：GPU 分数（fractional share，如 0.5 表示半张 GPU）
$p_i$：Tensor Parallelism 度（跨 GPU 拆分权重）
$r_i$：副本数（并行处理多个请求）

代码实现

Baseline：每个 LLM 独立分配，互不相知

# baseline_scheduler.py
# 朴素方案：每个 LLM 按请求量均分 GPU，完全忽略计算特性

class NaiveScheduler:
    def __init__(self, total_gpus: int):
        self.total_gpus = total_gpus
    
    def allocate(self, llm_names: list[str]) -> dict[str, float]:
        """均分 GPU，不考虑每个 LLM 的实际计算需求"""
        gpu_per_llm = self.total_gpus / len(llm_names)
        return {name: gpu_per_llm for name in llm_names}

# 问题：Planner LLM 只用了分配给它的 5% 时间
# 而 Executor LLM 严重过载，其他 LLM 在排队等待
scheduler = NaiveScheduler(total_gpus=4)
allocation = scheduler.allocate(["planner", "executor", "critic", "summarizer"])
# 结果: {'planner': 1.0, 'executor': 1.0, 'critic': 1.0, 'summarizer': 1.0}
# 每人一张卡，但 executor 的实际工作量是 planner 的 4 倍

性能分析：在真实 benchmark 中，这种方案导致 Executor LLM 队列堆积，平均等待时间占总延迟的 40%+。

第一步：构建时间占比 Profiler

# profiler.py
# 核心：通过采样估计各 LLM 的稳定时间占比

import time
import threading
from collections import defaultdict

class WorkflowProfiler:
    def __init__(self, warmup_runs=10, profile_runs=50):
        self.warmup_runs = warmup_runs
        self.profile_runs = profile_runs
        self._times: dict[str, list[float]] = defaultdict(list)
        self._lock = threading.Lock()
    
    def record(self, llm_name: str, duration_ms: float):
        """记录单次 LLM 调用的 GPU 时间"""
        with self._lock:
            self._times[llm_name].append(duration_ms)
    
    def get_shares(self) -> dict[str, float]:
        """计算各 LLM 的时间占比（归一化）"""
        # 取最近 profile_runs 次的平均值，忽略 warmup
        avg_times = {}
        for name, times in self._times.items():
            recent = times[-self.profile_runs:] if len(times) > self.profile_runs else times
            avg_times[name] = sum(recent) / len(recent) if recent else 0
        
        total = sum(avg_times.values())
        if total == 0:
            return {k: 1.0/len(avg_times) for k in avg_times}
        
        # 关键：返回相对占比，而非绝对时间
        return {name: t / total for name, t in avg_times.items()}
    
    def is_stable(self, threshold=0.05) -> bool:
        """检查占比是否已收敛（标准差 < threshold）"""
        shares_history = self._compute_shares_history(window=10)
        return all(std < threshold for std in shares_history.values())
    
    def _compute_shares_history(self, window=10):
        import statistics
        result = {}
        for name, times in self._times.items():
            if len(times) < window * 2:
                result[name] = float('inf')
                continue
            # 计算最近两个窗口的占比，看波动
            recent_share = sum(times[-window:]) / sum(sum(v[-window:]) for v in self._times.values())
            older_share = sum(times[-2*window:-window]) / sum(sum(v[-2*window:-window]) for v in self._times.values())
            result[name] = abs(recent_share - older_share)
        return result

第二步：Aggregate LLM Pipeline——轻量级延迟预测器

这是 Scepsy 的核心创新。它不预测单次请求延迟，而是预测给定 GPU 分配下的系统级延迟：

# aggregate_pipeline.py
# 基于 roofline 模型的简化延迟预测器

from dataclasses import dataclass
import numpy as np

@dataclass
class LLMProfile:
    name: str
    model_size_b: float    # 参数量（十亿）
    avg_input_tokens: int
    avg_output_tokens: int
    time_share: float      # 由 profiler 得到

@dataclass  
class AllocationConfig:
    gpu_fraction: float    # 分配的 GPU 分数
    tensor_parallel: int   # TP 度
    replicas: int          # 副本数

class AggregateLLMPipeline:
    """
    核心预测器：给定 GPU 分配，预测工作流延迟
    基于排队论（M/D/1 队列）建模
    """
    # 硬件常数（A100 80GB 为例）
    GPU_PEAK_TFLOPS = 312.0     # BF16
    MEM_BW_GB_S = 2000.0        # HBM3
    INTER_GPU_BW_GB_S = 600.0   # NVLink
    
    def predict_latency(
        self, 
        profiles: list[LLMProfile],
        allocations: list[AllocationConfig],
        target_qps: float
    ) -> float:
        """
        预测给定分配方案下的端到端 P99 延迟
        
        核心思路：
        1. 每个 LLM 是一个独立的排队系统
        2. 总延迟由 critical path 决定
        3. 考虑 TP 的通信开销
        """
        total_latency = 0.0
        
        for llm, alloc in zip(profiles, allocations):
            # Step 1: 计算单 LLM 的吞吐能力
            effective_tflops = self.GPU_PEAK_TFLOPS * alloc.gpu_fraction * alloc.tensor_parallel
            # TP 通信开销（all-reduce 代价）
            tp_overhead = 1.0 + 0.1 * (alloc.tensor_parallel - 1)
            effective_tflops /= tp_overhead
            
            # Step 2: 计算单请求延迟（roofline）
            compute_flops = 2 * llm.model_size_b * 1e9 * llm.avg_output_tokens
            compute_latency = compute_flops / (effective_tflops * 1e12)  # 秒
            
            # Step 3: 考虑副本带来的并行度收益
            effective_qps_per_replica = (target_qps * llm.time_share) / alloc.replicas
            # M/D/1 排队延迟近似：ρ/(2(1-ρ)) * service_time
            rho = effective_qps_per_replica * compute_latency  # 利用率
            if rho >= 1.0:
                return float('inf')  # 系统过载
            queue_latency = (rho / (2 * (1 - rho))) * compute_latency
            
            total_latency += compute_latency + queue_latency
        
        return total_latency * 1000  # 转为 ms

第三步：分配搜索——找到最优 GPU 分配

# profiler.py
# 核心：通过采样估计各 LLM 的稳定时间占比

from collections import defaultdict

class WorkflowProfiler:
    def __init__(self, warmup_runs=10, profile_runs=50):
        self.profile_runs = profile_runs
        self._times: dict[str, list[float]] = defaultdict(list)

    def record(self, llm_name: str, duration_ms: float):
        self._times[llm_name].append(duration_ms)

    def get_shares(self) -> dict[str, float]:
        # 取最近 profile_runs 次均值，忽略 warmup
        avg_times = {
            name: sum(times[-self.profile_runs:]) / len(times[-self.profile_runs:])
            for name, times in self._times.items() if times
        }
        total = sum(avg_times.values())
        # 关键：返回相对占比，而非绝对时间
        return {name: t / total for name, t in avg_times.items()} if total else {}

    def is_stable(self, threshold=0.05) -> bool:
        # 比较最近两个窗口的占比波动，判断是否收敛
        # ... (窗口统计代码省略)
        pass

完整使用示例

from dataclasses import dataclass

@dataclass
class LLMProfile:
    model_size_b: float
    avg_output_tokens: int
    time_share: float

@dataclass
class AllocationConfig:
    gpu_fraction: float
    tensor_parallel: int
    replicas: int

class AggregateLLMPipeline:
    GPU_PEAK_TFLOPS = 312.0  # A100 BF16

    def predict_latency(self, profiles, allocations, target_qps) -> float:
        total_latency = 0.0
        for llm, alloc in zip(profiles, allocations):
            # Roofline: 有效算力（含 TP 通信开销）
            effective_tflops = self.GPU_PEAK_TFLOPS * alloc.gpu_fraction * alloc.tensor_parallel
            effective_tflops /= (1.0 + 0.1 * (alloc.tensor_parallel - 1))

            # 单请求计算延迟
            compute_latency = (2 * llm.model_size_b * 1e9 * llm.avg_output_tokens) / (effective_tflops * 1e12)

            # M/D/1 排队延迟：ρ/(2(1-ρ)) * service_time
            rho = (target_qps * llm.time_share / alloc.replicas) * compute_latency
            if rho >= 1.0:
                return float('inf')
            total_latency += compute_latency + (rho / (2 * (1 - rho))) * compute_latency

        return total_latency * 1000  # ms

性能实测（论文数据，A100 集群）

实现版本	吞吐量 (req/s)	P99 延迟 (s)	GPU 利用率
均分分配（Naive）	4.2	87.3	41%
用户手工指定	6.8	52.1	58%
独立优化各 LLM	7.1	49.6	62%
Scepsy	10.1	3.2	81%

测试环境：8× A100 80GB，4 个 LLM 的 RAG 推理工作流，CUDA 12.1

延迟差距尤其显著（27x），原因在于 Naive 方案的排队级联效应：一个 LLM 的阻塞会向下游传播，而 Scepsy 通过精准分配避免了任何节点成为瓶颈。

常见踩坑

坑 1：时间占比随 batch size 变化

class ScepsyAllocator:
    def __init__(self, total_gpus: int, pipeline: AggregateLLMPipeline):
        self.total_gpus = total_gpus
        self.pipeline = pipeline

    def search(self, profiles: list[LLMProfile], target_qps: float) -> list[AllocationConfig]:
        best_latency, best_allocs = float('inf'), None
        base_gpu_fracs = np.array([p.time_share for p in profiles]) * self.total_gpus

        for tp_combo in self._enumerate_tp(len(profiles), [1, 2, 4, 8]):
            allocs, remaining = [], self.total_gpus
            for i, (profile, tp) in enumerate(zip(profiles, tp_combo)):
                # GPU 分数按时间占比初始化，向上取整到 TP 倍数
                gpu_frac = min(max(tp, round(base_gpu_fracs[i] / tp) * tp), remaining)
                replicas = max(1, int(gpu_frac / tp))
                remaining -= replicas * tp
                allocs.append(AllocationConfig(gpu_fraction=gpu_frac / self.total_gpus,
                                               tensor_parallel=tp, replicas=replicas))

            latency = self.pipeline.predict_latency(profiles, allocs, target_qps)
            if latency < best_latency:
                best_latency, best_allocs = latency, allocs

        return best_allocs

    def _enumerate_tp(self, n, options, max_combos=200):
        # 随机采样避免指数爆炸
        combos = list(itertools.product(options, repeat=n))
        return random.sample(combos, max_combos) if len(combos) > max_combos else combos

坑 2：TP 通信开销被低估

不同 GPU 互联拓扑下，TP 扩展效率差异巨大：

NVLink（A100 机内）：TP=4 效率约 92%
PCIe（跨机）：TP=4 效率可能只有 60%

Scepsy 的 tp_overhead 参数需根据实际互联测试调整，不能直接用论文默认值。

坑 3：工作流结构变化导致占比漂移

如果工作流逻辑随时间更新（如增加了一个新的 LLM 节点），需要触发重新 profile，否则旧的 share 数据会导致错误的分配决策。

什么时候用 / 不用？

适用场景	不适用场景
多 LLM 工作流（≥3 个模型）	单 LLM 推理服务
LLM 数量 > GPU 数量	GPU 资源充足（每模型独占）
工作流结构相对稳定	极度动态的工作流（每次结构变化大）
吞吐量敏感场景	超低延迟（< 100ms）场景
离线/批处理任务	实时交互（用户每次等待）

调试技巧

验证时间占比是否稳定：在上线前运行至少 50 次 profile，绘制各 LLM 时间占比的滑动平均曲线，确认收敛后再部署 Scepsy 分配方案。

检查 TP 效率：用 NCCL 的 all_reduce benchmark 测量实际 GPU 间带宽，与理论峰值对比，校准 tp_overhead 参数。

监控排队深度：生产环境中持续监控每个 LLM 的请求队列长度，若某节点持续积压，说明分配还有优化空间，可触发重新 search。