Serverless Inference – High error rates for open source models ( Qwen 3 32B)
Users relying on Serverless Inference and Agents for the open-source Qwen 3 32B model have reported elevated latency and error rates. The issue appears linked to capacity pressure within the underlying infrastructure, with signs that automatic scaling couldn’t keep pace with surging demand.
What’s happening
Initial telemetry and dashboards pointed to a spike in request volume targeting Qwen 3 32B. On the Ray dashboard, multiple workers remained stuck in a pending state—an indicator that there weren’t enough available resources to schedule and run new workers. In practical terms, that meant more time spent waiting for capacity and, in some cases, outright request failures.
Root cause at a glance
- Demand surge: Traffic outpaced the platform’s expected baseline for this model, pushing the autoscaling layer to its limits.
- Capacity constraints: Insufficient nodes were available to support the desired number of model replicas and workers, leading to queuing and timeouts.
- Post-scale complication: After expanding the node pool to increase capacity, a new pod-related error surfaced, further hindering recovery.
Mitigation steps taken
- Node pool expansion: The platform increased the node pool size to add headroom for workers and replicas. This alleviated some pressure but did not fully unlock the target replica count.
- Active remediation: Engineering is investigating and addressing the newly observed pod-level error that emerged following the scale-up. Restoring full performance remains the immediate priority.
Current impact
- Intermittent 5xx errors and timeouts when invoking Qwen 3 32B via Serverless Inference and Agents.
- Elevated latency during peak periods as requests queue for limited capacity.
- Potential variability in response times even when requests succeed.
What users can do right now
- Implement robust retries: Use exponential backoff with jitter and a sensible max retry count to ride out capacity fluctuations.
- Reduce client-side concurrency: Throttle parallel requests or batch where possible to lower instantaneous load.
- Consider a temporary fallback: Route overflow traffic to an alternative model or a smaller Qwen variant to maintain service continuity.
- Pin to stable regions/endpoints: If supported, select regions with healthier capacity signals to minimize latency.
- Evaluate dedicated capacity: For latency-sensitive workloads, consider dedicated or provisioned capacity options that decouple you from shared serverless contention.
- Monitor observability: Track error rates and p95/p99 latency; tighten circuit breakers to fail fast when upstream conditions degrade.
What’s next
Engineering is actively working to resolve the pod-related error discovered after the node pool expansion and to calibrate scaling parameters for Qwen 3 32B under high load. Additional capacity and configuration refinements are expected to stabilize performance and reduce error rates.
Updates will follow as fixes roll out and service health metrics return to normal. In the meantime, employing the mitigations above should help maintain application resilience while the platform restores full capacity.