TL;DR
Hugging Face has completed the migration from vLLM V0 to V1 for their inference engine, fixing critical backend discrepancies. The update ensures consistent rollout logprobs and training metrics, improving RL training fidelity.
Hugging Face has confirmed that their vLLM inference engine, after a series of targeted fixes, now produces rollout logprobs consistent with the previous V0 version, addressing a key source of training divergence in reinforcement learning workflows.
The company identified and fixed four primary issues during the migration from vLLM V0 to V1: the semantics of logprobs returned by the backend, runtime default settings such as prefix caching and async scheduling, the inflight weight update process, and the use of an fp32 lm_head for the final projection. These adjustments allowed the V1 engine to replicate the V0 reference behavior closely, as demonstrated by metrics such as clip rate, KL divergence, entropy, and reward, which now align across versions.
The initial V1 attempt showed discrepancies in these metrics, indicating a mismatch in how the inference engine computed and returned logprobs. After enabling the correct logprobs mode and standardizing runtime defaults, the V1 results matched the V0 reference, confirming the fixes’ effectiveness. The updates also included changes to cache management and weight synchronization to prevent stale or inconsistent state during online RL training.
Why It Matters
This development is significant because it restores the integrity of reinforcement learning training workflows that depend on accurate logprobs from the inference engine. Discrepancies in these values can lead to unstable training dynamics, affecting model performance and reproducibility. By achieving backend parity, Hugging Face ensures more reliable and predictable RL training, which is essential for deploying high-quality language models.

The Inference Engine Handbook: Deploy, Manage, and Scale AI Production Workloads: NCP-AIIO Exam Prep & Real-World Operations
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background
The migration from vLLM V0 to V1 represented a major rewrite of the inference engine, intended to improve performance and flexibility. However, early V1 runs revealed mismatches in key metrics, prompting an investigation into three potential causes: semantic differences in logprobs, runtime default variations, and objective misalignments. The resolution focused on addressing the first two, as objective corrections were unnecessary once backend parity was achieved.
The initial issues stemmed from the fact that V1 returned raw model output logprobs by default, rather than the processed distribution expected by the RL trainer. Additionally, runtime defaults such as prefix caching and async scheduling differed between versions, further complicating the comparison. Through systematic adjustments, the team aligned V1’s behavior with V0, ensuring consistent training signals.
“By fixing the semantics of logprobs, standardizing runtime defaults, and addressing weight update processes, vLLM V1 now accurately replicates V0 behavior, stabilizing RL training.”
— Hugging Face engineering team
“Our focus was on backend parity before changing the RL objectives, ensuring that the core inference behavior remained consistent across versions.”
— Hugging Face project lead
Reinforcement learning model training hardware
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What Remains Unclear
It is still unclear whether these fixes fully address all possible edge cases in diverse RL training setups or if further adjustments may be needed for different model configurations.

UJS Rocco OBD2 Scanner Bluetooth for iOS Android, AI Diagnostic Tool for Car Buying Repair, No Subscription Fee, AutoVIN, 45000+ Fault Codes, Check & Clear Engine Codes, Real-Time Data, Vehicles 1996+
- AI Car Health Reports: Quick, easy-to-understand diagnostics
- Wireless & Compact Design: Lightweight, cable-free, stays in car
- Real-Time Performance Data: Live engine metrics with graphs
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What’s Next
Hugging Face plans to monitor the stability of RL training with vLLM V1 in production environments, potentially incorporating further optimizations. They will also evaluate whether additional backend adjustments are necessary for other inference scenarios or model sizes.

High-Performance Computing and Artificial Intelligence in Process Engineering
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What specific issues were fixed in vLLM V1 to match V0?
The team fixed the semantics of logprobs returned by the backend, standardized runtime defaults such as prefix caching and async scheduling, corrected inflight weight update processes, and ensured the use of an fp32 lm_head for the final projection.
Why is matching V0 behavior important for RL training?
Because RL training relies on accurate rollout logprobs for policy updates, divergence from the reference logprobs can destabilize training and reduce model quality. Ensuring backend parity maintains training stability and reproducibility.
Will these fixes affect other uses of vLLM outside RL?
Potentially. The fixes primarily address inference correctness for RL workflows, but they may also improve consistency for other applications that depend on precise logprobs and inference behavior.
Are there remaining differences between V0 and V1?
As of now, the main backend behavior issues have been resolved. However, ongoing testing will determine if any subtle discrepancies remain in different configurations or workloads.