vLLM V0 To V1: Correctness Before Corrections In RL

TL;DR

Hugging Face has completed the migration from vLLM V0 to V1 for their inference engine, fixing critical backend discrepancies. The update ensures consistent rollout logprobs and training metrics, improving RL training fidelity.

Hugging Face has confirmed that their vLLM inference engine, after a series of targeted fixes, now produces rollout logprobs consistent with the previous V0 version, addressing a key source of training divergence in reinforcement learning workflows.

The company identified and fixed four primary issues during the migration from vLLM V0 to V1: the semantics of logprobs returned by the backend, runtime default settings such as prefix caching and async scheduling, the inflight weight update process, and the use of an fp32 lm_head for the final projection. These adjustments allowed the V1 engine to replicate the V0 reference behavior closely, as demonstrated by metrics such as clip rate, KL divergence, entropy, and reward, which now align across versions.

The initial V1 attempt showed discrepancies in these metrics, indicating a mismatch in how the inference engine computed and returned logprobs. After enabling the correct logprobs mode and standardizing runtime defaults, the V1 results matched the V0 reference, confirming the fixes’ effectiveness. The updates also included changes to cache management and weight synchronization to prevent stale or inconsistent state during online RL training.

Why It Matters

This development is significant because it restores the integrity of reinforcement learning training workflows that depend on accurate logprobs from the inference engine. Discrepancies in these values can lead to unstable training dynamics, affecting model performance and reproducibility. By achieving backend parity, Hugging Face ensures more reliable and predictable RL training, which is essential for deploying high-quality language models.

MX3 M.2 AI Accelerator

The MX3 M.2 AI Accelerator offers high-performance, energy-efficient AI processing with flexible integration for demanding computer vision workloads.

Form FactorM.2 M-key 2280

CompatibilitySupports Linux, Raspberry Pi 5

SDK SupportComprehensive development kit

Power EfficiencyMinimized power consumption

As an affiliate, we earn on qualifying purchases.

Background

The migration from vLLM V0 to V1 represented a major rewrite of the inference engine, intended to improve performance and flexibility. However, early V1 runs revealed mismatches in key metrics, prompting an investigation into three potential causes: semantic differences in logprobs, runtime default variations, and objective misalignments. The resolution focused on addressing the first two, as objective corrections were unnecessary once backend parity was achieved.

The initial issues stemmed from the fact that V1 returned raw model output logprobs by default, rather than the processed distribution expected by the RL trainer. Additionally, runtime defaults such as prefix caching and async scheduling differed between versions, further complicating the comparison. Through systematic adjustments, the team aligned V1’s behavior with V0, ensuring consistent training signals.

“By fixing the semantics of logprobs, standardizing runtime defaults, and addressing weight update processes, vLLM V1 now accurately replicates V0 behavior, stabilizing RL training.”

— Hugging Face engineering team

“Our focus was on backend parity before changing the RL objectives, ensuring that the core inference behavior remained consistent across versions.”

— Hugging Face project lead

What Remains Unclear

It is still unclear whether these fixes fully address all possible edge cases in diverse RL training setups or if further adjustments may be needed for different model configurations.

What’s Next

Hugging Face plans to monitor the stability of RL training with vLLM V1 in production environments, potentially incorporating further optimizations. They will also evaluate whether additional backend adjustments are necessary for other inference scenarios or model sizes.

Key Questions

What specific issues were fixed in vLLM V1 to match V0?

The team fixed the semantics of logprobs returned by the backend, standardized runtime defaults such as prefix caching and async scheduling, corrected inflight weight update processes, and ensured the use of an fp32 lm_head for the final projection.

Why is matching V0 behavior important for RL training?

Because RL training relies on accurate rollout logprobs for policy updates, divergence from the reference logprobs can destabilize training and reduce model quality. Ensuring backend parity maintains training stability and reproducibility.

Will these fixes affect other uses of vLLM outside RL?

Potentially. The fixes primarily address inference correctness for RL workflows, but they may also improve consistency for other applications that depend on precise logprobs and inference behavior.

Are there remaining differences between V0 and V1?

As of now, the main backend behavior issues have been resolved. However, ongoing testing will determine if any subtle discrepancies remain in different configurations or workloads.

vLLM V0 To V1: Correctness Before Corrections In RL

Up next

Building Blocks for Foundation Model Training and Inference on AWS

Author

AI Espionage Team

Share article

Why It Matters

MX3 M.2 AI Accelerator

Background

What Remains Unclear

What’s Next

Key Questions

What specific issues were fixed in vLLM V1 to match V0?

Why is matching V0 behavior important for RL training?

Will these fixes affect other uses of vLLM outside RL?

Are there remaining differences between V0 and V1?

Apple rejected my dictation app for using the accessibility API

Deepfakes and Espionage: The Weapon Changing Disinformation Forever

Maker packs an opinionated, googly-eyed AI chatbot into a mobile suitcase, powered by an Nvidia Jetson — entirely local machine entity runs Gemma 4 E4B and can respond in 200ms

How AI and Generative Intelligence Are Redefining Cybersecurity Innovations

How Barcode Scanners and Label Makers Streamline Operations

Cybersecurity Operations Signal Monitor: My Security Camera Shipped A GitHub Admin Token In Its Login Page

ARC-AGI Leaderboard

Claude Opus 5

vLLM V0 To V1: Correctness Before Corrections In RL

Up next

Author

AI Espionage Team

Share article

Why It Matters

MX3 M.2 AI Accelerator

Background

What Remains Unclear

What’s Next

Key Questions

What specific issues were fixed in vLLM V1 to match V0?

Why is matching V0 behavior important for RL training?

Will these fixes affect other uses of vLLM outside RL?

Are there remaining differences between V0 and V1?

You May Also Like