Skywork Unveils Reward-V2 Models with 40M Preference Dataset

by:
|
July 8, 2025

Skywork, a leader in AI research, has released its second-generation reward models, the Skywork-Reward-V2 series, on July 4, 2025. Building on the success of the first-generation models, which garnered over 750,000 downloads on HuggingFace, this new series introduces eight models with parameters ranging from 600 million to 8 billion. These models, trained on the innovative Skywork-SynPref-40M dataset, have secured top rankings across seven mainstream reward model evaluation benchmarks, marking a significant advancement in Reinforcement Learning from Human Feedback (RLHF).

Quick Intel

Skywork-Reward-V2 series includes eight models, from 600M to 8B parameters.
Models top seven major reward model evaluation benchmarks.
Skywork-SynPref-40M dataset contains 40 million preference pairs.
Human-machine collaboration ensures high-quality data screening.
Models excel in alignment, safety, and complex instruction understanding.
Available on HuggingFace and GitHub for research and application.

Advancing RLHF with Skywork-Reward-V2

The Skywork-Reward-V2 series is designed to enhance the RLHF process, a critical component in aligning AI models with human preferences. These models, based on Qwen3 and LLaMA3 architectures, demonstrate exceptional performance in areas such as general alignment, objective correctness, safety, and resistance to style bias. The series’ flagship model, Skywork-Reward-V2-Llama-3.1-8B, has achieved state-of-the-art results across all major benchmarks, including RewardBench v1/v2, PPE Preference & Correctness, and JudgeBench.

Skywork-SynPref-40M: A Game-Changing Dataset

Central to the success of Skywork-Reward-V2 is the Skywork-SynPref-40M dataset, comprising 40 million preference pairs, with 26 million meticulously screened for quality. Skywork’s innovative two-stage human-machine collaboration process ensures both scale and precision. In the first stage, human annotators create a high-quality “gold standard” dataset, which guides Large Language Models (LLMs) to generate a larger “silver standard” dataset. The second stage employs automated filtering using trained reward models, significantly reducing manual annotation while maintaining data integrity. “This human-machine collaborative closed-loop process continues iteratively, effectively improving the reward model’s understanding and discrimination of preferences,” the Skywork team notes.

Overcoming Limitations of Existing Reward Models

Previous open-source reward models often struggled with capturing complex human preferences and exhibited overfitting on specific tasks. Skywork addresses these challenges by prioritizing data quality and diversity. Experiments reveal that even a subset of 290,000 high-quality data points from Skywork-SynPref-40M enables an 8B-scale model to outperform 70B-scale state-of-the-art models, underscoring the dataset’s superior quality. This approach mitigates the fragility of existing models and enhances generalizability across diverse tasks.

Broad Applicability and Future Potential

The Skywork-Reward-V2 series excels in advanced capability evaluations, including Best-of-N tasks, bias resistance, and truthfulness judgment. Its scalable data screening process ensures continuous performance improvements, making it a cornerstone for future AI infrastructure. Skywork envisions reward models evolving into a “compass” for intelligent systems, guiding them toward alignment with human values and complex reasoning tasks. The release of these models and the Skywork-SynPref-40M dataset is poised to accelerate progress in RLHF and foster innovation in the open-source community.

The Skywork-Reward-V2 series and its accompanying dataset represent a milestone in open-source AI research. By providing robust tools for RLHF, Skywork empowers researchers and developers to build more aligned, safe, and capable AI systems, paving the way for future advancements in AI infrastructure.

Share

Join 30,000+ Avid Tech Readers!

Trending tech news, interviews & insights straight to your inbox.

I agree to the Privacy Policy terms

Skywork Unveils Reward-V2 Models with 40M Preference Dataset

Quick Intel

Advancing RLHF with Skywork-Reward-V2

Skywork-SynPref-40M: A Game-Changing Dataset

Overcoming Limitations of Existing Reward Models

Broad Applicability and Future Potential

Join 30,000+ Avid Tech Readers!

About Us

Quick Links

Connect With Us

Search TechIntelPro

Subscribe to Our Newsletter

Skywork Unveils Reward-V2 Models with 40M Preference Dataset

Quick Intel

Advancing RLHF with Skywork-Reward-V2

Skywork-SynPref-40M: A Game-Changing Dataset

Overcoming Limitations of Existing Reward Models

Broad Applicability and Future Potential

Join 30,000+ Avid Tech Readers!

About Us

Quick Links

Connect With Us