Alibaba’s Qwen team has introduced QwQ-32B, a 32-billion-parameter AI model demonstrating performance comparable to much larger models like DeepSeek-R1. This achievement highlights the potential of reinforcement learning (RL) in improving AI models beyond traditional pretraining and post-training methods. The Qwen team has successfully integrated agent capabilities, allowing the model to think critically, use tools, and adapt its reasoning based on real-world feedback.
QwQ-32B has been tested across multiple benchmarks, including AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL, where it consistently performed at levels close to or exceeding its larger competitors. In AIME24, it scored 79.5, nearly matching DeepSeek-R1’s 79.8, while outperforming smaller models like OpenAI’s o1-mini. Similarly, in LiveCodeBench, it achieved 63.4, closely trailing DeepSeek-R1’s 65.9. The model’s strong performance across these evaluations highlights RL’s ability to enhance reasoning and problem-solving without requiring significantly larger datasets or increased computational power.
The Qwen team used a multi-stage RL approach, starting with a cold-start checkpoint to refine mathematical and coding abilities before expanding to general capabilities. Their training process incorporated accuracy verifiers and rule-based rewards to fine-tune instruction-following, alignment with human preferences, and agent performance. The team found that RL training, even with a small number of steps, improved general AI capabilities without compromising its mathematical and coding strengths.
QwQ-32B is available under an open-weight Apache 2.0 license on platforms like Hugging Face and ModelScope, as well as through Qwen Chat. Alibaba sees this as a significant step toward scaling RL-driven AI and aims to further explore integrating agents with reinforcement learning for advanced reasoning tasks. With continued advancements, the team believes RL-powered AI could accelerate progress toward achieving Artificial General Intelligence (AGI).