RL for Small LLM Reasoning: What Works and What Doesn't

Paper

Another interesting research paper. RL for Small LLM Reasoning: What Works and What Doesn't.

The paper investigates how reinforcement learning (RL) can improve reasoning capabilities in small language models (LLMs) under strict computational constraints. The researchers experimented with a 1.5-billion-parameter model (DeepSeek-R1-Distill-Qwen-1.5B) on 4 NVIDIA A40 GPUs over 24 hours, adapting the Group Relative Policy Optimization (GRPO) algorithm and creating a curated dataset of mathematical reasoning problems. The performance gains were accomplished using only 7,000 samples at a rough cost of $42.

Through three experiments, they discovered that small LLMs can achieve rapid reasoning improvements within 50-100 training steps using limited high-quality data, but performance degrades with prolonged training under strict length constraints. Mixing easy and hard problems improved training stability, while cosine rewards effectively regulated output length. Their Open-RS variants outperformed models requiring thousands of dollars and much larger datasets to train.

The findings demonstrate that RL-based methods can effectively enhance small LLMs' reasoning capabilities with minimal resources, offering a cost-effective alternative to large-scale approaches. However, challenges remain, including optimization instability, length constraints, and multilingual tendencies. The researchers released their code and datasets as open-source resources to encourage further exploration of resource-efficient approaches to LLM reasoning enhancement.

Thoughts

Not sure what triggers this but I have seen Qwen models are especially susceptible to language drifts during extensive evaluations. (Basically, responding back in mandarin instead of English.)

The stability is always a challenge while training in reinforcement learning. Catastrophic forgetfulness is something my team is well aware of.

The ideal small model size still seems to be in the 7B+ parameter range for more specific applications. And can go as small as 3B parameter models for extreme narrowly focused tasks.

What is the next breakthrough that is needed for stable long-term reasoning on these constraint settings? Perhaps different reward mechanisms, manage internal state or architectural changes, or combine RL with other research?

My 2025 focus is still on the embedding/tokenization, and perhaps deeper research around backpropagation.

Terrence C. Kim

Search This Blog