Paper Another interesting research paper. RL for Small LLM Reasoning: What Works and What Doesn't. The paper investigates how reinforcement learning (RL) can improve reasoning capabilities in small language models (LLMs) under strict computational constraints. The researchers experimented with a 1.5-billion-parameter model (DeepSeek-R1-Distill-Qwen-1.5B) on 4 NVIDIA A40 GPUs over 24 hours, adapting the Group Relative Policy Optimization (GRPO) algorithm and creating a curated dataset of mathematical reasoning problems. The performance gains were accomplished using only 7,000 samples at a rough cost of $42. Through three experiments, they discovered that small LLMs can achieve rapid reasoning improvements within 50-100 training steps using limited high-quality data, but performance degrades with prolonged training under strict length constraints. Mixing easy and hard problems improved training stability, while cosine rewards effectively regulated output length. Their Open-RS varia...