Phi-4 Reasoning Models

Microsoft has quietly released several impressive open-weight reasoning models base on Phi-4, accompanied by three significant research papers. In this article, I'll guide you through each of these three papers, examining their key findings and discussing their implication for the broader development of reasoning-capable language models.

Why I like Phi

My interest in phi models began with the 2023 paper "Textbooks Are All You Need" which introduced the original phi-1 model. This pioneering work emphasized the development of small language models, highlighting the critical importance of high-quality data, the creation of synthetic training data, and extended training periods. Just three months later, the team released phi-1.5 along with their follow-up paper "Textbooks Are All You Need II", continuing their exploration of these principles.

The phi family of models has proven that Small Language Models (SLMs) can achieve performance comparable to much larger models through carefully curated and synthesized training data, the recent Phi-4 models have expanded these capabilities beyond test to include vision and audio modalities.

When we consider what remains crucial in 2025, the availability of large quantities of high-quality data stands out. These small models have transformed our understanding of what's possible in language model development, showing that careful data curation can compensate for smaller model sizes.

Phi 4 Family

Phi-4 family of models, specifically Phi-4-Mini, Phi-4-Multimodal, and Phi-4-reasoning/Phi-4-reasoning-plus.

These models build on the success of the Phi family's emphasis on high-quality, curated data for training Small Language Models (SLMs) to achieve performance competitive with much larger models. Phi-4-Mini is a 3.8-billion-parameter language model demonstrating strong performance on math and coding tasks. Phi-4-Multimodal extends Phi-4-Mini's capabilities to integrate text, vision, and speech/audio inputs through a novel Mixture-of-LoRAs approach. Phi-4-reasoning and its enhanced version, Phi-4-reasoning-plus, focus on improving structured reasoning capabilities, particularly in complex domains like mathematics, by leveraging supervised fine-tuning and a reward-based training approach with designated "thinking" tokens. The reports highlight significant performance gains over previous Phi models and other models of similar size, as well as competitive results against larger models across various benchmarks, including language, multimodal, and reasoning tasks. Safety evaluations indicate improved robustness against jailbreaks and reduced generation of harmful content compared to prior models.

Key Models & Architecture

The multimodal architecture includes:

Vision Modality: This is implemented with a finetuned SigLIP-400M image encoder, a 2-layer MLP projector to align vision and text embeddings, and a LoRA adaptor applied during supervised fine-tuning (SFT). The image encoder and projector add 440M parameters, while the vision adapter LoRA adds another 370M parameters.

Audio Modality: This is integrated using a finetuned WhisperV3 large-v2 audio encoder and a 2-layer MLP audio projector to map speech features to the text embedding space. LoRA is also applied to the attention and MLP layers of Phi-4-Mini with a rank of 320. The audio encoder and projector introduce 460M parameters, with the audio LoRA adding another 460M parameters. The speech token rate is 80ms, equating to 750 tokens per minute of audio.

The Phi-4 Mini and Phi-4 Multimodal share the same language model backbone.

Phi-4-Mini:

A compact 3.8-billion-parameter language model.
Consists of 32 Transformer layers with a hidden state size of 3.072.
Trained on "high-quality web and synthetic data," with a specific focus on "carefully curated synthetic data recipe emphasizing high-quality math and coding datasets."
Features an expanded vocabulary size of 200K tokens for multilingual support.
Employs Group Query Attention (GQA) for efficient long-sequence generation, reducing KV cache consumption by one-third.
Uses a fractional RoPE dimension for smoother handling of longer contexts.
Demonstrates performance comparable to models twice its size on math and coding tasks requiring complex reasoning.

Phi-4-Multimodal:

Integrates text, vision, and speech/audio input modalities into a single model.
Shares the same 3.8-billion-parameter language model backbone as Phi-4-Mini.
Utilizes a novel "Mixture of LoRAs" technique for multimodal training. This involves freezing the language model backbone and training additional LoRA modules, modality-specific encoders, and projectors.
Supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs.
Vision modality: Image encoder (SigLIP-400M), a 2-layer MLP projector, and a LoRA adapter (LoRAV). Vision components introduce 440M + 370M parameters.
Audio modality: Audio encoder (Whisper large-v3), a 2-layer MLP projector, and a LoRA adapter (LoRAA). Audio components introduce 460M + 460M parameters.
The LoRA adapters and modality-specific routers allow for multiple inference modes without interference between modalities.
Outperforms larger vision-language and speech-language models on a wide range of tasks.

Phi-4-reasoning / Phi-4-reasoning-plus:

Obtained by supervised fine-tuning (SFT) of a 14-billion-parameter Phi-4 model (Note: This is a different base model size than Phi-4-Mini/Multimodal, suggesting a larger language backbone for the reasoning variant).
Goal is to distill structured reasoning capabilities.
Introduces two placeholder tokens, <think> and </think>, to mark reasoning blocks.
Fine-tuning uses a system message that guides the model to use a systematic thinking process before providing a final solution, structuring responses into "Thought" and "Solution" sections.
Leverages an "additive property" of data mixtures, allowing domain-specific data mixtures (e.g., math and code) to be optimized independently and then combined.
Phi-4-reasoning-plus is an enhanced version trained with 32k maximum length, tested to perform well up to 64k tokens.
Utilizes a rule-based reward model during RL training to incentivize correctness, penalize undesirable behaviors (repetition, excessive length), and encourage proper formatting (including the use of <think> tags).

Key Themes & Ideas

Data Quality over Quantity: A central theme, consistent with previous Phi models, is that "carefully curated and synthesized data enables Small Language Models (SLMs) to achieve highly competitive performance despite having a significantly smaller number of parameters." This is explicitly mentioned as the driver for Phi-4-Mini's performance.
Modality Extension with LoRAs: Phi-4-Multimodal introduces a novel approach to integrating new modalities without compromising the core language model's capabilities. This is achieved by freezing the language backbone and using LoRA adapters and modality-specific routers, allowing "seamless integration of new LoRAs to support additional modalities without impacting existing ones."
Structured Reasoning: The Phi-4-reasoning models highlight the importance of structured thinking processes for complex tasks, particularly in math and coding. The introduction of <think> tokens and explicit instruction for step-by-step reasoning in the system message are key to this approach. The reports show that this fine-tuning significantly improves performance on reasoning benchmarks.
Extensibility and Modularity: The Mixture-of-LoRAs design in Phi-4-Multimodal is emphasized for its extensibility, allowing easy addition of new modalities. The "additive property" of data mixtures in Phi-4-reasoning training suggests a modular approach to improving capabilities across different domains.
Safety and Robustness: The reports specifically address safety evaluations, demonstrating that the new Phi-4 models are "more robust to jailbreaks than our previously released Phi-3.5-mini, and than other models of similar size." They also show a lower "Defect Rate" (fraction of responses containing harmful content) compared to Phi-3.5-Mini and comparable performance to other models.

Performance Highlights

Phi-4-Mini:

"Significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning."
An experimental reasoning-enhanced version of Phi-4-Mini (3.8B) achieves "reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B."
Strong performance on various language benchmarks, including BigBench-Hard, MMLU, GSM-8K, and MATH (see Table 7).

Phi-4-Multimodal:

"Now ranks first in the OpenASR leaderboard to date," despite the speech LoRA component having only 460 million parameters.
Outperforms larger vision-language and speech-language models on a wide range of tasks (see Tables 2, 3, 4, 5, 6, and associated benchmark results).
Achieves competitive or leading scores on multimodal benchmarks like MMMU, ScienceQA, MathVista, MMBench, TextVQA, DocVQA, InfoVQA, and OCRBench.
Strong performance on speech tasks like ASR, AST, and SQQA (Spoken Query Question Answering).

Phi-4-reasoning / Phi-4-reasoning-plus:

Show substantial improvements in reasoning performance compared to the base Phi-4 model (14B) and other models (see Figure 17 and Table 7).
Demonstrate significantly higher accuracy on challenging math benchmarks like AIME, HMMT, and OmniMath, as well as general reasoning benchmarks like GPQA and BA-Calendar (see Table 7 and Figure 17).
"Reasoning models... are robust to longer inputs compared to conventional models Phi-4 and GPT-4o, and they are also not affected by the location of key information in the context" on the FlenQA benchmark.
Show improved precision and recall on the Kitab information retrieval benchmark when provided with context, although factual knowledge without retrieval remains challenging.

Safety and Alignment

Both Phi-4-Mini and Phi-4-Multimodal show improved safety compared to Phi-3.5-Mini and comparable performance to other models of similar size.
Evaluated using an RAI benchmark (measuring Defect Rate for harmful content categories: Violence, Sexual, Self-Harm, Hateful) and XSTest (measuring Refusal Rate for inappropriate vs. valid prompts).
Reduced Defect Rates across harmful categories, both with and without the presence of known jailbreaks in prompts (see Tables 10, 11, and 13).
Higher Refusal Rate for inappropriate prompts (IPRR) and lower Refusal Rate for valid prompts (VPRR) on XSTest compared to Phi-3.5-Mini (see Table 12).
Multilingual safety evaluation across Tier 1 languages also shows improvement over Phi-3.5-Mini.
Audio safety alignment for Phi-4-Multimodal was performed using an approach analogous to text safety alignment, utilizing TTS synthesized datasets.

Future Directions

Releasing the reasoning-enhanced Phi-4-Mini, which is currently in a preview stage.
Exploring further benefits of enabling longer context windows (e.g., 64k tokens) for GRPO training in Phi-4-reasoning models.
Improving information retrieval and factuality, particularly without retrieval context, for all models.

My Local / Hybrid Setup

Let me share my personal setup for testing language models locally, which I hope will help others get started with their own experimentation.

My testing environment consists of two machines that each serve different purposes. For my desktop work, I use a 2021 Alienware R10 equipped with 64GB of RAM and an NVIDIA RTX 3080 GPU. For portable testing, I have a 2023 Alienware M16 R1 laptop featuring an AMD Ryzen 9 7000 processor, 64GB of RAM, and an NVIDIA RTX 4080 GPU.

It's worth noting an important technical detail here: the same GPU model number doesn't mean identical performance across form factors. Desktop GPUs consistently outperform their laptop counterparts due to better cooling solutions and higher power delivery limits. This is something to keep in mind when planning your own setup.

With this hardware configuration, I've found some practical boundaries for running models locally. You can comfortably run small language models up to 14 billion parameters without any significant performance issues. If you need to push further, models up to 24 billion parameters are still feasible with minimal performance impact—I've had good experiences running Google's Gemma 3 27B model, for instance.

For those times when you need to experiment with even larger models, quantization becomes your friend. This technique reduces the precision of model weights, allowing you to run larger models that would otherwise exceed your hardware's capabilities.

As for my software stack, I keep things relatively simple. I use Ollama for pulling and installing models—it's become my go-to tool for model management. When I need a more polished interface, I use the Msty app as my frontend, which provides a clean, user-friendly environment for interacting with these models.

However, I don't limit myself to local computing. When I need more computational power or want to experiment with models that exceed my local hardware capabilities, I turn to cloud solutions. Google Colab is excellent for quick experiments and prototyping, while Together.ai offers robust infrastructure for more demanding workloads.

This hybrid approach—combining local and cloud resources—gives me the flexibility to choose the right tool for each specific task while keeping costs manageable.

Terrence C. Kim

Search This Blog

RAG in 2025