Deep Dive on Frontier AI Models — May 2026 

Anthropic · OpenAI · Google DeepMind · DeepSeek · Moonshot AI (Kimi)

TL;DR

  • The frontier is now a two-tier oligopoly with an open-weights spoiler tier: Claude Opus 4.7, GPT-5.5 / 5.5 Pro, and Gemini 3.1 Pro lead the closed tier on agentic coding, scientific reasoning, and multimodal work; DeepSeek V4-Pro and Kimi K2.6 close most of the remaining gap at roughly 1/10th–1/25th the cost under MIT-style open weights, with DeepSeek V4-Pro scoring 52 vs Kimi K2.6's 54 on the Artificial Analysis Intelligence Index v4.0 (against ~60 for GPT-5.5 xhigh).
  • Convergence is now near-total on architecture (sparse MoE + sparse/compressed attention), on test-time-compute reasoning (adaptive thinking / Deep Think / "xhigh" / Think-Max), on agentic IDE surfaces (Claude Code, Codex, Antigravity, Kimi Code), and on 1M-token contexts. The remaining axes of differentiation are post-training data recipes, alignment philosophy, hardware vertical integration, and enterprise distribution.
  • For a large-scale regulated deployment, the rational stack is a router-based, Mixture-of-Agents pattern: Claude Opus 4.7 / Sonnet 4.6 as the governed primary (it ships finance-specific agent templates, Microsoft 365 integration, the Moody's MCP, and Bedrock zero-operator-access guarantees), GPT-5.5 Pro for terminal/computer-use and FrontierMath-class quant work, Gemini 3.1 Pro Deep Think for long-context multimodal research and TPU-cheap throughput, Kimi K2.6 self-hosted for sovereign / on-prem agentic coding, and DeepSeek V4-Flash for ultra-cheap bulk reasoning.


Key Findings





Anthropic reclaimed the headline coding crown on April 16, 2026 with Claude Opus 4.7 (64.3% SWE-Bench Pro, 80.8% Verified, 1M context, adaptive thinking, high-resolution 3.75 MP vision) — but conceded internally that the un-released Claude Mythos Preview is more capable and is gated to "Project Glasswing" cyber partners. Anthropic also concurrently re-platformed its developer surface around Skills + Subagents + Agent Teams + MCP inside Claude Code, and launched a $1.5B Blackstone/Goldman/H&F-backed financial-services JV plus a 12-agent Wall Street agent suite on May 5, 2026.

OpenAI's GPT-5.5 (April 23, 2026, codename "Spud") leads agentic computer use and frontier math at the cost of a 2× per-token price hike ($5/$30 per M tokens; $30/$180 for GPT-5.5 Pro). It hits 82.7% Terminal-Bench 2.0, 35.4% FrontierMath Tier 4, 78.7% OSWorld-Verified, 84.9% GDPval, 85.0% ARC-AGI-2, and 74.0% on MRCR v2 at 1M tokens — a generational long-context jump from GPT-5.4's 36.6%. Independent Artificial Analysis flagged an 86% hallucination rate on AA-Omniscience, versus 36% for Opus 4.7.

Google's Gemini 3.1 Pro Preview (Feb 11, 2026, building on Gemini 3 Pro of Nov 18, 2025) leads pure reasoning and abstract intelligence — 94.3% GPQA Diamond, 77.1% ARC-AGI-2, 44.4% Humanity's Last Exam, 85.9% BrowseComp, 2887 Elo on LiveCodeBench Pro — and undercuts both peers on price at $2/$12 per M tokens, with 1M context standard and TPU-only training. Gemini 3 Deep Think (Feb 2026 update) operates as a separately-gated heavy-reasoning mode for Ultra subscribers, and Google Antigravity is now the reference agentic IDE.

DeepSeek shipped V3.2 / V3.2-Speciale in Dec 2025 (introducing DeepSeek Sparse Attention — DSA — which cut KV cache from 656 B/token to 132 B/token in the V3 config and dropped long-context inference cost ~78% at 128K), then DeepSeek V4-Pro (1.6T total / 49B active) and V4-Flash (284B / 13B active) on April 24, 2026 under MIT license, with a novel hybrid Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) mechanism. V4-Pro-Max delivers 90.1% GPQA, 95.2% HMMT 2026, 93.5% LiveCodeBench, 80.6% SWE-Verified, 55.4% SWE-Pro, 83.4% BrowseComp at $0.435/$0.87 per M tokens with the 75%-discount-extended-to-May-31 rate. NIST/CAISI assessed V4-Pro as "the most capable PRC model to date" but ~8 months behind the closed frontier.

Moonshot AI's Kimi K2.6 (April 20, 2026) keeps the K2.5 architecture (1T MoE, 32B active, 384 experts with 8 routed + 1 shared, MLA, 256K context, native INT4 QAT, MoonViT 400M vision encoder) and re-trains the post-training recipe to deliver 80.2% SWE-Verified, 58.6% SWE-Pro (beats GPT-5.4 by 0.9 pts, edges Opus 4.6), 66.7% Terminal-Bench 2.0, 83.2% BrowseComp, plus a unique 300-sub-agent / 4,000-step / 12-hour-autonomous Agent Swarm. Open-weights MIT-style license, $0.60/$2.50 per M tokens API, hosts on 4× A100 80GB or 8× RTX 4090 INT4 (~500 GB VRAM), with CoreWeave NVFP4-on-GB300 hitting 205 t/s at $0.7/M.

Open-weights frontier is now operationally credible: Kimi K2.6 ranks #1 among open weights on Artificial Analysis Intelligence Index v4.0 (54), DeepSeek V4-Pro is #2 (52), and both meaningfully exceed prior-generation closed frontier models. On SWE-Bench Pro, Kimi K2.6 at 58.6% briefly held the open-source crown over GPT-5.4 (57.7%), though Opus 4.7 (64.3%) and DeepSeek V4-Pro-Max (55.4%) reshuffled the order in late April.

Details

Anthropic — the regulated-enterprise incumbent

Model line-up (May 2026): Opus 4.7 (flagship, April 16, 2026), Sonnet 4.6 (Feb 17, 2026), Haiku 4.5 (Oct 15, 2025), Opus 4.6 still in API for transition, plus the restricted-access Claude Mythos Preview for cyber red-team partners under Project Glasswing.

  • Architecture & inference: dense transformer family with adaptive (hybrid) thinking; Opus 4.7 introduced a new tokenizer that is ~1.0–1.35× more tokens per text — a hidden cost increase even with stable headline pricing. 1M-token context window, 128K max output, high-resolution vision (3.75 MP, ~2576px long edge, 3× prior). Adaptive thinking is the only reasoning toggle; setting thinking: {type: "adaptive"} is now required (extended-thinking budgets removed). A new xhigh effort tier sits between high and max. Task budgets beta (header task-budgets-2026-03-13) lets agents see a token countdown.
  • Reasoning & agentic stack: Claude Code is now the de facto enterprise reference for terminal-native agents. It runs the five-layer stack of Model → MCP servers → Skills → Subagents/Agent Teams → Agent SDK. Subagents each have their own context window, prompts, and per-tool permissions; Agent Teams (experimental) let teammates message each other directly, not just report to a lead. Anthropic's Agent Skills are now an open standard ("Agent Skills" specification adopted by Codex, Gemini CLI, Cursor). On benchmarks: 80.8% SWE-Bench Verified, 64.3% SWE-Bench Pro, 79.1% MCP-Atlas, 69.4% Terminal-Bench 2.0, 78.0% OSWorld-Verified, ~89% GPQA Diamond, 40% Humanity's Last Exam (Opus 4.7), state-of-the-art on GDPval-AA finance/legal tasks.
  • Multimodal: vision-only at high resolution; no native audio/video. Image localization, bounding-box, and screen-coordinate (1:1) modeling improved for computer-use.
  • Long context: 1M tokens GA; retrieval quality at long range now trails GPT-5.5 (Opus 4.7 ~59.2% vs GPT-5.5's 87.5% on 128K–256K MRCR per OpenAI's eval).
  • Post-training: Constitutional AI lineage; explicit RL safety + cyber-uplift differential training in Opus 4.7. Palisade Research's "Language Models Can Autonomously Hack and Self-Replicate" (May 7, 2026, authors: Alena Air, Reworr, Nikolaj Kotov, Dmitrii Volkov, John Steidley, Jeffrey Ladish) reported "Replicating Qwen weights, frontier models reach 81% with Opus 4.6 and 33% with GPT-5.4" — note Opus 4.6 was acting as an agent replicating Qwen weights onto compromised hosts, not its own proprietary weights — motivating the new automatic-block cyber safeguards in 4.7.
  • Open vs Closed: fully closed weights; available on Claude API, Amazon Bedrock (with a new generation Bedrock inference engine offering zero-operator access for enterprise data privacy), Google Cloud Vertex AI, Microsoft Foundry, and now natively inside Google Antigravity through Google billing.
  • Hardware/inference: Anthropic disclosed an up-to-5 GW Amazon compute commitment plus an additional $5B AWS investment and up to $20B more later. Bedrock's next-gen scheduler now does dynamic capacity allocation across versions. A SpaceX compute partnership was announced concurrently with the Opus 4.7 release.
  • Pricing: Opus 4.7 = $5/$25 per M (Cache: 90% off; Batch: 50% off; US-only inference at 1.1×); Sonnet 4.6 = $3/$15; Haiku 4.5 = $1/$5.
  • Enterprise / regulated deployment: Anthropic ships anthropics/financial-services plugin marketplace with pitch-agent, GL-reconciler, market-researcher, IB-modeling, equity-research, KYC screener, NAV-tie-out, statement auditor, etc.; Claude for Excel, PowerPoint, Word, Outlook via Microsoft 365 add-in with Vertex/Bedrock back-end routing; full Compliance API, HIPAA BAA, SCIM, SSO, audit logging, custom retention on Enterprise tier (50-seat min, ~$50k/yr floor). Moody's launched an MCP with credit ratings and data on 600M+ public/private entities; Fiscal AI, IBISWorld, Guidepoint, Dun & Bradstreet, SS&C Intralinks, Third Bridge, Financial Modeling Prep all expose MCP connectors. Claude Managed Agents run autonomous multi-hour sessions with credential vaults and full audit trail in Claude Console.
  • Strategic positioning: the safety-aligned, regulated-vertical incumbent. The May 5, 2026 financial-services JV anchor partners — Anthropic, Blackstone, and Hellman & Friedman — are each putting in roughly $300 million, with Goldman Sachs committing around $150 million as a founding investor, plus participation from Apollo Global Management, General Atlantic, Leonard Green, GIC, and Sequoia Capital — totaling the $1.5B vehicle to put Claude into PE portfolio companies. Per Anthropic's own April 6, 2026 disclosure, "Our run-rate revenue has surpassed $30 billion, up from $9 billion at the end of 2025," with "over 1,000 customers now spend over $1 million annually, doubling from 500+ in under two months." The first-ever shared-stage appearance of Jamie Dimon (JPM) and Dario Amodei on May 5, 2026 in NYC underlined the strategic centrality of financial services to Anthropic's go-to-market.

2. OpenAI — frontier capability + product breadth

Model line-up (May 2026): GPT-5.5 (April 23, 2026, "Spud"), GPT-5.5 Pro (parallel test-time compute on same base), GPT-5.5 Thinking (for Plus/Pro/Business/Enterprise), GPT-5.5 Instant (May 5, 2026 default ChatGPT model), and the GPT-5 base family still serving via the router.

  • Architecture & inference: GPT-5.5 is described as the first fully retrained base model since GPT-4.5 and natively omnimodal (text/image/audio/video in one architecture). The GPT-5-era hybrid router persists: a smart fast model + a deeper "thinking" model + a real-time router decides which to call per prompt. GPT-5.5 Pro applies parallel test-time compute to the same underlying base.
  • Reasoning: Reasoning effort tiers none, low, medium (default), high, xhigh. ARC-AGI-2 jumped from 73.3% (5.4) → 85.0% (5.5). FrontierMath Tier 4 doubled to 35.4% (5.5 Pro: 39.6%). AIME 2025 hits 81.2%. The internal Expert-SWE (20-hr human-equiv coding tasks) at 73.1%.
  • Agentic capabilities: state-of-the-art on Terminal-Bench 2.0 (82.7%), OSWorld-Verified (78.7%), GDPval (84.9%, 44 occupations), Tau2-bench Telecom (98.0% w/o prompt tuning). Codex now ships with Chronicle (screen-derived persistent memory). On SWE-Bench Pro, GPT-5.5 trails Opus 4.7 (58.6% vs 64.3%). OpenAI noted memorization signals in Anthropic's evaluation; treat the gap cautiously.
  • Multimodal: text, images, audio, video natively. MMMU-Pro 76% (up from 69.2% on 5.4). Voice intelligence API received a new set of models on May 7, 2026.
  • Long context: API supports 1M tokens (400K in Codex). Pricing increases to 2× input/1.5× output above 272K input tokens. MRCR v2 at 512K-1M jumped from 36.6% → 74.0%; Graphwalks BFS at 1M from 9.4% → 45.4%. This is the standout architectural delta vs 5.4.
  • Post-training: nearly 200 trusted early-access partners contributed real-use feedback. Strongest safeguards yet, plus targeted bio/cyber red-teaming. Sycophancy down from 14.5% → <6% (carried from GPT-5). GPT-5.5 Instant produced 52.5% fewer hallucinated claims than 5.3 on high-stakes prompts (med/legal/finance).
  • Open vs closed: fully closed. Available via Responses + Chat Completions APIs, Codex, ChatGPT, Azure (no native Microsoft Foundry GPT-5.5 listing yet at the time of writing).
  • Hardware: NVIDIA-dominant infrastructure; ChatGPT testing ads on May 7, 2026 indicates product monetization pressure.
  • Pricing: GPT-5.5 = $5 input / $30 output; GPT-5.5 Pro = $30/$180; Batch/Flex at 0.5×; Priority at 2.5×. Regional (data residency) endpoints carry +10% uplift. Real cost vs 5.4 is ~+20% after token efficiency gains in Codex per Artificial Analysis.
  • Enterprise: Codex + Skills (compatible with Anthropic's Agent Skills standard) + Hosted Shell + Apply-Patch + MCP + Tool Search. Used internally by 85%+ of OpenAI employees weekly. Major weakness for regulated industries: 86% hallucination rate on AA-Omniscience (vs Opus 4.7 at 36%, Gemini 3.1 Pro at 50%) — the model is highly knowledgeable but confidently wrong much more often.
  • Strategic positioning: frontier capability + super-app horizontal play. Greg Brockman framed GPT-5.5 as a step toward an OpenAI "super app." Combined with the Codex Chronicle memory, AI browser in development, and GPT-Image-2, the bet is on a unified ChatGPT surface across knowledge work.

3. Google DeepMind — full-stack vertical integration

Model line-up (May 2026): Gemini 3.1 Pro (Feb 11, 2026), Gemini 3 Pro (Nov 18, 2025), Gemini 3 Flash (Dec 17, 2025), Gemini 3.1 Deep Think (Feb 28, 2026 upgrade), Nano Banana 2 / Gemini 3 Pro Image, Lyria 3 (audio/music), plus Deep Research / Deep Research Max agents built on 3.1 Pro.

  • Architecture & inference: Sparse MoE transformer with native multimodality — text, vision, audio integrated at the architectural level (not bolted on). Gemini 3 Pro was trained from scratch, not a fine-tune of 2.5, on TPU pods using JAX + ML Pathways. Reuters reported the model has 500B+ total parameters with 50-100B activated per token (unverified). The full-stack TPU integration (v5e/v5p/Trillium for train, TPU v6 lite for inference) is the structural cost advantage that makes Google's $2/$12 pricing economically viable.
  • Reasoning: Deep Think mode — a heavy parallel test-time compute reasoning system — pushes 3.1 Pro from 87.8% → 93.8% on AIME, 44.4% → ~46%+ on Humanity's Last Exam, and from 31.1% → 45.1% on ARC-AGI-2 on Gemini 3 Pro Deep Think. Deep Think 3.1 is API-gated (early-access waitlist) plus Ultra-subscriber Gemini app access.
  • Agentic stack: Google Antigravity (Nov 18, 2025) — agent-first VS-Code-class desktop IDE with Editor view + Manager view, multi-agent orchestration across editor/terminal/browser, Artifacts (markdown plan docs), Skills (progressive disclosure, ~/.gemini/antigravity/skills/), MCP, and an integrated Gemini 2.5 Computer Use model for browser control plus Nano Banana for image editing. Free during preview, supports Gemini 3.1 Pro, Claude Sonnet/Opus 4.6, GPT-OSS 120B — i.e., Google is platforming competitor models inside its own IDE. Deep Research / Deep Research Max (April 21, 2026) are autonomous research agents with MCP support, native visualizations, multi-step planning, and proprietary-data integration; available via Gemini API.
  • Multimodal: best-in-class on combined text-vision-video-audio. MMMU-Pro 81.0%, Video-MMMU 87.6%, ScreenSpot-Pro high. Nano Banana surpassed 200M image edits and 10M new users to the Gemini app post-launch. Gemini 3 Pro Image is the #2 on LM Arena text-to-image (1235 Elo).
  • Long context: standard 1M tokens (Gemini 3.1 Pro), 65,536 output tokens. Effective long-context retrieval is competitive but not at GPT-5.5's headline MRCR levels.
  • Post-training: multimodal instruction tuning + RL from human and critic feedback for multi-step reasoning, problem solving and theorem proving. Frontier Safety Framework evaluated, no Critical Capability Levels reached.
  • Open vs closed: closed flagship; Gemma 3 open-weights line continues separately.
  • Hardware/inference strategy: full vertical integration — TPU compiler model co-design. ICI Pods solve the all-to-all communication bottleneck for MoE serving. TPU v6 lite serves 3.1 Pro inference. The economic implication: Google can serve sparse MoE at scale at structurally lower cost than NVIDIA-bound competitors, which is why the API price is 7.5× cheaper input than Claude Opus 4.7 for comparable quality on many tasks.
  • Pricing: $2 input / $12 output per M tokens for Gemini 3.1 Pro. Gemini 3 Flash now the default in the Gemini app (free tier).
  • Enterprise: Vertex AI, Gemini Enterprise, Workspace, Antigravity, NotebookLM, Android Studio, Gemini CLI. Per the January 12, 2026 joint statement from Apple and Google: "Apple and Google have entered into a multi-year collaboration under which the next generation of Apple Foundation Models will be based on Google's Gemini models and cloud technology. These models will help power future Apple Intelligence features, including a more personalized Siri coming this year."
  • Strategic positioning: full-stack low-cost frontier with the broadest model menu. Demis Hassabis and Koray Kavukcuoglu framed Gemini 3 as the convergence of multimodality (Gemini 1), reasoning/tools (Gemini 2), and agentic action (Gemini 3). The Deep Research → Deep Research Max → Antigravity stack is the most coherent answer to "what comes after chat."

4. DeepSeek — open-weights efficiency frontier

Model line-up (May 2026): DeepSeek V4-Pro (1.6T total / 49B active) and V4-Flash (284B total / 13B active), both released April 24, 2026 under MIT license; V3.2-Speciale still hostable for IMO-class reasoning; DeepSeekMath V2 specialist.

  • Architecture: V3.2 introduced DeepSeek Sparse Attention (DSA) with a lightning indexer that scores token relevance in FP8, then a top-K gather; combined with Multi-Head Latent Attention (MLA) for KV cache compression. KV per token at 128K dropped from ~656 B (dense MLA) to ~132 B with DSA, cutting long-context inference cost ~78%. V4 supersedes this with a hybrid Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA): per the V4 HF model card, "In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2." Manifold-Constrained Hyper-Connections (mHC) improve residual signal propagation; Muon optimizer is the first deployment at 1.6T scale; Multi-Token Prediction depth=1 retained; FP4 quantization-aware training for MoE experts.
  • Reasoning: V3.2's unified RL pipeline (single GRPO stage for reasoning + alignment + agentic) replaced separate post-training stages. V4 exposes three modes — Non-think / Think High / Think Max (OpenRouter maps to high / xhigh). V4-Pro-Max delivers 90.1% GPQA Diamond, 95.2% HMMT 2026 Feb, 89.8% IMOAnswerBench, 93.5% LiveCodeBench, Codeforces 3206 rating, 37.7% HLE.
  • Agentic: 80.6% SWE-Verified, 55.4% SWE-Pro, 76.2% SWE-Multilingual, 67.9% Terminal-Bench 2.0, 83.4% BrowseComp, 51.8% Toolathlon, 74.2% MCP-Atlas. Large-scale agentic task synthesis pipeline (1,800+ environments, 85K+ complex instructions). Native Claude Code / OpenCode / OpenClaw integrations; supports both Anthropic and OpenAI API formats (https://api.deepseek.com/anthropic).
  • Multimodal: text-only input/output in V4 preview. (V3.2 was text-only too; vision is on Moonshot's side, not DeepSeek's.)
  • Long context: 1M tokens GA on V4, 384K max output, Think Max mode requires ≥384K context window per recommendation. MRCR 1M at 83.5%.
  • Post-training: Specialist distillation → unified GRPO RL prevents catastrophic forgetting. The agentic task synthesis pipeline is the structural innovation that closed the open-source gap on tool use.
  • Open vs closed: fully open-weights MIT (commercial use allowed). HF: deepseek-ai/DeepSeek-V4-Pro and DeepSeek-V4-Flash, with FP8 base variants. File sizes ~865 GB (Pro), ~160 GB (Flash).
  • Hardware/inference: optimized for H800 / H100 / FP8 deployments. Production setup uses dual micro-batch overlap to hide all-to-all expert communication while DSA executes — only viable because DSA's predictable attention time enables reliable overlap scheduling.
  • Pricing: V4-Flash $0.14 / $0.28 per M tokens; V4-Pro $0.435 / $0.87 with the 75% discount extended to 2026-05-31; full rates $1.74 / $3.48. Input cache hit at 10% of cache-miss price since April 26, 2026. deepseek-chat / deepseek-reasoner aliases retire July 24, 2026.
  • Enterprise: NIST/CAISI (May 2026) called V4-Pro "the most capable PRC model to date across the domains evaluated," with capabilities lagging the closed frontier by ~8 months. For data-sovereignty-sensitive deployments, V4-Flash is the natural high-throughput open default at 13B active per token.
  • Strategic positioning: the open-weights efficiency benchmark. The published technical reports are exceptionally detailed and have become the de facto open reference for sparse-attention + GRPO + agentic task synthesis. Geopolitical headwinds (NIST/CAISI assessment, prior Anthropic distillation accusations) are real but have not slowed adoption.

5. Moonshot AI (Kimi) — open-weights agentic specialist

Model: Kimi K2.6 (April 20, 2026), Modified MIT license.

  • Architecture: 1T MoE, 32B active, 384 experts (8 routed + 1 shared), 61 layers (1 dense), 7,168 attention hidden dim, MoE hidden dim 2,048 per expert, 64 attention heads, MLA attention, SwiGLU, 160K vocab, 256K context, native INT4 QAT. Native multimodality via MoonViT 400M-parameter vision encoder (image + video). Identical topology to K2.5; gains come from re-trained post-training recipe.
  • Reasoning: Thinking mode (temp 1.0) and Instant mode (temp 0.6, top-p 0.95). preserve_thinking flag carries reasoning traces across multi-turn tool loops — important for stateful agentic systems.
  • Agentic stack: Agent Swarm scales to 300 sub-agents and 4,000 coordinated steps in a single autonomous run (up from 100 / 1,500 in K2.5); orchestrator dynamically decomposes tasks and recovers from sub-agent failures. Claw Groups (research preview) extend the swarm to heterogeneous human + multi-model collaborations — sub-agents can run on any device with any model carrying their own toolkits, with K2.6 as adaptive coordinator. Documented runs include 12-hour continuous coding, 13-hour rewrite of an 8-year-old financial matching engine, and reimplementing Qwen3.5-0.8B inference in Zig faster than LM Studio. Benchmarks (Moonshot-reported, thinking on): 80.2% SWE-Bench Verified, 58.6% SWE-Bench Pro, 76.7% SWE-Multilingual, 66.7% Terminal-Bench 2.0, 54.0% HLE-Full w/tools, 92.5% F1 DeepSearchQA, 83.2% BrowseComp, 50.0% Toolathlon, 86.7% CharXiv w/python, 93.2% MathVision w/python. Trails GPT-5.4 on AIME 2026 (96.4% vs 99.2%) and GPQA Diamond (90.5% vs 92.8%).
  • Multimodal: native text+image+video input; text output. Hallucination rate on AA-Omniscience dropped K2.5 65% → K2.6 39%.
  • Long context: 256K standard. Context-management strategy in tool-augmented benchmarks discards all but the most recent tool round.
  • Open vs closed: open weights under Modified MIT. HF: moonshotai/Kimi-K2.6. Day-zero hosting on Cloudflare Workers AI, Baseten, Fireworks, OpenRouter, Novita, Parasail, Ollama, DeepInfra, and CoreWeave.
  • Hardware/inference: native INT4 weights ~594 GB; runs on 4× A100 80GB or 8× RTX 4090 for INT4. Recommended inference engines: vLLM (production), SGLang (structured output / agents), KTransformers (Moonshot-native). CoreWeave on NVIDIA GB300 NVL72 with NVFP4 quantization and Eagle3 speculative decoding achieves 205 tokens/sec at $0.7/M tokens — #1 on Artificial Analysis's K2.6 benchmark across 11 inference providers. Unsloth ships UD-Q2_K_XL (~350 GB total memory) and lossless UD-Q8_K_XL.
  • Pricing: official API $0.60 / $2.50 per M tokens (Moonshot platform); DeepInfra at $0.74/$3.50; OpenRouter $0.75/$3.50.
  • Enterprise: Kimi Code CLI (Apache 2.0, 6,400+ GH stars) is the open Claude-Code analog. Day-zero Cursor / Vercel / Factory.ai / Augment Code / CodeBuddy integrations. CodeBuddy reports 96.6% tool-call success.
  • Strategic positioning: the open-weights agentic SOTA. Where DeepSeek leads on reasoning efficiency, Moonshot leads on long-horizon agent orchestration — the K2.6 release explicitly positions for "the runway to K3" infrastructure (12-hour runs + 300-agent swarms as load-bearing).
  • Heterogeneneous Swarm: The frontier is moving from monolithic inference to dynamic, cost-weighted routing. "Coordinated Agents" now decompose prompts and dispatch them to comepetior models based on cost-to-capability ratios.




Cross-cutting comparative tables

Headline benchmark table (May 2026, vendor-reported except where noted)

Benchmark Opus 4.7 GPT-5.5 (xhigh) Gemini 3.1 Pro DeepSeek V4-Pro-Max Kimi K2.6 (Thinking)
SWE-Bench Verified 80.8% ~80%* 80.6% 80.6% 80.2%
SWE-Bench Pro 64.3% 58.6% 54.2% 55.4% 58.6%
Terminal-Bench 2.0 69.4% 82.7% 68.5% 67.9% 66.7%
OSWorld-Verified 78.0% 78.7% ~75% n/r n/r
MCP-Atlas 79.1% 75.3% 78.2% 73.6% n/r
BrowseComp 83.7% 84.4% 85.9% 83.4% 83.2%
GPQA Diamond 91.3% 93.0% 94.3% 90.1% 90.5%
Humanity's Last Exam 40.0% 39.8% 44.4% 37.7% 36.4% (text)
FrontierMath Tier 4 22.9% 35.4% (Pro 39.6%) 16.7% n/r n/r
AIME 2025 / HMMT 2026 n/r 81.2% / 97.7% 95% / 94.7% n/r / 95.2% 96.4% / 92.7%
ARC-AGI-2 75.8% 85.0% 77.1% (Deep Think 93.8%) n/r n/r
MRCR v2 @ 1M 59.2% 74.0% n/r 83.5% (MRCR 1M) n/r
GDPval (44 occ.) 80.3% 84.9% 67.3% 1554 Elo (GDPval-AA) n/r
Hallucination (AA-Omniscience) 36% (best) 86% 50% n/r 39%

*GPT-5.5 SWE-Verified per third-party Artificial Analysis; not the primary OpenAI-reported number.

Pricing matrix ($/M tokens, input / output, May 2026 API list)

Model Input Output Context Open?
Claude Opus 4.7 $5.00 $25.00 1M No
Claude Sonnet 4.6 $3.00 $15.00 200K (1M beta) No
Claude Haiku 4.5 $1.00 $5.00 200K No
GPT-5.5 $5.00 $30.00 1M (>272K: 2×/1.5×) No
GPT-5.5 Pro $30.00 $180.00 1M No
Gemini 3.1 Pro $2.00 $12.00 1M No
DeepSeek V4-Pro (75% disc.) $0.435 $0.87 1M MIT
DeepSeek V4-Flash $0.14 $0.28 1M MIT
Kimi K2.6 (Moonshot API) $0.60 $2.50 256K MIT-style

Convergence vs. Divergence

Convergence (where everyone agrees):
  • Sparse MoE is universal at frontier scale (Gemini 3 Pro confirmed sparse MoE; DeepSeek V4 1.6T/49B; Kimi K2.6 1T/32B; GPT-5/5.5 router-based). The dense-only era ended in 2025.
  • Sparse / compressed attention is the new standard for long context economics — DeepSeek DSA → CSA+HCA, Anthropic's adaptive thinking with cached prefixes, Google's TPU-resident MoE, OpenAI's MRCR improvements all point to KV-cache and FLOPs-per-token compression as the dominant scaling axis above raw parameter count.
  • Test-time compute reasoning is the third scaling law — adaptive thinking, Deep Think, xhigh, Think-Max all instantiate the same idea: heavy parallel/serial inference compute on hard prompts. GPT-5.5 Pro and Gemini Deep Think are explicit parallel-sample architectures.
  • 1M context is the new floor for flagship models.
  • Agentic IDEs as the developer surface: Claude Code, Codex, Antigravity, Kimi Code, Gemini CLI all converged on the same primitives — Skills / Subagents / MCP / Hooks / Agent SDK. The Agent Skills standard is now cross-vendor.
  • MCP won as the tool protocol (Anthropic-originated, adopted by Moody's, Dun & Bradstreet, IBISWorld, Guidepoint, Fiscal AI, OpenAI Responses API, Google Antigravity Skills, Kimi Code, DeepSeek API).
  • Computer use / browser automation is table stakes — OSWorld, Terminal-Bench 2.0, ClawBench all matter; every flagship trains for it.

Divergence (where labs differentiate):



  • Open vs closed weights: Anthropic, OpenAI, Google fully closed. DeepSeek and Moonshot fully open under MIT-style licenses. Google straddles via Gemma. This is the single most consequential divergence for regulated, sovereign-AI and FinServ on-prem deployments.
  • Reasoning philosophy: Anthropic = single adaptive-thinking mode + xhigh effort tier (simplification); OpenAI = effort gradient + Pro parallel sampling; Google = separate Deep Think mode for heavy reasoning; DeepSeek = three explicit modes with Specialist→Generalist GRPO distillation; Moonshot = thinking on/off with preserve_thinking. Architecturally, all converge; UX/control surface diverges.
  • Alignment: Anthropic = Constitutional AI + safety-level differential training (cyber capability deliberately reduced in Opus 4.7 relative to Mythos); OpenAI = Preparedness Framework + safeguard stack + 200-partner pre-release; Google = Frontier Safety Framework; DeepSeek/Moonshot = lighter touch (and CAISI flags this).
  • Hardware bets: Google = TPU-only full-stack (structural cost advantage); OpenAI = NVIDIA-dominant + Azure; Anthropic = AWS Trainium + GCP TPU + new SpaceX compute + Bedrock zero-operator-access; DeepSeek = H800/H100 + FP4 QAT; Moonshot = INT4 QAT + NVFP4 (CoreWeave).
  • Target customer: Anthropic = regulated enterprise (FinServ, healthcare, legal) via verticalized agent suites and PE JV; OpenAI = consumer + developer + enterprise super-app; Google = horizontal everything (Search/Workspace/Android/Vertex); DeepSeek = open-weights research + sovereign deployments; Moonshot = open-weights agentic coding.
  • Distribution moats: Anthropic now has a $1.5B PE-backed services arm; OpenAI has ChatGPT consumer dominance + Codex; Google has Search/Android/Workspace + a multi-year Apple partnership for next-generation Apple Foundation Models powering Siri; DeepSeek has the academic mindshare + Chinese sovereign demand; Moonshot has the open-weights agentic ecosystem.
  • Hallucination calibration: Opus 4.7 (36% AA-Omniscience) ≫ Kimi K2.6 (39%) > Gemini 3.1 Pro (50%) > GPT-5.5 (86%). For regulated work, this is the most underappreciated divergence in the market.
  • Engineering the Context Bottleneck Google vs DeepSeek Approach: The standard KV Cache memory required for 100K+ contexts breaks inference economics.





What each model is best at — practitioner cheat sheet

  • Claude Opus 4.7production code-review, regulated enterprise workflows, long-horizon tool orchestration, MCP-heavy stacks, financial modeling, computer use with audit. Best calibration / lowest hallucination. Pick when correctness is more expensive than capability and an audit trail matters.
  • Claude Sonnet 4.680% of Opus quality at 40% of cost; the production daily-driver under smart routing.
  • Claude Haiku 4.5high-throughput routing, classification, RAG synthesis, low-latency conversational front-ends; first Haiku with extended thinking; sub-200ms latency.
  • GPT-5.5 / 5.5 Profrontier math (FrontierMath, AIME), terminal/CLI agents, 1M-token retrieval-heavy work, OSWorld computer use, scientific R&D scaffolding. Avoid for high-stakes factual recall.
  • GPT-5.5 InstantChatGPT default; smart routing across past chats / Gmail / files; conversational quality + lower hallucinations than 5.3.
  • Gemini 3.1 Pro / Deep Thinkabstract reasoning, multimodal (video, audio, image) analysis, very-long-context with TPU-cheap pricing, scientific peer review, web research at scale, agentic IDE (Antigravity), Apple Siri integration.
  • Gemini 3 Flashhigh-volume multimodal tasks at ChatGPT/Claude consumer-tier latency; default in Gemini app.
  • DeepSeek V4-Pro-Maxopen-weights frontier reasoning at 1/30th the cost; math/STEM/coding; long-context (1M) retrieval; sovereign deployment. Text-only.
  • DeepSeek V4-Flashultra-cheap bulk reasoning, embeddings substitution for many low-stakes RAG flows.
  • Kimi K2.6open-weights long-horizon agentic coding, 300-agent swarms, 12-hour autonomous runs, vision-enabled multimodal agents, terminal-first dev work, full-stack/UI generation. Best open-source choice for tool-heavy agents.

Recommendations 

Stage 1 (now — 30 days): Establish governed primary + production multi-model router

  • Adopt Claude Opus 4.7 + Sonnet 4.6 on Bedrock with zero-operator-access, run smart routing to Haiku 4.5 for classification and front-ends. This minimizes regulatory friction (Anthropic ships the only FinServ-grade vertical agent suite + Microsoft 365 add-in + Moody's MCP). Trigger to revisit: if AA-Omniscience hallucination rate of any competitor drops below 30% for two consecutive quarters.
  • Stand up a router-layer MoMA-style architecture (your prior work applies directly): SFT classifier → adaptive routing to Opus / Sonnet / GPT-5.5 / Gemini 3.1 Pro / Kimi K2.6, with caching at the prompt prefix level (90% cache savings on Anthropic; 90% on DeepSeek). Benchmark: target ≤30% Opus calls, ≥60% Sonnet, ≤10% Haiku for stable cost.
  • Deploy Kimi K2.6 INT4 on internal infrastructure (4× A100 80GB or 8× H100 cluster) for sovereign agentic coding behind the firewall — Kimi Code CLI under Apache 2.0 is the open Claude-Code analog. This is the natural home for your AMD Ryzen AI MAX+ 395 local fleet for prototyping smaller subagents in the swarm, with the full K2.6 INT4 served from the on-prem cluster. Trigger: if model self-replication or cyber-uplift safety eval scores cross internal red-team thresholds, freeze new local deployments.

Stage 2 (30–90 days): Specialist routing + agent skills

  • Route quant/math/derivatives modeling and long-horizon scientific work to GPT-5.5 Pro (FrontierMath Tier 4 = 35.4%; Expert-SWE 73.1%). Use it specifically for catalyst analysis, complex stress tests, and PnL attribution that exceeds Opus's reasoning depth. Threshold: when Opus thinking budget exhausted without acceptable answer (auto-fallback).
  • Use Gemini 3.1 Pro Deep Think (when ungated) for due-diligence research and multimodal compliance review — video / audio recording review (earnings calls, expert network interviews), 900-page filings, Deep Research Max as a single API call for credit research workflows pulling proprietary data via MCP. Price advantage is structural ($2/$12 vs $5/$25).
  • Adopt the Agent Skills open standard across Claude Code / Codex / Antigravity / Kimi Code; codify firm-specific Skills (comps, dcf, ic-memo, kyc, gl-recon, nav-tie-out) once and run them everywhere. Use subagents with bounded tool permissions (Read, Grep, Glob, Bash only) for any code-touching agent; reserve write privileges for human-approved diffs.
  • Mixture-of-Agents pattern: front-line worker agents on Kimi K2.6 (cheap, open) → critic agents on Sonnet 4.6 → final-synthesis on Opus 4.7. This mirrors the original MoA topology you published, but with vendor diversity for resilience. The CASTER and ORCH 2026 papers suggest this can cut cost up to 72.4% vs strong-model-only baselines while matching success rates.

Stage 3 (90–180 days): Deep agentic + sovereignty

  • Pilot Kimi K2.6 Agent Swarm or Claw Groups for heterogeneous human + multi-model collaboration on a constrained workflow (e.g., end-of-day reconciliation across 100+ entities). Document failure modes; pair with deterministic ORCH-style routing for discrete-choice steps to control cost variance.
  • Self-host DeepSeek V4-Flash (13B active, ~160 GB) for bulk reasoning at near-zero marginal cost — RAG synthesis, entity resolution, document classification at hundreds of millions of tokens/day. Threshold: above 50M tokens/day per workload, self-hosting beats API by 5–10×.
  • Lock procurement against statistical noise. The LMArena top-3 (Opus 4.6 1504, Gemini 3.1 Pro Preview 1500, Opus 4.6 Thinking 1500) sit within overlapping 95% CI — do not sign single-vendor multi-year deals on a single Elo screenshot. Use a procurement watchlist tracking Elo + 95% CI + vote count.

Stage 4 (180+ days): Frontier-track watchlist

  • Track Claude Mythos Preview — Anthropic explicitly says Opus 4.7 is less capable; when Mythos releases broadly (post-Glasswing), reassess cyber and reasoning ceilings.
  • Track GPT-5.6 — Polymarket pricing (as of May 12, 2026) puts the probability of release by June 30, 2026 at 78%, and by July 31 at 92%; OpenAI's 5–6 week cadence makes a Q3-2026 model highly likely.
  • Track DeepSeek V4 final (current is preview) and Kimi K3 (telegraphed via the K2.6 "runway" framing).
  • Threshold to switch primary: hallucination rate parity (within 5 pts of Opus on AA-Omniscience), an MCP/finance vertical stack at parity, and SOC-2 / FINRA-grade compliance attestations for any new primary.

Practitioner prompting / context-engineering notes

  • Opus 4.7: omit temperature, top_p, top_k from requests (now return 400 errors); use output_config.effort (high / xhigh / max) and the task_budget beta for long-running agents. Default-hide thinking content and opt in to display: "summarized" for streaming UIs.
  • GPT-5.5: lean on its long-context strength — feed entire codebases / document sets above 256K; pair with retrieval grounding because of the 86% AA-Omniscience risk. Use xhigh for FrontierMath-class quant problems.
  • Gemini 3.1 Pro: instruction-following is exceptional; prefer concise, intent-rich prompts. Use Deep Research Max as an atomic API call for end-to-end research pipelines rather than re-implementing the agent loop yourself.
  • DeepSeek V4-Pro Think Max: budget ≥384K context window; the model needs room to think. Use the Anthropic-compatible API endpoint to drop in behind Claude Code with zero code changes.
  • Kimi K2.6: enable preserve_thinking for multi-turn tool agents; temperature 1.0 for thinking, 0.6/top_p 0.95 for instant mode. For the 300-agent swarm, scope each sub-agent's tools tightly (Anthropic-style bounded permissions) — the documented 12-hour runs assume the orchestrator can recover from sub-agent failures, which it can only do if blast radius is constrained.
  • Local Ryzen AI MAX+ 395 fleet: 128GB unified memory ceiling means K2.6 full INT4 (~500 GB) won't fit; use it for prototyping subagents, running embedding/retrieval workers, hosting 13–32B distillations (DeepSeek V4 distilled variants, Qwen 3.6 27B, Gemma 3), and routing the heavy K2.6 calls to the on-prem cluster. The right pattern is Ryzen-resident worker agents that call the cluster only on escalation, not the inverse.

Near-future trajectory (next 6 months)

  • Test-time compute scaling will dominate the next benchmark jumps more than parameter scaling. Expect GPT-5.6 / Claude Mythos GA / Gemini 3.2 Deep Think to compete primarily on parallel-sample architectures.
  • Open-weights will keep closing the gap on agentic capability (Kimi K3 will likely surpass closed frontier on SWE-Pro within the year) but calibration / hallucination remains the structural moat for the closed labs.
  • Verticalized agent suites will be the procurement battleground, not raw model capability. Anthropic's FinServ JV, OpenAI's super-app, Google's Apple-Siri integration, and Moonshot's Claw Groups all aim at the same prize: owning the workflow surface, not the model.
  • MCP + Agent Skills will become the de facto control plane — any model that doesn't speak both fluently by Q4 2026 will be relegated to commodity inference.

Caveats

  • Several scores are vendor-reported and harness-dependent. Terminal-Bench 2.0 numbers swing 10+ points depending on harness (Terminus-2 vs Codex CLI vs custom). SWE-Bench Pro has reported memorization on some tasks; treat single-decimal-point differences as noise.
  • Anthropic Mythos and Gemini 3 Deep Think (API) are not generally available; the headline frontier benchmarks reflect models a typical enterprise cannot procure today.
  • GPT-5.5 hallucination rate (86% AA-Omniscience) is a serious deployment risk for any high-stakes factual workflow. Pair with retrieval-augmented grounding + a verifier model on every call.
  • DeepSeek V4 is a "preview" release; final weights and benchmarks may shift. NIST/CAISI lag-assessment (~8 months) is a U.S. government read; firms must form their own technical evaluation.
  • Kimi K2.6's 300-agent / 4,000-step / 12-hour autonomous run claims are vendor demonstrations; no third-party replication of the full swarm capability has been published as of late April 2026. Treat as a directional capability, not a benchmarked one.
  • API price stability is not the same as cost stability. Opus 4.7's new tokenizer can raise effective costs up to 35% on the same prompts at the same headline rates; GPT-5.5's higher >272K context multiplier (2× input / 1.5× output) penalizes long-context use disproportionately.
  • Open-weights does not equal cost-free at scale. Kimi K2.6 INT4 needs ~500 GB VRAM (4× A100 80GB or 8× RTX 4090) just to serve — the API at $0.60/$2.50 makes economic sense below ~50M tokens/day per workload.
  • Geopolitical and IP risk for Chinese open-weights models is real: Anthropic publicly accused DeepSeek, Moonshot, and MiniMax of industrial-scale distillation attacks in Feb 2026; CAISI's May 2026 evaluation explicitly framed DeepSeek V4 as a national-security-relevant benchmark. Procurement should treat Chinese open-weights deployments as sovereignty-policy decisions, not pure capability/cost decisions.
  • The "convergence" narrative can obscure structural moats — Google's TPU vertical integration, Anthropic's regulated-enterprise distribution + Apple's Gemini-Siri partnership effectively making Gemini the consumer iOS default, OpenAI's ChatGPT consumer surface, and the open-weights labs' MIT-license velocity are not commoditizable in a 12-month horizon even if model capability converges.
  • The Strategy: In May 2026, AI strategy does not rely on choosing one winning model. It requires architecting systems that leverage Anthropic's reliability, DeepSeek's efficiency, OpenAI's reasoning, and Google's edge presence-all dynamically orchestrated by Kimi's paralle swarms.

Comments