The AI model landscape shifted again on March 18, 2026, when Chinese AI company MiniMax released M2.7 — a self-evolving large language model that managed 30 to 50 percent of its own development workflow.
MiniMax M2.7: The Self-Evolving Challenger
MiniMax M2.7 is the first major commercial model to publicly document recursive self-improvement. Earlier versions built the research agent harness that managed data pipelines, training environments, and evaluation infrastructure for M2.7 itself — handling 30-50% of its own development workflow.
- Released March 18, 2026 by MiniMax (China)
- Self-evolving: model participated in its own training pipeline
- 205K context window, 131K max output
- Pricing: $0.30/M input, $1.20/M output
- PinchBench: 86.2% (within 1.2 points of Claude Opus 4.6)
On the SWE-Pro benchmark, M2.7 scored 56.22%, matching GPT-5.3-Codex. On PinchBench, M2.7 hit 86.2%, placing fifth overall. Terminal Bench 2 scored 57.0%, and SWE Multilingual reached 76.5%.
The hallucination improvement is dramatic: M2.7 scored +1 on the AA-Omniscience Index, up from M2.5's -40, with a hallucination rate of 34% — lower than Claude Sonnet 4.6 (46%) and Gemini 3.1 Pro (50%).
Claude Opus 4.6: The Coding Leader
Anthropic released Claude Opus 4.6 on February 5, 2026. It leads SWE-bench Verified at 80.8% and holds the #1 Chatbot Arena ELO at 1503.
On ARC-AGI-2, Opus scores 68.8% — ahead of GPT-5.4's 52.9% but behind Gemini 3.1 Pro's 77.1%. It also leads on Humanity's Last Exam and DeepSearchQA.
The tradeoff is cost: $5.00/M input and $25.00/M output — the most expensive in this comparison.
GPT-5.4: The Agentic Workhorse
OpenAI released GPT-5.4 on March 6, 2026. It carries a General Intelligence Index score of 57 (vs Opus's 53) and dominates agentic execution.
GPT-5.4 leads Terminal-Bench 2.0 at 75.1% vs Opus's 65.4%, and SWE-bench Pro at 57.7%. Its native Computer Use surpasses human performance on OSWorld at 75.0%.
Gemini 3.1 Pro: The Reasoning Powerhouse
Google's Gemini 3.1 Pro (Feb 19, 2026) ranks #1 on the Artificial Analysis Intelligence Index across 115 models, leading 13 of 16 benchmarks.
ARC-AGI-2: 77.1% — more than doubling its predecessor. GPQA Diamond: 94.3% (all-time record). But SWE-Bench Verified trails at 68.5%, and latency averages 29 seconds to first token.
Benchmark Comparison
| Benchmark | Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro | MiniMax M2.7 |
|---|---|---|---|---|
| SWE-bench Verified | 80.8% | ~78% | 68.5% | TBD |
| SWE-bench Pro | ~45% | 57.7% | — | 56.2% |
| ARC-AGI-2 | 68.8% | 52.9% | 77.1% | — |
| GPQA Diamond | 91.3% | 93.2% | 94.3% | — |
| PinchBench | ~87% | — | — | 86.2% |
| Terminal Bench 2.0 | 65.4% | 75.1% | — | 57.0% |
| Chatbot Arena ELO | 1503 | 1463 | — | TBD |
| MMLU | 91.1% | 89.6% | — | — |
Pricing Comparison
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| MiniMax M2.7 | $0.30 | $1.20 | 205K |
| GPT-5.4 | $2.50 | $15.00 | 128K |
| Claude Opus 4.6 | $5.00 | $25.00 | 200K (1M beta) |
| Gemini 3.1 Pro | ~$1.25 | ~$5.00 | 1M |
MiniMax M2.7 is 8x cheaper than GPT-5.4 and 16x cheaper than Claude Opus 4.6 on input tokens.
Which Model Should You Choose?
- **Budget production:** MiniMax M2.7 — frontier performance at 1/10th the price
- **Complex coding:** Claude Opus 4.6 — 80.8% SWE-bench, #1 Chatbot Arena
- **Agentic automation:** GPT-5.4 — Terminal-Bench leader, Computer Use built in
- **Research/reasoning:** Gemini 3.1 Pro — ARC-AGI-2 and GPQA Diamond champion
The AI model comparison in March 2026 reveals a market where no single model dominates every category. The era of one clear best model is over. The smartest strategy is matching the model to the task.