Open Models have crossed a threshold

LangChain Blog

Open models have crossed a threshold that matters for production agent systems. GLM-5 and MiniMax M2.7 now achieve comparable correctness to Claude Opus 4.6 and GPT-5.4 on core agentic tasks while delivering 8-10x cost reduction and 4x latency improvement. This isn't hype — it's measurable across standardized eval harnesses tracking file operations, tool use, and instruction following.

The cost delta is significant enough to change deployment economics. At 10M output tokens per day, you're looking at $250/day on Opus 4.6 versus $12/day on MiniMax M2.7. That's $87k annually, which for many teams represents the difference between a viable product and one that bleeds margin on every interaction. Input costs follow similar ratios: $0.30-0.95 per million tokens for open models versus $2.50-5.00 for frontier closed models.

Latency tells a similar story. GLM-5 on Baseten averages 0.65s time-to-first-token and 70 tokens/second throughput compared to 2.56s and 34 tokens/second for Opus 4.6. For interactive agent workflows where users expect sub-second response initiation, that gap is architectural — you can't optimize your way around 4x slower TTFT without changing models.

The eval methodology here matters because agent benchmarks are notoriously noisy. The Deep Agents harness measures four concrete metrics: correctness (pass/fail on assertions), solve rate (correctness weighted by speed), step ratio (actual steps versus expected), and tool call ratio (actual tool invocations versus expected). These aren't proxy metrics — they directly measure whether the model completed the task and how efficiently it got there.

Results show GLM-5 at 0.64 correctness (94 of 138 test cases passed) with a 1.02 step ratio and 1.06 tool call ratio. That step ratio near 1.0 means it's not thrashing — it's solving tasks in roughly the expected number of moves. Compare that to Opus 4.6 at 0.68 correctness with 0.99 step ratio. The correctness gap is 6 percentage points, not 60. For many production workloads, that's an acceptable tradeoff given the cost and latency wins.

Category-level breakdown reveals where open models actually compete. GLM-5 scores 1.0 on file operations, 0.82 on tool use, and 1.0 on unit tests — the same categories where Opus 4.6 also scores 1.0, 0.87, and 1.0 respectively. The gap shows up in conversation (0.38 versus 0.05 for Opus) and memory (0.44 versus 0.67). If your agent workflow is tool-heavy and file-manipulation-heavy rather than conversational, open models are already viable primary options, not just fallbacks.

MiniMax M2.7 shows a different profile: 0.57 correctness overall but 0.92 on file operations and 0.87 on tool use. It's weaker on conversation (0.14) but for structured, tool-driven workflows it performs within range of frontier models at a fraction of the cost.

The practical implication is that model selection now needs to be task-specific rather than defaulting to the most expensive option. If you're running high-volume code generation, file manipulation, or structured tool orchestration, open models deliver comparable quality at dramatically lower cost and latency. If you're building conversational agents with complex memory requirements, frontier models still lead.

Integration friction is minimal. Swapping to GLM-5 or MiniMax in most agent frameworks is a one-line model parameter change. The harness handles context window detection, tool-calling format adaptation, and capability negotiation. You can run the same eval suite locally against any model to verify performance on your specific task distribution before committing.

The switching cost is primarily eval time and validation. You need to run your own test suite against candidate open models to confirm they handle your specific tool schemas and task patterns. But the infrastructure cost of that validation is low — most teams can run comprehensive evals in hours, not weeks.