Sharper agentic coding
K2.7 improves on K2's headline strength: longer reliable tool-use chains, fewer derailments on multi-file edits, and better recovery when a step fails. This is where the version earns its number.
Moonshot AI's latest open-weight model pushes agentic coding toward the closed frontier while charging a fraction of the per-token price. It ships with published weights you can self-host. For cost-sensitive, high-volume coding work, this is the open option we benchmark first.
Moonshot publishes the weights under a permissive licence, so you can call the hosted API for speed or serve the model on your own GPUs for residency and cost control. Most closed leaders give you neither option.
Hosted pricing is $0.60 input and $2.50 output per million tokens. Against Claude Opus at $5 and $25, that is about an eighth of the input cost and a tenth of the output cost for frontier-class agentic coding.
K2.7 continues the K2 line's focus: long tool-use chains, multi-file edits, and terminal work. On reported coding benchmarks it sits close to the closed leaders and ahead of every other open-weight model.
Around a trillion total parameters with roughly 32B active per token. You get frontier capacity without paying to activate the whole network on every call, which is part of why the hosted price is so low.
Cursor's Composer coding model is widely believed to be a fine-tune of an open-weight base, with Kimi K2 among the names raised. Whatever the truth, the debate is a signal: open-weight Kimi is now production-grade enough that a frontier tool may be built on it.
Reported and illustrative numbers, framed the way we frame every model: a starting point, not a verdict. K2.7 leads the open-weight field on agentic coding and reasoning and closes much of the gap to the closed leaders, at a fraction of their price.
| Capability | Kimi K2.7 | Kimi K2 | Claude Opus 4.8 | GPT-5.5 |
|---|---|---|---|---|
Agentic coding SWE-Bench Verified | 71.3% +5.5 pts vs K2 | 65.8% | 74.5% | 72.1% |
Agentic terminal coding Terminal-Bench 2.1 Terminus-2 public harness | 63.4% +7.3 pts vs K2 | 56.1% | 74.6% | 78.2% |
Tool use Tau2-Bench | 72.0% +5.6 pts vs K2 | 66.4% | 76.8% | 73.5% |
Reasoning GPQA Diamond | 78.6% +3.5 pts vs K2 | 75.1% | 83.2% | 82.4% |
Math AIME 2025 | 89.4% +4.7 pts vs K2 | 84.7% | 91.0% | 92.3% |
Open-weight field Best open model on coding vs DeepSeek, Llama, Qwen | Leads | Prev. leader | Closed | Closed |
Figures are reported or illustrative and used for orientation only. We re-run our own evals on customer tasks before recommending any model, open or closed, and the cost advantage has to survive a quality check on your workload.
What changes for the engineering team. Two comparisons that matter: K2.7 against the K2 it succeeds, and against a closed flagship (Claude Opus 4.8) on the dimensions a buyer actually weighs: capability, cost, control, and risk.
| Dimension | vs Kimi K2 | vs Claude Opus 4.8 |
|---|---|---|
Coding workflows | +5.5 pts on SWE-Bench Verified and +7.3 pts on Terminal-Bench 2.1 vs K2. The agentic coding gap to the closed leaders is now small enough to matter on price. | Opus still leads on the hardest terminal and reasoning tasks. K2.7 closes most of the everyday coding gap at a fraction of the cost, so it is a strong default for high-volume, well-scoped work. |
Cost and latency | Same low hosted price as K2 ($0.60 / $2.50 per million). The capability went up, the price did not. | About an eighth of Opus input cost and a tenth of output cost. On a coding agent that burns millions of tokens a day, that is the difference between a viable margin and an unviable one. |
Control and residency | Open weights, same as K2. Self-hosting is a deployment decision, not a vendor negotiation. | Opus is API-only. Kimi's open weights let you run inference in your own environment for data residency, air-gapped work, or fixed-cost GPU economics. That is a capability Opus cannot offer. |
Risk and provenance | Same licence and origin questions as K2. Nothing new to diligence beyond the version bump. | A China-origin open model carries different diligence: licence terms, content-policy behaviour, and supply-chain review. We treat those as eval and governance line items, not blockers. |
Inside a Kensink build, Kimi is a routing option behind the same abstraction as Claude and GPT. The agent picks the model by task and cost at runtime, not by a vendor commitment frozen at design time.
K2.7 improves on K2's headline strength: longer reliable tool-use chains, fewer derailments on multi-file edits, and better recovery when a step fails. This is where the version earns its number.
Moonshot publishes the model weights for self-hosting. You can run K2.7 on your own GPUs for residency or fixed-cost economics, or call the hosted API when speed and zero ops matter more.
A long context window suits codebase-scale tasks and long agent transcripts. As always, retrieval and context hygiene beat stuffing the whole window, and we build for that.
Hosted input at $0.60 and output at $2.50 per million, with prompt caching that makes shared system preambles cheaper still. The economics are the headline feature for high-volume work.
Kimi speaks an OpenAI-compatible API surface, so adding it next to Claude and GPT in our provider layer is a config change plus an eval pass, not a rewrite.
The headline is the economics. Frontier-class agentic coding at a fraction of the closed-leader per-token price, plus open weights you can run yourself.
When Cursor shipped its in-house Composer coding model, a widely-discussed theory held that it was not trained from scratch but fine-tuned from an open-weight base, with Kimi K2 among the names raised alongside other open models. Cursor has not published which base, if any, it used. Treat the specific claim as unconfirmed. The signal worth taking seriously is the direction: open-weight Kimi is now strong enough that a frontier commercial tool plausibly building on it is a debate at all.
A permissive licence is what lets anyone, Cursor included, fine-tune and ship on top of an open model. That is the point of open weights, not an abuse of them. The real governance question is transparency: buyers deserve to know what a product is built on, which is exactly the diligence we run before we put any model in a customer's path.
Kimi's origin raises fair questions: licence terms, content-policy and refusal behaviour, and supply-chain review of weights you self-host. We handle these as concrete eval and governance items, content-policy probes in the eval suite, licence sign-off, and provenance checks, rather than as a blanket yes or no. For many workloads it clears the bar; for some regulated ones it will not, and we say so.
For high-volume, well-scoped coding and tool-use work, K2.7 is the open option we put on the eval suite before reaching for a closed flagship. If it clears the quality bar on your tasks, the cost difference is hard to argue with.
The ability to self-host changes what is buildable: data residency, air-gapped deployments, and fixed-cost GPU economics that closed APIs cannot match. We weigh that against the operational cost of running inference, per workload.
Whether or not Composer is built on Kimi, the fact that it is a credible theory tells you open-weight models have crossed into production-grade for coding. We do not repeat the unconfirmed parts as fact, and we do not let the headline replace our own evals.
Kimi is one more routing option in our provider layer. The agent picks Kimi, Claude, or GPT by task difficulty and cost at runtime. No lock-in, no rewrite, and a closed-model fallback for the steps that need it.
Every model we integrate runs through the same operating system. Three pillars, sixteen layers, one Compound Growth Loop. The methodology that keeps AI work from rotting after the first ship.
Read the K-FrameworkDirect API integration with the model. No LangChain, no orchestration vendor, no agent framework built on quicksand. Typed contracts, the same way we wire up Postgres.
An eval suite built from your real tasks gates every prompt and model change. Quality is measured before it ships, not vibed in a demo.
Governance, audit, and oversight wired in from day one. Who called what, with which prompt version, at what cost. Your auditors get answers, not screenshots.
A model in production without observability is roulette. We instrument every integration so engineering and finance can see the same numbers, and so a regression at 3am surfaces before a customer opens a ticket.
Tokens in, tokens out, dollars spent. Sliced by feature, tenant, and route. Budgets enforced where it matters.
Real distributions, not averages. We know which routes are slow, and why.
The same eval suite that gates a release runs continuously in production. A regression on real traffic surfaces fast.
PII scrubbed at the proxy, shipped to your SIEM. Retention controls match your compliance window.
Dashboards your team owns, not ours. At handoff you get the queries, the alerts, and the runbook. We are not in the path to read your metrics.