A three-model suite
Sol, Terra, and Luna span one family from flagship to fast-and-cheap. Same context window, same tool surface, same caching model. The only axis that changes is capability against price, which is exactly the axis you route on.
OpenAI's GPT-5.6 ships as a three-model suite. Sol is the flagship, and it sets a new state of the art on agentic terminal coding while pulling clear of GPT-5.5 across reasoning and tool use. We integrate it directly, eval-gated, behind a vendor-neutral abstraction. Preview now, general availability in the coming weeks.
GPT-5.6 ships in three sizes. Sol is the flagship at $5 / $30 per million tokens, Terra is the balanced tier at $2.50 / $15, and Luna is the fast, cheap tier at $1 / $6. The right answer is almost always a mix, routed by task difficulty.
Sol sets a new state of the art on Terminal-Bench 2.1, ahead of GPT-5.5 and the current frontier from Anthropic and Google. For command-line agents and CI-style automation, this is the headline win.
On GeneBench v1 (long-horizon quantitative analysis) Sol beats GPT-5.5 while spending fewer tokens. On ExploitBench it is competitive with the strongest models at roughly a third of the output tokens. Quality and unit cost moved in the same direction.
GPT-5.6 adds explicit cache breakpoints and a 30-minute minimum cache life. Cache reads keep the 90% discount; cache writes bill at 1.25x the uncached input rate. For agents with a large shared preamble, this changes the cost model.
Access is limited and general availability is weeks out. We re-run our own eval suite on customer tasks before recommending a switch, and we hold the line on cost and quality, not the launch slide.
From OpenAI's reported preview numbers, set against the current frontier. Sol leads on agentic terminal coding and financial agent tasks, posts solid gains over GPT-5.5 everywhere, and trails Claude Opus 4.8 on pure coding and multidisciplinary reasoning. We treat these as claims until our own evals confirm them on your tasks.
| Capability | GPT-5.6 Sol | GPT-5.5 | Claude Opus 4.8 | Gemini 3.1 Pro |
|---|---|---|---|---|
Agentic terminal coding Terminal-Bench 2.1 Terminus-2 public harness | 80.7% +2.5 pts vs 5.5 | 78.2% | 74.6% | 70.3% |
Agentic coding SWE-Bench Pro | 64.9% +6.3 pts vs 5.5 | 58.6% | 69.2% | 54.2% |
Multidisciplinary reasoning Humanity's Last Exam no tools / with tools | 45.6% / 55.8% +4.2 pts vs 5.5 | 41.4% / 52.2% | 49.8% / 57.9% | 44.4% / 51.4% |
Agentic computer use OSWorld-Verified | 81.0% +2.3 pts vs 5.5 | 78.7% | 83.4% | 76.2% |
Knowledge work GDPval-AA | 1824 +55 vs 5.5 | 1769 | 1890 | 1314 |
Agentic financial analysis Finance Agent v2 | 54.6% +2.8 pts vs 5.5 | 51.8% | 53.9% | 43.0% |
Numbers as reported by OpenAI in the 26 Jun 2026 preview. OpenAI also reports Sol beating GPT-5.5 on GeneBench v1 with fewer tokens, and matching the strongest models on ExploitBench at roughly a third of the output tokens. We re-run our own evals on customer tasks before recommending a switch, and a benchmark lead has to clear on your workload first.
What changes for the engineering team. Two comparisons that matter: Sol against the model it replaces (GPT-5.5), and against the cheaper tiers in its own suite (Terra and Luna), where most production volume should actually run.
| Dimension | vs GPT-5.5 | vs Terra + Luna |
|---|---|---|
Coding workflows | +2.5 pts on Terminal-Bench 2.1 and +6.3 pts on SWE-Bench Pro over GPT-5.5. The terminal-coding lead is the reason to reach for Sol on command-line agents and CI automation. | Terra handles most local edits and review at half the token price. Reserve Sol for the hard, multi-file, agentic runs where its lead actually shows up. |
Cost and latency | Same headline price as GPT-5.5 ($5 / $30 per million), but more predictable caching and better token efficiency on long-horizon tasks. Effective cost per finished task drops. | Sol is 2x Terra and 5x Luna on input. Routing easy, high-volume steps down to Terra or Luna is where the suite economics work. |
Caching + context | Explicit cache breakpoints and a 30-minute minimum cache life replace the old implicit, short-lived cache. Large shared preambles start paying back as cache reads sooner. | All three tiers share the 400K context and the same caching model, so a prompt built for Sol drops onto Terra or Luna without a rewrite. |
Migration risk | Behind a vendor-neutral abstraction, the switch from 5.5 to 5.6 is a config change plus an eval pass. Most prompts work identically. Preview access and weeks-out GA are the real gating factors. | Tiering is a routing decision, not three integrations. We wire the suite behind one interface and let the agent pick the tier at runtime by difficulty. |
Inside a Kensink build, picking Sol over Terra or Luna is a routing decision the agent makes at runtime by task difficulty, not a vendor commitment frozen at design time.
Sol, Terra, and Luna span one family from flagship to fast-and-cheap. Same context window, same tool surface, same caching model. The only axis that changes is capability against price, which is exactly the axis you route on.
Explicit cache breakpoints let you mark where a cacheable prefix ends, and a 30-minute minimum cache life replaces the old short, implicit window. Cache reads keep the 90% discount; cache writes bill at 1.25x the uncached input rate from GPT-5.6 onward.
OpenAI is reporting equal-or-better quality at fewer output tokens: stronger than GPT-5.5 on GeneBench v1 with fewer tokens, and competitive with the strongest models on ExploitBench at about a third of the output. For agentic loops, fewer tokens per step compounds.
Sol posts the top reported score on Terminal-Bench 2.1, the command-line agent benchmark. If your product runs shell-driven agents, this is the single most relevant number in the release.
GPT-5.6 is a limited preview as of 26 Jun 2026, with general availability promised in the coming weeks. Final GA pricing and context limits can still move, which is one more reason we eval before we commit a default.
The point of a suite is that you do not pick one model. You route. Here is how we map the three tiers onto real workloads inside a build.
The flagship. Route to it for the few genuinely hard steps in a workflow, not for everything.
The balanced default. Most tokens in a well-built system should run here, at half Sol's input price.
The fast, cheap tier. Right whenever a smaller model clears the eval bar, which is more often than teams expect.
Per million tokens. Cache reads keep the 90% discount; cache writes bill at 1.25x the uncached input rate from GPT-5.6 onward. Preview pricing, may shift at general availability.
OpenAI is gating GPT-5.6 behind a limited preview and staging general availability. The most capable tier carries the most scrutiny, and refusal behaviour on high-risk topics tends to be tighter on a flagship than on the smaller tiers.
Fewer tokens to reach the same answer means fewer chances to drift on a long agentic run, and a smaller surface for prompt injection to ride in on. We still treat it as a capability claim until our evals confirm it on your data.
A stronger, better-behaved model does not relax the need for task-specific evals on your prompts, your data, and your guardrails. We run the customer eval suite on every model change, every release, and a preview model raises the bar for what those evals must catch.
Limited access and weeks-out GA mean we do not move a production default onto Sol yet. We do wire it into the eval harness now, so the day GA lands we already know whether the gain clears on each customer's tasks.
The interesting design move is three tiers behind one interface. In our builds the agent already routes by difficulty, so adopting the suite is mostly a routing-table change. Most traffic should land on Terra and Luna, with Sol reserved for the hard steps.
If your product runs shell-driven or CI-style agents, Sol's Terminal-Bench 2.1 lead is the most concrete, testable claim in the release. That is the first eval we run, because it is the one most likely to change a routing decision.
Explicit breakpoints and a 30-minute cache life make a large shared preamble genuinely cheap to reuse. For multi-step agents that replay a big system prompt every turn, this is a real line-item saving, and it is worth re-costing existing GPT-5.5 workloads against it.
Every model we integrate runs through the same operating system. Three pillars, sixteen layers, one Compound Growth Loop. The methodology that keeps AI work from rotting after the first ship.
Read the K-FrameworkDirect API integration with the model. No LangChain, no orchestration vendor, no agent framework built on quicksand. Typed contracts, the same way we wire up Postgres.
An eval suite built from your real tasks gates every prompt and model change. Quality is measured before it ships, not vibed in a demo.
Governance, audit, and oversight wired in from day one. Who called what, with which prompt version, at what cost. Your auditors get answers, not screenshots.
A model in production without observability is roulette. We instrument every integration so engineering and finance can see the same numbers, and so a regression at 3am surfaces before a customer opens a ticket.
Tokens in, tokens out, dollars spent. Sliced by feature, tenant, and route. Budgets enforced where it matters.
Real distributions, not averages. We know which routes are slow, and why.
The same eval suite that gates a release runs continuously in production. A regression on real traffic surfaces fast.
PII scrubbed at the proxy, shipped to your SIEM. Retention controls match your compliance window.
Dashboards your team owns, not ours. At handoff you get the queries, the alerts, and the runbook. We are not in the path to read your metrics.