GPT-5.5 on VM0. OpenAI's flagship reasoning model
OpenAI's flagship of the GPT-5 family. The strongest pick for agentic coding, deep reasoning and computer-use loops at the OpenAI tier.
400K tokens · Text / Vision / Code · Prompt cache
GPT-5.5 is the model you reach for when the work needs both deep reasoning and reliable tool use: orchestrating multi-step agent loops, code edits that have to land first try, and computer-use workflows that span many GUI actions. Vendor benchmarks (SWE-bench Verified, AIME 2025, GPQA Diamond) put concrete numbers on the gains over GPT-5.4.
Vendor list price is $5 / $30 per 1M tokens with cached input at $0.50 / 1M. It's the most expensive model in VM0's Built-in catalogue at ×2 credits, so the cost-effective pattern is to keep GPT-5.4 or Claude Sonnet 4.6 as the everywhere-default and route only the hardest steps to GPT-5.5.
What is GPT-5.5?
April 2026 (successor to GPT-5.4) · Top-tier of the GPT-5 family. OpenAI's flagship for agentic coding and reasoning.
GPT-5.5 is the flagship of OpenAI's GPT-5 generation, released in April 2026 as the recommended upgrade from GPT-5.4. OpenAI frames it as a step-change improvement on agentic tool use and computer-use tasks rather than a refresh on the surface API. The 400K-token context window and reasoning_effort parameter introduced with GPT-5 carry over unchanged, so existing Codex agents drop in without rewrites.
Compared to GPT-5.4 (the workhorse in the same family), GPT-5.5 invests more compute per token on reasoning. The behavioural payoff shows up in three places: stronger first-attempt code patches on multi-file refactors, materially fewer mis-routed tool calls on long agent loops, and noticeable gains on graduate-level science reasoning (GPQA Diamond) and competition math (AIME 2025). The trade-off is the highest list price among GPT-5 variants ($5 / $30 per 1M tokens) and a ×2 credit multiplier on VM0, which is why OpenAI itself positions GPT-5.5 as the planner or escalation tier rather than the everywhere-default.
Independent leaderboards (Artificial Analysis, Vellum) corroborate the relative ordering against GPT-5.4 and place GPT-5.5 within a few points of Claude Opus 4.7 on most agentic-coding tasks. Absolute numbers shift weekly and OpenAI itself has flagged training-data contamination on SWE-bench Verified across frontier models. Treat the public scores as directional rather than authoritative; the structured behavioural differences (tool-call accuracy, computer-use reliability, first-attempt patch quality) are the more durable signal.
What's notable about GPT-5.5
Headline architecture and capability features.
GPT-5.5 keeps the 400K-token context window from GPT-5.4, billed at standard input pricing across the entire window. It supports the reasoning_effort parameter at four levels (minimal, low, medium, high), prompt caching where cached input bills at one-tenth the input rate, and the Responses API surface that codex CLI uses by default. Tool-use, structured outputs and computer-use are unchanged from 5.4. Inputs are multimodal across text, vision and code; the model has no native image generation (use the Images API for that).
Specs at a glance
GPT-5.5 benchmarks
Vendor-reported scores from OpenAI's GPT-5.5 release materials, with deltas shown against the public GPT-5.4 numbers. Independent reviews place 5.5 within a few points of Claude Opus 4.7 on agentic-coding tasks. Treat absolute percentages as directional; OpenAI has flagged training-data contamination on SWE-bench Verified across all frontier models.
GPT-5.5 pricing
Provider list price, per 1M tokens.
How GPT-5.5 behaves in practice
Observed behaviour from production agent runs.
Tool routing
Lowest rate of mis-routed tool calls in the GPT-5 family. The gap versus 5.4 widens on hard edge cases such as conditional tool selection, deeply nested arguments, and tool calls dispatched after long stretches of reasoning.
First-attempt code edits
Strongest patch quality in the GPT-5 family. The right pick when an agent has to modify code that must keep compiling and passing tests, especially when the patch spans multiple files. Vendor-reported SWE-bench Verified reflects this directly.
Computer use
Materially more reliable than 5.4 on multi-step GUI sequences, which is what the OSWorld delta captures. Reach for it when the agent is driving a browser or desktop app over dozens of steps and the cost of a mid-run derailment is high.
Speed
Slower than 5.4 and noticeably slower than 5.4 Mini. Around 70 tokens/sec at medium effort per Artificial Analysis. Reserve it for the steps that actually need the extra reasoning depth and run lighter tiers in parallel.
Hallucination behaviour
GPT-5.5 carries OpenAI's stricter calibration from the GPT-5 generation and tends to admit uncertainty rather than confabulate, which is the reason production teams keep paying the premium for high-stakes reasoning despite cheaper alternatives like DeepSeek V4 Pro now matching it on benchmarks.
Best agent tasks for GPT-5.5
The orchestrator running a multi-tool plan
Use GPT-5.5 as the planner that breaks a customer's request into ten steps, dispatches each step to a GPT-5.4- or 5.4 Mini-tier sub-agent, and stitches the results back together. Running 5.5 only at the planner layer (and the cheaper tiers everywhere else) costs a fraction of running 5.5 end-to-end, with most of the quality preserved.
The first-try code edits that don't waste a CI run
Ask GPT-5.5 to migrate a 50-file codebase from one ORM to another, refactor a tangled module, or apply a security fix across the repo. The patch applies cleanly on the first attempt more often than any other model in the family, and that's exactly what your CI bill will reflect.
The computer-use agent that has to finish the workflow
When the agent is driving a browser through a multi-step booking flow, a desktop app, or a legacy admin UI, 5.5's stronger OSWorld score translates to fewer mid-run derailments and fewer human takeovers. The premium pays for itself the first time a long session doesn't have to be restarted.
The hard-math or hard-science research step
Drop a competition-grade math problem set or a graduate physics derivation in and 5.5 will work through it without the off-by-one slips you see in 5.4. AIME 2025 and GPQA Diamond pick up exactly this kind of behaviour.
When to skip GPT-5.5
Skip GPT-5.5 on high-volume routine work where GPT-5.4 hits the same quality bar at half the credit cost, on latency-sensitive chat replies where GPT-5.4 Mini is much faster, and on bulk classification or extraction jobs where DeepSeek V4 Flash is roughly 35× cheaper at the vendor level.
GPT-5.5 vs other models
GPT-5.5 vs GPT-5.4
GPT-5.4 is the workhorse default in the GPT-5 family and the right pick for most agents. Promote to GPT-5.5 only when 5.4 visibly fails on hard reasoning, long agentic loops or first-attempt code edits, usually as the orchestrator that delegates downward to 5.4- or 5.4 Mini-tier sub-agents.
GPT-5.5 vs Claude Opus 4.7
Same role in different families: the high-stakes orchestrator and the model you escalate to when the cheaper tier fails. Opus 4.7 has the 1M-token context window and Anthropic's safety profile; GPT-5.5 has stronger computer-use scores and is the natural pick for teams already on the Codex framework. Pick by which framework and ecosystem your existing agents target.
GPT-5.5 vs Gemini 3 Pro
Gemini 3 Pro leads on raw long-context reasoning (2M-token window) and on some multimodal benchmarks. GPT-5.5 leads on agentic coding (SWE-bench Verified, Terminal-Bench) and computer use. Pick GPT-5.5 when the agent edits code or drives a UI; pick Gemini 3 Pro when the workload is heavy document or video understanding.
Bottom line: should you use GPT-5.5?
GPT-5.5 is the escalation tier on the OpenAI side. Default to GPT-5.4; promote to 5.5 only on the specific steps where 5.4 visibly fails.
Frequently asked questions
What is GPT-5.5's context window?
400,000 tokens, with up to 128K tokens of output per response. The full window bills at standard rates.
Can GPT-5.5 handle images?
Yes. GPT-5.5 is multimodal. It accepts image inputs alongside text and code, so screenshot-driven and document-vision agents work natively. For image generation use the OpenAI Images API.
When should I pick GPT-5.5 over GPT-5.4?
When (a) the agent is the planner / orchestrator and decisions cascade, (b) the run is long enough that 5.4 starts mis-routing tool calls, or (c) the output must apply cleanly on the first attempt (code edits, structured payloads, computer-use workflows).
Does GPT-5.5 support prompt caching?
Yes. Cached input bills at $0.50 per 1M tokens — a 10× discount on the cached portion. Worth using whenever your system prompt or tool schema is stable across calls.
What framework does GPT-5.5 use on VM0?
Codex. VM0 routes GPT-5.5 through the Codex framework's Responses API surface, which is what codex CLI uses by default. Claude Code-framework agents are not compatible with GPT-5 models on VM0.
Alternatives
Using GPT-5.5 on VM0
Two ways to access GPT-5.5 on VM0
VM0 supports GPT-5.5 as a Built-in model billed in VM0 credits, and through bring-your-own with a OpenAI API key. The Built-in path uses VM0 Managed routing and the credit multiplier explained below; the bring-your-own path bills you directly with the upstream vendor and skips the VM0 credit conversion entirely.
VM0's recommendation
VM0 positions GPT-5.5 as a core agent model, recommended alongside Claude Opus 4.7, Claude Opus 4.6, and Claude Sonnet 4.6 for the steps that drive the actual outcome of an agent run. These are the models we'd pick for the orchestrator role, for code-touching agents, and for any step where a wrong answer is expensive.
Credits and the ×2 multiplier
Every Built-in model on VM0 is priced as a multiple of Claude Sonnet 4.6, which sits at the ×1 credit baseline. GPT-5.5 bills at ×2 credits. The multiplier is what shows up on your VM0 invoice; the vendor list price in the pricing table above is what the upstream provider charges before VM0 converts it into credits.
GPT-5.5 bills at ×2, which means a step here costs 2× the credits of an equivalent step on Sonnet 4.6 (the ×1 baseline). It's a premium tier on VM0, so the cost-effective pattern is to default to a cheaper model and route only the steps that genuinely need the extra reasoning depth to GPT-5.5.
Available on VM0 since April 2026.