You can now access three new models through the API: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano.
These models surpass the performance of GPT-4o and GPT-4o mini in coding, instruction following, and long-context processing. Each model supports a context window of up to 1 million tokens and includes updated knowledge through June 2024.
Performance Across Key Areas
Coding
GPT-4.1 achieves a 54.6% score on the SWE-bench Verified benchmark, improving by 21.4% over GPT-4o and 26.6% over GPT-4.5. You can rely on it for generating accurate code patches and diffs across programming languages. It doubles GPT-4o’s score on Aider’s polyglot diff benchmark, ensuring precise edits.
Windsurf reports a 60% improvement on their internal coding benchmarks, with 30% more efficient tool calling and 50% fewer unnecessary edits. Qodo confirms GPT-4.1 outperforms competitors in 55% of GitHub pull request reviews, excelling in precision and analysis.
For frontend development, GPT-4.1 produces functional and visually appealing web applications. Human graders prefer its output over GPT-4o’s in 80% of comparisons, as demonstrated in a flashcard web app with dynamic search and smooth animations.
Instruction Following
GPT-4.1 scores 87.4% on IFEval, compared to 81.0% for GPT-4o, and 38.3% on the MultiChallenge benchmark, a 10.5% improvement. You can expect reliable adherence to formats, negative instructions, and multi-turn conversation coherence.
Blue J notes a 53% accuracy increase on complex tax scenarios. Hex reports a 2x improvement in challenging SQL evaluations, highlighting GPT-4.1’s ability to handle ambiguous schemas and nuanced instructions.
Long-Context Processing
With a 1 million-token context window, GPT-4.1 processes large codebases or documents efficiently. It scores 72.0% on Video-MME’s long, no-subtitles category, improving by 6.7% over GPT-4o. Its needle-in-a-haystack accuracy remains consistent across all context lengths, and it excels in the OpenAI-MRCR evaluation for multi-round coreference tasks.
Carlyle achieves 50% better retrieval performance from complex, lengthy documents, overcoming limitations like lost-in-the-middle errors.
Vision
GPT-4.1 mini outperforms GPT-4o on several vision benchmarks, enabling you to process images effectively for tasks like chart analysis or visual data interpretation.
Cost and Accessibility
You benefit from lower costs with GPT-4.1, which is 26% less expensive than GPT-4o for median queries. GPT-4.1 nano offers the lowest cost at $0.10 per 1M input tokens. Prompt caching provides a 75% discount, and long-context requests incur no additional fees beyond standard per-token rates.
Model | Input ($/1M tokens) | Cached Input ($/1M tokens) | Output ($/1M tokens) | Blended Pricing ($/1M tokens) |
---|---|---|---|---|
GPT-4.1 | $2.00 | $0.50 | $8.00 | $1.84 |
GPT-4.1 mini | $0.40 | $0.10 | $1.60 | $0.42 |
GPT-4.1 nano | $0.10 | $0.025 | $0.40 | $0.12 |
You can access these models through the Batch API at a 50% discount for large-scale applications.
Building AI Agents
GPT-4.1’s enhanced instruction following and long-context processing enable you to develop reliable AI agents. These agents handle complex tasks, such as software engineering, document analysis, and customer support, with minimal oversight when paired with tools like the Responses API.
Transition from GPT-4.5 Preview
GPT-4.1 replaces GPT-4.5 Preview, offering superior performance and cost-efficiency. You have until July 14, 2025, to transition, as GPT-4.5 Preview will be deprecated. GPT-4.1 retains the creativity and nuance of GPT-4.5 while improving scalability.
Conclusion
You can leverage GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano to build advanced applications with improved coding, instruction following, and long-context capabilities. These models provide cost-effective solutions for diverse use cases. Visit the API documentation to start integrating GPT-4.1 into your projects.