Zencoder Leads SWE-bench with 70% Success Rate

At Zencoder, our mission is to help you ship software faster. Every software improvement begins with the same basic step: taking an issue from the backlog and producing code that correctly solves it. To ensure we're building the best coding partner possible, our applied research team continuously benchmarks our agents against a growing set of public and private standards. This rigorous measurement approach follows our core belief that meaningful progress in AI coding assistance requires objective, real-world validation, because you can't manage what you can't measure.

The Benchmark That Matters

The cornerstone of these public benchmarks is SWE-bench, and specifically, SWE-bench Verified which is a high quality subset of manually selected SWE-bench tasks. As we deep-dived in a previous post, SWE-bench challenges AI agents to solve real-world GitHub issues and validates those solutions using a verified set of unit tests, approximating a real-world test of practical coding capability.

Our Journey to the Top

When we first started working on Zencoder, the best large language models could solve less than 5% of these issues. As we built our testing infrastructure and made consistent progress sprint after sprint, we began to see a compelling trajectory emerge.

This progression revealed fascinating milestone effects:

At around the 45% mark, coding agents became powerful enough to spark the "vibe coding" movement.
When we crossed the 60% threshold, Zencoder usage jumped 5x, signalling a profound sea change in the utility and accuracy of our coding agent.

Zencoder 70% success rate on SWE-bench Verified-2

Today, we're proud to announce a new milestone: a 70% success rate on SWE-bench Verified. Our latest submission represents the culmination of months of focused development.

The Three Superpowers Behind Our Success

This achievement is powered by the full lifecycle of our unique agentic pipeline, which from day one has leveraged three fundamental capabilities:

1. Deep Contextual Understanding of Codebases

Just as you wouldn't expect a new developer to produce a useful pull request in their first five minutes on the job, LLMs need comprehensive context to become productive. Every session for an LLM is like "50 First Dates"; they've never seen your code before, making them sensitive to both the completeness of context (or lack thereof) and the signal-to-noise ratio of truly useful information. This understanding led us to develop Repo Grokking™ technology, which we've integrated into all our products.

2. Powerful Tool Integration

Our platform’s performance on real-world engineering workflows has grown dramatically with the introduction of tool use. At Zencoder, we've always believed it's naive to expect LLMs to replace the diverse tools humans have created. Humans remain more intelligent than AI, and building valuable software requires numerous specialized tools (compilers, linters, etc.) and sophisticated processes. We've invested heavily in enabling our AI agents to leverage existing tools effectively, building both proprietary solutions and connectors to hundreds of others.

3. Verification Through Feedback Loops

Try walking a straight line with your eyes closed—it's nearly impossible. Similarly, engineers constantly rely on feedback: IDE error highlights, compiler warnings, failed tests, and code reviews. This principle applies to all intelligent agents, including AI. From our earliest development, we've emphasized verification capabilities that not only provide peace of mind but also enable AI to self-correct and improve its output.

Practical Principles That Make the Difference

Atop these foundations, two core approaches have proven invaluable to building the best-performing coding agent today

Model Agnosticism: We maintain a modular platform architecture that lets us benchmark not just entire processes but individual components, using the optimal model for each task. Sometimes that's our own model, sometimes it's one from Anthropic, OpenAI, or Google, and occasionally it's open source. This flexibility gives us a significant advantage over single-model agents in quality, price, and latency.

Industry Experience: Our team comes from practical industry backgrounds. For instance, I previously built and scaled a SaaS business from zero to millions of users with hundreds of engineers and tens of thousands of automated tests. We understand that benchmarks alone don't always translate to production value, so our agents are designed with practical utility as their north star.

How We Achieved 70%

Our experimental setup employs two key components:

Parallel Agent Execution: We ran four distinct Zen Agents simultaneously on each SWE-bench task, with each generating its own solution.
Critic Evaluation: We used OpenAI's o3 model as a judge to select the best solution from among the four candidates.

As we worked on the next generation of autonomous agents, we were keen to explore these two directions in the lab before offering similar production capabilities.

Each agent in our ensemble has the following characteristics:

One-Shot Solving: Each agent gets exactly one try to solve the task.
Model Diversity: Agents use either Anthropic's Claude Sonnet 3.7 or OpenAI's o4-mini model.
Specialized Tooling: Each agent is equipped with different combinations of edit, search, diagnostic, and "thinking" tools.

The Power of Ensemble Approaches

What's particularly fascinating is the performance comparison: while our best individual agent achieved a solve rate of 66.6%, the four agents used in the ensemble individually achieved solve rates between 60.8-64.6%. Yet together, with the “critic” role selecting the optimal solution, they pushed our overall success rate to 70%.

Even more compelling, a theoretical "Best of 4" solution—where we always pick the correct solution if any agent found it—would reach an astonishing 78.6% success rate. This dramatic improvement underscores the immense potential of verification and feedback loops, a direction our lab is actively working toward.

Looking Forward: What This Means for You

This achievement isn't just a number on a leaderboard. It represents a significant step toward our vision of autonomous agents capable of producing valuable work in real production environments. As we continue refining our approach to verification and feedback systems, we expect to see even more dramatic improvements in capability and reliability.

For Zencoder users, this means increasingly powerful AI assistance that better understands your codebase, leverages the right tools at the right time, and learns from its own successes and failures. The 70% benchmark is not an endpoint but a milestone on our journey to transform how software gets built.

We'll continue pushing these boundaries in our lab while ensuring that every advance translates to practical value for your development workflow. After all, our ultimate measure of success isn't a benchmark score—it's how much faster you can ship great software with Zencoder by your side.

Zencoder Leads SWE-bench with 70% Success Rate

The Benchmark That Matters

Our Journey to the Top

The Three Superpowers Behind Our Success

1. Deep Contextual Understanding of Codebases

2. Powerful Tool Integration

3. Verification Through Feedback Loops

Practical Principles That Make the Difference

How We Achieved 70%

The Power of Ensemble Approaches

Looking Forward: What This Means for You

Contents

Latest in Company News

Claude Haiku 4.5: When Speed Meets Intelligence

Zencoder + Claude Sonnet 4.5: Raising the Bar for AI Software Agents

Sonnet 4.5 Review: The first spec-driven model has arrived