Cursor AI—an online AI Code Editor, recently published (and removed) a blog post on a new model and inference method for high-accuracy full-file edits at 1000 tokens/second. The blog post is no longer accessible, but we are reposting the methodology they had published here for the readers. This is the link to the original Cursor AI blog.
Advanced models such as GPT-4o struggle on large edits, with problems of laziness, inaccuracy, and high latency. This is a weakness visible in coding agents. Accurately editing hundreds of lines can take multiple model calls, sometimes trapping the agent in an infinite loop. Even small, isolated edits are plagued with bugs.
Worst of all, existing models are slow at large edits, breaking the programmer out of flow. Cursor trained a specialized model on an important version of the full-file code edit task called fast apply.
Difficult code edits are broken down into two stages: planning, and applying.
For Cursor, the planning phase is a chat interface with a powerful model such as GPT-4o or Claude 3.5 Sonnet. Applying the change to the current file is straightforward and instant.
Cursor’s fast-apply model surpasses GPT-4 and GPT-4o performance. They achieved speeds of ~1000 tokens/s (around 3500 char/s) on the 70b model using a speculative-decoding variant tailored for code edits, called speculative edits.
This means a ~13x speedup over vanilla inference using Llama-3-70b and a ~9x speedup over our previous GPT-4 speculative edits deployment.
What is Speculative Decoding?
Speculative decoding is an inference technique introduced by researchers at OpenAI to enhance the speed and efficiency of language models during text generation by making educated guesses about future tokens in a sequence. Instead of generating tokens one at a time, which can be slow and computationally expensive, speculative decoding uses a draft or approximate model to predict several tokens in a forward-looking manner. These predictions are then used to quickly evaluate and verify the most likely subsequent tokens, significantly speeding up the overall process.
This method capitalizes on the predictability and structured nature of certain text sequences, such as code, allowing for faster and more efficient text generation while maintaining high accuracy. Speculative decoding is particularly useful in settings where low latency and rapid response times are critical, such as in interactive coding assistants and real-time communication tools.
Cursor’s Case
Cursor AI applied the concept of speculative decoding to the domain of code edits, which they termed “speculative edits”. This method was particularly designed to expedite the process of full-file code rewrites, achieving speeds up to 9x faster than traditional approaches. Instead of generating each token one by one, which can be slow and disrupt the programmer’s workflow, Cursor’s speculative edits algorithm predicts larger chunks of the file at once. These predictions are based on a strong prior understanding of the draft tokens at any given point in time.
Cursor implemented speculative edits by having a deterministic algorithm speculate on future tokens, rather than depending solely on a draft model’s forward passes. This approach is tailored for their fast-apply model, optimizing full-file rewrites rather than focusing on diffs. Essentially, speculative edits enabled Cursor to process and apply significant code changes almost instantaneously, revolutionizing the speed and fluidity of large-scale code edits within their system.
Their custom language models can generate the fully rewritten file conditioned on the current file, the conversation history, and the current code block.
Speculative Decoding for Faster Results
Cursor’s biggest breakthrough came from their speculative decoding algorithm, which achieved up to a 9x speed improvement. This method allowed Cursor to speculate on future tokens using a deterministic algorithm, avoiding the need for multiple passes by the model.
In standard LLM inference, each token relies on the context of all previously generated tokens. The next token (n+1) cannot be produced until the current token (n) is available.
Speculative decoding allows for the parallel generation of tokens. This method lets users predict multiple tokens at once and use them without breaking from the given context.
Cursor partnered with Fireworks AI to deploy this model efficiently, further increasing the speed advantage of Llama-3 over GPT-4.
You can also use Fireworks AI for speculative coding.
How Cursor Trained and Evaluated Their Model
To refine their model, Cursor designed an evaluation set consisting of 450 full-file edits, each under 400 lines. They measured the performance of various prompted models, with Claude-3 Opus acting as a grader. This method revealed that Opus-based grading aligned more with their internal assessments compared to GPT-4-Turbo and GPT-4o.
Although there was some bias towards Claude’s models, the results reflected Cursor’s qualitative evaluations. Notably, Claude-3 Sonnet outperformed GPT-4-Turbo, while GPT-4o performed similarly to GPT-4-Turbo.
Why Full-File Rewrites Instead of Diffs?
Cursor opted to have the model rewrite entire files instead of suggesting diffs. They discovered that language models struggled with diff-formatted edits for several reasons:
- More Tokens for Thinking: Rewriting an entire file allows the model to utilize more tokens, giving it more forward passes to find the right solution.
- Diffs are Rare in Training: Models likely see more full-file code examples than diff-format ones during training.
- Line Number Issues: Handling line numbers in different formats is challenging, especially if tokenizers treat them as a single token.
To summarize: Cursor’s model eliminates the line number issue and uses a system based on search-and-replace blocks, ensuring robustness even when models made minor errors. Among the tested models, only Claude Opus managed to output accurate diffs consistently.
Speed and Performance Analysis
Cursor measured speed by dividing the number of rewritten characters by the time taken for the rewrite. This metric normalized performance across tokenizers and provided a meaningful single value for speed. The results placed Opus, Sonnet, GPT-4o, and Haiku on the Pareto frontier for speed.
Speculative edits introduced another layer of optimization, allowing GPT-4-Turbo to perform at a similar speed to GPT-4o. However, Cursor found that speculative edits were not yet available for GPT-4o.
Custom Model Training
Since speculative edits weren’t feasible in any of Anthropic’s models, Cursor trained their own custom model. They started with a set of ”fast-apply” prompts, generated more data using GPT-4, and finetuned the dataset. Their best-performing model, Llama-3-70b-ft, almost matched Claude-3 Opus and outperformed GPT-4-Turbo.
Cursor also downsampled their dataset to balance file sizes and reduce over-represented categories. Their training efforts paid off as the finetuned Llama-3 model consistently outperformed other models in evaluations.
Future Enhancements
Cursor’s next steps involve expanding the model’s context to handle files up to 2,500 lines. They are also exploring knowledge distillation, which could transfer the ”fast-apply” abilities into smaller models like Llama-3-8b. Additionally, on-policy reinforcement learning (RL) is being considered for further accuracy improvements.
These advancements are not only vital for code generation but also lay the foundation for more sophisticated systems, where low-latency applications are increasingly important.
Cursor’s approach to speculative edits and full-file rewrites exemplifies their commitment to building faster, more accurate models. Their success in this field highlights the depth and precision behind their product development. For more in-depth comparisons of LLM models, and analysis, check the Bind AI Blog.