Three months after the release of Qwen 2.5, the Qwen team released open-source Qwen 2.5 1M models. The first month of 2025 has already given us advanced models like Mistral Codestral 25.01 and DeepSeek R1, and the new Qwen models make that line-up even better. Let’s see how the Qwen 2.5 1M model differentiates itself from Qwen 2.5 previous iterations, such as the Qwen2.5-turbo and other LMMs. But before that, let’s get a technical overview of Qwen 2.5 1M.
Introduction to Qwen 2.5-1M
The Qwen 2.5-1M models: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M represent a leap forward in the Qwen series. The new models offer context lengths of up to one million tokens. This upgrade allows for more complex and nuanced interactions with large datasets, making it particularly useful for applications that require extensive context analysis, such as legal document review, academic research, and long-form content and code generation.
This release follows the successful upgrade of Qwen2.5-Turbo, which also supports long-context processing but was limited to shorter token lengths. The new models are designed to enhance inference speed and accuracy through an open-source inference framework based on vLLM (Variable-length Language Model).
Here’s what to expect from the Qwen2.5-1M release:
- Open-source Models: The release includes two primary checkpoints significantly enhancing the model’s ability to manage long sequences.
- Inference Framework: The new framework integrates sparse attention methods that allow for processing inputs of up to one million tokens at speeds three to seven times faster than previous versions.
- Qwen Chat: An advanced AI assistant that uses the capabilities of the Qwen series, enabling users to engage in conversations, generate content, and perform various tasks seamlessly.
Qwen 2.5 1M Performance Evaluation
Long-Context Tasks
The Qwen 2.5-1M models have been evaluated on several long-context tasks, including the Passkey Retrieval task, where they demonstrated impressive accuracy in retrieving hidden information from documents containing up to one million tokens. The results indicate:
- Superior Performance: The new models outperform their predecessors (128K versions) in most long-context tasks, particularly for sequences exceeding 64K tokens.
- Competitive Edge: The Qwen2.5-14B-Instruct-1M model not only surpasses the earlier Qwen2.5-Turbo but also consistently outperforms GPT-4o-mini across multiple datasets which should be enough to position it as a robust alternative for long-context applications.
Short-Context Tasks
Besides the long-context capabilities, the performance of these models on short-context tasks remains strong:
- Both Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M maintain performance levels comparable to their 128K counterparts.
- When compared with GPT-4o-mini, both models achieve similar results on short text tasks while supporting a context length that is eight times longer.
Qwen 2.5 1M Long-Context Training
Training these models required substantial computational resources and a progressive approach:
- Pretraining: The process began with an intermediate checkpoint of pre-trained Qwen2.5 with a 4K token context length.
- Incremental Context Length: During pretraining, the context length was gradually increased from 4K to 256K tokens using Adjusted Base Frequency techniques.
- Supervised Fine-tuning: This involved two stages:
- Stage 1 focused on short instructions (up to 32K tokens).
- Stage 2 mixed short and long instructions to enhance performance across both contexts.
- Reinforcement Learning: This phase involved training on short texts up to 8K tokens to improve alignment with human preferences.
Length Extrapolation
To extend the model’s capabilities up to one million tokens, length extrapolation techniques were employed:
- Dual Chunk Attention (DCA) was introduced to remap relative positions during attention calculations, thus avoiding issues related to unseen large relative positional distances.
This innovative approach allowed even models trained on shorter sequences (e.g., up to 32K tokens) to achieve high accuracy in passkey retrieval tasks when dealing with one million-token contexts.
Sparse Attention Mechanism
The Qwen team incorporated a sparse attention mechanism inspired by MInference to improve inference speed. This approach processes sequences in smaller chunks instead of handling the entire sequence at once, which dramatically lowers memory usage. Further optimization included dynamic chunked pipeline parallelism and improved kernel efficiency. These enhancements led to a substantial increase in prefill speed, ranging from approximately 3.2x to 6.7x faster across various model sizes and GPU configurations.
Deployment Instructions
For developers interested in deploying the Qwen 2.5-1M models locally, specific system requirements must be met:
- System Preparation: GPUs with Ampere or Hopper architecture are recommended for optimal performance.
- Installation Steps:
- Clone the vLLM repository.
- Install dependencies via pip.
- Launching API Service: Users can start an OpenAI-compatible API service using provided command-line instructions tailored for their hardware setups.
- Interaction Methods: Models can be interacted with through various methods including Curl commands or Python scripts for more advanced use cases.
Try these prompts to compare with Claude and GPT models
You can use Qwen 2.5 1M on Hugging Face, but you can go here to test Claude and GPT models.
- Python: Write a Python function to calculate the Fibonacci sequence up to the nth number.
- JavaScript: Create a JavaScript function that reverses a string.
- Java: Write a Java program to sort an array of integers using the bubble sort algorithm.
- C++: Implement a C++ class for a simple bank account with deposit and withdrawal methods.
Will you use Qwen?
Qwen 2.5-1M models boast a 1 million token context window, significantly outperforming previous Qwen iterations and competing strongly with GPT-4o-mini in long-context tasks (like retrieving information from massive documents). They achieve this through innovative training (gradually increasing context length, dual chunk attention), and a sparse attention mechanism for faster inference. Performance on short-context tasks remains comparable to their 128K counterparts. Deployment requires specific hardware (Ampere/Hopper GPUs) and involves using the vLLM framework. Overall, Qwen 2.5-1M is a powerful new option for applications needing to process extremely long sequences of text.
To try other GPT and Claude models, head over to Bind AI Copilot.