Stay updated with our daily and weekly newsletters for the latest in industry-leading AI coverage and exclusive content.
Chinese AI startup DeepSeek, known for creating a ChatGPT competitor trained on 2 trillion English and Chinese tokens, has just released DeepSeek Coder V2. This new model is an open-source mixture of experts (MoE) code language model.
Building on DeepSeek-V2, an MoE model that launched last month, DeepSeek Coder V2 is tailored for coding and math tasks. It supports over 300 programming languages and surpasses leading closed-source models like GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro. According to DeepSeek, this is the first open model to achieve such high performance, significantly outperforming models like Llama 3-70B.
Not only does DeepSeek Coder V2 perform exceptionally well in coding and math, but it also maintains strong general reasoning and language capabilities.
DeepSeek, founded last year with a mission to explore AGI (Artificial General Intelligence), has quickly made a name for itself in the AI industry alongside competitors like Qwen, 01.AI, and Baidu. Within a year, DeepSeek has open-sourced several models, including the DeepSeek Coder series.
The original DeepSeek Coder had up to 33 billion parameters and performed well in benchmarks, offering features like project-level code completion but only supported 86 programming languages and had a 16K context window. The new V2 model expands support to 338 programming languages and extends the context window to 128K, enabling more complex coding tasks.
Tested on benchmarks like MBPP+, HumanEval, and Aider, which assess code generation, editing, and problem-solving skills of language models, DeepSeek Coder V2 scored 76.2, 90.2, and 73.7, respectively, outperforming many other models, including GPT-4 Turbo, Claude 3 Opus, Gemini 1.5 Pro, Codestral, and Llama-3 70B. Similar high performance was observed in mathematical benchmarks like MATH and GSM8K.
In the Arena-Hard-Auto leaderboard, DeepSeek-Coder-V2 outperformed models such as Yi-large, Claude3-Opus, GLM4, and Qwen2-72B.
The only model that managed to slightly outperform DeepSeek’s offering across multiple benchmarks was GPT-4o, which had marginally higher scores in HumanEval, LiveCode Bench, MATH, and GSM8K.
DeepSeek achieved these advances by using DeepSeek V2, based on its Mixture of Experts framework, as a foundation. They pre-trained the V2 model on an additional 6 trillion tokens, mostly from code and math-related data from GitHub and CommonCrawl.
This training allows the model, available in 16B and 236B parameter versions, to activate only 2.4B and 21B “expert” parameters for specific tasks, thus optimizing for various computing needs.
Aside from coding and math, DeepSeek Coder V2 also performs well in general reasoning and language tasks. For example, in the MMLU benchmark, which evaluates language understanding across multiple tasks, it scored 79.2. This score outperforms other code-specific models and is close to Llama-3 70B. GPT-4o and Claude 3 Opus lead this category with scores of 88.7 and 88.6, respectively, with GPT-4 Turbo not far behind.
The progress of open coding-specific models like DeepSeek Coder V2 is closing the gap with top-tier closed-source models, proving their versatility beyond their primary functions.
DeepSeek Coder V2 is available under an MIT license, permitting research and unrestricted commercial use. Users can download both the 16B and 236B versions in instruct and base models via Hugging Face. Additionally, the models can be accessed via API through DeepSeek’s platform on a pay-as-you-go basis. For those who want to test the models first, there is an option to interact with DeepSeek Coder V2 via a chatbot.