Introducing DeepSeek Chat: China’s Newest Contender in the AI Arena with a 67B Model

Introducing DeepSeek Chat: China's Newest Contender in the AI Arena with a 67B Model

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More.

As ChatGPT celebrates its first anniversary this week, the Chinese startup DeepSeek AI is stepping up to challenge its dominance with their own conversational AI, DeepSeek Chat.

This new AI assistant, launched as part of an alpha test, uses the DeepSeek LLMs with 7B and 67B parameters. These models, trained on a dataset of 2 trillion tokens in both English and Chinese, have reportedly shown strong performance in various evaluations, including coding and mathematics, sometimes even surpassing Meta’s well-known Llama 2-70B.

This news introduces another Chinese competitor in the AI field, following recent launches from Qwen, 01.AI, and Baidu. DeepSeek has made these models available as open-source, both in base and instruction-tuned versions, to encourage further research among academic and commercial communities. The company, founded a few months ago with the goal of exploring AGI using curiosity, also permits commercial use under specific conditions.

What exactly is DeepSeek Chat and its LLMs?

DeepSeek Chat can be accessed through a web interface similar to ChatGPT, where users can sign in and interact with the model for various tasks. However, only the 67B version is available through this interface.

According to DeepSeek, both models utilize the same auto-regressive transformer decoder architecture as Llama, but they differ in their inference approach. The smaller model employs multi-head attention (MHA), processing several parallel attention mechanisms, while the larger model uses grouped-query attention (GQA) for its results.

For training, the 7B model used a batch size of 2304 and a learning rate of 4.2e-4, whereas the 67B model was trained with a batch size of 4608 and a learning rate of 3.2e-4. DeepSeek follows a multi-step learning rate schedule, starting with 2000 warmup steps, and then adjusting to 31.6% of the maximum at 1.6 trillion tokens, and 10% of the maximum at 1.8 trillion tokens.

In tests, the DeepSeek LLM 67B Base model demonstrated excellent general capabilities, often outperforming Llama2 70B Base in areas such as reasoning, coding, mathematics, and Chinese comprehension. The only benchmark where Llama performed slightly better was in a 5-shot trivia QA (79.5 vs 78.9).

The chat version of the model, which was further refined with additional instruction data, also performed exceptionally well on brand-new tests. For example, it scored 73.78 on the HumanEval pass@1 for coding, and 84.1 on the GSM8K 0-shot for mathematics, ranking just behind GPT-4 and Anthropic’s Claude 2.

Despite these impressive benchmarks, the DeepSeek model appears to have some censorship issues. A user on X highlighted that responses related to China were automatically redacted, with the content marked as “withdrawn” for security reasons. It is unclear if this censorship is also present in the base model.

Chinese LLMs may face significant challenges on the open internet. A user shared an experience of asking an innocuous question about modern China, only to see the response automatically censored.

This launch represents another significant advancement from China in the AI domain, broadening the range of model sizes available to various users. Recently, Baidu’s Ernie 4.0, 01.AI’s Yi 34B, and Qwen’s models ranging from 1.8B to 72B parameters have been introduced as well. Interestingly, some smaller models, like Yi 34B, have outperformed larger ones like Llama-2-70B and Falcon-180B, potentially allowing businesses to achieve efficiency by saving computational resources while maintaining effectiveness.

Additionally, Microsoft recently unveiled its Orca 2 models, which performed better than models five to ten times their size, including Llama-2Chat-70B.