Zyphra Introduces Zyda: A 1.3T Language Modeling Dataset That Surpasses Pile, C4, and arXiv

Zyphra Introduces Zyda: A 1.3T Language Modeling Dataset That Surpasses Pile, C4, and arXiv

Get the latest updates and exclusive content on top-tier AI coverage by joining our daily and weekly newsletters.

Zyphra Technologies has introduced Zyda, a vast dataset designed to train language models. Zyda consists of 1.3 trillion tokens and combines, filters, and deduplicates multiple existing premium open datasets, including RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. According to the company, Zyda outperforms the individual datasets it was created from. An early version of Zyda is already powering Zyphra’s Zamba model and will soon be available for download on Hugging Face.

Zyda was conceived as a pretraining dataset for Zyphra’s Zamba models. Yury Tokpanov, a machine learning research engineer and product lead at Zyphra, explained that the dataset provides a massive, high-quality training set for language models, solving the issue of recreating such a dataset from scratch.

Zyphra’s goal was to improve upon existing datasets by merging them and meticulously cleaning the data to ensure uniqueness. They performed syntactic filtering to remove low-quality documents and an aggressive deduplication process both within and between the datasets. This cross-deduplication was crucial since many datasets share common sources, leading to duplicated content.

Among the seven datasets used for Zyda, RefinedWeb comprises the largest portion at 43.6%, followed by Slimpajama at 18.7% and StarCoder at 17.8%. The remaining datasets contribute smaller percentages.

Zyphra discarded around 40% of the initial dataset, reducing it from approximately 2 trillion tokens to 1.3 trillion.

As an open-source resource, developers can use Zyda to create more intelligent AI, leading to better word predictions, text generation, language translation, and more. If Zyda delivers as promised, it might streamline the process for developers by providing a single, comprehensive dataset, cutting down on production time and costs.

The name “Zyda” is a combination of “Zyphra Dataset.” You can find and download Zyda on Zyphra’s Hugging Face page.