Sign up for our daily and weekly newsletters to stay updated with exclusive content on the latest advancements in AI.
Large AI training datasets, often referred to as the “backbone of large language models,” have come under scrutiny recently. EleutherAI, the group behind one of the largest datasets known as the Pile (an 825 GB open-sourced text corpus), faced significant controversy in 2023. This arose from legal and ethical concerns regarding the datasets used to train popular language models like OpenAI’s GPT-4 and Meta’s Llama.
EleutherAI started as a grassroots nonprofit research collective on Discord in 2020, aimed at understanding OpenAI’s GPT-3. The group was named in a lawsuit last year, alongside other AI entities, by former Arkansas Governor Mike Huckabee and other authors. They alleged that their books were used without permission in the Books3 dataset, part of the Pile project. Books3, initially uploaded by Shawn Presser in 2020, was removed from the internet in August 2023 following a legal notice.
Despite the controversy, EleutherAI is developing an updated version of the Pile dataset. This effort involves collaboration with the University of Toronto, the Allen Institute for AI, and independent researchers. Stella Biderman, EleutherAI’s executive director, and Aviya Skowron, head of policy and ethics, revealed that the new dataset is nearing completion.
The upcoming Pile v2 dataset is anticipated to be larger and significantly improved. Biderman mentioned that the new dataset would include novel data and undergo better preprocessing. This improvement comes from EleutherAI’s extensive experience in training numerous language models since the original Pile’s release in December 2020. The new dataset will feature more recent and diverse data, including a broader range of books and non-academic non-fiction.
The original Pile comprised 22 sub-datasets, including Books3, PubMed Central, Arxiv, Stack Exchange, Wikipedia, YouTube subtitles, and Enron emails. Biderman noted that the Pile remains the most well-documented LLM training dataset globally. The aim was to create a vast dataset with billions of text passages, similar to what OpenAI used for GPT-3.
When released in 2020, the Pile was groundbreaking due to its size and diversity, far surpassing the C4 dataset used by Google. EleutherAI focused on selecting specific categories and domains to provide meaningful information for training language models.
EleutherAI maintains that using copyrighted data for model training constitutes fair use. However, they acknowledge the need to address copyright and data licensing issues, which will be reflected in the new Pile dataset. The updated dataset will include public domain data, text licensed under Creative Commons, open-source code, and other texts with explicit redistribution permissions.
The debate over AI training datasets intensified with the release of ChatGPT in November 2022, leading to numerous lawsuits from artists, writers, and publishers. Concerns about deepfake content and the presence of harmful images in datasets like LAION-5B have added to the controversy.
Biderman and Skowron argue that the complexity of AI training data issues is often oversimplified. For instance, removing harmful content from datasets like LAION-5B is challenging due to legal and resource constraints. They also empathize with creatives upset about their work being used in AI training but point out that many uploaded their work under permissive licenses without foreseeing future AI applications.
Despite these challenges, EleutherAI believes that open datasets like the Pile are safer for AI model training. Transparency and thorough documentation are crucial for ethical AI development.
As EleutherAI continues to work on the Pile v2, Biderman expressed optimism about its impact. They aim to train and release new models within the year, contributing to ongoing advancements in AI.
Stay informed by subscribing to our daily updates.