Stay updated with our daily and weekly newsletters for the latest AI industry news and exclusive content.
A recent report by the Stanford Internet Observatory has found that LAION-5B, a large open-source AI dataset used to train popular AI generators like Stable Diffusion 1.5 and Google’s Imagen, contains at least 1,008 instances of child sexual abuse material (CSAM). The report suggests that there could be thousands more such instances in the dataset. The LAION-5B dataset, released in March 2022, includes over 5 billion images and captions and could potentially enable AI products to produce new, realistic child abuse content.
In response to these findings, LAION announced on Tuesday that it will temporarily remove its datasets to ensure their safety before republishing them.
This isn’t the first time LAION’s datasets have faced criticism. Cognitive scientist Abeba Birhane highlighted issues with LAION-400M, an earlier dataset, in October 2021. Her paper pointed out the presence of problematic content, including explicit images and offensive stereotypes.
In September 2022, an artist discovered private medical record photos taken by her doctor in 2013 that were referenced in the LAION-5B dataset. Additionally, a class-action lawsuit filed in January 2023 by artists Sarah Andersen, Kelly McKernan, and Karla Ortiz against Stability AI, Midjourney, and DeviantArt mentioned LAION. The lawsuit claimed that Stability AI had used billions of copyrighted images without permission for training their AI models.
Ortiz, a renowned artist, discussed LAION-5B at a virtual FTC panel in October, noting that the dataset includes vast amounts of intellectual property, private medical records, non-consensual pornography, images of children, and even social media pictures.
AI pioneer Andrew Ng has expressed concerns about restricting access to datasets like LAION. He argues that the progress in machine learning relies heavily on large datasets sourced from the internet. Ng believes that limiting access to these resources could hinder advancements in various fields such as art, education, drug development, and manufacturing.
LAION was founded by Christoph Schuhmann, a high school teacher in Hamburg, Germany. Inspired by OpenAI’s DALL-E, Schuhmann aimed to create an open-source dataset to aid in training image-to-text diffusion models. From initially compiling 3 million image-text pairs, LAION has now grown to over 5 billion pairs.
The nonprofit organization has actively participated in debates on open-source AI. After a public call for an AI research pause in March 2023, LAION advocated for accelerating research and establishing a joint international cluster for large-scale open-source AI models.
LAION datasets have partly been compiled using visual data from online shopping sites like Shopify, eBay, and Amazon. A paper from the Allen Institute for AI revealed that a significant portion of LAION-2B-en, a subset of LAION-5B, consisted of data from Shopify.
Stay informed with the latest news by subscribing to our daily updates.