Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Lasso Security recently discovered a serious security risk that could have had major consequences for Hugging Face, a popular platform for AI development. By scanning through GitHub and Hugging Face repositories, Lasso’s researchers found 1,681 API tokens that were vulnerable to being compromised.
These tokens gave access to 723 different organizations, including giants like Meta, Microsoft, Google, and VMware. Alarmingly, 655 user tokens had write permissions, and a smaller group of 77 tokens had full control over several prominent repositories. This meant that the security breach could have exposed millions of users to potential risks, including the possibility of manipulating widely-used models like Bloom, Llama 2, and Pythia.
Hugging Face has become a critical resource for over 50,000 organizations working on large language models (LLMs) and generative AI projects. It serves as the main hub for LLM developers, hosting more than 500,000 AI models and 250,000 datasets. The popularity of Hugging Face is partly due to its open-source Transformers library, which fosters collaboration and accelerates model development.
However, this popularity also makes Hugging Face a prime target for cyberattacks focused on exploiting vulnerabilities in the AI supply chain. Potential attacks could include poisoning training data or stealing models, posing significant challenges for detection and mitigation.
Lasso Security decided to scrutinize Hugging Face’s security methods, especially how they handle API tokens, to identify potential risks. Their research focused on three main areas of concern: supply chain vulnerabilities, training data poisoning, and model theft.
Supply chain vulnerabilities could easily expose LLM applications to attacks if they rely on third-party datasets or plugins. Training data poisoning involves compromising the data used to train models, which could introduce biases or security flaws. Model theft, on the other hand, involves unauthorized access to and theft of proprietary models, which can be incredibly costly to develop.
Lasso’s research revealed that they could have “stolen” over 10,000 private models linked to more than 2,500 datasets. This led them to suggest that the term “Model Theft” should be updated to “AI Resource Theft (Models & Datasets)” to reflect the broader scope of potential threats.
The researchers stressed the importance of treating API tokens with the same level of security as user identities. They recommended that platforms like Hugging Face regularly scan for exposed tokens and revoke them or alert users. A similar approach is already in place at GitHub, which automatically revokes tokens if they are exposed in public repositories.
Lasso’s findings underscore the need for a zero-trust approach to managing API tokens. This means each token should be unique, authenticated, and used with the least amount of privilege necessary. Continuous validation and lifecycle management of each token are crucial for maintaining security.
In summary, Hugging Face narrowly avoided a significant cybersecurity incident thanks to Lasso Security’s proactive measures. Their research highlights the importance of stringent security practices, particularly in managing API tokens, to prevent future breaches and protect sensitive AI resources.