Thousands of login credentials, including those for Amazon Web Services (AWS), MailChimp, and WalkScore, have been discovered in an AI training dataset used by AI developers such as DeepSeek.
Cybersecurity researchers at Truffle Security analyzed 400 terabytes of data from 2.67 billion web pages archived in 2024 by The Common Crawl, a nonprofit organization that provides open-source web data. Their investigation revealed nearly 12,000 valid secrets, including API keys and passwords, embedded in the dataset.
The discovery raises concerns about the security risks associated with AI training methods. Researchers found that sensitive credentials were often repeated across multiple web pages, with one WalkScore API key appearing over 57,000 times across nearly 1,900 subdomains.
This widespread reuse of sensitive data increases the risk of cybercriminals exploiting AI-generated content to access private accounts and launch attacks.
The findings highlight ongoing issues regarding data privacy and AI security. A recent Deloitte report found that nearly 75% of professionals rank data privacy among their top concerns related to generative AI. The unauthorized inclusion of sensitive login details in training datasets adds to growing fears about how AI models are trained and whether they inadvertently expose private information.
Truffle Security is now collaborating with vendors to mitigate the issue and prevent further data leaks. The company has also recommended that AI developers implement stronger safeguards, such as Constitutional AI, to prevent AI models from reproducing sensitive information.
While some companies, like Anthropic, advocate for AI safety frameworks, others, including OpenAI, prioritize rapid development. With governments and regulatory bodies paying closer attention to AI security, the debate between innovation and responsible AI development continues to grow.