DeepSeek, a Chinese AI company, recently launched an improved version of its reasoning AI model, R1-0528, which excels in math and coding benchmarks. However, the company has not revealed its training data sources. Some AI researchers suggest that DeepSeek may have used Google’s Gemini models for training.
Evidence was highlighted by Sam Paech, an Australian developer, who noted that DeepSeek’s model’s language and expressions closely resemble those of Google Gemini 2.5 Pro, based on a social media post. Another researcher, involved with SpeechMap, a project analyzing AI “thoughts,” observed that the model’s outputs also appeared similar to Gemini traces.
While this isn’t conclusive proof, it raises concerns about the training practices behind DeepSeek’s model. Historically, the company has faced accusations of using data from rival AI models. In December, developers noticed DeepSeek’s V3 model often identified itself as ChatGPT, implying it might have been trained on ChatGPT chat logs.
OpenAI indicated to the Financial Times earlier this year that it found clues linking DeepSeek to distillation, a technique where information is extracted from larger, more advanced models for training новы neural networks. According to Bloomberg, in late 2024, Microsoft detected significant data leaks from OpenAI developer accounts suspected to be linked to DeepSeek.
Although distillation is common, OpenAI’s policies prohibit using its outputs to build competing AI. Experts note that many models tend to generate similar phrases and behaviors due to the widespread contamination of training data, which includes AI-generated content, clickbait, and bots flooding platforms like Reddit and X. Nathan Lambert from AI2 said DeepSeek may have generated synthetic data from the best API models like Gemini to compensate for limited compute resources, which are plentiful in DeepSeek’s case.
In response, AI firms are strengthening security to prevent data leaks and unauthorized training. OpenAI now mandates ID verification for access to advanced models, excluding China, while Google has begun summarizing traces from its platform to protect proprietary information. This ongoing situation underscores rising concerns over transparency and competition in AI development.