DeepSeek’s Latest Models Might Have Trained with Google’s Data
Researchers claim to have spotted similarities between DeepSeek’s latest AI and Google’s Gemini.
DeepSeek, the Chinese AI company behind the fast-rising R1 model series, is back in the spotlight. and not just for performance gains. Last week, it dropped an updated version of its reasoning model, R1-0528, which quickly impressed the AI community with its strong results on math and coding benchmarks.
But the applause hasn’t even settled before controversy rolled in: multiple researchers now believe the model may have been partially trained on outputs from Google’s Gemini.
The suspicion came to light after Melbourne-based developer Sam Paech compared the language preferences of R1-0528 to Google’s Gemini 2.5 Pro. He pointed out that the DeepSeek model tends to favor phrases and expressions nearly identical to Gemini's.
If you're wondering why new deepseek r1 sounds a bit different, I think they probably switched from training on synthetic openai to synthetic gemini outputs. pic.twitter.com/Oex9roapNv
— Sam Paech (@sam_paech) May 29, 2025
Another developer, who goes by the pseudonym "SpeechMap," noted that R1-0528’s “thought traces”—basically the reasoning steps it takes when solving problems—also resemble Gemini’s. While this isn’t direct evidence, it’s sparked debate about how AI models are being trained behind the scenes.
This isn’t the first time DeepSeek has faced similar accusations. Back in December 2024, DeepSeek’s V3 model raised concerns after it identified itself as ChatGPT, prompting speculation about training data. That same month, OpenAI and Microsoft reported data leaks linked to DeepSeek accounts.
Despite the controversy, V3 was outperforming rivals. It scored 88.5% on MMLU—beating GPT-4o’s 87.2%—and achieved 90.2% on MATH-500, far ahead of GPT-4o’s 74.6%. Remarkably, it did this while costing just $5.6 million to train, using a Mixture-of-Experts model that activated only 37B of 671B parameters at a time.
AI companies are reacting. OpenAI now requires ID verification for access to advanced models, locking out countries like China. Google has also begun hiding the intermediate reasoning data ("traces") from its developer tools, making it harder to reverse-engineer Gemini outputs. Anthropic has done the same with Claude.
While there’s no confirmed evidence that DeepSeek used Gemini outputs without permission, if investigations do find this to be true, DeepSeek could face serious legal consequences. This might include lawsuits over intellectual property violations and breaches of service agreements, which could impact its operations and reputation significantly.
DeepSeek’s latest update might be impressive, but if it’s riding too closely on the backs of competitors, the industry may need more than just better models; it may need better guardrails.