DeepSeek says it has found a way to make AI 85 per cent faster, flagship chip not required
As demand for AI models continues to grow, companies are facing a problem โ computing resources. Data centres need thousands of the most-advanced GPUs out
As demand for AI models continues to grow, companies are facing a problem โ computing resources. Data centres need thousands of the most-advanced GPUs out there to run. And for Chinese AI companies like DeepSeek, advanced AI chips from the likes of Nvidia are not accessible due to US sanctions. But now, the Chinese startup claims that it has found a way to not only make its AI models respond much faster, but without needing the most advanced chips to do so. Read Full Story DeepSeek has unveiled DSpark โ a speculative decoding framework for its flagship V4 model family. DeepSeek says that this can speed up AI responses by as much as 85 per cent. For instance, a single GPU that previously handled 100 user queries could process about 185. Smaller AI model does the work, larger one verifies DeepSeek claims that DSpark is aimed at speeding up AI inference โ that is the time an AI model takes to respond to your query.
AI inference is often seen as a major bottleneck in serving AI models. AI models generate text one token at a time, which gets slow and wasteful with GPUs when responses are long. Tokens are the basic unit of measurement for AI models. The more work you do, the more tokens you consume. DSpark addresses this with speculative decoding. According to the company, a lightweight draft model quickly proposes responses, and then the main model verifies them in batches rather than generating everything from scratch. That is a smaller model does the work, which is then verified by the larger one. If the draft created by the smaller model is correct, the system skips ahead, and if it is not, it falls back. DeepSeek says that most tokens are easy to predict, so the system can often move ahead.
DeepSeek claims that this allows output to be generated faster. All of this happens on the GPU with no work being shifted to the CPU. The framework also uses a semi-autoregressive generation method. That is instead of generating responses on a token-by-token basis, it can produce small chunks of tokens, making the process quicker. More efficient way of using AI? The company has open-sourced its DSpark research, a joint effort with Peking University, on GitHub and HuggingFace. DeepSeek says that DSpark does not improve a model's general capabilities, but it could help companies get better performance without large additional investment in computing resources. The company tested the framework on several open-source models, including Google DeepMind's Gemma and Alibaba's Qwen, suggesting the gains could be applied more widely. This could be crucial at a time when AI companies are spending billions of dollars to acquire more compute for data centres.
