Cerebras and Groq speeds

⚡ AI inference is speeding up

Transformers, the underlying technology behind today’s AI, arenotoriously slow at inference. Remember when ChatGPT launched and you had to wait for a couple of seconds while your answer was generated one word at a time? During that time generation speed was approximately 1-10 tokens per seconds with the possibility for a 10x increase after quantisation and other optimisations 🐌

Noticeably, AI response are almost instant nowadays. At the same time, there is a race happening at the hardware level for the provider that can run AI the fastest, with two of the most prominent players being Cerebras and Groq 🔥 A few months ago Groq broke into the scene with an advertised 100 tokens per second generation speed for Llama 70B which has now increased to 250 👌 And while this is extremely fast for such a large model, Cerebras recently announced a speed of 2200 tokens per second for the same model 😮

On the surface such speeds may seem irrelevant for your application, but this is not entirely true since AI applications nowadays consist of multiple AI calls and components. Those speeds can enable building more complex solutions that still feel instant to the user. They also allow to improve existing solutions by allowing the model to “think more” for the same time 🚀