Member-only story
The World’s Fastest LLM Inference: 3x Faster Than vLLM and TGI
How fast can LLMs get? The answer lies in the latest breakthrough in LLM inferencing.
TogetherAI claims that they have built the world’s fastest LLM inference engine on CUDA, which runs on NVIDIA Tensor Core GPUs. Looking at the benchmarks, it seems that it isn’t just a step forward; it’s a giant leap.
We obsess over system optimization and scaling so you don’t have to. As your application grows, capacity is automatically added to meet your API request volume. — TogetherAI
Together Inference Engine lets you run 100+ open-source models like Llama-2 and generates 117 tokens per second on Llama-2–70B-Chat and 171 tokens per second on Llama-2–13B-Chat. Although I’m not sure whether it’s the fastest, it’s indeed an impressive performance!
Join our next cohort: Full-stack GenAI SaaS Product in 4 weeks!
In this article, I will walk you through:
- The techniques behind the scenes
- Using Python API for LLM Inferencing
- Integration with LangChain
- Managing chat history
Let’s get started!