Member-only story

The World’s Fastest LLM Inference: 3x Faster Than vLLM and TGI

9 min readNov 15, 2023

How fast can LLMs get? The answer lies in the latest breakthrough in LLM inferencing.

TogetherAI claims that they have built the world’s fastest LLM inference engine on CUDA, which runs on NVIDIA Tensor Core GPUs. Looking at the benchmarks, it seems that it isn’t just a step forward; it’s a giant leap.

We obsess over system optimization and scaling so you don’t have to. As your application grows, capacity is automatically added to meet your API request volume. — TogetherAI

Together Inference Engine lets you run 100+ open-source models like Llama-2 and generates 117 tokens per second on Llama-2–70B-Chat and 171 tokens per second on Llama-2–13B-Chat. Although I’m not sure whether it’s the fastest, it’s indeed an impressive performance!

Join our next cohort: Full-stack GenAI SaaS Product in 4 weeks!

In this article, I will walk you through:

The techniques behind the scenes
Using Python API for LLM Inferencing
Integration with LangChain
Managing chat history

Let’s get started!

TogetherAI Pushing Limits of LLM Inference

The World’s Fastest LLM Inference: 3x Faster Than vLLM and TGI

TogetherAI Pushing Limits of LLM Inference

Written by Agent Issue

Responses (1)