Member-only story

The World’s Fastest LLM Inference: 3x Faster Than vLLM and TGI

Agent Issue
9 min readNov 15, 2023

--

How fast can LLMs get? The answer lies in the latest breakthrough in LLM inferencing.

TogetherAI claims that they have built the world’s fastest LLM inference engine on CUDA, which runs on NVIDIA Tensor Core GPUs. Looking at the benchmarks, it seems that it isn’t just a step forward; it’s a giant leap.

We obsess over system optimization and scaling so you don’t have to. As your application grows, capacity is automatically added to meet your API request volume. — TogetherAI

Together Inference Engine lets you run 100+ open-source models like Llama-2 and generates 117 tokens per second on Llama-2–70B-Chat and 171 tokens per second on Llama-2–13B-Chat. Although I’m not sure whether it’s the fastest, it’s indeed an impressive performance!

Join our next cohort: Full-stack GenAI SaaS Product in 4 weeks!

In this article, I will walk you through:

  • The techniques behind the scenes
  • Using Python API for LLM Inferencing
  • Integration with LangChain
  • Managing chat history

Let’s get started!

TogetherAI Pushing Limits of LLM Inference

--

--

Agent Issue
Agent Issue

Written by Agent Issue

Your front-row seat to the future of Agents.

Responses (1)