Member-only story

The Cheapest Way to Run LLMs in Production: GPT-4 Turbo, Llama 2, Claude 2.1

6 min readNov 27, 2023

Since Google introduced Transformers in 2017, we knew that there was going to be a Cambrian explosion in SaaS products powered by language models in the years to come.

Transformers paved the way for OpenAI’s GPT models, Google’s BERT, PaLM, and Bard, Baidu’s ERNIE, and Meta’s Llama.

Join our next cohort: Full-stack GenAI SaaS Product in 4 weeks!

Fast forward a few years, and GenAI is capturing imaginations and raising expectations about its potential to revolutionize industries and even turn around financial markets.

Everybody is eager to jump on the bandwagon with LLMs, either to quickly capitalize on the hype or to build something that can fundamentally change how industries operate or the way we live.

While you can build anything from simple Discord or Slack bots to complex systems augmenting medical diagnosis and accelerating research and development, running LLM-powered applications can be notoriously expensive.

Even for simple prototypes, you can quickly rack up thousands of dollars in bills, and not many of us have the funds to burn for free tier users if the project takes off.

That’s why I want to briefly cover this topic by walking you through:

A simple framework to understand LLM pricing
Token math
An example of a cost estimate for a full-stack…

The Cheapest Way to Run LLMs in Production: GPT-4 Turbo, Llama 2, Claude 2.1

Written by Agent Issue

Responses (3)