How to Choose Batch Sizes to Minimize Cost per Token in LLM Serving

How to Choose Batch Sizes to Minimize Cost per Token in LLM Serving
by Vicki Powell Jan, 16 2026

Running large language models (LLMs) at scale isn’t just about having powerful GPUs-it’s about using them wisely. If you’re paying for API calls or running your own models, every extra millisecond and unused GPU cycle is money going down the drain. The biggest lever you can pull to cut costs? Batch size.

Batching isn’t new. It’s the same idea as loading a dishwasher full of dishes instead of washing one plate at a time. But in LLM serving, getting it right can slash your costs by half-or more. Companies like Scribd and First American cut their monthly LLM bills by 40-50% just by switching from single-request processing to smart batching. And it’s not magic. It’s math.

Why Batch Size Matters More Than You Think

When you send one request to an LLM, the GPU wakes up, loads the model into memory, runs inference, and then goes back to sleep. That’s inefficient. GPUs are built to do thousands of calculations at once. They’re hungry for parallel work.

Batching groups multiple requests together-say, 20 or 50-so the GPU runs them all in one go. The model loads once. The memory stays hot. The compute stays busy. Result? You get more tokens processed per dollar spent.

Here’s the real number: running a 7B-parameter model with batch size 1 can take 976 milliseconds per request. At batch size 8? That drops to 126 ms. That’s an 87% reduction in latency-and a 70%+ drop in cost per token. At scale, that’s thousands of dollars saved every month.

And it’s not just small models. For LLaMA2-70B, proper batching boosts throughput from 200 to 1,500 tokens per second. That’s a 650% increase. If you’re serving 10 million tokens a day, that’s the difference between a $3,000 daily bill and a $900 one.

What’s the Best Batch Size?

There’s no universal number. The sweet spot depends on your model, your hardware, and your use case.

  • Text generation (creative writing, summarization): 10-50 requests per batch. Too big, and latency spikes. Too small, and you’re not squeezing the GPU.
  • Classification (sentiment, intent, spam detection): 100-500 requests. These are short inputs, so you can pack more in without hitting memory limits.
  • Q&A systems: 50-200 requests. Balanced between speed and throughput.

These aren’t guesses. They’re backed by benchmarks from Clarifai, Latitude, and Databricks. But here’s the catch: your input length matters. If your users are sending 2,000-token prompts, you might max out at batch size 8-even if your model could handle 50 with 500-token prompts. Longer sequences = less room in GPU memory.

Most systems hit a wall around batch size 64. Why? Memory. The key-value (KV) cache for each sequence eats up VRAM. Push past that, and you get out-of-memory crashes. No amount of optimization fixes bad hardware limits.

Static vs. Dynamic vs. Continuous Batching

Not all batching is created equal.

Static batching waits until you have, say, 30 requests before processing. Simple to implement, great for scheduled jobs like overnight document processing. But if you’re handling live chat, users wait too long.

Dynamic batching collects requests over a short window-say, 100 milliseconds-and sends the group when the timer hits or the batch fills up. This works well for chatbots and customer support systems. It balances latency and throughput without requiring complex infrastructure.

Continuous batching is the future. Instead of waiting for a group, it slots new requests into gaps as others finish. Imagine a highway where cars don’t wait for a full truckload-they just merge in as space opens. Tools like vLLM and TensorRT-LLM do this automatically. One study showed continuous batching delivers 23x higher throughput than static methods. And latency? Down by half.

If you’re building something real-time, skip static. Go dynamic or continuous. Most enterprises now use one of the latter two-63%, according to Gartner’s 2024 survey.

Three batching methods illustrated as server racks with flowing request arrows and throughput indicators.

Hardware Choices Change the Game

Batch size doesn’t exist in a vacuum. It’s tied to your GPU.

Consumer-grade cards like the RTX 4090 offer 1.9x more memory bandwidth per dollar than enterprise A100s or H100s. That means you can run smaller models like Mistral 7B or Phi-3 at much lower cost. One study found that using the right GPU for the workload improved cost-efficiency by 2.27x.

And here’s the kicker: a 70B model on an A100 can cost $2,000-$3,000 per day. That’s not sustainable unless you’re batching at scale. But if you route 90% of queries to a 7B model (costing $0.00006 per 300 tokens) and only escalate complex ones to GPT-4, you can cut costs by 87%-even before batching.

Combine that with batch processing, and you’re not just saving money. You’re making your system economically viable.

Real-World Wins (and Mistakes)

One fintech company switched from individual API calls to batch sizes of 35 for classifying customer support tickets. Result? 58% cost reduction. Took them three weeks to tune. But they didn’t guess. They tested batch sizes from 5 to 100, tracked latency and memory usage, and picked the point where cost per token flattened out.

Another team at a media company dropped their OpenAI bill from $18,500 to $9,200 a month by batching document summaries at size 22. No quality loss. Just smarter resource use.

But it’s not all smooth sailing. Common pitfalls:

  • Trying to batch too many long inputs → out-of-memory crashes
  • Ignoring latency requirements → users hate slow chatbots
  • Using streaming responses for non-interactive tasks → adds 20-40% to your cost

Streaming is great for real-time chat. But if you’re generating reports, summaries, or classifications? Batch. Don’t stream. Streaming keeps the GPU active the whole time. Batching lets it rest between jobs.

Split-screen comparison of high-cost vs. optimized LLM serving with efficiency icons and cost savings.

Stack the Savings: Batch + Other Tricks

Batching alone isn’t enough. Combine it with other optimizations:

  • Early stopping: Configure models to stop generating once they’ve reached a confident answer. Cuts output tokens by 20-40% without hurting quality.
  • Quantization: Run models with 4-bit or 8-bit precision. Reduces memory use by 50-75%, letting you fit bigger batches.
  • Caching: Store common queries and responses. If 10 users ask “What’s our return policy?”, serve the same answer 10 times. Cuts costs by 15-25%.

One team at a SaaS company combined dynamic batching, quantization, and caching. Their cost per token dropped 72% in six weeks. No model换了. Just smarter ops.

What’s Next? The Future of Batch Optimization

Manual tuning won’t last. The next wave is automated batch sizing.

Research from February 2025 shows AI systems using mixed-integer linear programming can dynamically adjust batch sizes in real time to maximize throughput or minimize cost-depending on your goal. Some can boost throughput by 41% or cut latency by 54% without human input.

Anthropic plans to roll out auto-batching in late 2025. Tools like vLLM already do it. If you’re building today, you don’t need to code your own scheduler. Use a framework that does it for you.

And the trend won’t stop. With the LLM market projected to hit $36.1 billion by 2030, efficiency isn’t optional-it’s survival. Companies that treat LLM serving like a black box will bleed cash. Those that optimize batch size, hardware, and request patterns will thrive.

How to Start

Here’s your simple 5-step plan:

  1. Measure your current cost per token. Use your provider’s logs or internal tracking.
  2. Identify your workload type. Is it short queries? Long generations? Classification?
  3. Test batch sizes from 5 to 64. Monitor latency, memory usage, and cost per token.
  4. Switch to dynamic or continuous batching. Use vLLM, TensorRT-LLM, or your cloud provider’s built-in tools.
  5. Combine with early stopping and caching. Layer on savings.

You don’t need a PhD. You don’t need a $100k GPU cluster. You just need to stop treating each request like a snowflake. Group them. Optimize them. Watch your bill shrink.

What’s the ideal batch size for a chatbot?

For chatbots, aim for 20-50 requests per batch using dynamic batching. This keeps latency under 500ms while maximizing GPU use. If your prompts are long (over 1,000 tokens), reduce the batch size to 10-20 to avoid memory overload. Test with your actual user input lengths.

Does batch size affect response quality?

No. Batch size doesn’t change how the model generates text-it just changes how many requests it handles at once. The output quality stays identical. The only trade-off is latency: larger batches mean slightly longer wait times per request, but not worse answers.

Can I batch requests across different models?

Not directly. Each model has its own memory footprint and compute needs. You can’t batch a GPT-4 request with a Mistral 7B request in the same GPU run. But you can route different requests to different models first (model cascading), then batch within each model group. This is how companies cut costs by 80%+.

Why is my GPU memory full even with small batch sizes?

It’s likely your input sequences are too long. The KV cache stores past tokens for each request. A single 3,000-token prompt can use as much memory as 10 short prompts. Use input truncation or chunking for long texts. Also, check if your model is loaded in full precision-switching to 8-bit or 4-bit quantization can free up 50-75% of memory.

Is continuous batching worth the complexity?

If you’re serving over 1,000 requests per minute, yes. Continuous batching can give you 20x+ better throughput than static methods. But if you’re under 100 requests per minute, dynamic batching is simpler and nearly as effective. Start simple. Scale up only when you see diminishing returns.

How long does it take to implement batch processing?

For teams familiar with LLM infrastructure, it takes 2-4 weeks. The biggest hurdle isn’t coding-it’s testing. You need to run real traffic through different batch sizes, measure latency spikes, and find the balance between speed and cost. Start with a non-critical endpoint, like email summarization, before rolling it out to your main API.