Tag: LLM inference

Distributed Transformer Inference: Master Tensor and Pipeline Parallelism for LLMs

Learn how to scale LLMs using Tensor and Pipeline Parallelism. Discover how vLLM and llm-d overcome memory limits to run massive models across multiple GPUs.

Constrained Decoding for LLMs: Mastering JSON, Regex, and Schema Control

Learn how constrained decoding ensures LLMs produce perfect JSON, regex, and schema-compliant outputs, eliminating syntax errors in production AI pipelines.

Speculative Decoding with Compressed Draft Models for LLMs: Faster Inference Without Losing Quality

Speculative decoding with compressed draft models cuts LLM inference time by up to 3x by letting a small model predict tokens ahead, while the large model verifies them in parallel. No quality loss-just faster responses.

Tag: LLM inference

Distributed Transformer Inference: Master Tensor and Pipeline Parallelism for LLMs

Constrained Decoding for LLMs: Mastering JSON, Regex, and Schema Control

Speculative Decoding with Compressed Draft Models for LLMs: Faster Inference Without Losing Quality

Categories

Archives

Tag Cloud