vLLM for beginners: Deployment Options (PartIII)

Intro In Part 2 of our vLLM for beginners Series, we explored performance features like PagedAttention, attention backends, and prefill/decode optimization. In this final part, we’ll shift from theory to practice, covering how to deploy vLLM across different environments, from source builds to docker containers (K8s deployment will be covered separately). 💡In this series, we aim to provide …

vLLM for beginners: Key Features & Performance Optimization(PartII)

Intro In Part 1 of our vLLM for beginners Series, we covered the fundamentals—core concepts and terminology behind vLLM’s architecture. In Part 2, we go deeper into what makes vLLM excel at performance: features like PagedAttention, attention backends, prefill & decode management, and more. 💡This series is about building a strong foundation in vLLM—understanding how …

vLLM for beginners: The Fundamentals

Intro last year, I have dived deep into Ollama inference where I ended up building and speaking about Ollama Kubernetes deployments along with rich documentation in my ollama_lab repo and quantization article—This year’s Cloudtrhill focus is VLLM Inference which is a next level beast from a model serving standpoint. Exploring multiple inference options is time-intensive …

world of LLM

How to Quantize AI Models with Ollama CLI

Intro You’ve probably fired up ollama run some-cool-model tons of times, effortlessly pulling models from Ollama’s Repo or even directly from Hugging Face. But have you ever wondered how those CPU-friendly GGUF quantized models actually land on places like Hugging Face in the first place? What if I told you, you could contribute back with tools you might already be …

world of LLM

LLM Quantization: All You Need to Know!

Intro Over the past year, I was drowning into GitHub PRs, half-baked redit discussions, videos, and scattered docs trying to decode the chaos of quantization for Large Language Models (LLMs). Everyone was talking about running Llama models on a laptop, but no one was explaining how it actually worked—and forget about finding proper research papers …