
Intro
You’ve probably fired up ollama run some-cool-model tons of times, effortlessly pulling models from Ollama’s Repo or even directly from Hugging Face. But have you ever wondered how those CPU-friendly GGUF quantized models actually land on places like Hugging Face in the first place?
What if I told you, you could contribute back with tools you might already be using? That’s right โ Ollama isn’t just a slick way to run models; it’s also a tool to create these quantized versions yourself!
Today, we’ll show you exactly how to take a base model (like the compact Phi) and quantize it using the Ollama CLI. Let’s check this out!
What is Quantization?
Quantization is a process that reduces the precision of a model’s weights (i.e full precision 32 floating points => 8 INT). This makes the model:
How to Quantize with Ollama
Ollama can quantize FP16/FP32 based models into different quantization levels using the ollama create -q/--quantize command.
ollama create command can also be used to run already quantized model directly (GGUF).
๐กLet’s quantize a hypothetical FP16 (16-bit floating point) version of the Phi-1.5 model using the Ollama CLI.
Prerequisites
- Ollama Installed: Make sure you have the Ollama CLI installed and working. Windows | Linux | Mac OS
- Base Model Weights: You need the unquantized (FP16 or FP32) weights for the model you want to quantize.
For this example, let’s assume you have the Phi-1.5 FP16 weights downloaded from hugging face to a local directory.
Supported Quantizations
q4_0|q4_1q5_0|q5_1q8_0
K-means Quantizations:
q3_K_S|q3_K_M|q3_K_Lq4_K_S|q4_K_Mq5_K_S | q5_K_Mq6_K
Step 1: Create a Modelfile
After downloading the Phi-1.5 FP16 model, you need a simple text file named Modelfile_phi15 . This file tells Ollama where to find the base model weights. Create the file with the following content:
vi Modelfile_phi15 # Choose any name you want
# Modelfile to quantize Phi-1.5
FROM /path/to/your/phi-1.5/fp16/model
# IMPORTANT: Replace the path above with the actual download weights location!Step 2: Run the Quantization Command
Now, open your terminal in the same directory where you created the Modelfile. Use the ollama create command with the --quantize (or -q) flag to perform the quantization. We’ll choose the q4_K_M quantization level (popular K-Means quantization offering a good balance) and name our new quantized model phi15-q4km:
ollama create phi15-q4km --quantize q4_K_M
...
transferring model data
quantizing F16 model to Q4_K_M
creating new layer sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
creating new layer sha256:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
writing manifest
successStep 3: Run Your Quantized Model
That’s it! You can now run your smaller, faster Phi-15 model:
ollama run phi15-q4kmBonus: Sharing Your Quantized Model ๐
Enjoying your quantized model locally? More reason to share some love with the Ollama Community. You can now publish your model in the ollama repository just like you’d share a Docker image on Docker Hub ๐ซก.
1. Rename (if needed):
Ensure model name includes your Ollama username (e.g., myuser/phi2-q4km). Use ollama cp if you need to rename it:
ollama cp phi15-q4km your-ollama-username/phi15-q4km2. Push:
Upload it to Ollama Hub (make sure you’ve added your public key to your ollama.com profile first):
ollama push your-ollama-username/phi15-q4km
Voila! Now others can easily download and run your quantized model using:
ollama run your-ollama-username/phi15-q4kmConclusion
And there you have it! You’ve successfully performed digital liposuction on a powerful AI model using just a couple of Ollama commands. Now you can run your own quantized models without needing fancy GPUs or selling a kidney. Give it a try with your favorite models!
Happy shrinking!

Run AI Your Way โ In Your Cloud
Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open โ but not forever.
๐ข Secure your spot (only a few left), ๐๐ฝ๐ฝ๐น๐ ๐ป๐ผ๐!
Run AI assistants, RAG, or internal models on an AI backend ๐ฝ๐ฟ๐ถ๐๐ฎ๐๐ฒ๐น๐ ๐ถ๐ป ๐๐ผ๐๐ฟ ๐ฐ๐น๐ผ๐๐ฑ –
โ
No external APIs
โ
No vendor lock-in
โ
Total data control
