
Intro
You’ve probably fired up ollama run
some-cool-model tons of times, effortlessly pulling models from Ollama’s Repo or even directly from Hugging Face. But have you ever wondered how those CPU-friendly GGUF quantized models actually land on places like Hugging Face in the first place?
What if I told you, you could contribute back with tools you might already be using? That’s right – Ollama isn’t just a slick way to run models; it’s also a tool to create these quantized versions yourself!
Today, we’ll show you exactly how to take a base model (like the compact Phi) and quantize it using the Ollama CLI. Let’s check this out!
What is Quantization?
Quantization is a process that reduces the precision of a model’s weights (i.e full precision 32 floating points => 8 INT). This makes the model:
How to Quantize with Ollama
Ollama can quantize FP16/FP32 based models into different quantization levels using the ollama create -q/--quantize
command.
ollama create
command can also be used to run already quantized model directly (GGUF).
💡Let’s quantize a hypothetical FP16 (16-bit floating point) version of the Phi-1.5 model using the Ollama CLI
.
Prerequisites
- Ollama Installed: Make sure you have the Ollama CLI installed and working. Windows | Linux | Mac OS
- Base Model Weights: You need the unquantized (FP16 or FP32) weights for the model you want to quantize.
For this example, let’s assume you have the Phi-1.5 FP16 weights downloaded from hugging face to a local directory.
Supported Quantizations
q4_0
|q4_1
q5_0
|q5_1
q8_0
K-means Quantizations:
q3_K_S
|
q3_K_M
|
q3_K_L
q4_K_S
|
q4_K_M
q5_K_S | q5_K_M
q6_K
Step 1: Create a Modelfile
After downloading the Phi-1.5 FP16 model, you need a simple text file named Modelfile_phi15 . This file tells Ollama where to find the base model weights. Create the file with the following content:
vi Modelfile_phi15 # Choose any name you want
# Modelfile to quantize Phi-1.5
FROM /path/to/your/phi-1.5/fp16/model
# IMPORTANT: Replace the path above with the actual download weights location!
Step 2: Run the Quantization Command
Now, open your terminal in the same directory where you created the Modelfile. Use the ollama create command with the --quantize
(or -q) flag to perform the quantization. We’ll choose the q4_K_M quantization level (popular K-Means quantization offering a good balance) and name our new quantized model phi15-q4km:
ollama create phi15-q4km --quantize q4_K_M
...
transferring model data
quantizing F16 model to Q4_K_M
creating new layer sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
creating new layer sha256:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
writing manifest
success
Step 3: Run Your Quantized Model
That’s it! You can now run your smaller, faster Phi-15 model:
ollama run phi15-q4km
Bonus: Sharing Your Quantized Model 🚀
Enjoying your quantized model locally? More reason to share some love with the Ollama Community. You can now publish your model in the ollama repository just like you’d share a Docker image on Docker Hub 🫡.
1. Rename (if needed):
Ensure model name includes your Ollama username (e.g., myuser/phi2-q4km). Use ollama cp if you need to rename it:
ollama cp phi15-q4km your-ollama-username/phi15-q4km
2. Push:
Upload it to Ollama Hub (make sure you’ve added your public key to your ollama.com profile first):
ollama push your-ollama-username/phi15-q4km
Voila! Now others can easily download and run your quantized model using:
ollama run your-ollama-username/phi15-q4km
Conclusion
And there you have it! You’ve successfully performed digital liposuction on a powerful AI model using just a couple of Ollama commands. Now you can run your own quantized models without needing fancy GPUs or selling a kidney. Give it a try with your favorite models!
Happy shrinking!