Intro

You’ve probably fired up ollama run some-cool-model tons of times, effortlessly pulling models from Ollama’s Repo or even directly from Hugging Face. But have you ever wondered how those CPU-friendly GGUF quantized models actually land on places like Hugging Face in the first place?

What if I told you, you could contribute back with tools you might already be using? That’s right – Ollama isn’t just a slick way to run models; it’s also a tool to create these quantized versions yourself!

Today, we’ll show you exactly how to take a base model (like the compact Phi) and quantize it using the Ollama CLI. Let’s check this out!

What is Quantization?

Quantization is a process that reduces the precision of a model’s weights (i.e full precision 32 floating points => 8 INT). This makes the model:

Smaller: Consuming less disk space and RAM.
Faster: Requiring less computation during inference.

The trade-off: Potential reduction in accuracy, but for many use cases, the performance gains on modest hardware are well worth it. Please read our Blog post on quantization (all you need to know).

How to Quantize with Ollama

Ollama can quantize FP16/FP32 based models into different quantization levels using the ollama create -q/--quantize command.

Note: The ollama create command can also be used to run already quantized model directly (GGUF).

💡Let’s quantize a hypothetical FP16 (16-bit floating point) version of the Phi-1.5 model using the Ollama CLI.

Prerequisites

Ollama Installed: Make sure you have the Ollama CLI installed and working. Windows | Linux | Mac OS
Base Model Weights: You need the unquantized (FP16 or FP32) weights for the model you want to quantize.

For this example, let’s assume you have the Phi-1.5 FP16 weights downloaded from hugging face to a local directory.

Supported Quantizations

q4_0 | q4_1
q5_0 | q5_1
q8_0

K-means Quantizations:

q3_K_S | q3_K_M | q3_K_L
q4_K_S | q4_K_M
q5_K_S | q5_K_M
q6_K

Step 1: Create a Modelfile

After downloading the Phi-1.5 FP16 model, you need a simple text file named Modelfile_phi15 . This file tells Ollama where to find the base model weights. Create the file with the following content:

vi Modelfile_phi15    # Choose any name you want
# Modelfile to quantize Phi-1.5
FROM /path/to/your/phi-1.5/fp16/model
# IMPORTANT: Replace the path above with the actual download weights location!

Step 2: Run the Quantization Command

Now, open your terminal in the same directory where you created the Modelfile. Use the ollama create command with the --quantize (or -q) flag to perform the quantization. We’ll choose the q4_K_M quantization level (popular K-Means quantization offering a good balance) and name our new quantized model phi15-q4km:

ollama create phi15-q4km --quantize q4_K_M
...
transferring model data
quantizing F16 model to Q4_K_M
creating new layer sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
creating new layer sha256:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
writing manifest
success

Step 3: Run Your Quantized Model

That’s it! You can now run your smaller, faster Phi-15 model:

ollama run phi15-q4km

Bonus: Sharing Your Quantized Model 🚀

Enjoying your quantized model locally? More reason to share some love with the Ollama Community. You can now publish your model in the ollama repository just like you’d share a Docker image on Docker Hub 🫡.

1. Rename (if needed):

Ensure model name includes your Ollama username (e.g., myuser/phi2-q4km). Use ollama cp if you need to rename it:

ollama cp phi15-q4km your-ollama-username/phi15-q4km

2. Push:

Upload it to Ollama Hub (make sure you’ve added your public key to your ollama.com profile first):

ollama push your-ollama-username/phi15-q4km

Voila! Now others can easily download and run your quantized model using:

ollama run your-ollama-username/phi15-q4km

Conclusion

And there you have it! You’ve successfully performed digital liposuction on a powerful AI model using just a couple of Ollama commands. Now you can run your own quantized models without needing fancy GPUs or selling a kidney. Give it a try with your favorite models!

Happy shrinking!

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Get in touch

Latest Podcasts

How to Quantize AI Models with Ollama CLI

Intro

What is Quantization?

How to Quantize with Ollama

Prerequisites

Supported Quantizations

Step 1: Create a Modelfile

Step 2: Run the Quantization Command

Step 3: Run Your Quantized Model

Bonus: Sharing Your Quantized Model 🚀

1. Rename (if needed):

2. Push:

Conclusion

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Intro

What is Quantization?

How to Quantize with Ollama

Prerequisites

Supported Quantizations

Step 1: Create a Modelfile

Step 2: Run the Quantization Command

Step 3: Run Your Quantized Model

Bonus: Sharing Your Quantized Model 🚀

1. Rename (if needed):

2. Push:

Conclusion

👋🏻Want to chat about your challenges? We’d love to hear from you!

Don't miss a Bit!

Join countless others! Sign up and get awesome cloud content straight to your inbox. 🚀

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀