How to Quantize AI Models with Ollama CLI

LLM quantization

Intro

You’ve probably fired up ollama run some-cool-model tons of times, effortlessly pulling models from Ollama’s Repo or even directly from Hugging Face. But have you ever wondered how those CPU-friendly GGUF quantized models actually land on places like Hugging Face in the first place?

What if I told you, you could contribute back with tools you might already be using? That’s right – Ollama isn’t just a slick way to run models; it’s also a tool to create these quantized versions yourself!

Today, we’ll show you exactly how to take a base model (like the compact Phi) and quantize it using the Ollama CLI. Let’s check this out!

What is Quantization?

Quantization is a process that reduces the precision of a model’s weights (i.e full precision 32 floating points => 8 INT). This makes the model:

  • Smaller: Consuming less disk space and RAM.
  • Faster: Requiring less computation during inference.

The trade-off: Potential reduction in accuracy, but for many use cases, the performance gains on modest hardware are well worth it. Please read our Blog post on quantization (all you need to know).

How to Quantize with Ollama

Ollama can quantize FP16/FP32 based models into different quantization levels using the ollama create -q/--quantize  command.

Note: The ollama create command can also be used to run already quantized model directly (GGUF).

💡Let’s quantize a hypothetical FP16 (16-bit floating point) version of the Phi-1.5 model using the Ollama CLI.

Prerequisites

  • Ollama Installed: Make sure you have the Ollama CLI installed and working. Windows | Linux | Mac OS
  • Base Model Weights: You need the unquantized (FP16 or FP32) weights for the model you want to quantize.

For this example, let’s assume you have the Phi-1.5 FP16 weights downloaded from hugging face to a local directory.

Supported Quantizations

  • q4_0 | q4_1
  • q5_0 | q5_1
  • q8_0

K-means Quantizations:

  • q3_K_S | q3_K_M | q3_K_L
  • q4_K_S | q4_K_M
  • q5_K_S | q5_K_M
  • q6_K

Step 1: Create a Modelfile

After downloading the Phi-1.5 FP16 model, you need a simple text file named Modelfile_phi15 . This file tells Ollama where to find the base model weights. Create the file with the following content:

vi Modelfile_phi15    # Choose any name you want
# Modelfile to quantize Phi-1.5
FROM /path/to/your/phi-1.5/fp16/model
# IMPORTANT: Replace the path above with the actual download weights location!

Step 2: Run the Quantization Command

Now, open your terminal in the same directory where you created the Modelfile. Use the ollama create command with the --quantize (or -q) flag to perform the quantization. We’ll choose the q4_K_M quantization level (popular K-Means quantization offering a good balance) and name our new quantized model phi15-q4km:

ollama create phi15-q4km --quantize q4_K_M
...
transferring model data
quantizing F16 model to Q4_K_M
creating new layer sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
creating new layer sha256:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
writing manifest
success

Step 3: Run Your Quantized Model

That’s it! You can now run your smaller, faster Phi-15 model:

ollama run phi15-q4km

Bonus: Sharing Your Quantized Model 🚀

Enjoying your quantized model locally? More reason to share some love with the Ollama Community. You can now publish your model in the ollama repository just like you’d share a Docker image on Docker Hub 🫡.

1. Rename (if needed):

Ensure model name includes your Ollama username (e.g., myuser/phi2-q4km). Use ollama cp if you need to rename it:

ollama cp phi15-q4km your-ollama-username/phi15-q4km

2. Push:

Upload it to Ollama Hub (make sure you’ve added your public key to your ollama.com profile first):

ollama push your-ollama-username/phi15-q4km


Voila! Now others can easily download and run your quantized model using:

ollama run your-ollama-username/phi15-q4km

Conclusion

And there you have it! You’ve successfully performed digital liposuction on a powerful AI model using just a couple of Ollama commands. Now you can run your own quantized models without needing fancy GPUs or selling a kidney. Give it a try with your favorite models!

Happy shrinking!

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .