Deploy Your Fine-Tuned Llama Model: From Hugging Face to Production with Ollama and Cloud Run

A complete guide to training, converting, and serving your own fine tuned language model at scale.

Sep 02, 2025

TL;DR: What We're Building

In my previous article, I walked through the process of fine-tuning a Llama model for specific use cases and storing it on Hugging Face. Today, we're taking the next crucial step: deploying that model in production where it can serve real-world traffic efficiently and cost-effectively.

Thanks for reading RIDDHIMAN’s Substack! Subscribe for free to receive new posts and support my work.

By the end of this guide, you'll have:

Converted the Fine tuned model on HuggingFace to GGUF format for optimal inference performance with Ollama.
Deployed the model to Google Cloud Run using L4 GPU acceleration for production-ready serving.
Created a scalable API endpoint that can handle concurrent requests with proper authentication.

This approach gives you complete control over your model while leveraging Google Cloud's infrastructure for reliable, scalable deployment.

Let’s Dive Deep into it!

The first step is to create a GGUF-optimized version of a fine-tuned language model for local deployment. The code is shared below:

#Imports
from huggingface_hub import snapshot_download
from huggingface_hub import HfApi
api = HfApi()

#Download the model from HuggingFace
model_id="rsher60/llama3.2-1B-text2sql-finetuned"
snapshot_download(repo_id=model_id, local_dir="rsher60-hf",
                  local_dir_use_symlinks=False, revision="main")

#Clone the llama.cpp git repo
!git clone https://github.com/ggerganov/llama.cpp.git

# Download the requirements.
!pip install -r llama.cpp/requirements.txt

# Convert the model in GGUF format
!python llama.cpp/convert_hf_to_gguf.py rsher60-hf \
  --outfile rsher60-llama3.2-1B-text2sql-finetuned.gguf \
  --outtype q8_0

# Save the model 
model_id = "rsher60/llama3.2-1B-text2sql-finetuned-gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj="rsher60-llama3.2-1B-text2sql-finetuned.gguf",
    path_in_repo="rsher60-llama3.2-1B-text2sql-finetuned.gguf",
    repo_id=model_id,
)

What It Does:

Downloads the original fine-tuned model from HuggingFace (rsher60/llama3.2-1B-text2sql-finetuned)
Converts it to GGUF format using llama.cpp tools with Q8_0 quantization (high quality, ~50% size reduction due to 8-bit quantization)
Creates a new repository (rsher60/llama3.2-1B-text2sql-finetuned-gguf) and uploads the optimized version

Deploying the Model on Cloud Run

We will be using Google Cloud’s serverless service Cloud Run. Before we, proceed deployment, we need to do some capacity and resource planning.

Production Capacity Planning

Before deploying, it's crucial to understand realistic user behavior patterns, not theoretical maximums. Based on production data from similar deployments:

User Activity Breakdown:

Total registered users: 20,000
Daily Active Users (DAU): 4,000 (20% conversion rate)
Peak concurrent users: 280-400 (7-10% of DAU during peak hours)
Average session duration: 8-12 minutes
Requests per session: 3-8 requests
Think time between requests: 2-4 minutes

Realistic Request Patterns:

Sustained RPS: ~2 requests per second
Peak burst RPS: ~53 requests per second (during traffic spikes)
Average request processing time: 2-5 seconds per inference

GPU Memory Architecture Analysis

Understanding NVIDIA L4 GPU memory allocation is critical for proper scaling:

NVIDIA L4 GPU (24GB VRAM) Memory Allocation:
├── Model weights: 1.33 GB (loaded once, shared across users)
├── Framework overhead (llama.cpp): 2.0 GB
├── System buffers: 1.5 GB  
├── Available for user sessions: 19.17 GB
└── Concurrent users per GPU: 50-80 users

Per-user memory requirements:
├── KV Cache (conversation context): 250-400 MB
├── Request processing buffer: 50-100 MB
└── Total per active user: 300-500 MB

Infrastructure Sizing Recommendations

Phase 1: Initial Production Deployment

Base capacity: 6-8 NVIDIA L4 GPUs
Supports: Up to 480 concurrent users
Headroom: 35-40% for traffic spikes
Auto-scaling trigger: When sustained concurrency exceeds 400 users

Phase 2: Scaled Production (Based on Usage Data)

Full capacity: 12-15 NVIDIA L4 GPUs
Peak concurrent support: 720-1,200 users
High availability: Includes 20% redundancy overhead
Cost optimization: Scale down during off-peak hours

Cloud Run Deployment Architecture

Google Cloud Run offers serverless GPUs with NVIDIA L4 support, providing pay-per-second billing and automatic scaling to zero. This architecture offers several compelling advantages:

Technical Benefits:

Cost efficiency: Pay only for compute time used, with scale-to-zero capability
Auto-scaling: Handle traffic spikes automatically while maintaining low latency
Managed infrastructure: No Kubernetes complexity while retaining enterprise features
Fast cold starts: GPU instances with drivers pre-installed start in approximately 5 seconds
Built-in security: IAM authentication keeps model endpoints private by default

Production Dockerfile Configuration

This production-ready Dockerfile incorporates best practices for reliability and performance:


FROM ollama/ollama:latest

# Install required tools
RUN apt-get update && apt-get install -y \
    curl \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Listen on all interfaces, port 8080
ENV OLLAMA_HOST 0.0.0.0:8080

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Set your Hugging Face model repository and specific GGUF file
ENV HF_REPO "rsher60/llama3.2-1B-text2sql-finetuned-gguf"
ENV GGUF_FILE "rsher60-llama3.2-1B-text2sql-finetuned.gguf"
ENV MODEL_NAME "llama3.2-1B-text2sql-finetuned-gguf"

# Create models directory
RUN mkdir -p /models

# Download the GGUF model from Hugging Face
RUN wget -O /models/${GGUF_FILE} \
    "https://huggingface.co/${HF_REPO}/resolve/main/${GGUF_FILE}"

# Create a Modelfile for Ollama to use the GGUF model
RUN echo "FROM /models/${GGUF_FILE}" > /tmp/Modelfile

# Start Ollama service, create the model, then stop the background service
RUN ollama serve & \
    sleep 10 && \
    ollama create ${MODEL_NAME} -f /tmp/Modelfile && \
    pkill ollama

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Build the service on Cloud Run using the following command:

gcloud run deploy ollama-rsher60-finetuned \
  --source . \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-allow-unauthenticated \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600

Curl command to test :

gcloud run services proxy ollama-rsher60-finetuned --port=9090


curl http://localhost:9090/api/generate -d '{
  "model": "llama3.2-1B-text2sql-finetuned-gguf",
  "prompt": "Write a query to calculate the number of Mondays in the calendar year 2025"
}'

Why This Architecture Matters

The combination of Ollama and Cloud Run with GPUs provides several compelling advantages:

Cost Efficiency: Pay only for the compute time you use with Cloud Run's scale-to-zero model
GPU Acceleration: Leverage NVIDIA L4 GPUs for fast inference without managing infrastructure
Auto-scaling: Automatically handle traffic spikes while maintaining low latency
Simplified Deployment: Skip the complexity of Kubernetes while retaining enterprise features
Security: Built-in IAM authentication keeps your model endpoints private by default

Conclusion

Deploying fine-tuned Llama models on Google Cloud Run with Ollama provides a powerful combination of simplicity, scalability, and cost-effectiveness. This architecture allows you to focus on improving your model's performance rather than managing infrastructure complexity.
The serverless nature of Cloud Run means you only pay for actual inference time, making it economical for both development and production workloads. The GPU acceleration ensures fast response times, while the built-in scaling handles traffic variations automatically.
As you continue developing and refining your models, this deployment pattern provides a solid foundation that can evolve with your needs—from prototype to production scale.

Riddhiman Sherlekar

Discussion about this post

Ready for more?