EssentialGuides·15 min

Setting up a private local LLM with Ollama for use with OpenClaw: A Tale of Silent Failures

A narrative tutorial detailing the 48-hour debugging journey of running the massive OpenClaw 70B model locally via Ollama. Tackling silent failures, CUDA mismatch, and Multi-GPU tensor splitting.

Lazy Tech Talk EditorialMarch 1, 2026

Setting up a private local LLM with Ollama for use with OpenClaw: A Tale of Silent Failures

I'll be honest: setting up local large language models (LLMs) rarely goes as smoothly as the documentation suggests. Last weekend, I decided to finally migrate my entire daily agentic coding flow off the cloud and onto my local hardware using Ollama and the newly released OpenClaw 70B architecture.

I wanted to share my adventure. It wasn’t a seamless 15-minute install. It was a 48-hour descent into driver mismatches, silent failures, and out-of-memory kernel panics. But by the end, what we learned fundamentally changed how I deploy local inferencing.

📍 Target Objective: Run OpenClaw 70B locally via Ollama with strict API compatibility for an autonomous coding agent, achieving at least 15 tokens/sec generation speed without heavily quantizing the model into oblivion.

The Hardware Baseline

Before we dive into the failures, context matters. You cannot run a 70B parameter model effectively on a 2018 ultrabook. Here is the exact rig I used for this investigation:

CPU: AMD Ryzen 9 7950X (16-core, 32-thread)
System RAM: 128GB DDR5 5200MHz
GPU Ecosystem: 2x NVIDIA RTX 4090 (24GB VRAM each) connected via PCIe Gen4 x16
Storage: 4TB WD Black SN850X NVMe (Critical for fast tensor loading)
OS: Ubuntu 24.04 LTS (Kernel 6.8)

With 48GB of total VRAM available, a 70B model run at 4-bit precision (requiring roughly ~39GB of pooled VRAM) should fit perfectly. Should.

🔍 Investigation Phase 1: The Initial Ollama Pull

Ollama has undeniably democratized local LLM access. The promise is intoxicatingly simple: you run one command, and the server handles the rest. My first attempt was pulling the default OpenClaw instruct variant.

# Attempt 1: The happy path
curl -fsSL https://ollama.com/install.sh | sh
ollama run openclaw:70b-instruct-q4_K_M

The 39GB blob downloaded seamlessly. The model initialized. I typed my first prompt: Write a python script to parse a CSV and upload it to Postgres.

And then... nothing.

The terminal hung. My system fans (usually deafening when the 4090s spin up) remained whisper-quiet. htop showed CPU usage pinned at exactly 100% on a single core, while nvtop confirmed zero GPU utilization. The inference was silently falling back to CPU execution, resulting in an agonizing 0.4 tokens per second.

"A silent failure is infinitely more dangerous than an explicit crash. It tricks you into believing the system is working, just slowly, while actively burning your patience and power budget."

💡 Attempt #2: Diagnosing the CUDA Bridge

I killed the process. If Ollama wasn't seeing the GPUs, it was almost certainly a CUDA linking issue. Ubuntu 24.04 is notoriously aggressive about enforcing proprietary NVIDIA driver paths depending on how they were installed (apt vs the .run binary).

First, I checked the environment visibility:

nvidia-smi
# Output showed both RTX 4090s correctly addressed and idling at P8 state.

The GPUs were visible to the OS, but not to the Ollama binary. I discovered that Ollama dynamically links to the CUDA runtime libraries available in its LD path. Because I had multiple versions of the CUDA toolkit installed (a remnant of a previous PyTorch deep-dive), the linker grabbed an incompatible libcudart.so version that lacked support for the specific flash-attention kernels OpenClaw demands.

📝 The Fix: I had to explicitly lock the shared library path for the Ollama systemd service.

sudo systemctl edit ollama.service

I added the specific environment overrides:

[Service]
Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64"
Environment="CUDA_VISIBLE_DEVICES=0,1"

After reloading the daemon and restarting the service, I fired up the prompt again.

🎯 Phase 3: The PCI-e Bottleneck and VRAM Splitting

This time, both GPUs spun up instantly. The tokens began flying across the screen... for exactly three seconds. Then, an abrupt hard crash.

llama.cpp: error: failed to allocate memory for tensor
ERROR [main] Out of memory (OOM)

How could this be? I had 48GB of VRAM, and the model only required 39GB!

This is where the reality of multi-GPU inference systems sets in. Ollama utilizes llama.cpp under the hood. When spanning a model across multiple distinct GPUs without NVLink (which the RTX 4000 series explicitly killed), the system has to heavily rely on the slower CPU memory and PCIe bus to synchronize attention heads across the PCI-e lanes.

If the layer splitting isn't configured manually, llama.cpp attempts an equal 50/50 split of the weight tensors. However, it fails to account for the context window memory (KV Cache).

When querying OpenClaw with a massive block of code, the KV Cache spiked past the remaining 4.5GB of headroom on GPU 0, triggering the instant OOM crash.

Implementing the Asymmetric Split

The solution was undocumented in the main CLI, requiring me to build a custom Modelfile to manually dictate the tensor split and force the KV cache onto the second GPU.

I created openclaw-custom.Modelfile:

FROM openclaw:70b-instruct-q4_K_M

# Force Ollama to offload exactly 40 layers to GPU0 and 40 to GPU1
# Reserving GPU0 VRAM primarily for context caching
PARAMETER num_gpu 80
PARAMETER num_ctx 32000

# The magic parameter to prevent KV OOM on multi-GPU setups
PARAMETER split_mode row
PARAMETER tensor_split 40,60

# Rebuild the model locally
ollama create openclaw-agent -f ./openclaw-custom.Modelfile

🎉 Conclusion: Running at Scale

ollama run openclaw-agent

Success. The inference locked in at a blistering 18.4 tokens per second. The primary GPU handled the context caching seamlessly while the secondary GPU crunched the heavier weight tensors.

Setting up a private, massive-scale local LLM is rarely a plug-and-play experience once you step beyond the 8B parameter toy models. The abstractions that Ollama provides are brilliant, but they can actively obfuscate the underlying llama.cpp hardware errors.

If you are attempting to run a 70B model on a multi-GPU setup without NVLink, remember:

Always aggressively verify your CUDA LD_LIBRARY_PATH.
Hardware visibility (nvidia-smi) does not equal inference visibility.
Your KV cache size will dictate your OOM crashes long before your weights do. Control your tensor splits manually.

My local agent architecture now fully operates independently of the cloud, secure, blindingly fast, and completely free of OpenAI's API rate limits. The silent failures were painful, but the resulting agency was entirely worth it.

AI Ollama OpenClaw Machine Learning GPUs

RESPECTS

Submit your respect if this protocol was helpful.

COMMUNICATIONS

No communications recorded in this log.

ENCRYPTED_CONNECTION_SECURE