Live observability for self-hosted LLM fleets

Mission control for your
local + cloud AI fleet

AI Fleet Router watches every inference request in real time, tracks GPU health across every node, and routes traffic across your local Ollama boxes and the cloud — so the agents you run never wait on a busy GPU and never burn budget they don't need to.

// built for teams running fleets of AI agents on real hardware
AI FLEET ROUTER / overview
aifleetrouter.com
Requests
2.4M
24h · 61.8k
Tokens
6.2B
24h · 184M
Avg t/s
64.2
all time
Inf. time
318d
cumulative
Clients
184
active · 31
Local/Cloud
84/16
% split
Live Feed · streaming
gemma4:26bspark-02
client · agent-research
88.3 t/s10:07:46
qwen3.6:35b-a3bacer-03
client · openclaw-ops
61.2 t/s10:07:54
gemma4:31b-cloudgx10-1
client · agent-support
79.6 t/s10:07:58
spark-02DGX Spark #2HEALTHY
GPU
78%
MEM
85%
TEMP
49°
gx10-1GX10 Spark #1HEALTHY
GPU
41%
MEM
76%
TEMP
39°
12 GPU nodes · local + cloud
40 models · loaded across fleet
2.4M requests · routed
6.2B tokens · observed
100% self-hosted · your hardware
Why we built it

Running AI agents on your own GPUs
shouldn't be a black box

The moment you go past one model on one machine, you lose the plot. Here's the chaos AI Fleet Router was built to end.

01

No idea what's happening

Requests vanish into a cluster of boxes. Which node served it? How fast? Did it fall back to the cloud? You're flying blind.

02

Hot nodes, idle nodes

One Spark is pinned at 100% while three sit cold. Without a live view of GPU load, you can't balance the fleet you paid for.

03

Cloud bills you can't explain

Overflow quietly spills to paid cloud models. By month-end the invoice is a mystery and nobody can say which agent caused it.

04

No proof for the team

Leadership wants numbers — throughput, cost, uptime. Screenshotting terminals doesn't cut it. You need real reports.

What it does

One dashboard for the whole fleet

Every request, every node, every model — local and cloud — in a single real-time control plane.

📡

Live request feed

Watch inference stream in as it happens — model, node, client and tokens-per-second on every call, updating in real time.

// real-time
🖥️

Per-node fleet health

GPU, memory, CPU, disk, temperature and wattage for every backend — with loaded models and VRAM, all at a glance.

// gpu telemetry
🔀

Local + cloud routing

See exactly how traffic splits between your local GPUs and cloud models, and drain a node with one click for maintenance.

// smart routing
📊

Model performance

Average and peak t/s, time-to-first-token, request volume and token counts — ranked per model across the fleet.

// benchmarks
👥

Client analytics

Break usage down by client and agent. Know which workload drives load, tokens and spend over 24h, 7d or 30d.

// attribution
📄

One-click PDF reports

Export performance, model breakdown, request logs and TTFT analysis as a clean PDF — proof for the team in seconds.

// reporting
Inside the dashboard

Built like the terminal you live in

Dense, fast, and information-rich. No fluff — just the signal you need to run a fleet.

Every node, healthy or not — at a glance

The Fleet view gives each backend its own live card: utilization bars for GPU, memory, CPU and disk, plus temperature and power draw. Loaded models show their VRAM footprint, and a single Drain toggle pulls a node out of rotation cleanly.

  • Real-time GPU / MEM / CPU / DISK meters per node
  • Loaded models with warm/cold state and VRAM
  • Temp + wattage so you catch thermal throttling early
  • Health status and in-flight request counts
Fleet · backends12/12 healthy
spark-01DGX Spark #1HEALTHY
GPU
4%
MEM
85%
CPU
14%
DISK
9%
gx10-2GX10 Spark #2HEALTHY
GPU
62%
MEM
73%
CPU
20%
DISK
11%

Know which model is actually fast

The Models view ranks everything running on the fleet by throughput. Average and max tokens-per-second, time-to-first-token, total tokens and request counts — so you can right-size which model runs where, and spot the cloud models pulling their weight (or not).

  • Avg + peak t/s per model, across all nodes
  • Latency: average duration and TTFT
  • Local vs cloud models, side by side
  • 24h / 7d / 30d windows
Models · performance7d
gemma4:26b88.3avg t/s · 1.4B tok
kimi-k2.6 ☁ cloud64.1avg t/s · 690M tok
gemma4:31b-cloud58.7avg t/s · 212M tok
qwen3.6:35b-a3b54.2avg t/s · 480M tok
glm-5.1 ☁ cloud49.2avg t/s · 318M tok

Local first. Cloud when it counts.

The router keeps work on your own silicon by default and overflows to cloud models only when the fleet is saturated or a request needs a model you don't host. You see the split live — and the receipts at the end of the month.

  • Live local-vs-cloud traffic split
  • Per-tier request and token totals
  • Routing flow over the last 5m / 1h / 24h
  • One-click drain for clean maintenance
Routing · last 24h84% local
spark-0131%
spark-0227%
gx10-126%
local fleet
🔀
router
kimi-k2.6 ☁9%
glm-5.1 ☁7%
cloud overflow
The story

Why a marketer built a GPU router

AI Fleet Router didn't come from a lab. It came from needing to run a small army of AI agents — reliably, privately, and without a runaway cloud bill.

JJH
Jeff J Hunter
Built by Jeff J Hunter · Founder, VA Staffer

From AI Persona Method™ to a fleet of GPUs

Jeff has spent 11+ years scaling businesses with humans + automation — featured in Entrepreneur Magazine and Business Insider, creator of the AI Persona Method™, and the founder behind 1,000+ students building businesses that run without them.

As his team deployed 15+ AI Employees across messaging channels, the question stopped being "can AI do the work" and became "where does all this inference actually run?" Renting cloud tokens for every agent doesn't scale — so Jeff stood up a fleet of local GPU boxes to serve models privately. But a pile of Sparks with no visibility is just expensive guesswork. AI Fleet Router was the missing control plane — the dashboard that finally made the fleet observable, balanced, and accountable.

🦞
Early OpenClaw contributor & advocate. Jeff is an active supporter of OpenClaw — the open-source gateway that lets AI agents work across WhatsApp, Telegram, Discord and dozens more channels. Running 15+ AI Employees on OpenClaw + AI Persona OS, he's helped shape its security best practices and real-world deployment workflows — and AI Fleet Router is what keeps those agents fed with fast, local inference.
🌐 jeffjhunter.com 🦞 OpenClaw project 📬 TheTip.ai newsletter
Want this for your business?

Build your own AI-powered income

AI Fleet Router is the kind of infrastructure that runs a business on AI. Learn the playbook behind it — 8 proven ways to make money with AI, live group calls, and 100+ guides — inside AI Money Group.

// learn how Jeff runs 15+ AI Employees on his own fleet