Live observability for self-hosted LLM fleets

Mission control for your
local + cloud AI fleet

AI Fleet Router watches every inference request in real time, tracks GPU health across every node, and routes traffic across your local Ollama boxes and the cloud — so the agents you run never wait on a busy GPU and never burn budget they don't need to.

Learn to make money with AI See the dashboard ↓

// built for teams running fleets of AI agents on real hardware

AI FLEET ROUTER / overview

aifleetrouter.com

Requests

2.4M

24h · 61.8k

Tokens

6.2B

24h · 184M

Avg t/s

64.2

all time

Inf. time

318d

cumulative

Clients

184

active · 31

Local/Cloud

84/16

% split

Live Feed · streaming

gemma4:26bspark-02

client · agent-research

88.3 t/s10:07:46

qwen3.6:35b-a3bacer-03

client · openclaw-ops

61.2 t/s10:07:54

gemma4:31b-cloudgx10-1

client · agent-support

79.6 t/s10:07:58

spark-02DGX Spark #2HEALTHY

GPU

78%

MEM

85%

TEMP

49°

gx10-1GX10 Spark #1HEALTHY

GPU

41%

MEM

76%

TEMP

39°

12 GPU nodes · local + cloud

40 models · loaded across fleet

2.4M requests · routed

6.2B tokens · observed

100% self-hosted · your hardware

Why we built it

Running AI agents on your own GPUs
shouldn't be a black box

The moment you go past one model on one machine, you lose the plot. Here's the chaos AI Fleet Router was built to end.

No idea what's happening

Requests vanish into a cluster of boxes. Which node served it? How fast? Did it fall back to the cloud? You're flying blind.

Hot nodes, idle nodes

One Spark is pinned at 100% while three sit cold. Without a live view of GPU load, you can't balance the fleet you paid for.

Cloud bills you can't explain

Overflow quietly spills to paid cloud models. By month-end the invoice is a mystery and nobody can say which agent caused it.

No proof for the team

Leadership wants numbers — throughput, cost, uptime. Screenshotting terminals doesn't cut it. You need real reports.

What it does

One dashboard for the whole fleet

Every request, every node, every model — local and cloud — in a single real-time control plane.

📡

Live request feed

Watch inference stream in as it happens — model, node, client and tokens-per-second on every call, updating in real time.

// real-time

🖥️

Per-node fleet health

GPU, memory, CPU, disk, temperature and wattage for every backend — with loaded models and VRAM, all at a glance.

// gpu telemetry

🔀

Local + cloud routing

See exactly how traffic splits between your local GPUs and cloud models, and drain a node with one click for maintenance.

// smart routing

📊

Model performance

Average and peak t/s, time-to-first-token, request volume and token counts — ranked per model across the fleet.

// benchmarks

👥

Client analytics

Break usage down by client and agent. Know which workload drives load, tokens and spend over 24h, 7d or 30d.

// attribution

📄

One-click PDF reports

Export performance, model breakdown, request logs and TTFT analysis as a clean PDF — proof for the team in seconds.

// reporting

Inside the dashboard

Built like the terminal you live in

Dense, fast, and information-rich. No fluff — just the signal you need to run a fleet.

Every node, healthy or not — at a glance

The Fleet view gives each backend its own live card: utilization bars for GPU, memory, CPU and disk, plus temperature and power draw. Loaded models show their VRAM footprint, and a single Drain toggle pulls a node out of rotation cleanly.

▸Real-time GPU / MEM / CPU / DISK meters per node
▸Loaded models with warm/cold state and VRAM
▸Temp + wattage so you catch thermal throttling early
▸Health status and in-flight request counts

Fleet · backends12/12 healthy

spark-01DGX Spark #1HEALTHY

GPU

MEM

85%

CPU

14%

DISK

gx10-2GX10 Spark #2HEALTHY

GPU

62%

MEM

73%

CPU

20%

DISK

11%

Know which model is actually fast

The Models view ranks everything running on the fleet by throughput. Average and max tokens-per-second, time-to-first-token, total tokens and request counts — so you can right-size which model runs where, and spot the cloud models pulling their weight (or not).

▸Avg + peak t/s per model, across all nodes
▸Latency: average duration and TTFT
▸Local vs cloud models, side by side
▸24h / 7d / 30d windows

Models · performance7d

gemma4:26b88.3avg t/s · 1.4B tok

kimi-k2.6 ☁ cloud64.1avg t/s · 690M tok

gemma4:31b-cloud58.7avg t/s · 212M tok

qwen3.6:35b-a3b54.2avg t/s · 480M tok

glm-5.1 ☁ cloud49.2avg t/s · 318M tok

Local first. Cloud when it counts.

The router keeps work on your own silicon by default and overflows to cloud models only when the fleet is saturated or a request needs a model you don't host. You see the split live — and the receipts at the end of the month.

▸Live local-vs-cloud traffic split
▸Per-tier request and token totals
▸Routing flow over the last 5m / 1h / 24h
▸One-click drain for clean maintenance

Routing · last 24h84% local

spark-0131%

spark-0227%

gx10-126%

local fleet

🔀

router

kimi-k2.6 ☁9%

glm-5.1 ☁7%

cloud overflow

The story

Why a marketer built a GPU router

AI Fleet Router didn't come from a lab. It came from needing to run a small army of AI agents — reliably, privately, and without a runaway cloud bill.

JJH

Jeff J Hunter

Built by Jeff J Hunter · Founder, VA Staffer

From AI Persona Method™ to a fleet of GPUs

Jeff has spent 11+ years scaling businesses with humans + automation — featured in Entrepreneur Magazine and Business Insider, creator of the AI Persona Method™, and the founder behind 1,000+ students building businesses that run without them.

As his team deployed 15+ AI Employees across messaging channels, the question stopped being "can AI do the work" and became "where does all this inference actually run?" Renting cloud tokens for every agent doesn't scale — so Jeff stood up a fleet of local GPU boxes to serve models privately. But a pile of Sparks with no visibility is just expensive guesswork. AI Fleet Router was the missing control plane — the dashboard that finally made the fleet observable, balanced, and accountable.

🦞

Early OpenClaw contributor & advocate. Jeff is an active supporter of OpenClaw — the open-source gateway that lets AI agents work across WhatsApp, Telegram, Discord and dozens more channels. Running 15+ AI Employees on OpenClaw + AI Persona OS, he's helped shape its security best practices and real-world deployment workflows — and AI Fleet Router is what keeps those agents fed with fast, local inference.

15+AI Employees deployed

11+years scaling with automation

1,000+students taught

2×featured: Entrepreneur · Insider

🌐 jeffjhunter.com 🦞 OpenClaw project 📬 TheTip.ai newsletter

Want this for your business?

Build your own AI-powered income

AI Fleet Router is the kind of infrastructure that runs a business on AI. Learn the playbook behind it — 8 proven ways to make money with AI, live group calls, and 100+ guides — inside AI Money Group.

Join AI Money Group → Explore the dashboard

// learn how Jeff runs 15+ AI Employees on his own fleet