Back to Intel

The AI Cost Problem Nobody Wants to Talk About

Three days ago I published a post about how we accidentally built a hyperagent -- a self-improving research platform that runs autonomously across nine domains. The response was interesting. Some people loved it. A consortium of security professionals reviewed it and gave us some honest, useful feedback. The gist: "cool story, but break down the actual economics."

Fair enough. Let's talk about money.

Every company I talk to that's serious about AI is spending $10,000 to $50,000 per month on API costs. OpenAI, Anthropic, Google -- the token meters run constantly, and the bills arrive like clockwork. I've seen startups burning through their seed rounds on inference costs alone. I've watched enterprise teams artificially limit how many queries their analysts can run per day because the per-token pricing makes unconstrained usage financially terrifying.

That's insane. You build a tool to make people more productive, then you ration it so the bill doesn't bankrupt you. The tool works best when you use it the most, but you can't afford to use it the most. It's a trap.

We don't have that problem. Our incremental inference cost is $0. Not "approximately zero." Not "negligible." Zero dollars. Zero cents. The platform that produces 5,000 structured findings per day, 24 hours a day, costs us nothing beyond the electricity to keep the machines on.

Here's exactly how.

The Economics, Line by Line

I'm going to be specific because vague claims are worthless. These are real numbers from a system that's been running continuously since January 2026.

Monthly Cost Breakdown
LLM Inference (Ollama Cloud / Abacus bulk) -- $0/mo
Embedding Generation (M3 Max local) -- $0/mo
Database (Vultr VPS, PostgreSQL) -- $24/mo (already existed for other workloads)
Hardware Depreciation -- $0 incremental (owned before the project started)
Electricity -- ~$15-20/mo estimated (6 machines, not all running 24/7)
Total New Spend -- $0/mo (everything was already running and paid for)

The $24/month VPS was hosting our PostgreSQL database for SOC operations before any of this started. The hardware was bought for other purposes -- security operations, development, gaming. We didn't buy a single piece of equipment for this project. We just pointed existing machines at a new problem.

Compare that to what it would cost at retail API rates. We've generated over 25,000 structured findings from 12,000+ research papers. Each paper involves three parallel LLM workers plus a captain synthesis pass plus a topic discovery pass -- call it five inference calls per paper at roughly 4,000 tokens each. That's approximately 240 million tokens processed. At GPT-4-class pricing ($10-30 per million input tokens, $30-60 per million output), you're looking at somewhere between $7,000 and $20,000 in API costs. Per month. For a system that's been running for three months.

We spent zero.

12,000+
Research papers generated
25,000+
Structured findings extracted
~240M
Tokens processed (estimated)
$0
API costs paid

How: Open-Source Models and Free Inference

The secret isn't some proprietary trick. It's that the open-source LLM ecosystem has gotten absurdly good, and most companies haven't noticed yet because they're locked into API contracts with the big providers.

Our research workers run on models like Qwen 3 (80B), GPT-OSS (120B), and DeepSeek v3.1 (671B) -- all served through Ollama Cloud endpoints. These aren't toy models. DeepSeek v3.1 at 671 billion parameters produces research synthesis that's genuinely difficult to distinguish from Claude or GPT-4 output on knowledge-heavy tasks. The captain synthesis layer uses DeepSeek v3.1 and Mistral Large 3 (675B). These are frontier-class models available at zero cost through free inference tiers.

"Free inference" sounds too good to be true, so let me explain what's actually happening. Providers like Abacus AI offer bulk inference at rates so low that for our volume, they round to zero. Ollama Cloud provides free endpoints for open-source models. We route through LiteLLM, which gives us a unified API layer that can failover between providers automatically. If one endpoint rate-limits us, the next request goes somewhere else. The system doesn't care which provider serves a given request. It cares about uptime and quality.

Optionally, we run a Claude Sonnet pass as a "queen" polish layer on the highest-value outputs. When it's available, it adds a layer of refinement. When it's rate-limited, the system gracefully skips it and the output is still excellent. The queen is nice to have, not a dependency.

The Model Quality Question

I know what you're thinking: "Open-source models can't be as good as GPT-4 or Claude." And for some tasks, you're right. For single-turn creative writing or complex multi-step reasoning, the commercial models still have an edge. But here's what we've learned from running this system for months:

For research synthesis, the gap is nearly gone. When you have three different 80B+ models researching the same topic from different angles, and a 671B model synthesizing them, the output quality matches or exceeds what you'd get from a single GPT-4 call. The multi-model consensus approach compensates for individual model weaknesses. One model misses a nuance; another catches it. The captain keeps the best from each.

We've done blind comparisons. When I can't reliably tell whether a research paper came from our open-source pipeline or a Claude Opus pass, the quality argument stops being relevant.

The Mesh: Six Nodes, Zero Planning

Our compute infrastructure sounds impressive in aggregate -- 90+ CPU cores, 296 GB of RAM, three GPUs -- but the individual components are just stuff we already had.

The Mesh
mac00 -- M3 Max, 64GB RAM. Primary embedding engine. 330 embeddings/minute via nomic-embed-text.
mac01 -- M3 Max, 64GB RAM. Orchestrator and coordinator. Runs the daemon.
rog18 -- i9-14900HX + RTX 4060 8GB. GPU inference node. LiteLLM gateway.
loki -- RTX 4070. GPU inference overflow. Handles burst capacity.
amd00 -- AMD workstation. CPU-heavy tasks and batch processing.
cr-vultr01 -- Cloud VPS. PostgreSQL primary with pgvector. 24/7 persistence layer.

The mac00 is a workstation we bought for video editing and security operations. The gaming laptops were, well, gaming laptops. The Vultr box was running our SOC database. None of this was purchased for research. We just wrote a topology file that describes what each machine is good at, and a coordinator script that routes work accordingly.

GPU work goes to the RTX machines. Embeddings go to the M3 Max because Apple Silicon is absurdly efficient at that workload -- 330 embeddings per minute, sustained, while the machine runs other tasks. Heavy synthesis goes to the high-RAM nodes. Database writes go to the cloud box because it needs to be reachable from everywhere.

The glue is SSH, Node.js scripts, and a JSON topology file. That's it. No Kubernetes. No Terraform. No cloud orchestration platform. Just six machines that know about each other and a coordinator that knows what each one is good at.

I'm not going to pretend this is elegant. It's held together with PM2 process management and a launchd service that restarts the coordinator on boot. But it's been running continuously since January without anyone babysitting it. The coordinator handles crash recovery -- if a worker dies mid-research, it resets the topic to "pending" and picks it up again on the next cycle. Simple, boring, reliable.

What It Produces

Numbers are nice but they don't mean anything without context. Here's what 12,000 research papers and 25,000 findings actually look like in practice.

The system covers ten domains right now, each with its own configuration, seed topics, and output directory:

Each domain runs on a weighted round-robin schedule. Cyber threat intel and consciousness studies get double weight because they produce the most follow-up questions. The coordinator cycles through: cloud, cyber, cyber, GRC, political accountability, health, pentesting, consciousness, consciousness. Then repeats. Forever.

Every completed paper goes through the topic discoverer, which reads the output and generates three to five new research questions. Those go into the queue. Right now there are over 19,000 pending topics that the system discovered on its own from a few hundred initial seeds. The knowledge base is self-expanding at a rate that will take years to exhaust, even running 24/7.

The Embedding Pipeline

Raw text is useless at scale. You need semantic search -- the ability to find things by meaning, not just keywords. Every finding we extract gets vector-embedded using nomic-embed-text running locally on the M3 Max.

330 embeddings per minute. That's the sustained rate. The vectors are 768-dimensional, stored in PostgreSQL with pgvector and indexed with IVFFlat for fast approximate nearest-neighbor search. When an analyst asks "what do we know about identity-based attacks that bypass MFA," the system doesn't do keyword matching. It finds the 50 most conceptually similar findings across all ten domains, regardless of the exact words used.

We've had it surface connections between pen testing evasion techniques and GRC compliance gaps that no human would have thought to link. A finding about a specific Active Directory attack path showed up as semantically related to a compliance framework gap in our GRC domain. That's not something you'd find with grep. That's the whole point of embeddings.

330/min
Embedding generation rate
768-dim
Vector dimensionality
10
Research domains
19,000+
Auto-discovered topics queued

Why This Matters for Small Companies

Here's the thing that bothers me about the current AI landscape: it's being built to benefit companies that can afford $50,000/month in API costs. The narrative is that you need GPT-4 or Claude for anything serious, and those models are metered by the token. The more you use them, the more you pay. The companies that benefit most from AI are the ones that can afford to use it without flinching at the bill.

That's backwards. Small companies -- small SOCs, small consultancies, small research teams -- are the ones who need force multiplication the most. A four-person security team doesn't have the luxury of dedicated threat researchers. A small MSSP can't afford a full-time compliance analyst who does nothing but track framework updates. These are exactly the organizations where AI should be doing the most work.

But at $10-30 per million tokens, they can't afford to run it at the volume that actually makes a difference.

Our system proves there's another way. The open-source models are good enough. The free inference endpoints are reliable enough. The hardware most companies already own is powerful enough. You don't need a data center. You don't need an OpenAI enterprise contract. You need a topology file, a coordinator script, and the willingness to wire some machines together.

The barrier to autonomous AI research is no longer cost. It's imagination.

Honest Caveats

I'd be doing you a disservice if I didn't mention the downsides, because there are some, and they're real.

Free inference endpoints aren't guaranteed. Ollama Cloud and Abacus could change their pricing tomorrow. We've designed for this -- LiteLLM can failover between providers, and we can fall back to running models locally on our GPU nodes. The throughput drops, but the system doesn't stop. If every free endpoint disappeared overnight, we'd still be running, just slower.

The hardware isn't actually free. I said "zero incremental cost" and I meant it -- we didn't buy anything for this project. But those machines cost money when we bought them. If you're starting from scratch, you're looking at real hardware costs. A refurbished Mac Mini with an M-series chip runs about $600. A used gaming laptop with an RTX card is $800-1,200. You could build a minimal version of this mesh for under $2,000 in hardware. That's still vastly cheaper than three months of API costs, but it's not free.

Setup is not turnkey. This took real engineering time to build. The coordinator, the worker pipeline, the topic discoverer, the embedding pipeline, the LiteLLM routing, the crash recovery -- that's weeks of development work. We're a security company that happens to have strong engineering capability. Not everyone does. We're thinking about how to make parts of this available as open-source tooling, but we're not there yet.

Quality varies by domain. Open-source models are excellent at research synthesis and knowledge extraction. They're less reliable for tasks that require very precise reasoning, mathematical proof, or nuanced ethical judgment. We use commercial models (Claude, specifically) for our highest-stakes security work. The daemon handles the volume. The premium models handle the critical path.


The Compounding Effect

After the hyperagent post, the consortium review pushed us to articulate why this matters beyond "look at our cool system." Here's my honest answer.

The value isn't in any single research paper. Most of them are decent but not extraordinary. The value is in the compound effect of 12,000 papers building on each other across ten domains, with semantic search connecting them, running continuously for months.

When one of our analysts investigates an alert involving a novel attack technique, they're not starting from a Google search. They're querying a knowledge base that's been researching attack techniques, defense strategies, compliance implications, and red team methodologies 24/7 for three months. The answer draws from thousands of synthesized sources, cross-referenced across domains, ranked by semantic similarity to their specific question.

That's not something you can buy from a vendor. It's not something you can download. It's institutional knowledge that compounds automatically at machine speed. And it cost us nothing beyond what we were already spending.

Our tagline is "Don't replace your people. UPGRADE them." This system is that tagline made real. It doesn't replace analysts. It gives every analyst on the team access to a research corpus that would take a team of twenty full-time researchers years to build. And it keeps growing while everyone sleeps.

The question isn't whether you can afford to build something like this. At zero marginal cost, the question is whether you can afford not to.

6
Compute nodes in the mesh
90+
CPU cores distributed
296 GB
Total RAM across mesh
$0
Monthly API cost

Want to Build Your Own Zero-Cost Research Platform?

We're exploring ways to help other small teams build autonomous knowledge systems on hardware they already own. If you're interested in what that looks like for your organization, let's talk.

Start a Conversation