The $200 Cron Job: How We Burned 7M Tokens in One Hour (And How You Won't)
Published: February 17, 2026
We're sharing this because we wish someone had warned us. If you're building AI agents with scheduled tasks, this post could save you real money.
TL;DR โ The Mistake
Never use wakeMode: "next-heartbeat" for scheduled cron jobs. Use "now" instead.
That's it. That one setting cost us 7 million tokens in under an hour. Read on for why, and how to avoid our exact mistake.
The Setup
We run a multi-agent AI platform. One agent, Flo, has a daily summary task โ an "evening report" scheduled for 6 PM CST. Classic cron job, nothing fancy:
{
"name": "evening-report",
"schedule": {
"kind": "cron",
"expr": "0 0 * * *"
},
"payload": {
"kind": "agentTurn",
"message": "Generate the daily summary report..."
},
"sessionTarget": "isolated"
}
This should fire once per day. It was firing every 2 minutes.
The Root Cause: Wake Mode Confusion
Here's what we got wrong. Our cron system has a wakeMode setting that controls what happens when the scheduler wakes up (after a restart, sleep, or idle period):
| Wake Mode | Behavior |
|---|---|
"now" |
If job is past due, run it once immediately, then resume normal schedule |
"next-heartbeat" |
Run on the next heartbeat cycle if conditions are met |
We thought "next-heartbeat" meant "run at the next appropriate time." It doesn't.
What actually happens: The heartbeat is a frequent polling mechanism (every ~2 minutes) that keeps agents responsive. When wakeMode: "next-heartbeat" is set, the scheduler can interpret every heartbeat as a valid trigger opportunity under certain state conditions.
The job's cron expression was correct. But the wake mode turned every heartbeat into: "Hey, should I run this job? The wake mode says next-heartbeat... okay, running."
35 executions in one hour. Each one spinning up Claude Sonnet 4.5 to generate a comprehensive daily report. 7 million+ tokens burned.
The Fix
Immediate
cron action=update jobId="f4070232-..." patch='{"enabled": false}'
Stop the bleeding first.
Permanent
Change the wake mode:
{
"wakeMode": "now"
}
With "now", if the job is past due (say, the system was offline at 6 PM), it runs once immediately and then waits for the next scheduled time. No heartbeat chaos.
The Checklist We Now Follow
Before enabling any new cron job, we run through this:
1. Wake Mode Selection
| Use Case | Wake Mode |
|---|---|
| Time-based schedule (daily, hourly, etc.) | "now" |
| Event-driven task that should resume on next opportunity | "next-heartbeat" |
| One-shot reminder | "now" |
| Periodic batch job | "now" |
Rule of thumb: If it has a cron expression, use "now".
2. Model Selection for Testing
Start cheap. New cron jobs get a free-tier model until we've verified they work:
{
"payload": {
"kind": "agentTurn",
"model": "openrouter/nvidia/nemotron-3-nano-30b-a3b:free"
}
}
Upgrade to Claude only after 2-3 successful scheduled runs.
3. Manual Test Before Enabling
cron action=run jobId="your-job-id"
Watch it execute once. Check the output. Verify it completes without errors.
4. Monitor the First Few Runs
After enabling, actively watch the job for its first 2-3 scheduled executions. Check:
- Did it run at the expected time?
- Did it run only once?
- Is the output correct?
5. Set Execution Limits (If Your System Supports It)
We're implementing max-runs-per-hour guards. A daily job that runs more than twice in an hour is obviously broken.
The Before/After
Before (dangerous):
{
"name": "evening-report",
"schedule": { "kind": "cron", "expr": "0 0 * * *" },
"payload": {
"kind": "agentTurn",
"message": "Generate daily summary",
"model": "anthropic/claude-sonnet-4-5"
},
"wakeMode": "next-heartbeat"
}
After (safe):
{
"name": "evening-report",
"schedule": { "kind": "cron", "expr": "0 0 * * *" },
"payload": {
"kind": "agentTurn",
"message": "Generate daily summary",
"model": "openrouter/nvidia/nemotron-3-nano-30b-a3b:free"
},
"wakeMode": "now",
"enabled": false
}
Note: We also start with enabled: false now, manually test, then enable.
Why This Matters for AI Infrastructure
Misconfigured cron jobs in traditional systems are annoying. They might spam emails or fill up logs. The blast radius is limited.
Misconfigured cron jobs in AI systems are expensive. Every execution burns real money โ API calls, token usage, compute time. A runaway job doesn't just waste resources; it can blow through your monthly budget before you notice.
The stakes are higher. The safeguards need to match.
Other Lessons From This Week
While debugging the cron disaster, we also fixed some other infrastructure issues:
Tailscale Gateway Access
Our control UI wasn't accessible over Tailscale โ gateway was binding to localhost only. Fix:
{ "gateway": { "bind": "tailnet" } }
Rogue Gateway Detection
Found a stray gateway process on a Windows machine conflicting with our main VPS gateway. Two gateways polling the same Telegram bot = chaos. Always know where your processes are running.
Quick Reference Card
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CRON JOB SAFETY CHECKLIST โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ [ ] wakeMode set to "now" (NOT "next-heartbeat") โ
โ [ ] Model is free/cheap for initial testing โ
โ [ ] Manual test run completed: cron action=run โ
โ [ ] Watched 2-3 scheduled runs behave correctly โ
โ [ ] Only THEN upgrade to expensive model โ
โ [ ] Only THEN set enabled: true โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Print it. Tape it to your monitor. We did.
Closing Thoughts
Building in public means sharing the expensive lessons, not just the wins. This one stung, but now we have battle-tested safeguards โ and so do you.
If this saves even one person from the same mistake, the 7 million tokens were worth it.
(Okay, they still hurt. But you get the idea.)
This is part of our Building in Public series. Follow along as we build Atlas-OS โ a multi-agent AI platform for personal and business automation.
Links: