The $200 Cron Job: How We Burned 7M Tokens in One Hour (And How You Won't)

Published: February 17, 2026

We're sharing this because we wish someone had warned us. If you're building AI agents with scheduled tasks, this post could save you real money.

TL;DR — The Mistake

Never use wakeMode: "next-heartbeat" for scheduled cron jobs. Use "now" instead.

That's it. That one setting cost us 7 million tokens in under an hour. Read on for why, and how to avoid our exact mistake.

The Setup

We run a multi-agent AI platform. One agent, Flo, has a daily summary task — an "evening report" scheduled for 6 PM CST. Classic cron job, nothing fancy:

{
  "name": "evening-report",
  "schedule": {
    "kind": "cron",
    "expr": "0 0 * * *"
  },
  "payload": {
    "kind": "agentTurn",
    "message": "Generate the daily summary report..."
  },
  "sessionTarget": "isolated"
}

This should fire once per day. It was firing every 2 minutes.

The Root Cause: Wake Mode Confusion

Here's what we got wrong. Our cron system has a wakeMode setting that controls what happens when the scheduler wakes up (after a restart, sleep, or idle period):

Wake Mode	Behavior
`"now"`	If job is past due, run it once immediately, then resume normal schedule
`"next-heartbeat"`	Run on the next heartbeat cycle if conditions are met

We thought "next-heartbeat" meant "run at the next appropriate time." It doesn't.

What actually happens: The heartbeat is a frequent polling mechanism (every ~2 minutes) that keeps agents responsive. When wakeMode: "next-heartbeat" is set, the scheduler can interpret every heartbeat as a valid trigger opportunity under certain state conditions.

The job's cron expression was correct. But the wake mode turned every heartbeat into: "Hey, should I run this job? The wake mode says next-heartbeat... okay, running."

35 executions in one hour. Each one spinning up Claude Sonnet 4.5 to generate a comprehensive daily report. 7 million+ tokens burned.

The Fix

Immediate

cron action=update jobId="f4070232-..." patch='{"enabled": false}'

Stop the bleeding first.

Permanent

Change the wake mode:

{
  "wakeMode": "now"
}

With "now", if the job is past due (say, the system was offline at 6 PM), it runs once immediately and then waits for the next scheduled time. No heartbeat chaos.

The Checklist We Now Follow

Before enabling any new cron job, we run through this:

1. Wake Mode Selection

Use Case	Wake Mode
Time-based schedule (daily, hourly, etc.)	`"now"`
Event-driven task that should resume on next opportunity	`"next-heartbeat"`
One-shot reminder	`"now"`
Periodic batch job	`"now"`

Rule of thumb: If it has a cron expression, use "now".

2. Model Selection for Testing

Start cheap. New cron jobs get a free-tier model until we've verified they work:

{
  "payload": {
    "kind": "agentTurn",
    "model": "openrouter/nvidia/nemotron-3-nano-30b-a3b:free"
  }
}

Upgrade to Claude only after 2-3 successful scheduled runs.

3. Manual Test Before Enabling

cron action=run jobId="your-job-id"

Watch it execute once. Check the output. Verify it completes without errors.

4. Monitor the First Few Runs

After enabling, actively watch the job for its first 2-3 scheduled executions. Check:

Did it run at the expected time?
Did it run only once?
Is the output correct?

5. Set Execution Limits (If Your System Supports It)

We're implementing max-runs-per-hour guards. A daily job that runs more than twice in an hour is obviously broken.

The Before/After

Before (dangerous):

{
  "name": "evening-report",
  "schedule": { "kind": "cron", "expr": "0 0 * * *" },
  "payload": {
    "kind": "agentTurn",
    "message": "Generate daily summary",
    "model": "anthropic/claude-sonnet-4-5"
  },
  "wakeMode": "next-heartbeat"
}

After (safe):

{
  "name": "evening-report",
  "schedule": { "kind": "cron", "expr": "0 0 * * *" },
  "payload": {
    "kind": "agentTurn",
    "message": "Generate daily summary",
    "model": "openrouter/nvidia/nemotron-3-nano-30b-a3b:free"
  },
  "wakeMode": "now",
  "enabled": false
}

Note: We also start with enabled: false now, manually test, then enable.

Why This Matters for AI Infrastructure

Misconfigured cron jobs in traditional systems are annoying. They might spam emails or fill up logs. The blast radius is limited.

Misconfigured cron jobs in AI systems are expensive. Every execution burns real money — API calls, token usage, compute time. A runaway job doesn't just waste resources; it can blow through your monthly budget before you notice.

The stakes are higher. The safeguards need to match.

Other Lessons From This Week

While debugging the cron disaster, we also fixed some other infrastructure issues:

Tailscale Gateway Access

Our control UI wasn't accessible over Tailscale — gateway was binding to localhost only. Fix:

{ "gateway": { "bind": "tailnet" } }

Rogue Gateway Detection

Found a stray gateway process on a Windows machine conflicting with our main VPS gateway. Two gateways polling the same Telegram bot = chaos. Always know where your processes are running.

Quick Reference Card

┌─────────────────────────────────────────────────────────┐
│  CRON JOB SAFETY CHECKLIST                              │
├─────────────────────────────────────────────────────────┤
│  [ ] wakeMode set to "now" (NOT "next-heartbeat")       │
│  [ ] Model is free/cheap for initial testing            │
│  [ ] Manual test run completed: cron action=run         │
│  [ ] Watched 2-3 scheduled runs behave correctly        │
│  [ ] Only THEN upgrade to expensive model               │
│  [ ] Only THEN set enabled: true                        │
└─────────────────────────────────────────────────────────┘

Print it. Tape it to your monitor. We did.

Closing Thoughts

Building in public means sharing the expensive lessons, not just the wins. This one stung, but now we have battle-tested safeguards — and so do you.

If this saves even one person from the same mistake, the 7 million tokens were worth it.

(Okay, they still hurt. But you get the idea.)

This is part of our Building in Public series. Follow along as we build Atlas-OS — a multi-agent AI platform for personal and business automation.

Links:

The $200 Cron Job: How We Burned 7M Tokens in One Hour (And How You Won't)

The $200 Cron Job: How We Burned 7M Tokens in One Hour (And How You Won't)

TL;DR — The Mistake

The Setup

The Root Cause: Wake Mode Confusion

The Fix

Immediate

Permanent

The Checklist We Now Follow

1. Wake Mode Selection

2. Model Selection for Testing

3. Manual Test Before Enabling

4. Monitor the First Few Runs

5. Set Execution Limits (If Your System Supports It)

The Before/After

Why This Matters for AI Infrastructure

Other Lessons From This Week

Tailscale Gateway Access

Rogue Gateway Detection

Quick Reference Card

Closing Thoughts

Related Posts

Building a Knowledge-Powered AI Agent: Handy Beaver's Multi-Layer Architecture

Building Lil Beaver: Knowledge Base + Social Content Generation for Service Businesses

Debugging Facebook Automation & Building Admin Tools for The Handy Beaver