← Back to blog

HEARTBEAT.md: Set Up Health Monitoring for Your OpenClaw Agent

The worst kind of agent failure is the one nobody notices. Your agent stops responding to customer emails on a Friday afternoon. The integration token expires and the agent silently drops every Slack message. The LLM provider has a partial outage and your agent starts returning empty responses instead of errors.

These aren't hypothetical. We've seen all three in ClawSprout deployments. The teams that caught them quickly had one thing in common: a properly configured HEARTBEAT.md.

## Why Agents Fail Silently

Traditional software crashes loudly. A web server returns a 500 error. A database connection timeout throws an exception. A crashed process gets restarted by the OS.

Agents are different. An agent that can't reach its email provider doesn't crash. It just doesn't send email. An agent whose SOUL.md has a broken variable reference doesn't error out. It generates responses with raw template tags instead of real values. An agent that hits a rate limit doesn't stop. It queues up retries until it falls hours behind.

The common thread is that agents degrade gracefully by default, which sounds like a feature until you realize "graceful degradation" often means "failing without telling anyone." HEARTBEAT.md exists to turn silent failures into loud alerts.

## What HEARTBEAT.md Monitors

HEARTBEAT.md is a configuration file that defines health checks for your OpenClaw agent. Each check is a probe that runs on a schedule and reports the result. If a check fails, HEARTBEAT.md triggers the alert routing you've configured.

A typical HEARTBEAT.md has three types of checks:

**Ping checks** verify that the agent process is running and responsive. This is the most basic health check. The runtime sends a request to the agent's health endpoint and expects a response within a timeout. If the agent doesn't respond, something is fundamentally wrong: the process crashed, the container died, or the host is unreachable.

```markdown ## Checks

### ping - interval: 60s - timeout: 5s - endpoint: /health - alert_after: 3 failures ```

**Skill checks** verify that each integration is working. A skill check doesn't just confirm the agent is running. It confirms the agent can actually reach its external services. Can it connect to the email provider? Is the database credential still valid? Does the Slack token have the right scopes?

```markdown ### skill:email - interval: 300s - test: send_test_message - alert_after: 2 failures

### skill:database - interval: 120s - test: query_health_table - alert_after: 1 failure ```

**Deep checks** verify that the agent produces correct output. A deep check sends a known input to the agent and validates the response. This catches problems that ping and skill checks miss: a corrupted SOUL.md, a model that's returning garbage, or a logic error in the agent's reasoning chain.

```markdown ### deep - interval: 3600s - test_input: "What is your name and role?" - expect_contains: ["support agent", "Acme"] - alert_after: 1 failure ```

Deep checks are the most powerful but also the most expensive since they consume LLM tokens on every run. Run them hourly or daily, not every minute.

## Alert Routing

Detecting a failure is useless if the alert goes to the wrong place. HEARTBEAT.md supports multiple alert channels with escalation.

```markdown ## Alerts

### primary - channel: slack - target: #agent-alerts - on: any_failure

### escalation - channel: email - target: ops-team@company.com - on: 3_consecutive_failures - include: logs, last_10_interactions

### critical - channel: pagerduty - target: agent-oncall - on: ping_failure ```

The pattern most teams settle on: Slack for first alerts, email with logs for repeated failures, PagerDuty (or equivalent) for complete outages. Adjust based on how critical your agent is. A research assistant that's down for an hour is inconvenient. A customer support agent that's down for an hour is losing you money.

## Auto-Recovery

HEARTBEAT.md can do more than alert. It can take automatic recovery actions when specific checks fail.

```markdown ## Recovery

### on_skill_failure - action: refresh_credentials - retry_after: 30s - max_retries: 3

### on_ping_failure - action: restart_agent - retry_after: 60s - max_retries: 2 - alert_if_recovery_fails: true

### on_deep_failure - action: reload_config - retry_after: 10s - max_retries: 1 ```

**Credential refresh** handles the most common integration failure: expired tokens. When a skill check fails, the recovery action requests a new token from the OAuth provider and retries. This fixes 60-70% of integration outages without human intervention.

**Agent restart** is the nuclear option for ping failures. If the agent process is unresponsive, HEARTBEAT.md can kill it and start a fresh instance. This works well for transient issues like memory leaks or stuck event loops.

**Config reload** handles cases where a configuration change caused a problem. It re-reads SOUL.md and AGENTS.md from storage and reinitializes the agent. This is useful when deep checks fail after a config update.

Auto-recovery should always have a retry limit and an alert-on-failure fallback. An agent stuck in a restart loop is worse than an agent that's down, because the restart loop consumes resources and can cascade to other services.

## Common Monitoring Mistakes

**No HEARTBEAT.md at all.** This is the default for most new agents and it's the most dangerous state. You're relying on users to report problems, which means you'll find out about outages hours or days after they start.

**Ping checks only.** A ping check confirms the agent is alive. It doesn't confirm the agent is useful. Your agent can respond to health checks perfectly while returning empty responses to every real request because an integration token expired.

**Alert fatigue.** If every minor hiccup triggers a PagerDuty alert, your team starts ignoring alerts. Use the alert_after field to require multiple consecutive failures before alerting. A single failed health check is noise. Three consecutive failures is a problem.

**No deep checks.** Ping checks and skill checks verify infrastructure. Deep checks verify behavior. Without deep checks, you won't catch the class of failures where everything is technically running but the agent is producing wrong or empty output.

**Missing recovery limits.** Auto-recovery without a max_retries limit can create infinite restart loops. Always set a ceiling and always alert when recovery exhausts its retries.

## Setting Up HEARTBEAT.md in ClawSprout

ClawSprout generates a default HEARTBEAT.md for every agent. The defaults include a 60-second ping check, skill checks for each configured integration, and Slack alerts. You can customize everything from the monitoring tab in your agent's dashboard.

The dashboard also shows a real-time view of check results: green for passing, yellow for degraded (one failure but not yet alerting), red for failing. Historical data is retained for 30 days so you can spot patterns like integrations that fail every night when tokens rotate.

If you already have agents running without HEARTBEAT.md, add one today. Start with just ping checks and one alert channel. You can add skill checks and deep checks incrementally. The goal is to stop finding out about agent failures from your users and start finding out from your monitoring.

Launch your agent in 5 minutes

ClawSprout turns OpenClaw setup into a visual questionnaire. No terminal required.

Get Started Free

Related posts

How to Set Up an OpenClaw Agent Without Writing CodeOpenClaw Tutorial for Beginners: From Zero to Your First AgentWhat Is SOUL.md? The Complete Guide to OpenClaw Agent PersonalityOpenClaw Quickstart: Your First Agent in 5 Minutes (No Terminal)SOUL.md vs System Prompts: What's the Difference and Why It MattersWhat Is AGENTS.md? How to Define What Your OpenClaw Agent Can Do50 OpenClaw Agent Templates: Find the Right Starting Point for Your Use CaseWhen to Split One Agent Into Multiple Agents