SM-01: How One Agent Monitors 23 Websites Every Hour Without Me

The Worst Way to Find Out Your Site Is Down

A client calls. Their site has been returning a 503 error for four hours. You check – they are right. The hosting provider had a blip, the site went down, and nobody noticed because nobody was watching. Four hours of lost traffic, lost leads, and lost trust.

This happened to me once. It never happened again, because I built SM-01.

SM-01 is the first agent in my autonomous fleet. It runs every 60 minutes via Windows Task Scheduler, checks 23 websites across my client portfolio, and reports to Slack only when it finds a problem. No dashboard to check. No email digest to read. Silence means everything is fine. A Slack message means something needs attention.

What SM-01 Checks

HTTP status: Is the site returning 200? A 503, 502, or 500 triggers an immediate red alert. A 301 or 302 redirect chain triggers a yellow alert – the site works but something changed.

Response time: How long does the homepage take to respond? Baseline is established over 30 days of monitoring. If response time exceeds 2x the baseline, a yellow alert fires. If it exceeds 5x, red alert. Slow sites lose rankings and visitors before they fully go down – response time degradation is an early warning.

SSL certificate expiration: SM-01 checks the SSL certificate expiry date on every pass. If a certificate expires within 14 days, yellow alert. Within 3 days, red alert. Expired, critical alert. An expired SSL certificate turns your site into a browser warning page and kills organic traffic instantly.

Content integrity: The agent checks for the presence of specific strings on each homepage – the site name, a key heading, or a footer element. If these strings disappear, it means the homepage content changed unexpectedly – possibly a defacement, a bad deploy, or a theme crash. This catches the subtle failures that return a 200 status code but serve broken content.

The Architecture Is Deliberately Boring

SM-01 is a Python script. It uses the requests library for HTTP checks, the ssl and socket libraries for certificate inspection, and a Slack webhook for alerts. No monitoring platform. No subscription. No agent framework. Under 250 lines of code.

The site list is a JSON file with 23 entries. Each entry has the URL, expected status code, content check string, and baseline response time. Adding a new site takes 30 seconds – add an entry to the JSON file.

Results are stored in a local SQLite database for trend analysis. I can query historical uptime, average response time, and alert frequency for any site over any time period. The database is 12MB after six months of hourly checks across 23 sites.

What Six Months of Data Revealed

Across 23 sites monitored hourly for six months, SM-01 recorded 99.7% average uptime. The 0.3% downtime was concentrated in three sites on shared hosting – every other site on dedicated or managed hosting had 99.99%+ uptime.

SSL certificate alerts saved two near-misses where auto-renewal failed silently. Without SM-01, those certificates would have expired and the sites would have shown browser security warnings until someone manually noticed and renewed.

Response time trending caught one hosting degradation issue three weeks before it became a visible problem. A site’s response time crept from 400ms baseline to 900ms over 10 days. SM-01 flagged it at the 800ms mark. Investigation revealed a database table that needed optimization. Fixed in 20 minutes, before any traffic impact.

Frequently Asked Questions

Why not use UptimeRobot or Pingdom?

I have. They work well for basic uptime monitoring. SM-01 adds content integrity checking, custom response time baselines per site, and integration with my existing Slack alert ecosystem. The biggest advantage is cost at scale – monitoring 23 sites on UptimeRobot Pro costs about /month. SM-01 costs nothing.

Does hourly checking miss short outages?

Yes – an outage lasting 30 minutes between checks would be missed. For critical production sites, you could reduce the interval to 5 minutes. I chose hourly because my sites are content sites, not e-commerce or SaaS platforms where minutes of downtime have direct revenue impact. The monitoring frequency should match the cost of missed downtime.

How do you handle false positives from network issues?

SM-01 requires two consecutive failed checks before alerting. A single timeout or error is logged but not reported. This eliminates the vast majority of false positives from transient network blips or temporary DNS issues. If both the hourly check and the immediate recheck 60 seconds later fail, the alert fires.

Monitoring Is Not Optional

Every website you manage is a promise to a client. That promise includes being available when their customers look for them. SM-01 is how I keep that promise without manually checking 23 URLs every day. It is the simplest agent in my fleet and arguably the most important.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *