The Moment Everything Stops
It’s 11 PM on a Wednesday. I’m deploying a WordPress optimization batch across a 5-site cluster running on a single GCP Compute Engine VM. Midway through site three, the SSH connection drops. Not a timeout — a hard refusal. Connection refused. Port 22.
I try again. Same result. I try from a different terminal. Same. I check the GCP Console — the VM shows as running. CPU is at 4%. Memory is fine. The machine is alive but unreachable. SSH is dead and it’s not coming back without intervention.
Most people would stop here, file a support ticket, and go to bed. I didn’t have that luxury. I had three more sites to process and a client deadline in the morning. So I did what any reasonable person with API access and a grudge would do — I built a workaround in real time.
Why SSH Dies on GCP VMs (And Why It’s More Common Than You Think)
SSH failures on Compute Engine instances are not rare. The common causes include firewall rule changes that block port 22, the SSH daemon crashing after a bad package update, disk space filling up (which prevents SSH from writing session files), and metadata server issues that break OS Login or key propagation.
In my case, the culprit was disk space. The optimization scripts had been writing temporary files and logs. The 20GB boot disk — which seemed generous when I provisioned it — had filled to 98%. The SSH daemon couldn’t create a new session file, so it refused all connections. The VM was fine. The services were running. But the front door was locked from the inside.
This is a pattern I’ve seen across dozens of GCP deployments: the VM isn’t down, it’s just unreachable. And the solution isn’t to wait for SSH to magically recover. It’s to have a plan that doesn’t depend on SSH at all.
The GCP API Workaround: Reboot With a Startup Script
GCP Compute Engine exposes a full REST API that lets you manage VMs without ever touching SSH. The key operations: stop an instance, update its metadata (including startup scripts), and start it again. All authenticated via service account or OAuth token.
Here’s the approach I used that Wednesday night:
Step 1: Stop the VM via API. A simple POST to compute.instances.stop. This is a clean shutdown — it sends ACPI shutdown to the guest OS, waits for confirmation, then reports the instance as TERMINATED. Takes about 30-60 seconds.
Step 2: Inject a startup script via metadata. GCP lets you set a startup-script metadata key on any instance. Whatever script you put there runs automatically when the instance boots. I wrote a bash script that does three things: cleans up temp files to free disk space, restarts the SSH daemon, and then resumes the WordPress optimization batch from where it left off.
Step 3: Start the VM. POST to compute.instances.start. The VM boots, runs the startup script, frees the disk space, restarts SSHD, and picks up the work. Total downtime: under 3 minutes.
No SSH required at any point. No support ticket. No waiting until morning.
The Self-Healing Script I Built That Night
After solving the immediate crisis, I turned the workaround into a permanent tool. A Python script that does the following:
Health check: Every 5 minutes, attempt an SSH connection to the VM. If it fails twice consecutively, trigger the recovery sequence. This uses the paramiko library for SSH and the google-cloud-compute library for the API calls.
Recovery sequence: Stop the instance, wait for TERMINATED status, set a cleanup startup script in metadata, start the instance, wait for RUNNING status, verify SSH access returns within 120 seconds. If SSH still fails after reboot, escalate to Slack with full diagnostic output.
Resume logic: The startup script checks for a resume.json file on the persistent disk. This file tracks which sites have been processed and which operation was in progress when the failure occurred. On boot, the script reads this file and picks up from the exact point of failure — not from the beginning of the batch.
The entire recovery script is 180 lines of Python. It’s run as a background process on my local machine, watching the VM like a lifeguard watches a pool.
IAP Tunneling: The Backup Access Method
After this incident, I also set up Identity-Aware Proxy (IAP) TCP tunneling as a permanent backup access method. IAP tunneling lets you SSH into a VM through Google’s infrastructure, bypassing standard firewall rules and port 22 entirely.
The command is simple: gcloud compute ssh instance-name --tunnel-through-iap. It works even when port 22 is blocked, because the traffic routes through Google’s IAP service on port 443. The VM doesn’t need a public IP address, and you don’t need any firewall rules allowing SSH.
I should have set this up on day one. It’s now part of my standard VM provisioning checklist — every Compute Engine instance gets IAP tunneling configured before anything else. The extra 5 minutes of setup would have saved me the Wednesday night adventure entirely.
Lessons That Apply Beyond GCP
Never depend on a single access method. SSH is not a guarantee. It’s a service running on a Linux machine, and services fail. Always have a second path — IAP tunneling on GCP, Serial Console on AWS, Bastion hosts, or API-based management. If your only way into a server is SSH, you will eventually be locked out at the worst possible time.
Disk space kills more deployments than bad code. I’ve seen this pattern at companies of every size. Nobody monitors disk space on VMs that “aren’t doing much.” Then a log file grows, or temp files accumulate, and suddenly the machine is functionally dead even though every dashboard says it’s healthy. Set a 80% disk alert on every VM you provision. It takes 30 seconds and prevents hours of debugging.
Startup scripts are the most underused feature in cloud computing. Every major cloud provider supports them — GCP metadata startup scripts, AWS EC2 user data, Azure custom script extensions. They turn a reboot into a deployment. If your recovery plan is “SSH in and run commands,” your recovery plan fails exactly when you need it most. If your recovery plan is “reboot and let the startup script handle it,” you can recover from anything, from anywhere, including your phone.
Build resume logic into every batch process. If a script processes 10 items and fails on item 7, restarting should begin at item 7, not item 1. This is trivial to implement — write progress to a JSON file after each step — but most people don’t do it until they’ve lost work to a mid-batch failure. I now build resume logic into every automation by default.
Frequently Asked Questions
Can I use the GCP API to manage VMs without the gcloud CLI?
Yes. The Compute Engine REST API is fully documented and works with any HTTP client. You authenticate with an OAuth2 token or service account key, then make standard REST calls. The gcloud CLI is a convenience wrapper — everything it does, the API can do directly. I use Python with the google-cloud-compute library for programmatic access.
How do I prevent disk space issues on GCP VMs?
Three steps: set up Cloud Monitoring alerts at 80% disk usage, add a cron job that cleans temp directories weekly, and size your boot disk with 50% headroom beyond what you think you need. A 30GB disk costs pennies more than 20GB and prevents the most common cause of mysterious SSH failures.
Is IAP tunneling slower than standard SSH?
Marginally. IAP adds about 50-100ms of latency because traffic routes through Google’s proxy infrastructure. For interactive terminal work, you won’t notice the difference. For bulk file transfers, use gcloud compute scp with the --tunnel-through-iap flag and expect about 10-15% slower throughput compared to direct SSH.
What if the VM won’t stop via the API?
If instances.stop hangs for more than 90 seconds, use instances.reset instead. This is a hard reset — equivalent to pulling the power cord. It’s not graceful, but it works when the OS is unresponsive. The startup script still runs on reboot, so your recovery logic kicks in either way.
The Real Takeaway
The Wednesday night SSH failure cost me about 45 minutes, including building the workaround. If it had happened before I understood the GCP API, it would have cost me a full day and a missed deadline. The difference isn’t talent or experience — it’s having built systems that assume failure and recover automatically.
Every server will become unreachable. Every batch process will fail mid-run. Every disk will fill up. The question isn’t whether these things happen. It’s whether your systems are built to handle them without you being the single point of failure at 11 PM on a Wednesday.
Leave a Reply