Skip to content

Runbook: Incident Response Checklist

Severity Assessment

SeverityDefinitionExample
P1 — CriticalSite down or major feature broken for all usersHomepage 500, database down
P2 — HighSignificant feature brokenSearch not working, no ads loading
P3 — MediumPartial degradationSlow page loads, one page type broken
P4 — LowMinor issueCosmetic bug, admin-only issue

P1: Site Down

Immediate (0–5 minutes)

  • [ ] Verify the site is actually down: curl -I https://www.dezeen.com/
  • [ ] Check Cloudflare status: cloudflarestatus.com
  • [ ] Check if it's a specific node: test each web node IP directly
  • [ ] Check Varnish backend health: varnishadm "backend.list"

Diagnose (5–15 minutes)

  • [ ] SSH into web nodes (WS1–WS4) and check Apache status: systemctl status apache2
  • [ ] Check PHP errors: tail -100 /var/log/apache2/error.log
  • [ ] Check database: mysql -u root -p -e "SHOW PROCESSLIST"
  • [ ] Check disk space: df -h
  • [ ] Check memory: free -m
  • [ ] Check WordPress debug log: tail -100 wp-content/debug.log

Resolve

  • [ ] Apache down: systemctl restart apache2
  • [ ] MySQL down: systemctl restart mysql
  • [ ] Memory exhaustion: Kill runaway processes; increase limits
  • [ ] Disk full: Clear WP Rocket cache, old logs, temporary files
  • [ ] Bad deployment: Redeploy last known good commit via DeployHQ
  • [ ] Plugin crash: Disable the problem plugin via WP-CLI: wp plugin deactivate <plugin-name>

Post-Incident

  • [ ] Purge all caches (see cache-purge.md)
  • [ ] Verify all 4 web nodes are healthy
  • [ ] Monitor for 30 minutes
  • [ ] Write incident report

P2: Feature Broken

Search Down

  • [ ] Check Algolia dashboard for API status
  • [ ] Verify ALGOLIA_APPLICATION_ID and API keys
  • [ ] Check which Algolia plugin is active (only one PHP variant should be)
  • [ ] Test search endpoint directly: curl https://www.dezeen.com/wp-json/...

Ads Not Loading

  • [ ] Check Google Ad Manager status
  • [ ] Verify Cookiebot is not blocking ad scripts
  • [ ] Check browser console for JS errors
  • [ ] Test in incognito / with consent accepted

Comments Not Loading

  • [ ] Check Disqus status: status.disqus.com
  • [ ] Verify Disqus plugin is active
  • [ ] Check for JavaScript errors in browser console

Newsletter Forms Broken

  • [ ] Test CM API: POST to dezeen-campaign-monitor/v1/test
  • [ ] Check Campaign Monitor service status
  • [ ] Verify API keys in wp-config.php

P3: Performance Degradation

  • [ ] Check Query Monitor for slow queries
  • [ ] Check Varnish hit rate: varnishstat
  • [ ] Check Cloudflare analytics for traffic spike
  • [ ] Review recent deployments for performance regression
  • [ ] Check if a heavy admin operation is running (bulk edit, import)
  • [ ] Run wp cron event list to check for stuck cron jobs

Communication Template

[TIMESTAMP] - Dezeen.com Incident

Status: [Investigating / Identified / Resolved]
Severity: P[1-4]
Impact: [Description of user impact]
Cause: [Known cause or "Under investigation"]
ETA: [Expected resolution time]
Actions taken: [List of actions]

Key Contacts

RoleContactNotes
Hosting (Jelastic)Check Enscale dashboard
CDN (Cloudflare)Check dashboard / status page
Outgoing teamAvailable until October 2026
StakeholdersNotify for P1/P2 incidents

Rollback Procedure

  1. Open DeployHQ
  2. Find the last successful deployment
  3. Click "Redeploy" to restore previous version
  4. Alternatively: git revert <commit> and push to master
  5. Purge all caches after rollback

Gotchas During Incidents

  • Don't restart all nodes at once — Keep at least 2 nodes in rotation
  • Check the admin server separatelyadmin.dezeen.com is on a different server
  • Database changes are not versioned — SQL rollback requires a backup restore
  • Varnish takes time to recover — After a node comes back, 5–15 second delay before traffic routes to it
  • Cloudflare may mask the issue — If Cloudflare is caching, users may see old content even when origin is down