What Uptime Monitoring Actually Means and Why It Matters

A website that goes down without anyone noticing is still a problem. The longer an outage lasts, the more it affects visitors, transactions, and trust. Uptime monitoring runs checks against your server and services at regular intervals, alerting you the moment something stops responding correctly.

This is different from general server monitoring with tools like htop and Netdata, which focus on resource usage such as CPU load, memory consumption, and active processes. Uptime monitoring is specifically about service availability. Both approaches are useful, and many production setups use both together.

The core question uptime monitoring answers is simple: is my service reachable, and is it returning the right response?

Why Waiting for Users to Report Problems Is a Poor Strategy

Most visitors who encounter a broken website do not report it. Some assume the problem is on their end. Some leave immediately and do not return. Some vent on social media rather than reaching out directly. By the time you hear about an outage, it has typically been running long enough to affect a meaningful portion of your audience.

Automated monitoring catches problems that users do not report. Intermittent failures that resolve before a visitor decides to complain, degraded performance that slows pages without making them completely unavailable, and regional issues that affect only certain network providers or geographic areas are all visible through active monitoring but invisible to passive observation.

For a production website, this is not optional. It is a basic operational requirement.

A Simple Cron-Based Uptime Monitor

For a single server or a small number of services, a bash script run by cron provides effective monitoring without installing additional software. The script checks whether each service responds correctly and sends an alert only when something fails.

#!/bin/bash

# uptime_monitor.sh - checks if services are responding

check_service() {
    local url="$1"
    local name="$2"
    local expected_code="${3:-200}"

    response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$url")

    if [ "$response" != "$expected_code" ]; then
        echo "ALERT: $name (expected $expected_code, got $response)"
        return 1
    else
        echo "OK: $name ($response)"
        return 0
    fi
}

# Check main website and API endpoints
check_service "https://example.com" "Main Website" "200"
check_service "https://api.example.com/health" "API Health Endpoint" "200"
check_service "https://shop.example.com" "Shop" "200"

Run this script every five minutes from cron:

*/5 * * * * /root/scripts/uptime_monitor.sh 2>&1 | grep ALERT | mail -s "Uptime Alert on $(hostname)" [email protected]

The grep ALERT filter ensures you receive an email only when something is actually wrong. When all services are healthy, the script runs silently and produces no output.

Adding Retry Logic to Reduce False Positives

Network glitches occasionally cause a single check to fail even when the service is running fine. This creates noise and can desensitise you to real alerts. A script that retries before alerting eliminates most transient network issues from triggering unnecessary notifications.

#!/bin/bash

check_service() {
    local url="$1"
    local name="$2"
    local max_attempts=3
    local attempt=1

    while [ $attempt -le $max_attempts ]; do
        http_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 15 "$url")
        curl_exit=$?

        if [ $curl_exit -eq 0 ] && [ "$http_code" = "200" ]; then
            return 0
        fi

        echo "Retry $attempt/$max_attempts for $name (HTTP $http_code)"
        attempt=$((attempt + 1))
        sleep 5
    done

    echo "ALERT: $name is DOWN after $max_attempts attempts"
    return 1
}

log_status() {
    local status="$1"
    local message="$2"
    echo "$(date '+%Y-%m-%d %H:%M:%S') $message" >> /var/log/uptime_monitor.log
}

# Main checks
if ! check_service "https://example.com" "Main Website"; then
    log_status "DOWN" "Main Website failed"
    mail -s "Main Website Down on $(hostname)" [email protected]
else
    log_status "UP" "Main Website OK"
fi

The script waits five seconds between attempts. Most network hiccups resolve before the second or third attempt. Only genuine outages trigger an alert, which keeps the signal-to-noise ratio high.

Managing Multiple Services with a Configuration File

As the number of monitored services grows, hardcoding URLs directly into the script becomes difficult to maintain. A simple configuration file lets you add, remove, or modify monitored services without touching the script logic.

Create a configuration file at /root/scripts/services.conf:

# Format: URL|EXPECTED_CODE|ALERT_EMAIL
# Skip lines starting with # and empty lines

https://example.com|200|[email protected]
https://api.example.com/health|200|[email protected]
https://shop.example.com|200|[email protected]

The monitoring script reads this file and processes each service:

#!/bin/bash

CONFIG="/root/scripts/services.conf"
LOG_FILE="/var/log/uptime_monitor.log"

log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') $1" >> "$LOG_FILE"
}

send_alert() {
    local service="$1"
    local email="$2"
    mail -s "Alert: $service on $(hostname)" "$email"
}

check_all() {
    local failed=0

    while IFS='|' read -r url expected_code alert_email; do
        # Skip comments and empty lines
        [[ "$url" =~ ^# ]] && continue
        [[ -z "$url" ]] && continue

        http_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 15 "$url")
        curl_exit=$?

        if [ $curl_exit -ne 0 ] || [ "$http_code" != "$expected_code" ]; then
            log "DOWN: $url (expected $expected_code, got $http_code, curl exit $curl_exit)"
            send_alert "$url" "$alert_email"
            failed=$((failed + 1))
        else
            log "UP: $url"
        fi
    done < "$CONFIG"

    return $failed
}

check_all

Adding a new service means adding a single line to the configuration file. No script changes are required.

Monitoring DNS Resolution Separately

HTTP checks alone do not cover every failure mode. DNS issues can make a site completely unreachable even when the web server is functioning correctly. If DNS resolution fails or points to the wrong IP address, visitors cannot reach your site regardless of how well your web server responds.

check_dns() {
    local domain="$1"
    local expected_ip="${2:-}"

    resolved_ip=$(dig +short "$domain" A | head -1)

    if [ -z "$resolved_ip" ]; then
        echo "ALERT: DNS resolution failed for $domain"
        return 1
    fi

    if [ -n "$expected_ip" ] && [ "$resolved_ip" != "$expected_ip" ]; then
        echo "ALERT: DNS mismatch for $domain (expected $expected_ip, got $resolved_ip)"
        return 1
    fi

    echo "OK: $domain resolves to $resolved_ip"
    return 0
}

check_dns "example.com" "93.184.216.34"

DNS changes are rare compared to web server issues, so running DNS checks every 15 to 30 minutes is usually sufficient. Checking less frequently keeps the monitoring lightweight while still catching DNS problems before they cause extended outages.

Tracking SSL Certificate Expiry

An expired SSL certificate blocks visitors from accessing your site in modern browsers. Rather than discovering an expired certificate when visitors start complaining, monitor expiry proactively and renew before the deadline arrives.

#!/bin/bash

check_ssl_expiry() {
    local domain="$1"
    local warn_days="${2:-30}"

    expiry_date=$(echo | openssl s_client -servername "$domain" -connect "$domain":443 2>/dev/null | \
        openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)

    if [ -z "$expiry_date" ]; then
        echo "ERROR: Could not retrieve SSL certificate for $domain"
        return 1
    fi

    days_until_expiry=$(echo "( $(date -d "$expiry_date" +%s) - $(date +%s) )" | bc | awk '{print int($1/86400)}')

    if [ "$days_until_expiry" -lt "$warn_days" ]; then
        echo "ALERT: SSL certificate for $domain expires in $days_until_expiry days ($expiry_date)"
        return 1
    else
        echo "OK: $domain SSL certificate valid for $days_until_expiry days"
        return 0
    fi
}

check_ssl_expiry "example.com" 30

Run this script daily via cron. A 30-day warning threshold gives adequate time to investigate renewal issues and complete the process before the certificate lapses. Some certificate authorities offer automated renewal through services like Let's Encrypt with Certbot, which can handle renewals without manual intervention.

Including Disk Space in Your Monitoring

Running out of disk space causes services to fail in unpredictable ways. Databases stop accepting writes, log files cannot be created, and applications crash without clear error messages. Adding a disk space check to your monitoring script catches this before it causes a production incident.

check_disk_space() {
    local threshold="${1:-90}"

    usage=$(df / | tail -1 | awk '{print $5}' | tr -d '%')

    if [ "$usage" -gt "$threshold" ]; then
        echo "ALERT: Disk usage at ${usage}% (threshold: ${threshold}%)"
        mail -s "WARNING: $(hostname) disk at ${usage}%" [email protected]
    else
        echo "OK: Disk usage at ${usage}%"
    fi
}

check_disk_space 90

Set the threshold based on your typical usage patterns. If your server routinely sits at 75% disk usage, a 90% threshold gives early warning. If you typically run at 40%, you can set a lower threshold without triggering false alerts.

Scheduling Checks at Appropriate Intervals

Different monitoring checks suit different frequencies. HTTP and API health checks should run frequently enough to catch outages quickly, while DNS and SSL checks can run less often since those values change infrequently.

# /etc/cron.d/uptime-monitoring

# HTTP checks every 5 minutes
*/5 * * * * root /root/scripts/http_monitor.sh >> /var/log/uptime_monitor.log 2>&1

# DNS checks every 30 minutes
*/30 * * * * root /root/scripts/dns_monitor.sh >> /var/log/uptime_monitor.log 2>&1

# Disk checks every 2 hours
0 */2 * * * root /root/scripts/disk_monitor.sh >> /var/log/uptime_monitor.log 2>&1

# SSL expiry check daily at 9am
0 9 * * * root /root/scripts/ssl_expiry_check.sh >> /var/log/uptime_monitor.log 2>&1

Use a dedicated log file and rotate it to prevent it from growing indefinitely. Add this to /etc/logrotate.d/uptime-monitoring:

/var/log/uptime_monitor.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
}

This configuration keeps seven days of logs compressed on disk, which is usually enough to identify when an issue started and what preceded it.

When to Add a Third-Party Monitoring Service

A self-hosted monitoring script works well for internal checks and small deployments. It has a fundamental limitation, though. If the server running your monitoring script loses network connectivity, it reports all services as down even when they are running fine. This creates a blind spot.

Third-party monitoring services run checks from multiple geographic locations. They can alert you when your entire server is unreachable, not just when individual services fail. Popular options include UptimeRobot, Pingdom, and HetrixTools, which provide HTTP monitoring, DNS checks, SSL validation, and alerting through email, SMS, and webhooks.

A practical approach is to run your own monitoring for fast, internal checks and use a third-party service as an independent layer that verifies your public-facing services are actually reachable from the outside. This hybrid approach catches issues your internal monitoring cannot see, including complete server connectivity failures.

Putting It All Together

The monitoring scripts described here share a common pattern. They run silently during normal operation, produce output only when something requires attention, and log their activity for later review. This keeps noise low while maintaining visibility.

When setting up monitoring for the first time, document which services you are monitoring, what thresholds you have configured, and what each alert should trigger. It is worth testing the alerting mechanism periodically by temporarily lowering a threshold or by verifying that alert emails arrive in your inbox.

Back up your monitoring scripts and configuration files along with your other server configurations. If you ever need to rebuild or migrate your monitoring setup, having the scripts in version control or documented somewhere accessible speeds up the recovery process.