Server Monitoring with htop and Netdata: Spotting Problems Before They Happen
Server problems rarely announce themselves. A web application slows down, customers complain, and by the time someone logs in to check what is happening, the disk is full or the database has been swapping for twenty minutes. Server monitoring tools exist to close that gap. When set up correctly, they show you what is happening inside a server before the symptoms become problems that affect users.
Two tools that work well for this purpose are htop and Netdata. Both run on Ubuntu and other Linux distributions, both are free and open source, and both give you a real-time view of what a server is doing. They serve different purposes and they complement each other. This guide covers how to read what each tool is telling you, which numbers matter most, and how to act on the information before a slow server becomes an outage.
What htop Shows You About Your Server
htop is an interactive process viewer. It displays running processes, CPU usage per core, memory consumption, and swap usage in a continuously updating screen. It is more readable than the standard top command because it uses colour and a more intuitive layout.
When you run htop on an Ubuntu server, the display is split into several areas. The top section shows CPU cores across the screen, each displayed as a bar that fills as the core is used. Below that, memory and swap bars show how much RAM is in use and how much swap space has been allocated. The main body of the screen lists running processes, sorted by CPU usage by default.
The key parts of the htop display are:
- CPU bars: One bar per core at the top. Green is normal usage, red is high usage, and blue is for privileged kernel operations. If all cores are consistently red, the CPU is overloaded and requests are queuing.
- Mem bar: Shows physical RAM. This is the number to watch most carefully on a server that runs databases or PHP applications. When this approaches capacity, the system starts moving data to swap.
- Swap bar: When the RAM bar is full, the system writes data to disk to free RAM. Swap usage that is climbing continuously is a sign that the server is running out of memory and performance will degrade.
- Load average: Three numbers shown at the top of the process list, labelled "load average". The first is the average load over the last minute, the second over five minutes, the third over fifteen minutes. Load represents the number of processes waiting for CPU or blocked on I/O. A load of 4 on a 4-core server means all cores are busy. A load of 8 on a 4-core server means four processes are waiting for CPU time.
- Process list: Running processes sorted by CPU or memory usage. Press F6 to change the sort field. Sorting by memory usage identifies which process is consuming RAM. Sorting by CPU identifies which process is generating load.
Running htop is straightforward. On Ubuntu, install it with:
sudo apt update && sudo apt install htop
Then run it with:
htop
htop can also be run with specific columns displayed. For a compact view focused on CPU and memory:
htop -d -C
The -d flag sets the delay between updates in tenths of a second. The -C flag disables colour, which is useful when redirecting output to a file.
Interpreting CPU Usage Patterns in htop
Not all high CPU usage is a problem. Understanding what normal looks like is the first step to spotting abnormal. A web server that handles mostly static files will show low CPU usage except when generating dynamic content or running PHP scripts. A database server will show higher CPU when running complex queries, but the usage should drop back down when the query completes.
Signs that CPU usage is a problem:
- Sustained high usage across all cores: Occasional spikes are normal, especially during backups, log rotations, or scheduled tasks. Sustained usage above 80 percent on all cores means the server is overloaded and requests are queuing.
- Load average climbing and not dropping: If the one-minute load average is climbing and the five-minute load average follows, the server is accumulating work it cannot process fast enough.
- A single process at the top of the list with consistently high CPU: This pinpoints the source. Run
htopthen press F6 to sort by CPU, then F5 to show the process tree to see if the high-CPU process has child processes.
To identify what a high-CPU process actually is, press Enter on the process in htop to see the full command line, including any arguments. A PHP-FPM worker consuming high CPU might indicate a slow or infinite loop in application code. A MySQL process consuming high CPU might indicate a query that is missing an index.
When dealing with database performance issues that show up in CPU usage, it is worth reviewing how your database queries are structured. Poorly indexed queries can cause the database to work harder than necessary, and this often manifests as elevated CPU on the database server. A database indexing strategy can help identify which queries benefit most from additional indexes.
Reading Memory Usage and Swap in htop
Memory is where most server performance problems originate. Servers that run out of RAM slow down because the operating system moves data to disk (swap), and disk access is orders of magnitude slower than RAM access. Watching the Mem and Swap bars in htop is the most important habit a server operator can develop.
Understanding the Mem bar:
- Total RAM: The full length of the Mem bar represents all installed RAM. A 4 GB bar means 4 GB of RAM is installed.
- Used memory: The coloured portion of the bar. Linux uses available RAM for disk caching to improve I/O performance, so some used memory is normal and beneficial. The actual amount of memory used by applications is shown as the bar without the cached portion.
- Free memory: The empty portion of the bar. A server with almost no free memory is not necessarily a problem if the cached portion is large, but if the cached portion is small and used is high, the server is constrained.
Understanding when swap is a problem:
- Zero swap used: Normal. Swap is a safety net, not a working memory.
- Small amount of swap used: Some swap usage is not unusual on idle servers. Linux may swap out rarely-used application memory to make more room for disk cache.
- Swap bar growing continuously: This is a problem. The server is trying to use more memory than is physically available. Applications will slow down noticeably. Identify the process consuming RAM with htop and either restart it, optimise its memory usage, or add more RAM.
- Swap usage equal to or greater than physical RAM: Critical. The server is severely memory-constrained. The database or application will be extremely slow. Immediate action is required.
When managing database servers, paying close attention to memory usage becomes especially important. Database engines like MySQL and PostgreSQL can consume significant RAM when handling large datasets, and insufficient memory often leads to the swap issues described above. If you are running MySQL and need to manage the database through a web interface, installing and securing phpMyAdmin on Ubuntu provides a convenient way to monitor database performance and spot memory-related issues.
What Netdata Adds Beyond htop
htop shows what is happening right now. Netdata shows what has been happening over time and presents a much broader picture of system health. Netdata is a distributed monitoring agent that collects hundreds of metrics from a server, stores them temporarily in memory, and presents them through a web interface.
Installing Netdata on Ubuntu is one command:
wget -O /tmp/netdata-kickstart.sh https://my-netdata.io/kickstart.sh && sh /tmp/netdata-kickstart.sh
After installation, Netdata starts collecting data immediately and the dashboard is available on port 19999. Access it by browsing to http://your-server-ip:19999.
Security note: By default, Netdata listens on all network interfaces. Before exposing it publicly, you should restrict access using firewall rules or authentication. In production environments, it is best to bind Netdata to localhost or access it through a VPN.
What Netdata monitors by default:
- CPU: Usage per core, steal time (time lost to virtualisation), and softirq time.
- Memory: RAM, swap, and the memory used by specific kernel subsystems.
- Disks: Per-disk throughput (read/write MB/s), operations per second, and latency.
- Network: Interface throughput in bits or bytes per second, TCP connections, TCP retransmits, and bandwidth per port.
- Processes: Number of running processes, I/O wait time, and detailed process-level CPU and memory usage.
- Application metrics: If Nginx, MySQL, PostgreSQL, or PHP-FPM is running, Netdata collects application-specific metrics automatically without additional configuration.
Reading the Netdata Dashboard
The Netdata dashboard is arranged into sections, each showing one category of metrics. Each chart shows the last few minutes of data by default, but you can zoom in to see individual seconds or zoom out to see hours of history.
The most important charts for general server health are:
- System CPU: Shows the overall CPU usage broken down by user, system, softirq, and steal. Steal time matters if the server runs in a virtual machine. High steal means the hypervisor is oversubscribing the physical host and the virtual server is waiting for CPU time.
- System RAM: Shows used, cached, and free RAM over time. A chart that shows used memory climbing steadily over several hours indicates a memory leak in an application.
- Disk utilisation: Shows the percentage of time the disk is busy. A disk that is consistently above 80 percent busy during normal operation is a bottleneck. Applications will stall waiting for the disk.
- Network interfaces: Shows inbound and outbound bandwidth. Useful for confirming that expected traffic is flowing and for spotting unusual spikes that might indicate an attack or a misconfigured service broadcasting excessive data.
When monitoring network traffic, it helps to have a clear picture of how your website infrastructure handles visitor requests. A CDN setup for business websites can reduce the load on your origin server by serving static content from edge locations, which often shows up in Netdata as lower bandwidth usage on the primary network interface.
Setting Up Basic Alerts in Netdata
Netdata includes a health monitoring system that can trigger alerts when a metric crosses a threshold. Alerts are defined in configuration files and can send notifications by email, by webhook, or through several other notification mechanisms.
Alerts are configured in files under /etc/netdata/health.d/. Each file defines one or more alarm entities with a metric to watch, a condition, and a threshold. For example, to alert when CPU usage stays above 80 percent for 5 minutes, create a file in health.d:
template: system_cpu_usage
on: system.cpu
lookup: average -5m percentage of user
units: %
every: 1m
critical: $this > 80
Restart Netdata after adding or modifying alarms:
sudo systemctl restart netdata
The most useful alerts to configure first are:
- CPU usage above 80 percent for 5 minutes: Sustained high CPU means the server is overloaded.
- Memory usage above 85 percent: Warns before the server starts swapping.
- Disk usage above 80 percent on any partition: Prevents the server from running out of disk space, which causes application failures and system errors.
- Swap usage above 10 percent: Any significant swap usage means the server is short on memory.
- TCP retransmit rate above 1 percent: High retransmit rates indicate network problems or congestion.
Common Server Problems and How to Spot Them
Knowing what normal looks like makes it easier to spot problems. Here are the most common server issues and the metrics that reveal them.
Problem: Server is slow but CPU is not maxed. Check the disk I/O chart in Netdata. High disk wait time means applications are blocked waiting for the disk. This often happens when a database grows too large for available RAM and the server is constantly reading and writing data to disk. Adding RAM or optimising database queries that read too much data are common fixes.
Problem: Server was fast and is now slow. Check the memory chart over time. If used memory is climbing over hours or days, an application has a memory leak. The fix is usually to restart the application periodically or to find and fix the leak in the code. Until the leak is fixed, restart the application when memory usage approaches 80 percent.
Problem: Server is slow and htop shows high CPU but no single process stands out. Check the I/O wait column in htop. Press F2 to set up htop, go to "Columns", and add "IOWAIT". High I/O wait means the CPU is idle waiting for disk or network I/O. This is usually a disk bottleneck or a network saturation issue.
Problem: Web server returns slow responses intermittently. Check the number of connections in Netdata. If the number of concurrent connections is climbing toward the default limit for the web server software, new connections queue and response times increase. The fix is usually to increase the connection limit or add more server capacity.
Using htop and Netdata Together
htop is a snapshot tool. It shows what is happening right now and is best used when you suspect something is wrong. Netdata is a history tool. It shows what has been happening and makes patterns visible that are impossible to see in a single htop snapshot.
Run htop when a server feels slow or when users report performance problems. Use Netdata to review what happened in the hours leading up to the problem. Together they cover both real-time diagnosis and historical analysis.
A good workflow is to check Netdata first, note any anomalies in the charts, then open htop to drill down into the specific process or resource that the Netdata charts flagged as unusual.
Keeping Monitoring Active Over Time
Setting up monitoring tools is only the first step. Monitoring only helps if it stays active and if someone reviews the alerts. On servers that run for months or years without interruption, it is easy to overlook alerts that fire during out-of-hours periods.
Schedule regular reviews of your monitoring setup. Check that alerts are firing correctly and that notifications reach the right person. Over time, you will learn which alerts are useful and which generate too much noise. Tune thresholds based on what you observe during normal operation.
Documentation matters too. Keep a record of what each metric threshold means for your specific applications, what the typical values look like during normal operation, and what steps to take when an alert fires. This makes it easier to respond quickly when something goes wrong, especially if multiple people are involved in server management.
Getting Started Today
Install htop on any Ubuntu server you manage and get familiar with the display. Press F1 in htop for a help screen that explains all the available options and keyboard shortcuts. Set up the columns to show the information that matters most to your applications.
Install Netdata on servers that need more monitoring than htop alone provides, particularly production servers where historical performance data is important for diagnosing intermittent problems. Configure at least the five basic alerts described above so that you know something is wrong before a customer reports it.
Server monitoring is not a luxury. A slow server that is not monitored will eventually become a failed server. The thirty minutes spent setting up htop and Netdata on a new server is recovered the first time they help you identify and fix a problem before it becomes an outage.