Understanding Kernel Isolation in Docker Containers

Docker containers share the host Linux kernel, which means a misconfiguration can allow a compromised container to break out of its isolation and access the host filesystem, network interfaces, or other running containers. Unlike virtual machines that each run a full operating system, containers rely on Linux kernel features such as namespaces, control groups, and capabilities to provide separation. When these are not properly configured, the boundary between the container and the host becomes thin enough for an attacker to cross.

Docker ships with default settings that prioritised convenience for local development environments. Those defaults include running as root inside the container, broad capability sets, and access to the host network. Switching to a production environment requires deliberate hardening steps. This guide walks through the practical measures that reduce the attack surface of a containerised web application running on a Linux server.

Running Containers as Non-Root Users

The Docker documentation confirms that containers run as the root user inside the container by default. This is convenient for installation scripts and package managers but unnecessary for most applications and dangerous if the container is ever compromised. Even a limited foothold inside a root container can sometimes be escalated to host-level access.

Most official images, including those for PostgreSQL, Nginx, and common web application stacks, can run as a standard unprivileged user. The Dockerfile directive USER switches the container to run as a named user after the build steps complete.

FROM php:8.3-fpm

RUN groupadd -r appuser && useradd -r -g appuser appuser

COPY --chown=appuser:appuser . /var/www/html

USER appuser

This creates a system user called appuser and runs the application process under that identity. The --chown flag on the COPY directive ensures the files are readable by the new user without granting excessive permissions.

Some applications need to bind to privileged ports below 1024. Rather than running the entire container as root, you can grant only the specific capability needed at container startup:

docker run --cap-add NET_BIND_SERVICE -p 443:443 myapp

The NET_BIND_SERVICE capability allows the process to bind to any port without running as root. This follows the principle of granting the minimum privilege required for the task.

Limiting Linux Capabilities

Linux capabilities break down the traditional all-or-nothing root privilege into individual units. A web server process rarely needs all of root's capabilities. It may need to bind to privileged ports, read certain system files, or change ownership of specific resources, but it almost never needs to load kernel modules, configure network interfaces, or reboot the host.

The principle of least privilege means granting only the capabilities the specific application requires and dropping everything else. You can audit what capabilities a container currently holds by entering the container and checking:

docker run --rm -it myapp /bin/bash
capsh --print

The output shows the current capability set. From there, you can identify which ones are actually necessary and drop the rest:

docker run --rm \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  --cap-add DAC_READ_SEARCH \
  myapp

Common capability additions for web servers include NET_BIND_SERVICE for port binding and DAC_READ_SEARCH for reading files that the normal permission model would block. Audit your specific application before deciding which capabilities to retain.

Restricting System Calls with seccomp

Seccomp, short for secure computing mode, restricts the system calls a container can make to the Linux kernel. Every operation a container performs, from opening a file to opening a network socket, eventually calls the kernel through a system call. By controlling which system calls are permitted, you reduce the attack surface available to an attacker who has gained code execution inside the container.

Docker provides a default seccomp profile that blocks approximately 44 system calls considered dangerous or unnecessary for most container workloads. You do not need to specify anything to use this default profile. For tighter restrictions, you can provide a custom seccomp profile:

docker run --security-opt seccomp=profile.json myapp

A custom seccomp profile allows you to define exactly which system calls your application needs. For a typical PHP web application, you might allow only a carefully selected subset of calls. Testing custom profiles thoroughly before applying them in production is essential. Blocking a system call that the application requires causes immediate failures and can be difficult to diagnose under pressure.

Avoiding the Privileged Flag

Never run a container with the --privileged flag in production. A privileged container has access to all devices on the host and can escape its isolation entirely. If a container truly needs access to specific hardware, use --device to expose only those devices:

docker run --device=/dev/sda:/dev/sda myapp

This grants access to a specific disk device without exposing every device on the host. Treat --privileged as a last resort and never use it in any environment where security matters.

Read-Only Root Filesystem

Setting the root filesystem to read-only prevents a compromised container from modifying application files or writing malicious files to the container filesystem. Applications that need to write data can only do so in mounted volumes:

docker run --read-only myapp

For applications that legitimately need to write temporary files, mount tmpfs volumes for those specific directories:

docker run --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=64m \
  --tmpfs /var/cache:rw,noexec,nosuid,size=128m \
  myapp

Before enabling a read-only root filesystem, identify every directory the application writes to. Common writable locations include /tmp, /var/cache, upload directories, and log files. For logs, use a logging driver or a mounted volume rather than relying on filesystem writes inside the container.

Setting Resource Limits

Containers without resource limits can consume all available CPU and memory on the host, affecting other containers and the host system itself. A compromised container running an infinite loop or a memory leak can cause a denial of service across the entire host. Setting limits ensures that a misbehaving container cannot take down unrelated services:

docker run \
  --memory="256m" \
  --memory-swap="512m" \
  --cpus="0.5" \
  --restart=on-failure:3 \
  myapp

The --memory-swap value controls the total memory and swap available. Setting it larger than --memory allows the container to swap to disk when it hits the memory limit. Setting it equal to --memory prevents swap usage entirely, which may cause the container to be terminated by the OOM killer when memory is exhausted. Choose values based on your application's actual requirements and monitor usage in production.

Network Namespace Isolation

By default, Docker creates a bridge network that allows all containers on the same host to communicate with each other. If one container is compromised, an attacker may be able to reach other containers on the same network, including databases, caches, or internal APIs that should not be exposed.

Custom Docker networks let you control which containers can communicate with each other. Using separate networks for different components enforces a clear boundary:

docker network create --driver bridge appnet
docker network create --driver bridge dbanet

docker run --network=appnet myapp
docker run --network=dbanet --network=appnet mydb

In this setup, the application container connects only to appnet, while the database container connects to both appnet and dbanet. The application can reach the database, but the database cannot be reached directly from other networks. If the web server is compromised, it cannot directly reach the database because the database is not attached to appnet.

Image Security and Vulnerability Scanning

Base images provide the operating system and libraries that your application runs on. Outdated base images accumulate known vulnerabilities over time. Using minimal base images that contain only what the application needs reduces the attack surface significantly.

  • Choose minimal images: Use alpine or distroless images instead of full Ubuntu or Debian base images where possible.
  • Pin image versions: Always specify a version tag rather than :latest to ensure reproducible builds and avoid unexpected updates during rebuilds.
  • Rebuild regularly: Schedule regular rebuilds of application images to pick up security patches applied to the base image.
  • Scan for vulnerabilities: Use tools such as Docker Scout or Trivy to scan images for known vulnerabilities during the build pipeline and reject images with critical issues before they reach production.
docker scout cves myapp:latest

Integrating image scanning into a continuous integration pipeline means that images with critical vulnerabilities fail the build automatically. This prevents vulnerable containers from reaching production environments where they could be exploited.

Managing Secrets in Containers

Passing sensitive data such as database passwords, API keys, and encryption tokens to containers requires care. Environment variables are visible through docker inspect and are inherited by child processes, making them unsuitable for secrets. A better approach is Docker secrets, which are stored encrypted and mounted as files only within the container:

echo "db_password" | docker secret create db_password -

docker service create --secret db_password myapp

Secrets are mounted as files in /run/secrets/ within the container, readable only by the container's root user. They are never written to disk as part of the image and are not exposed through environment variables. If you are not using Docker Swarm, consider a dedicated secrets management tool such as HashiCorp Vault or AWS Secrets Manager and retrieve secrets at container startup rather than baking them into images or passing them as plain environment variables.

Protecting the Docker Socket

The Docker socket at /var/run/docker.sock provides full control over the Docker daemon. Mounting this socket into a container is one of the most dangerous misconfigurations possible. A container with access to the Docker socket can create new containers, mount the host filesystem, and effectively gain root access to the host:

# Never mount the Docker socket into a container
docker run -v /var/run/docker.sock:/var/run/docker.sock myapp

If your application needs to manage containers programmatically, use the Docker API through a named pipe with restricted permissions, or use a container orchestration tool that provides its own access control layer. Never grant a workload container direct access to the Docker socket.

Multi-Container Environments and Security Boundaries

When deploying multi-container applications using Docker Compose, the same security principles apply but the configuration becomes distributed across multiple service definitions. Each service should run with its own user, limited capabilities, and restricted network access. A web application container should not be able to reach the database container directly unless that communication path is explicitly required.

For projects involving multiple interconnected containers, reviewing the overall architecture for security boundaries is worth the effort. A well-structured compose setup places each component on its own network and grants only the capabilities each service genuinely needs.

Regular Security Reviews

Container security is not a one-time configuration. New vulnerabilities are discovered regularly in base images, application dependencies, and the container runtime itself. A scheduled review process helps catch issues before they are exploited.

Key review tasks include rebuilding images to pull the latest base image patches, re-scanning for newly disclosed vulnerabilities, checking that capability and network configurations remain appropriate as the application evolves, and verifying that resource limits are still correctly set for the current workload.

Putting These Practices Together

Docker container security works through layers. No single hardening step provides complete protection, but each measure reduces the potential impact of a compromise. Running containers as non-root users limits what an attacker can do inside the container. Dropping unnecessary capabilities restricts access to dangerous kernel operations. Setting resource limits prevents a compromised container from affecting the wider host. Read-only filesystems stop attackers from modifying application files. Network isolation ensures that compromising one container does not automatically grant access to others. Regular image scanning catches vulnerabilities before they reach production. Proper secrets management prevents sensitive data from leaking through environment variables.

These measures complement each other. A read-only filesystem protects against file modification even if a capability has been incorrectly granted. Network isolation limits lateral movement even if an attacker escapes the container. Together, they create a defence-in-depth posture that is significantly harder to breach than a default container configuration.

If you are running containerised applications in production and want a practical review of your current setup, you can get in touch with details of your infrastructure, the base images you use, and how secrets are currently managed.