Production Ready VPS

Updated on 05 Jun 26 22:06 UTC

# VPS# Linux# Server Hardening# DevOps# Linux Security

Most VPS setup guides stop at the point where an application is running. That is understandable -- getting software deployed is usually the goal. The problem is that a running application and a production-ready server are not the same thing.

A server can happily serve traffic for months while quietly accumulating operational debt: logs grow unchecked, Docker images pile up, access controls become haphazard, deployments grow riskier, and before long you are spending more time maintaining the server than building the application it exists to host.

A production-ready VPS does not need to be complicated. In fact, the best server setups are usually boring. What they do need is a solid foundation -- one where security, monitoring, maintenance, and deployment workflows are considered from the start rather than bolted on after the first incident.

In this guide we will walk through each component that goes into provisioning a production-ready server. Every section includes real configuration examples drawn from the bootstrap script we use at DeployCrate to provision thousands of servers.

What Does "Production-Ready" Actually Mean?

The term gets thrown around a lot, often without much explanation. To us, a production-ready VPS is a server that can host a real application, receive real traffic, and continue operating reliably without constant manual intervention. That means thinking beyond simply deploying code and checking it off a list.

A production-ready VPS should have each of the following foundations in place:

Secure access controls so that only authorized users can reach the server and they cannot escalate beyond what they need.
Proper user management so that administrative users, service accounts, and emergency recovery accounts are clearly separated.
Firewall protection so that only the ports you intend to expose are actually reachable from the internet.
Brute-force protection so that automated attack traffic is detected and blocked automatically.
Monitoring so that CPU, memory, disk, and service health are visible before users notice a problem.
Log retention so that when something does break you have the evidence needed to investigate.
Automated maintenance so that logs, Docker artifacts, and other accumulated cruft are cleaned up without anyone needing to remember.
Safe deployment workflows so that pushing new code is predictable, reversible, and ideally invisible to users.

These are not separate concerns. They are all part of the same goal: building a server that you can trust to run your application without constant babysitting.

User Management

One of the first things we do on a fresh VPS is create proper users. Using root directly is convenient, but convenience tends to age poorly. It is difficult to audit, easy to misuse, and encourages a habit of running everything with more permissions than necessary.

A production server should have at least three categories of users: an administrative user for day-to-day work, service accounts for running daemons, and an emergency recovery account that can get you back in if something goes wrong with the primary account.

Creating the administrative user

We start by creating a dedicated user for server administration. This account gets sudo access, membership in the deploy-crate group (which controls access to operator-managed directories), and an SSH key for authentication.

1if id "$USERNAME" >/dev/null 2>&1; then
2    log "User $USERNAME already exists, updating configuration..."
3else
4    useradd -m -s /bin/bash "$USERNAME"
5fi
6
7echo "${USERNAME}:${PASSWORD}" | chpasswd
8usermod -aG sudo,deploy-crate "$USERNAME"

Notice the guard clause at the top: the script is designed to be re-entrant, meaning it can be run against an existing server to bring it up to spec without breaking anything. This is an important principle. Production setup scripts should be idempotent.

After creating the user, we install their SSH public key and configure proper permissions:

1mkdir -p "/home/${USERNAME}/.ssh"
2chmod 700 "/home/${USERNAME}/.ssh"
3printf '%s\n' "$SSH_PUBLIC_KEY" > "/home/${USERNAME}/.ssh/authorized_keys"
4chmod 600 "/home/${USERNAME}/.ssh/authorized_keys"
5chown -R "${USERNAME}:${USERNAME}" "/home/${USERNAME}/.ssh"

The permissions here matter a great deal. The .ssh directory must be 700 (read/write/execute for the owner only) and authorized_keys must be 600 (read/write for the owner only). If these permissions are too permissive, SSH will refuse to use the key file entirely.

Finally, we drop a sudoers entry so the admin user can run commands as root:

1echo "${USERNAME} ALL=(ALL) ALL" > "/etc/sudoers.d/${USERNAME}"
2chmod 440 "/etc/sudoers.d/${USERNAME}"

The sudoers file lives in /etc/sudoers.d/ rather than editing /etc/sudoers directly. This keeps configuration modular -- each user or service gets its own file, and you can audit them independently.

Creating the emergency recovery account

If your admin user's SSH key is lost or the account is accidentally locked, you need a way back in. This is what the emergency user provides. It is a separate account with its own SSH key, its password is locked (meaning no password-based login is possible), and its sudo access is scoped to a very narrow set of commands.

1useradd -m -s /bin/bash "$EMERGENCY_USER"
2passwd -l "$EMERGENCY_USER"

By locking the password with passwd -l, we ensure that even if someone somehow acquires the emergency user's credentials, they cannot authenticate with a password. Only the SSH key works.

The emergency user's sudo access is restricted to controlling the operator's blue/green systemd services -- nothing else. This is enforced through a tightly-scoped sudoers policy:

 1echo "Cmnd_Alias EMERGENCY_OPERATOR = \\
 2  ${SYSTEMCTL} start crate-operator-green, \\
 3  ${SYSTEMCTL} stop crate-operator-green, \\
 4  ${SYSTEMCTL} restart crate-operator-green, \\
 5  ${SYSTEMCTL} status crate-operator-green, \\
 6  ${SYSTEMCTL} start crate-operator-blue, \\
 7  ${SYSTEMCTL} stop crate-operator-blue, \\
 8  ${SYSTEMCTL} restart crate-operator-blue, \\
 9  ${SYSTEMCTL} status crate-operator-blue"
10
11echo "${EMERGENCY_USER} ALL=(root) NOPASSWD: EMERGENCY_OPERATOR"

The emergency account cannot run arbitrary commands. It exists solely to restart the operator if something goes wrong, giving you a recovery path that does not require full root access.

SSH Key Authentication

SSH is the primary way administrators access a server. That makes it one of the most critical parts of the entire setup to get right. Password authentication remains one of the most common attack vectors against internet-facing servers -- bots will attempt tens of thousands of password guesses per day against any server with port 22 open. SSH keys provide a significantly stronger foundation and are easier to manage as a team grows.

We recommend Ed25519 keys. They offer 128-bit security equivalent in a compact key format, are faster than RSA for both signing and verification, and have much smaller key files. Generating a key pair looks like this:

1ssh-keygen -t ed25519 -C "[email protected]"

Add a passphrase to the private key. A passphrase acts as a second factor: even if someone steals the private key file, they cannot use it without the passphrase. Modern SSH agents can cache the passphrase so you only type it once per session.

The key installation we showed in the previous section uses printf to write the public key into authorized_keys. The format of an authorized_keys file is one key per line, and each line is just the raw public key string -- no special markup required.

SSH Hardening

Installing SSH keys is only the beginning. A production VPS should treat SSH as a critical entry point and apply additional protections at the server configuration level. The default SSH configuration on most distributions is permissive in ways that you do not want on a public-facing server.

Here is the hardened sshd_config we deploy:

 1cat > /etc/ssh/sshd_config << EOF
 2Port ${SSH_PORT}
 3AddressFamily inet
 4Protocol 2
 5HostKey /etc/ssh/ssh_host_rsa_key
 6HostKey /etc/ssh/ssh_host_ecdsa_key
 7HostKey /etc/ssh/ssh_host_ed25519_key
 8UsePrivilegeSeparation yes
 9KeyRegenerationInterval 3600
10SyslogFacility AUTH
11LogLevel VERBOSE
12LoginGraceTime 30
13StrictModes yes
14RSAAuthentication yes
15PubkeyAuthentication yes
16IgnoreRhosts yes
17RhostsRSAAuthentication no
18HostbasedAuthentication no
19PermitEmptyPasswords no
20ChallengeResponseAuthentication no
21PasswordAuthentication no
22X11Forwarding no
23X11DisplayOffset 10
24PrintMotd no
25PrintLastLog yes
26TCPKeepAlive yes
27AcceptEnv LANG LC_*
28Subsystem sftp /usr/lib/openssh/sftp-server
29UsePAM yes
30AllowUsers ${USERNAME} ${EMERGENCY_USER}
31MaxAuthTries 3
32MaxSessions 2
33ClientAliveInterval 300
34ClientAliveCountMax 2
35EOF
36
37systemctl restart ssh

Let us walk through the settings that matter most and why each one is there.

PasswordAuthentication no is arguably the single most impactful change. Once set, the only way to authenticate is with an SSH key. Password-guessing bots become immediately irrelevant because there is no password prompt to guess against.

AllowUsers ${USERNAME} ${EMERGENCY_USER} restricts SSH access to only the two accounts we created. Even if someone manages to create a new user on the system (for example, through a compromised application), that user cannot log in via SSH because they are not in the allow list. This is a defense-in-depth measure.

MaxAuthTries 3 limits each connection to three authentication attempts before the connection is terminated. With PasswordAuthentication no this primarily acts as a rate limit on public key attempts, but it is good hygiene regardless.

MaxSessions 2 prevents a single SSH connection from multiplexing too many sessions. This limits the blast radius of a compromised session.

ClientAliveInterval 300 and ClientAliveCountMax 2 together mean that if a client becomes unresponsive for 10 minutes (5 minutes × 2 probes), the server terminates the connection. This prevents stale sessions from accumulating.

PermitRootLogin is implicitly disabled because root is not in the AllowUsers list. Even if the default for this directive is yes, the allow list takes precedence.

LogLevel VERBOSE increases SSH logging detail, which helps when investigating authentication failures or suspicious activity. The verbose logs feed into Fail2Ban, which we will configure in a later section.

X11Forwarding no disables X11 forwarding entirely. Production servers have no use for graphical forwarding, and leaving it enabled is unnecessary exposure.

After writing the configuration, we restart the SSH daemon. Importantly, this does not terminate existing connections -- your current session remains active while the new configuration applies to all new connections. If you are making these changes over SSH and worry about locking yourself out, open a second terminal and test the new configuration before closing the original session.

Firewall Protection

Every internet-facing server should have an explicit firewall policy. The simplest principle in server security is this: if a service does not need to be reachable from the internet, do not expose it.

UFW (Uncomplicated Firewall) provides a straightforward interface to iptables. Our policy is simple -- deny everything by default, then explicitly allow only the ports we need:

1ufw allow "${SSH_PORT}/tcp"
2ufw allow 80,443/tcp
3echo y | ufw enable
4ufw reload

For most web applications, the only publicly accessible ports should be SSH (for administration), HTTP (port 80, for redirecting to HTTPS), and HTTPS (port 443, for encrypted traffic). Everything else -- databases, monitoring endpoints, internal APIs, Docker daemon ports -- should be denied by default.

UFW's default policy is to deny all incoming traffic and allow all outgoing traffic. The ufw enable command activates this policy, and ufw reload ensures the rules are applied immediately. The echo y pipes an automatic confirmation to the enable command so the script runs without interactive prompts.

Note that SSH and HTTP/HTTPS are the only services exposed. Services like the node_exporter metrics endpoint (port 9100) and the Caddy admin API (port 2019) bind to 127.0.0.1 only -- they are not reachable from outside the server, and the firewall reinforces this boundary at the network level.

Brute Force Protection

The moment a server becomes publicly accessible, automated login attempts begin. You do not need to be running a popular application or hosting sensitive data for this to happen -- bots scan the entire IPv4 address space continuously, and any server with port 22 open will receive connection attempts within minutes.

Fail2Ban monitors log files for repeated authentication failures and temporarily blocks offending IP addresses by adding iptables rules. It is one of the simplest improvements you can make to a server's security posture.

Here is the configuration we deploy:

 1cat > /etc/fail2ban/jail.local << EOF
 2[sshd]
 3enabled = true
 4port = ${SSH_PORT}
 5filter = sshd
 6logpath = /var/log/auth.log
 7maxretry = 3
 8bantime = 3600
 9EOF
10
11if systemctl is-active --quiet fail2ban; then
12    systemctl restart fail2ban
13else
14    systemctl enable --now fail2ban
15fi

The configuration is deliberately simple. maxretry = 3 means that after three failed authentication attempts from the same IP address within the default findtime window (10 minutes), Fail2Ban blocks that IP. bantime = 3600 means the block lasts for one hour.

Three attempts is a reasonable threshold. It gives a legitimate user room to fat-finger a key selection or mistype a passphrase without getting locked out, while making brute-force attacks impractical. An attacker that can try three keys per hour per IP address is not going to get very far.

The port parameter uses the ${SSH_PORT} variable so that the configuration stays in sync with whatever SSH port you have chosen. If you changed SSH to port 2222, Fail2Ban automatically watches the correct port.

The restart logic is also worth noting: if Fail2Ban is already running, we restart it to pick up the new configuration. If it is not running, we enable and start it. This keeps the script idempotent -- you can run it on an existing server without worrying about the current state.

One important detail: Fail2Ban uses iptables for blocking by default, which integrates cleanly with UFW. When Fail2Ban adds a ban rule, UFW respects it because both tools manage the same underlying netfilter subsystem.

Swap Configuration

Memory exhaustion is one of the fastest ways to destabilize a server. When a Linux system runs out of physical RAM and has no swap, the kernel's OOM (Out-Of-Memory) killer starts terminating processes -- and it does not always pick the right ones. Your database might get killed before the memory leak in your application.

Swap is not a replacement for RAM. It is a safety net. Particularly on smaller VPS instances with 1-2 GB of RAM, having swap configured provides valuable breathing room during temporary resource spikes like a deployment artifact build or a sudden traffic surge.

Here is how we configure swap:

 1ensure_swap() {
 2    if swapon --noheadings --raw 2>/dev/null | grep -q .; then
 3        return 0
 4    fi
 5
 6    if [ ! -f "${SWAP_FILE}" ]; then
 7        if ! (command -v fallocate >/dev/null 2>&1 && fallocate -l "${SWAP_SIZE_MB}M" "${SWAP_FILE}"); then
 8            dd if=/dev/zero of="${SWAP_FILE}" bs=1M count="${SWAP_SIZE_MB}" status=none
 9        fi
10    fi
11
12    chmod 0600 "${SWAP_FILE}"
13
14    if ! /sbin/mkswap "${SWAP_FILE}" >/dev/null 2>&1; then
15        mkswap "${SWAP_FILE}" >/dev/null
16    fi
17
18    swapon "${SWAP_FILE}" || true
19
20    if ! grep -q "^${SWAP_FILE}[[:space:]]" /etc/fstab; then
21        echo "${SWAP_FILE} none swap sw 0 0" >> /etc/fstab
22    fi
23}

Several details in this function are worth calling out.

First, the guard clause at the top: if swap is already active (swapon returns output), we exit early. This is another example of idempotency -- the function only creates swap if it does not already exist.

Second, the file allocation uses fallocate as the primary method, with dd as a fallback. fallocate is faster because it allocates blocks without writing zeros to them, but it is not available on every filesystem. The dd fallback ensures the script works anywhere.

Third, the swap file permissions are set to 0600. This prevents other users on the system from reading the swap file, which could contain sensitive data that was paged out of memory -- API keys, session tokens, database queries.

Fourth, the fstab entry uses grep to check whether the swap line already exists before appending it. Running echo >> /etc/fstab blindly would add duplicate entries every time the script runs.

We default to 1 GB of swap (SWAP_SIZE_MB=1024), which is a reasonable starting point for most small-to-medium VPS instances. You can adjust it through the environment variable if you need more.

Logging and Retention

When something breaks, logs are often the first place you will look. Unfortunately, many servers treat logs as an afterthought -- default configurations keep a few days of data, store it ephemerally (lost on reboot), or let it grow unbounded until the disk fills up.

We configure systemd-journald for persistent, size-capped logging. Journald is the logging system used by systemd-based distributions (essentially every modern Linux distribution) and captures logs from services, the kernel, and boot processes in a structured binary format.

 1CONF_DIR="/etc/systemd/journald.conf.d"
 2CONF_FILE="${CONF_DIR}/deploy-crate.conf"
 3
 4install -d -m 0755 "${CONF_DIR}"
 5install -d -m 2755 /var/log/journal
 6
 7cat > "${CONF_FILE}" <<EOC
 8[Journal]
 9Storage=persistent
10Compress=yes
11SystemMaxUse=1G
12SystemKeepFree=1G
13RuntimeMaxUse=256M
14RuntimeKeepFree=256M
15MaxRetentionSec=14day
16EOC
17
18chmod 0644 "${CONF_FILE}"
19systemctl restart systemd-journald

Let us break down what each directive does and why it matters.

Storage=persistent tells journald to write logs to /var/log/journal rather than only to the in-memory runtime journal. Without this setting, logs are lost on every reboot -- which is exactly when you need them most (to diagnose why the server rebooted).

Compress=yes enables LZ4 compression for journal files. In practice this reduces log storage by 50-70% with negligible CPU overhead.

SystemMaxUse=1G caps the total disk space the journal can consume to 1 GB. This prevents logs from filling the disk entirely.

SystemKeepFree=1G is a complementary guard: even if the journal has not reached 1 GB, it will stop writing if the filesystem has less than 1 GB free. This protects against the edge case where the journal is small but the disk is near capacity for other reasons.

RuntimeMaxUse=256M and RuntimeKeepFree=256M apply the same limits to the in-memory runtime journal, which exists during early boot and on systems without persistent storage.

MaxRetentionSec=14day ensures that logs older than 14 days are automatically purged, regardless of disk usage. Two weeks is usually enough time to investigate most issues -- if you need longer retention for compliance reasons, you can increase this value or ship logs to an external service.

The install -d -m 2755 /var/log/journal line creates the persistent journal directory if it does not already exist, with the setgid bit so that files created inside inherit the directory's group ownership. This is important for access control when multiple services need to read the journal.

We drop our configuration in a .conf.d directory rather than editing the main journald.conf file directly. This is the cleanest approach: the distribution's defaults stay intact, and our overrides are clearly separated.

Monitoring

Monitoring is easy to postpone because everything appears to work without it -- right up until the moment something does not. By the time a user reports that the application is slow, the root cause could have been building for hours or days. Monitoring catches problems before they become incidents.

We deploy the Prometheus Node Exporter on every server. Node Exporter exposes hardware- and OS-level metrics (CPU, memory, disk, network, filesystem) in a format that Prometheus and compatible systems can scrape. It is widely used, actively maintained, and has a minimal resource footprint.

 1setup_node_exporter() {
 2    if ! id node_exporter >/dev/null 2>&1; then
 3        useradd --system --no-create-home --shell /bin/false node_exporter
 4    fi
 5
 6    if [ ! -f /usr/local/bin/node_exporter ]; then
 7        cd /tmp
 8        curl -sSLO "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
 9        tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
10
11        cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
12        chown node_exporter:node_exporter /usr/local/bin/node_exporter
13
14        rm -rf /tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64*
15    fi
16}

The node_exporter runs under a dedicated system user with no login shell (--shell /bin/false) and no home directory (--no-create-home). This is the principle of least privilege: the exporter does not need, and should not have, any more access than necessary to collect and expose metrics.

The systemd service unit adds further security hardening:

 1cat > /etc/systemd/system/node_exporter.service << EOF
 2[Unit]
 3Description=Node Exporter
 4Wants=network-online.target
 5After=network-online.target
 6
 7[Service]
 8User=node_exporter
 9Group=node_exporter
10Type=simple
11ExecStart=/usr/local/bin/node_exporter --web.listen-address=127.0.0.1:9100
12Restart=on-failure
13RestartSec=5
14NoNewPrivileges=true
15PrivateTmp=true
16ProtectSystem=full
17ProtectHome=true
18ProtectControlGroups=true
19ProtectKernelTunables=true
20ProtectKernelModules=true
21LockPersonality=true
22MemoryDenyWriteExecute=true
23RestrictSUIDSGID=true
24SystemCallArchitectures=native
25CapabilityBoundingSet=
26AmbientCapabilities=
27
28[Install]
29WantedBy=multi-user.target
30EOF

The most important line here is --web.listen-address=127.0.0.1:9100. Node Exporter binds to localhost only, meaning its metrics endpoint is not reachable from outside the server. External monitoring systems connect through the Crate Operator's API (or a VPN tunnel), never by exposing the raw exporter to the internet.

The security directives deserve attention because they represent good systemd hygiene that should be used on every service:

NoNewPrivileges=true prevents the process from gaining new privileges, even through setuid binaries.
PrivateTmp=true gives the service a private /tmp directory, preventing it from seeing or interfering with other services' temporary files.
ProtectSystem=full mounts /usr and /etc as read-only, preventing the service from modifying system binaries or configuration.
ProtectHome=true makes /home directories inaccessible, which the exporter has no reason to access.
CapabilityBoundingSet= (empty) drops all Linux capabilities. The exporter does not need any.
MemoryDenyWriteExecute=true prevents memory regions from being both writable and executable, which blocks an entire class of code-injection attacks.
RestrictSUIDSGID=true prevents the process from using setuid/setgid binaries.

These directives do not affect the exporter's functionality -- it only needs to read /proc and serve HTTP -- but they dramatically reduce what an attacker could do if they compromised the exporter process.

After writing the unit file, we enable and start the service:

1systemctl daemon-reload
2systemctl enable --now node_exporter

Automated Maintenance

Servers accumulate operational debt over time. Docker images pile up after each deployment. Stopped containers linger. Build caches grow. Log files expand. None of these issues are particularly interesting by themselves, but left unattended they eventually consume disk space, degrade performance, and cause opaque failures that are difficult to diagnose under pressure.

This is why we prefer automating routine maintenance. It is the kind of work that humans are bad at remembering and computers are excellent at performing on schedule.

Host health guard scripts

We install three scripts that work together as a health monitoring and garbage collection system.

deploy-crate-host-preflight checks disk space, available memory, and swap. Each check has configurable thresholds, and the script exits with a non-zero status code if any threshold is violated:

 1cat > /usr/local/bin/deploy-crate-host-preflight << EOC
 2#!/usr/bin/env bash
 3set -euo pipefail
 4
 5MIN_ROOT_FREE_MB="${MIN_ROOT_FREE_MB}"
 6MIN_ROOT_FREE_PERCENT="${MIN_ROOT_FREE_PERCENT}"
 7MIN_MEM_AVAILABLE_MB="${MIN_MEM_AVAILABLE_MB}"
 8MIN_SWAP_FREE_MB="${MIN_SWAP_FREE_MB}"
 9
10fail() {
11  echo "\$1" >&2
12  exit 1
13}
14
15check_disk() {
16  local mountpoint="\$1"
17  local min_mb="\$2"
18  local min_pct="\$3"
19  local free_mb use_pct pct_free
20
21  read -r free_mb use_pct < <(df -Pm "\${mountpoint}" | awk 'NR==2 {print \$4, \$5}')
22  use_pct="\${use_pct%%%}"
23  pct_free=\$((100 - use_pct))
24
25  if [ "\${free_mb}" -lt "\${min_mb}" ]; then
26    fail "disk free below threshold on \${mountpoint}: \${free_mb}MB < \${min_mb}MB"
27  fi
28
29  if [ "\${pct_free}" -lt "\${min_pct}" ]; then
30    fail "disk free percent below threshold on \${mountpoint}: \${pct_free}% < \${min_pct}%"
31  fi
32}
33
34check_disk "/" "\${MIN_ROOT_FREE_MB}" "\${MIN_ROOT_FREE_PERCENT}"
35
36if [ -d /var/lib/docker ]; then
37  check_disk "/var/lib/docker" "\${MIN_ROOT_FREE_MB}" "\${MIN_ROOT_FREE_PERCENT}"
38fi
39
40mem_available_mb=\$(awk '/MemAvailable:/ {print int(\$2/1024)}' /proc/meminfo)
41swap_free_mb=\$(awk '/SwapFree:/ {print int(\$2/1024)}' /proc/meminfo)
42
43if [ "\${mem_available_mb}" -lt "\${MIN_MEM_AVAILABLE_MB}" ]; then
44  fail "mem available below threshold: \${mem_available_mb}MB < \${MIN_MEM_AVAILABLE_MB}MB"
45fi
46
47if [ "\${swap_free_mb}" -lt "\${MIN_SWAP_FREE_MB}" ]; then
48  fail "swap free below threshold: \${swap_free_mb}MB < \${MIN_SWAP_FREE_MB}MB"
49fi
50
51echo "ok"
52EOC
53chmod 0755 /usr/local/bin/deploy-crate-host-preflight

The script checks both absolute free space (MB) and percentage free space for each mount point. Absolute values catch the case where a small disk is nearly full, and percentage values catch the case where a large disk has free space in absolute terms but is proportionally tight. Both thresholds must be satisfied.

deploy-crate-docker-gc performs Docker garbage collection -- removing stopped containers, unused images, and stale build cache:

 1cat > /usr/local/bin/deploy-crate-docker-gc << 'EOC'
 2#!/usr/bin/env bash
 3set -euo pipefail
 4
 5if ! command -v docker >/dev/null 2>&1; then
 6  exit 0
 7fi
 8
 9docker container prune --force --filter "until=24h" >/dev/null 2>&1 || true
10docker image prune --force >/dev/null 2>&1 || true
11docker builder prune --force --filter "until=168h" >/dev/null 2>&1 || true
12EOC
13chmod 0755 /usr/local/bin/deploy-crate-docker-gc

The filters are intentional. Stopped containers are only pruned if they have been stopped for at least 24 hours, giving you time to inspect a failed container before it is cleaned up. Build cache is retained for 7 days (168 hours) because recent cache is valuable for accelerating repeated builds. Unused images are pruned immediately because they are the largest source of disk consumption and can always be re-pulled.

deploy-crate-host-guard ties the two together. It runs the preflight check first, and only triggers garbage collection if the check fails:

 1cat > /usr/local/bin/deploy-crate-host-guard << 'EOC'
 2#!/usr/bin/env bash
 3set -euo pipefail
 4
 5if /usr/local/bin/deploy-crate-host-preflight >/dev/null 2>&1; then
 6  exit 0
 7fi
 8
 9logger -t deploy-crate-host-guard "host pressure detected, attempting safe docker cleanup"
10/usr/local/bin/deploy-crate-docker-gc || true
11
12if ! /usr/local/bin/deploy-crate-host-preflight >/dev/null 2>&1; then
13  logger -t deploy-crate-host-guard "host pressure persists after cleanup"
14  exit 1
15fi
16
17logger -t deploy-crate-host-guard "host pressure recovered after cleanup"
18EOC
19chmod 0755 /usr/local/bin/deploy-crate-host-guard

The logging via logger -t deploy-crate-host-guard means that every run -- whether successful or not -- leaves a trace in the system journal. If you are troubleshooting a disk pressure issue days later, you can check the journal to see when the guard detected the problem and whether cleanup resolved it.

Systemd timers for scheduled execution

The scripts alone do nothing -- they need to be run on a schedule. We use systemd timers, which are the systemd-native replacement for cron. Timers have several advantages over cron: they integrate with systemd's dependency system, they persist across reboots (via Persistent=true), and they support randomized delays to avoid thundering-herd problems.

 1cat > /etc/systemd/system/deploy-crate-docker-gc.service << 'EOC'
 2[Unit]
 3Description=Deploy Crate Docker Garbage Collection
 4After=docker.service
 5Wants=docker.service
 6
 7[Service]
 8Type=oneshot
 9ExecStart=/usr/local/bin/deploy-crate-docker-gc
10Nice=10
11EOC
12
13cat > /etc/systemd/system/deploy-crate-docker-gc.timer << 'EOC'
14[Unit]
15Description=Daily Deploy Crate Docker Garbage Collection
16
17[Timer]
18OnCalendar=daily
19RandomizedDelaySec=15m
20Persistent=true
21
22[Install]
23WantedBy=timers.target
24EOC

The Docker GC runs once daily, with a randomized delay of up to 15 minutes so that hundreds of servers do not all run garbage collection at the exact same second (which could strain shared infrastructure like container registries).

The Nice=10 setting on the service unit lowers the CPU scheduling priority, so garbage collection does not compete with application workloads.

For the host guard, we use a more aggressive schedule:

 1cat > /etc/systemd/system/deploy-crate-host-guard.service << 'EOC'
 2[Unit]
 3Description=Deploy Crate Host Guard
 4After=docker.service
 5Wants=docker.service
 6
 7[Service]
 8Type=oneshot
 9ExecStart=/usr/local/bin/deploy-crate-host-guard
10Nice=10
11EOC
12
13cat > /etc/systemd/system/deploy-crate-host-guard.timer << 'EOC'
14[Unit]
15Description=Periodic Deploy Crate Host Guard
16
17[Timer]
18OnBootSec=2m
19OnUnitActiveSec=15m
20RandomizedDelaySec=60s
21Persistent=true
22
23[Install]
24WantedBy=timers.target
25EOC

The host guard runs every 15 minutes (OnUnitActiveSec=15m) starting 2 minutes after boot (OnBootSec=2m). This frequency catches disk pressure before it becomes critical -- if a runaway process fills the disk in under 15 minutes, you have bigger problems that a timer cannot solve.

To activate the timers:

1systemctl daemon-reload
2systemctl enable --now deploy-crate-docker-gc.timer
3systemctl enable --now deploy-crate-host-guard.timer

You can inspect timer status at any time with systemctl list-timers, which shows when each timer last ran and when it will run next.

Safe Deployments

Deployments are one of the highest-risk operations performed on a production server. A deployment workflow should aim for predictability rather than raw speed. If you cannot roll back a deployment in under a minute, you are not deploying safely -- you are gambling.

Our deployment architecture has three layers: Docker configuration for reliable container operation, a blue/green service pattern for zero-downtime updates, and Caddy as a reverse proxy for traffic routing. Each layer contributes to making deployments boring -- in the best possible sense of the word.

Docker daemon configuration

The default Docker configuration works for development but is not suitable for production. We apply a daemon configuration that addresses log management and service continuity:

 1install -m 0755 -d /etc/docker
 2cat > /etc/docker/daemon.json << EOF
 3{
 4  "log-driver": "local",
 5  "log-opts": {
 6    "max-size": "10m",
 7    "max-file": "5",
 8    "compress": "true"
 9  },
10  "live-restore": true
11}
12EOF
13
14systemctl enable --now docker
15systemctl restart docker

The log-driver setting switches from the default json-file driver to the local driver. The local driver stores logs in an internal format that is more space-efficient and supports automatic rotation. The max-size (10 MB per log file), max-file (5 rotated files), and compress options ensure container logs do not consume unbounded disk space.

live-restore: true is critical for production. Without it, restarting the Docker daemon kills all running containers. With live-restore enabled, containers continue running during a daemon restart or upgrade, and the restarted daemon reconnects to them. This means Docker package updates do not cause application downtime.

We also add the admin user and operator user to the docker group so they can manage containers without sudo:

1usermod -aG docker -- "${USERNAME}"
2usermod -aG docker -- "${DEPLOY_CRATE_USER}"

Blue/green operator deployment

The blue/green pattern runs two identical instances of a service -- one active (green) and one standby (blue) -- on different ports. When deploying a new version, you update the standby instance, verify it is healthy, and then flip traffic to it. If something goes wrong, you flip back. There is no window where the service is down or running partially updated code.

We run two systemd units for the operator, each binding to a different port:

 1cat > /etc/systemd/system/crate-operator-green.service << EOF
 2[Unit]
 3Description=Deploy Crate Operator (Green)
 4After=network.target
 5StartLimitIntervalSec=300
 6StartLimitBurst=5
 7[Service]
 8Type=simple
 9User=${DEPLOY_CRATE_USER}
10Environment="CRATE_OPERATOR_BIND_ADDR=127.0.0.1:9640"
11Environment="CRATE_OPERATOR_SERVICE_INSTANCE=green"
12EnvironmentFile=${OPERATOR_ENV_FILE}
13ExecStart=${OPERATOR_BINARY_PATH}
14Restart=on-failure
15RestartSec=5
16TimeoutStartSec=30
17TimeoutStopSec=30
18LimitNOFILE=65535
19PrivateTmp=true
20PrivateDevices=true
21ProtectClock=true
22ProtectControlGroups=true
23ProtectKernelTunables=true
24ProtectKernelModules=true
25LockPersonality=true
26RestrictRealtime=true
27[Install]
28WantedBy=multi-user.target
29EOF
30
31cat > /etc/systemd/system/crate-operator-blue.service << EOF
32[Unit]
33Description=Deploy Crate Operator (Blue)
34After=network.target caddy.service
35Wants=caddy.service
36StartLimitIntervalSec=300
37StartLimitBurst=5
38[Service]
39Type=simple
40User=${DEPLOY_CRATE_USER}
41Environment="CRATE_OPERATOR_BIND_ADDR=127.0.0.1:9640"
42Environment="CRATE_OPERATOR_SERVICE_INSTANCE=blue"
43EnvironmentFile=${OPERATOR_ENV_FILE}
44ExecStart=${OPERATOR_BINARY_PATH}
45Restart=on-failure
46RestartSec=5
47TimeoutStartSec=30
48TimeoutStopSec=30
49LimitNOFILE=65535
50PrivateTmp=true
51PrivateDevices=true
52ProtectClock=true
53ProtectControlGroups=true
54ProtectKernelTunables=true
55ProtectKernelModules=true
56LockPersonality=true
57RestrictRealtime=true
58[Install]
59WantedBy=multi-user.target
60EOF

The two units are nearly identical -- the difference is the CRATE_OPERATOR_SERVICE_INSTANCE environment variable and the port. Green listens on 127.0.0.1:9640 and blue on 127.0.0.1:9641. The CRATE_OPERATOR_BIND_ADDR uses 127.0.0.1 -- the operator is not directly reachable from the public internet. All external traffic goes through Caddy.

Note the StartLimitIntervalSec=300 and StartLimitBurst=5 settings. These define a failure window: if the service fails to start 5 times within 300 seconds (5 minutes), systemd stops trying and marks the unit as failed. This prevents a broken deployment from entering an infinite restart loop that floods logs and wastes resources.

After writing the unit files, we enable and start green, while keeping blue disabled and stopped:

1systemctl daemon-reload
2systemctl enable crate-operator-green
3systemctl start crate-operator-green
4systemctl disable crate-operator-blue || true
5systemctl stop crate-operator-blue || true

The || true on the disable and stop commands handles the case where blue is not installed yet. A failing command does not abort the entire script.

Health check validation

After starting the operator, we validate that it is healthy before proceeding. A deployment that does not verify the service came up correctly is not a deployment -- it is a wish:

 1for i in $(seq 1 12); do
 2    if curl -sf -H "X-API-KEY: ${DEPLOY_CRATE_API_KEY}" \
 3       "http://127.0.0.1:9640/v1/health" >/dev/null 2>&1; then
 4        log "Operator is healthy"
 5        break
 6    fi
 7    if [ "$i" -eq 12 ]; then
 8        log "ERROR: Operator health check failed" >&2
 9        journalctl -u crate-operator-green --no-pager -n 50 || true
10        exit 1
11    fi
12    sleep 5
13done

The health check loop tries 12 times with 5-second intervals, giving the operator up to 60 seconds to start and become healthy. If it never responds, we dump the last 50 lines of the service journal so you can see what went wrong, and exit the script with a failure code.

Caddy reverse proxy

Caddy acts as the TLS-terminating reverse proxy. It handles Let's Encrypt certificate automation and routes traffic to the appropriate backend. The Caddy API is used at runtime to configure routes -- no configuration file edits required for blue/green switching:

1cat > /etc/caddy/Caddyfile <<EOF
2{
3  admin localhost:2019
4}
5EOF
6
7systemctl enable --now caddy

The minimal Caddyfile only enables the admin API on localhost. All route configuration happens dynamically through the API, which is how we implement blue/green traffic switching without restarting Caddy or reloading configuration files.

The routing configuration uses a weighted round-robin strategy that initially directs all traffic to green:

 1{
 2  "@id": "crate_operator_api",
 3  "match": [
 4    {
 5      "host": ["${OPERATOR_DOMAIN}"]
 6    }
 7  ],
 8  "handle": [
 9    {
10      "handler": "subroute",
11      "routes": [
12        {
13          "handle": [
14            {
15              "handler": "reverse_proxy",
16              "load_balancing": {
17                "selection_policy": {
18                  "policy": "weighted_round_robin",
19                  "weights": [100, 0]
20                }
21              },
22              "upstreams": [
23                { "dial": "127.0.0.1:9640" },
24                { "dial": "127.0.0.1:9641" }
25              ]
26            }
27          ]
28        }
29      ]
30    }
31  ],
32  "terminal": true
33}

The weights: [100, 0] sends 100% of traffic to the green upstream (port 9640) and 0% to the blue upstream (port 9641). To perform a blue/green deployment, you update the blue service, verify it is healthy, then flip the weights to [0, 100] -- traffic moves to blue with zero downtime. If the blue deployment is broken, flip back to [100, 0] and investigate. The rollback is instantaneous.

Operator binary download and verification

Before any of the services can run, the operator binary must be downloaded and verified. We use checksum verification to ensure the binary has not been tampered with or corrupted during download:

 1rm -f "${OPERATOR_TMP_BINARY}"
 2curl -f -L -o "${OPERATOR_TMP_BINARY}" "${DOWNLOAD_URL}" \
 3  || { log "ERROR: Failed to download operator binary" >&2; exit 1; }
 4
 5if [ -n "${CHECKSUM}" ]; then
 6    actual_checksum=$(sha256sum "${OPERATOR_TMP_BINARY}" | awk '{print $1}')
 7    if [ "$actual_checksum" != "${CHECKSUM}" ]; then
 8        rm -f "${OPERATOR_TMP_BINARY}"
 9        log "ERROR: Checksum mismatch" >&2
10        exit 1
11    fi
12fi
13
14chmod +x "${OPERATOR_TMP_BINARY}"
15mkdir -p "${OPERATOR_INSTALL_DIR}"
16mv "${OPERATOR_TMP_BINARY}" "${OPERATOR_BINARY_PATH}.tmp" && \
17  mv "${OPERATOR_BINARY_PATH}.tmp" "${OPERATOR_BINARY_PATH}"
18
19chown "${DEPLOY_CRATE_USER}:deploy-crate" "${OPERATOR_BINARY_PATH}"

The two-step mv (rename to .tmp first, then rename to final path) is an atomic install pattern on the same filesystem. It prevents a window where the binary path exists but points to a partially-written file -- a consumer that opens the path before the download completes would either see the old binary (before the first mv) or the new one (after the second mv), never an incomplete file.

The operator environment file stores sensitive configuration separately from the systemd unit files:

1mkdir -p /etc/crate-operator
2printf 'CRATE_OPERATOR_API_KEY=%s\nCRATE_OPERATOR_SERVER_ID=%s\n' \
3  "${DEPLOY_CRATE_API_KEY}" "${CRATE_OPERATOR_SERVER_ID}" > "${OPERATOR_ENV_FILE}"
4chmod 600 "${OPERATOR_ENV_FILE}"

Permissions on the environment file are 600 -- only the owner (root) can read it. Both the root user and the crate-operator user (which runs the service) can read it through the EnvironmentFile directive in the systemd unit.

Production Readiness Checklist

Before considering a server production-ready, we want to be able to answer "yes" to every question on this checklist. Each item is linked to the section of this guide that explains how to implement it.

#	Check	Implementation
1	Are SSH keys configured and is password authentication disabled?	User Management + SSH Hardening
2	Is a dedicated admin user created with sudo access?	User Management
3	Is an emergency recovery account configured with scoped sudo access?	User Management
4	Is the SSH daemon hardened (restricted users, rate-limited connections, idle timeouts)?	SSH Hardening
5	Is a firewall enabled that denies all traffic except SSH, HTTP, and HTTPS?	Firewall Protection
6	Is Fail2Ban running with SSH jail configured?	Brute Force Protection
7	Is swap enabled?	Swap Configuration
8	Are system logs configured for persistent, size-capped retention?	Logging and Retention
9	Is Node Exporter installed and running with restricted privileges?	Monitoring
10	Is container log rotation configured (Docker local log driver)?	Docker Configuration
11	Are automated maintenance timers running (Docker GC, host guard)?	Automated Maintenance
12	Are blue/green service units configured with health checks?	Blue/Green Deployment
13	Is Caddy serving as TLS-terminating reverse proxy?	Caddy Reverse Proxy
14	Is the operator binary verified with checksum before installation?	Binary Installation
15	Can you roll back a deployment in under one minute without downtime?	Blue/Green Weight Flipping

If the answer to several of these questions is no, the server probably is not production-ready yet. Each "no" represents an operational risk that will eventually become an operational incident -- it is only a matter of time.

Closing Thoughts

A production-ready VPS is not defined by how much software is installed on it. It is defined by how many operational problems have already been considered before they occur. Security, monitoring, maintenance, and deployment workflows are not separate projects to tackle after the application is running. They are all part of the same goal: building a server that you can trust.

Most of the configurations in this guide are one-time setup. Once the SSH daemon is hardened, it stays hardened. Once Fail2Ban is running, it keeps running. Once the timers are installed, they fire on schedule forever. The investment is front-loaded, and the return is years of not having to think about these things at 3 AM during an incident.

The bootstrap script we use at DeployCrate runs all of these steps automatically when provisioning a new server or auditing an existing one. You can adapt the individual sections to your own needs -- take what applies to your stack and leave the rest. The principles are universal: least privilege, defense in depth, automation over heroism, and always having a way back.