Nextcloud High Availability (NCHA)

A self-hosted Nextcloud cluster spread across multiple Proxmox nodes, built so that losing a node doesn’t take down file storage, Talk, Collabora, or Whiteboard. Same clustering approach as the HAHA project — Pacemaker, DRBD, the works — applied to a different problem.

Overview

Aspect	Details
Load Balancing	HAProxy (Layer 7) + DNS Round Robin
Database	Galera-MariaDB cluster
Cache	Redis Sentinel
Storage	DRBD + OCFS2 shared volumes
Authentication	FreeIPA LDAP
Services	Nextcloud, Talk, Collabora Code, Whiteboard

Technology Stack

Orchestration & HA

Corosync + Pacemaker for VIP management
HAProxy for Layer 7 load balancing
DNS Round Robin for geographic distribution

Storage Backend

DRBD + OCFS2 shared volumes
Clustered Galera-MariaDB for database HA
Redis Sentinel for distributed caching

Networking & Security

FreeIPA-based LDAP authentication
SSL offloading at HAProxy
Health checks for automatic failover

Nextcloud Services

Frontend nodes (web servers)
Backend services: Talk, Collabora Code, Imaginary, Whiteboard
WebDAV for file access

Infrastructure

Two NVMe-equipped nodes with PCI passthrough
Third arbitrator node for quorum

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      DNS Round Robin                             │
│                   (nextcloud.domain.tld)                        │
└───────────────────────────┬─────────────────────────────────────┘
                            │
         ┌──────────────────┴──────────────────┐
         ▼                                     ▼
┌─────────────────┐                 ┌─────────────────┐
│    HAProxy 1    │                 │    HAProxy 2    │
│   (Active)      │◄───────────────►│   (Standby)     │
│   Floating VIP  │   Pacemaker     │                 │
└────────┬────────┘                 └────────┬────────┘
         │                                   │
         └──────────────────┬────────────────┘
                            │ Health Checks
         ┌──────────────────┼──────────────────┐
         ▼                  ▼                  ▼
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│ Nextcloud 1 │      │ Nextcloud 2 │      │ Nextcloud 3 │
│  Frontend   │      │  Frontend   │      │  (Backup)   │
└──────┬──────┘      └──────┬──────┘      └──────┬──────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │
┌───────────────────────────┴───────────────────────────┐
│                    Backend Services                    │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐           │
│  │   Talk    │ │ Collabora │ │ Whiteboard│           │
│  └───────────┘ └───────────┘ └───────────┘           │
└───────────────────────────────────────────────────────┘
                            │
┌───────────────────────────┴───────────────────────────┐
│                    Data Layer                          │
│  ┌─────────────────┐  ┌─────────────────┐            │
│  │ Galera-MariaDB  │  │  Redis Sentinel │            │
│  │    Cluster      │  │    (3 nodes)    │            │
│  └─────────────────┘  └─────────────────┘            │
│  ┌─────────────────────────────────────────┐         │
│  │        DRBD + OCFS2 (NFS Export)        │         │
│  └─────────────────────────────────────────┘         │
└───────────────────────────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────┐
│                      FreeIPA                          │
│              (LDAP Authentication)                    │
└───────────────────────────────────────────────────────┘

What Made This Tricky

The challenge wasn’t any single component — it was getting Galera, Redis Sentinel, DRBD, OCFS2, and HAProxy to all agree on what “healthy” means at the same time. Each layer has its own idea of quorum, its own failure detection, and its own timeout semantics.

HAProxy health checks needed careful tuning to avoid flapping during Galera donor/desynced states. The DRBD + OCFS2 lock handling issues I’d already dealt with in the HAHA project showed up again here, plus new ones from Nextcloud’s file locking interacting with the clustered filesystem.

FreeIPA LDAP as the auth backend added another dependency — if the LDAP server is down, nobody logs in, regardless of how available everything else is.

The system tolerates node failures and restarts without data loss. I’ve tested it by pulling nodes out mid-operation.

Inspired by Matthias Wobben’s “Guide to Nextcloud Cluster Design” for architectures supporting up to 500k users

Overview#

Technology Stack#

Orchestration & HA#

Storage Backend#

Networking & Security#

Nextcloud Services#

Infrastructure#

Architecture#

What Made This Tricky#