363 lines
12 KiB
Markdown
363 lines
12 KiB
Markdown
# ns8-backup-monitor
|
||
|
||
> **NethServer 8 backup failure notification service.**
|
||
>
|
||
> Receives Alertmanager webhook alerts, correlates per-module backup status
|
||
> from the cluster Redis, optionally probes restic repositories, and sends a
|
||
> detailed HTML/text email through the NS8 mail relay.
|
||
|
||
---
|
||
|
||
## Table of contents
|
||
|
||
1. [Architecture](#architecture)
|
||
2. [File layout](#file-layout)
|
||
3. [Runtime paths](#runtime-paths)
|
||
4. [Requirements](#requirements)
|
||
5. [Installation](#installation)
|
||
6. [Configuration](#configuration)
|
||
7. [Alertmanager integration](#alertmanager-integration)
|
||
8. [Outcome classification](#outcome-classification)
|
||
9. [Redis key structure](#redis-key-structure)
|
||
10. [Service management](#service-management)
|
||
11. [Troubleshooting](#troubleshooting)
|
||
12. [Uninstallation](#uninstallation)
|
||
13. [License](#license)
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
Alertmanager ──POST /alert──► receiver.py
|
||
│
|
||
(wait N seconds for all modules
|
||
to finish writing their status)
|
||
│
|
||
▼
|
||
correlator.py
|
||
(reads Redis KEYS/HGETALL,
|
||
classifies outcome:
|
||
SUCCESS / PARTIAL / REPO_FAILURE)
|
||
│
|
||
▼
|
||
repo_check.py ← optional
|
||
(runagent → restic snapshots
|
||
on each module's repository)
|
||
│
|
||
▼
|
||
notifier.py
|
||
(builds HTML + plain-text email,
|
||
dispatches via ns8-sendmail)
|
||
```
|
||
|
||
**Key design decision:** the service is a long-running HTTP server managed by
|
||
systemd, not a one-shot script. This means it is always ready to receive an
|
||
alert regardless of whether the backup was triggered manually or by a scheduled
|
||
timer.
|
||
|
||
---
|
||
|
||
## File layout
|
||
|
||
```
|
||
ns8-backup-monitor/
|
||
│
|
||
├── README.md ← this file
|
||
│
|
||
├── config/
|
||
│ └── config.yml.example ← annotated configuration template
|
||
│ (copy to /etc/ns8-backup-monitor/config.yml)
|
||
│
|
||
├── deploy/
|
||
│ ├── install.sh ← interactive installer / uninstaller
|
||
│ └── ns8-backup-monitor.service ← systemd unit file
|
||
│
|
||
└── ns8_backup_monitor/ ← Python package
|
||
├── __init__.py ← package metadata, version string
|
||
├── __main__.py ← entry point: arg parsing, logging init,
|
||
│ hands off to receiver.run_server()
|
||
├── receiver.py ← HTTP webhook server (POST /alert)
|
||
├── correlator.py ← reads Redis, classifies backup outcome
|
||
├── repo_check.py ← probes restic repositories via runagent
|
||
├── notifier.py ← builds and sends email notifications
|
||
└── utils.py ← load_config(), setup_logging()
|
||
```
|
||
|
||
---
|
||
|
||
## Runtime paths
|
||
|
||
The following paths are created by `deploy/install.sh` and assumed by the
|
||
default configuration.
|
||
|
||
| Purpose | Path |
|
||
|---------|------|
|
||
| Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` |
|
||
| Deploy scripts | `/opt/ns8-backup-monitor/deploy/` |
|
||
| Configuration | `/etc/ns8-backup-monitor/config.yml` |
|
||
| systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` |
|
||
| Log file | `/var/log/ns8-backup-monitor.log` |
|
||
| NS8 Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
|
||
|
||
---
|
||
|
||
## Requirements
|
||
|
||
| Dependency | Provided by | Notes |
|
||
|------------|------------|-------|
|
||
| `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ |
|
||
| `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency |
|
||
| `redis-cli` | NethServer 8 | Used via subprocess, no Python Redis client needed |
|
||
| `runagent` | NethServer 8 | Required for `repo_check` only |
|
||
| `ns8-sendmail` | NethServer 8 | Required for email delivery |
|
||
| `systemd` | OS | Service management |
|
||
|
||
> **This service must run on an NS8 leader node** (or any node that has
|
||
> read access to the cluster Redis socket and `runagent` in `PATH`).
|
||
|
||
---
|
||
|
||
## Installation
|
||
|
||
### One-liner (recommended)
|
||
|
||
```bash
|
||
bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)
|
||
```
|
||
|
||
The installer will:
|
||
1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`).
|
||
2. Download and extract the latest source archive from the Gitea repository.
|
||
3. Prompt interactively for sender address, recipient list, and subject prefix.
|
||
4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values.
|
||
5. Install and start the systemd service.
|
||
|
||
### Manual installation
|
||
|
||
```bash
|
||
git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
|
||
cd ns8-backup-monitor
|
||
|
||
# Install Python dependency
|
||
pip3 install pyyaml
|
||
|
||
# Create directories
|
||
mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
|
||
|
||
# Copy source and config template
|
||
cp -r . /opt/ns8-backup-monitor/
|
||
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
|
||
# Edit the config before starting
|
||
nano /etc/ns8-backup-monitor/config.yml
|
||
|
||
# Install systemd unit
|
||
cp deploy/ns8-backup-monitor.service /etc/systemd/system/
|
||
systemctl daemon-reload
|
||
systemctl enable --now ns8-backup-monitor
|
||
```
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
The configuration file is a YAML document. The installer writes it to
|
||
`/etc/ns8-backup-monitor/config.yml`; a fully annotated template is available
|
||
at `config/config.yml.example`.
|
||
|
||
```yaml
|
||
# ---------------------------------------------------------------------------
|
||
# Email notification settings
|
||
# ---------------------------------------------------------------------------
|
||
# Delivery is handled by ns8-sendmail, which uses the SMTP relay already
|
||
# configured in NethServer 8. No SMTP credentials are needed here.
|
||
mail:
|
||
# Envelope / header sender address.
|
||
from: "ns8-backup-monitor@yourdomain.com"
|
||
|
||
# One or more recipient addresses. At least one is required.
|
||
to:
|
||
- "admin@yourdomain.com"
|
||
|
||
# String prepended to every email subject line.
|
||
subject_prefix: "[NS8 Backup]"
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Webhook receiver (HTTP server)
|
||
# ---------------------------------------------------------------------------
|
||
receiver:
|
||
# Interface to listen on. 127.0.0.1 is recommended when Alertmanager
|
||
# runs on the same host; use 0.0.0.0 only if it runs on a different node.
|
||
host: "127.0.0.1"
|
||
# TCP port. Must match the webhook URL configured in Alertmanager.
|
||
port: 9099
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Timing
|
||
# ---------------------------------------------------------------------------
|
||
correlator:
|
||
# Seconds to wait after receiving the alert before reading Redis.
|
||
# This grace period allows all module agents to finish writing their
|
||
# per-module status hashes. 30 s is sufficient for most deployments.
|
||
wait_seconds: 30
|
||
|
||
# Look-back window in seconds used when the alert does not include a
|
||
# backup_id label. Any plan whose Redis status was updated within this
|
||
# window is considered "recent" and included in the report.
|
||
recent_window: 3600
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Redis connection
|
||
# ---------------------------------------------------------------------------
|
||
redis:
|
||
# Path to the NS8 cluster Redis Unix socket.
|
||
# On a standard NS8 installation this path never changes.
|
||
socket: "/var/lib/nethserver/cluster/state/redis.sock"
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Repository check (optional, uses runagent + restic)
|
||
# ---------------------------------------------------------------------------
|
||
repo_check:
|
||
# Maximum seconds to wait for each repository check before giving up.
|
||
timeout: 60
|
||
# Extra flags passed verbatim to every restic invocation.
|
||
# Example: "--cacert /etc/pki/tls/certs/ca-bundle.crt"
|
||
restic_flags: ""
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Logging
|
||
# ---------------------------------------------------------------------------
|
||
logging:
|
||
# Python log level: DEBUG, INFO, WARNING, ERROR.
|
||
level: INFO
|
||
# Absolute path for the rotating log file (5 MB × 3 backups).
|
||
# Leave empty to log to stdout / journald only.
|
||
file: "/var/log/ns8-backup-monitor.log"
|
||
```
|
||
|
||
---
|
||
|
||
## Alertmanager integration
|
||
|
||
Add a receiver pointing to the service in your Alertmanager configuration:
|
||
|
||
```yaml
|
||
# alertmanager.yml (relevant excerpt)
|
||
route:
|
||
receiver: ns8-backup-monitor
|
||
# Only route backup-related alerts to this receiver.
|
||
routes:
|
||
- match:
|
||
alertname: NethServerBackupFailed
|
||
receiver: ns8-backup-monitor
|
||
|
||
receivers:
|
||
- name: ns8-backup-monitor
|
||
webhook_configs:
|
||
- url: "http://127.0.0.1:9099/alert"
|
||
# Send resolved alerts too so the service can log them.
|
||
send_resolved: true
|
||
```
|
||
|
||
Reload Alertmanager after editing:
|
||
|
||
```bash
|
||
systemctl reload alertmanager
|
||
# or, for the NS8 metrics module:
|
||
runagent -m metrics1 systemctl reload alertmanager
|
||
```
|
||
|
||
---
|
||
|
||
## Outcome classification
|
||
|
||
For each backup plan the correlator reads all per-module status hashes and
|
||
produces one of three outcomes:
|
||
|
||
| Outcome | Condition | Email subject |
|
||
|---------|-----------|---------------|
|
||
| `SUCCESS` | All modules finished with `result=success` | `✅ Backup completed` |
|
||
| `PARTIAL` | At least one module succeeded, at least one failed | `⚠️ Backup partially failed` |
|
||
| `REPO_FAILURE` | All modules failed **or** no status found in Redis | `❌ Backup failed` |
|
||
|
||
---
|
||
|
||
## Redis key structure
|
||
|
||
The correlator reads two families of keys from the NS8 cluster Redis:
|
||
|
||
| Key pattern | Description |
|
||
|-------------|-------------|
|
||
| `cluster/backup/<backup_id>/status` | Plan-level status hash. Fields: `result`, `timestamp`, `errors` (integer count). |
|
||
| `module/<module_id>/backup/<backup_id>/status` | Per-module status hash. Fields: `result`, `timestamp`, `error` (message string). |
|
||
|
||
`result` is either `"success"` or `"error"`. `timestamp` is an ISO 8601
|
||
string in UTC (e.g. `2024-01-15T03:00:05Z`).
|
||
|
||
---
|
||
|
||
## Service management
|
||
|
||
```bash
|
||
# Check service status
|
||
systemctl status ns8-backup-monitor
|
||
|
||
# Follow live logs via journald
|
||
journalctl -u ns8-backup-monitor -f
|
||
|
||
# Follow the rotating log file directly
|
||
tail -f /var/log/ns8-backup-monitor.log
|
||
|
||
# Restart after a config change
|
||
systemctl restart ns8-backup-monitor
|
||
|
||
# Test the webhook endpoint manually
|
||
curl -s -X POST http://127.0.0.1:9099/alert \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"alerts":[{"status":"firing","labels":{"alertname":"NethServerBackupFailed"}}]}'
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Service starts but no email is received
|
||
|
||
1. Verify `ns8-sendmail` works independently:
|
||
```bash
|
||
echo 'Test' | ns8-sendmail -s 'Test' admin@yourdomain.com
|
||
```
|
||
2. Check `mail.to` in `/etc/ns8-backup-monitor/config.yml`.
|
||
3. Increase log level to `DEBUG` and restart the service.
|
||
|
||
### `REPO_FAILURE` on every alert even though backups succeed
|
||
|
||
- The correlator may be reading Redis before all modules have finished.
|
||
Increase `correlator.wait_seconds` (e.g. to `60`).
|
||
- Check that the Redis socket path is correct:
|
||
`redis-cli -s /var/lib/nethserver/cluster/state/redis.sock PING`
|
||
|
||
### Alertmanager does not reach the webhook
|
||
|
||
- Confirm the service is listening:
|
||
`ss -tlnp | grep 9099`
|
||
- If Alertmanager runs on a different host, change `receiver.host` to
|
||
`0.0.0.0` and open the port in the firewall.
|
||
|
||
---
|
||
|
||
## Uninstallation
|
||
|
||
```bash
|
||
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
|
||
```
|
||
|
||
The script will stop and disable the service, remove the install directory,
|
||
and optionally remove the configuration directory.
|
||
|
||
---
|
||
|
||
## License
|
||
|
||
MIT — see [LICENSE](LICENSE) if present, otherwise contact the repository owner.
|