Files

354 lines
12 KiB
Markdown
Raw Permalink Normal View History

2026-05-18 15:09:33 +00:00
# ns8-backup-monitor
> **NethServer 8 backup failure notification service.**
>
> Receives Alertmanager webhook alerts, correlates per-module backup status
> from the cluster Redis, optionally probes restic repositories, and sends a
> structured plain-text email through the NS8 mail relay.
---
## Table of contents
1. [Architecture](#architecture)
2. [File layout](#file-layout)
3. [Runtime paths](#runtime-paths)
4. [Requirements](#requirements)
5. [Installation](#installation)
6. [Configuration](#configuration)
7. [Alertmanager integration](#alertmanager-integration)
8. [Outcome classification](#outcome-classification)
9. [Redis key structure](#redis-key-structure)
10. [Service management](#service-management)
11. [Troubleshooting](#troubleshooting)
12. [Uninstallation](#uninstallation)
13. [License](#license)
---
## Architecture
```
Alertmanager ──POST /alert──► receiver.py
(wait N seconds for all modules
to finish writing their status)
correlator.py
(reads Redis KEYS/HGETALL,
classifies outcome:
SUCCESS / PARTIAL / REPO_FAILURE)
repo_check.py ← skipped on SUCCESS
(restic snapshots --last --no-cache
on each configured repository)
notifier.py
(builds plain-text email,
dispatches via ns8-sendmail)
```
**Key design decision:** the service is a long-running HTTP server managed by
systemd, not a one-shot script. It is always ready to receive alerts whether
the backup is triggered manually from the UI or by an automatic scheduled timer.
> **Note on automatic alerts:** NS8 native Prometheus rules emit alerts named
> `backup_failed` and `backup_missing` (not `NsBackupFailed` / `NsBackupMissing`).
> All four names are matched so the pipeline fires on both native and custom rules.
> The plan identifier is extracted from the `id` label (native) or `backup_id`
> label (custom) — both are checked.
---
## File layout
```
ns8-backup-monitor/
├── README.md ← this file
├── config/
│ └── config.yml.example ← annotated configuration template
│ (copy to /etc/ns8-backup-monitor/config.yml)
├── deploy/
│ ├── install.sh ← interactive installer / uninstaller
│ └── ns8-backup-monitor.service ← systemd unit file
└── ns8_backup_monitor/ ← Python package (the service code)
├── __init__.py ← package metadata and version string
├── __main__.py ← entry point: argument parsing, logging
│ initialisation, calls receiver.run_server()
├── receiver.py ← HTTP webhook server (POST /alert)
│ matches alert names, spawns pipeline thread
├── correlator.py ← reads NS8 Redis, classifies backup outcome
├── repo_check.py ← probes restic repositories for health status
├── notifier.py ← builds and sends the status email
└── utils.py ← load_config() and setup_logging() helpers
```
---
## Runtime paths
Paths created by `deploy/install.sh` and assumed by the default configuration.
| Purpose | Default path |
|---------|-------------|
| Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` |
| Deploy scripts | `/opt/ns8-backup-monitor/deploy/` |
| Configuration file | `/etc/ns8-backup-monitor/config.yml` |
| systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` |
| Log file | `/var/log/ns8-backup-monitor.log` |
| NS8 cluster Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
---
## Requirements
| Dependency | Provided by | Notes |
|------------|-------------|-------|
| `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ |
| `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency |
| `redis-cli` | NethServer 8 | Accessed via subprocess; no Python Redis client needed |
| `restic` | NethServer 8 / manual | Required for `repo_check` only |
| `ns8-sendmail` | NethServer 8 | Required for email delivery |
| `systemd` | OS | Service management |
> **This service must run on an NS8 leader node** (or any node with read access
> to the cluster Redis Unix socket and `ns8-sendmail` in `PATH`).
---
## Installation
### One-liner (recommended)
```bash
bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)
```
The installer will:
1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`).
2. Download and extract the latest source from the Gitea repository.
3. Prompt interactively for sender address, recipient list, and subject prefix.
4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values.
5. Install and start the systemd service.
### Manual installation
```bash
git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
cd ns8-backup-monitor
pip3 install pyyaml
mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
cp -r . /opt/ns8-backup-monitor/
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
nano /etc/ns8-backup-monitor/config.yml # edit before starting
cp deploy/ns8-backup-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now ns8-backup-monitor
```
---
## Configuration
Full reference is in `config/config.yml.example`. Key parameters:
```yaml
receiver:
host: 127.0.0.1 # bind address (use 0.0.0.0 only for remote Alertmanager)
port: 9099 # webhook listening port
correlator:
wait_seconds: 30 # wait after alert before reading Redis (allow slow modules to finish)
recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present
repo_check:
enabled: true # set false to skip restic health checks entirely
timeout: 60 # per-repository restic timeout (seconds)
notification:
mail_from: ns8-backup-monitor@example.com
mail_to:
- admin@example.com
- ops@example.com
redis:
socket: /var/lib/nethserver/cluster/state/redis.sock
logging:
level: INFO # DEBUG for verbose output during troubleshooting
file: /var/log/ns8-backup-monitor.log
```
---
## Alertmanager integration
Add a receiver and route to your Alertmanager configuration:
```yaml
# alertmanager.yml
receivers:
- name: ns8-backup-monitor
webhook_configs:
- url: http://127.0.0.1:9099/alert
send_resolved: false # resolved alerts are intentionally ignored
route:
routes:
- matchers:
- alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
receiver: ns8-backup-monitor
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
```
### Supported alert names
| Alert name | Rule set | Trigger |
|------------|----------|---------|
| `backup_failed` | NS8 native (`node_backup_status`) | One or more plans reported `result != success` |
| `backup_missing` | NS8 native (`node_backup_status`) | Expected backup did not complete in time |
| `NsBackupFailed` | Custom / legacy | Same semantic as `backup_failed` |
| `NsBackupMissing` | Custom / legacy | Same semantic as `backup_missing` |
### Label mapping
| Label | Used by | Contains |
|-------|---------|----------|
| `id` | NS8 native alerts | Backup plan identifier |
| `backup_id` | Custom / legacy alerts | Backup plan identifier |
Both labels are checked. When neither is present the correlator falls back to
scanning Redis for all plan status keys updated within `correlator.recent_window`.
---
## Outcome classification
| Outcome | Condition | Email subject |
|---------|-----------|---------------|
| `SUCCESS` | `failed == 0` and `total > 0` | `[ns8-backup] SUCCESS - all N module(s) backed up successfully` |
| `PARTIAL` | `0 < failed < total` | `[ns8-backup] PARTIAL - N/M module(s) failed` |
| `REPO_FAILURE` | `failed == total` or `total == 0` | `[ns8-backup] REPO_FAILURE - <reason>` |
`REPO_FAILURE` covers both the case where all modules failed and the case where
no status was found in Redis at all (possible repository-level or scheduling
issue). The repository health check (`repo_check.py`) runs automatically for
`PARTIAL` and `REPO_FAILURE` outcomes to provide additional diagnostics.
---
## Redis key structure
Keys read by `correlator.py` and `repo_check.py`:
```
cluster/backup/<backup_id>/status
Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
errors integer count of failed modules
module/<module_id>/backup/<backup_id>/status
Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
error human-readable error message (empty on success)
cluster/backup_repository/<repo_id>/parameters
Hash fields:
url cloud backend URL (S3, B2, rclone, ...)
path local or SFTP path
password restic repository password
backend "s3" | "b2" | "sftp" | "rclone" | "local"
aws_access_key_id S3 key ID (also used as b2_account_id for B2)
aws_secret_access_key S3 secret (also used as b2_account_key for B2)
rclone_config path to rclone configuration file
```
---
## Service management
```bash
# Status
systemctl status ns8-backup-monitor
# Restart after config change
systemctl restart ns8-backup-monitor
# Live logs
journalctl -u ns8-backup-monitor -f
# Enable debug logging without editing config.yml
journalctl -u ns8-backup-monitor -f | grep -v DEBUG
# Test the webhook manually
curl -s -X POST http://127.0.0.1:9099/alert \
-H "Content-Type: application/json" \
-d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'
```
---
## Troubleshooting
### Pipeline does not trigger on automatic backups
**Symptom:** email arrives when you click "Run backup" from the NS8 UI but not
when the scheduled timer fires.
**Cause (pre-fix):** the old `ALERT_NAMES` set only contained `NsBackupFailed`
and `NsBackupMissing`. NS8 native Prometheus rules emit `backup_failed` and
`backup_missing` instead. Additionally, native alerts carry the plan identifier
in the `id` label, not in `backup_id`.
**Fix:** update to the current version; both alert name sets and both labels
are now handled automatically.
**Verify:** run `journalctl -u ns8-backup-monitor -f` and trigger a test alert
with `amtool alert add alertname=backup_failed id=1`. You should see the
`Received alert:` DEBUG line followed by the pipeline start.
### No email received
1. Check the service is running: `systemctl status ns8-backup-monitor`.
2. Verify Alertmanager is routing to the correct receiver: `amtool config routes test alertname=backup_failed`.
3. Check `mail_to` in `config.yml` is set and `ns8-sendmail` works:
`echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test"`.
4. Increase log level to `DEBUG` in `config.yml` and restart the service.
### REPO_FAILURE with no modules found
**Cause:** the correlator ran before backup modules finished writing their
status to Redis.
**Fix:** increase `correlator.wait_seconds` in `config.yml`. A value of 60120
is safe for most clusters. Restart the service after changing the value.
---
## Uninstallation
```bash
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
```
---
## License
MIT — see [LICENSE](LICENSE).