Files

354 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ns8-backup-monitor
> **NethServer 8 backup failure notification service.**
>
> Receives Alertmanager webhook alerts, correlates per-module backup status
> from the cluster Redis, optionally probes restic repositories, and sends a
> structured plain-text email through the NS8 mail relay.
---
## Table of contents
1. [Architecture](#architecture)
2. [File layout](#file-layout)
3. [Runtime paths](#runtime-paths)
4. [Requirements](#requirements)
5. [Installation](#installation)
6. [Configuration](#configuration)
7. [Alertmanager integration](#alertmanager-integration)
8. [Outcome classification](#outcome-classification)
9. [Redis key structure](#redis-key-structure)
10. [Service management](#service-management)
11. [Troubleshooting](#troubleshooting)
12. [Uninstallation](#uninstallation)
13. [License](#license)
---
## Architecture
```
Alertmanager ──POST /alert──► receiver.py
(wait N seconds for all modules
to finish writing their status)
correlator.py
(reads Redis KEYS/HGETALL,
classifies outcome:
SUCCESS / PARTIAL / REPO_FAILURE)
repo_check.py ← skipped on SUCCESS
(restic snapshots --last --no-cache
on each configured repository)
notifier.py
(builds plain-text email,
dispatches via ns8-sendmail)
```
**Key design decision:** the service is a long-running HTTP server managed by
systemd, not a one-shot script. It is always ready to receive alerts whether
the backup is triggered manually from the UI or by an automatic scheduled timer.
> **Note on automatic alerts:** NS8 native Prometheus rules emit alerts named
> `backup_failed` and `backup_missing` (not `NsBackupFailed` / `NsBackupMissing`).
> All four names are matched so the pipeline fires on both native and custom rules.
> The plan identifier is extracted from the `id` label (native) or `backup_id`
> label (custom) — both are checked.
---
## File layout
```
ns8-backup-monitor/
├── README.md ← this file
├── config/
│ └── config.yml.example ← annotated configuration template
│ (copy to /etc/ns8-backup-monitor/config.yml)
├── deploy/
│ ├── install.sh ← interactive installer / uninstaller
│ └── ns8-backup-monitor.service ← systemd unit file
└── ns8_backup_monitor/ ← Python package (the service code)
├── __init__.py ← package metadata and version string
├── __main__.py ← entry point: argument parsing, logging
│ initialisation, calls receiver.run_server()
├── receiver.py ← HTTP webhook server (POST /alert)
│ matches alert names, spawns pipeline thread
├── correlator.py ← reads NS8 Redis, classifies backup outcome
├── repo_check.py ← probes restic repositories for health status
├── notifier.py ← builds and sends the status email
└── utils.py ← load_config() and setup_logging() helpers
```
---
## Runtime paths
Paths created by `deploy/install.sh` and assumed by the default configuration.
| Purpose | Default path |
|---------|-------------|
| Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` |
| Deploy scripts | `/opt/ns8-backup-monitor/deploy/` |
| Configuration file | `/etc/ns8-backup-monitor/config.yml` |
| systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` |
| Log file | `/var/log/ns8-backup-monitor.log` |
| NS8 cluster Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
---
## Requirements
| Dependency | Provided by | Notes |
|------------|-------------|-------|
| `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ |
| `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency |
| `redis-cli` | NethServer 8 | Accessed via subprocess; no Python Redis client needed |
| `restic` | NethServer 8 / manual | Required for `repo_check` only |
| `ns8-sendmail` | NethServer 8 | Required for email delivery |
| `systemd` | OS | Service management |
> **This service must run on an NS8 leader node** (or any node with read access
> to the cluster Redis Unix socket and `ns8-sendmail` in `PATH`).
---
## Installation
### One-liner (recommended)
```bash
bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)
```
The installer will:
1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`).
2. Download and extract the latest source from the Gitea repository.
3. Prompt interactively for sender address, recipient list, and subject prefix.
4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values.
5. Install and start the systemd service.
### Manual installation
```bash
git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
cd ns8-backup-monitor
pip3 install pyyaml
mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
cp -r . /opt/ns8-backup-monitor/
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
nano /etc/ns8-backup-monitor/config.yml # edit before starting
cp deploy/ns8-backup-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now ns8-backup-monitor
```
---
## Configuration
Full reference is in `config/config.yml.example`. Key parameters:
```yaml
receiver:
host: 127.0.0.1 # bind address (use 0.0.0.0 only for remote Alertmanager)
port: 9099 # webhook listening port
correlator:
wait_seconds: 30 # wait after alert before reading Redis (allow slow modules to finish)
recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present
repo_check:
enabled: true # set false to skip restic health checks entirely
timeout: 60 # per-repository restic timeout (seconds)
notification:
mail_from: ns8-backup-monitor@example.com
mail_to:
- admin@example.com
- ops@example.com
redis:
socket: /var/lib/nethserver/cluster/state/redis.sock
logging:
level: INFO # DEBUG for verbose output during troubleshooting
file: /var/log/ns8-backup-monitor.log
```
---
## Alertmanager integration
Add a receiver and route to your Alertmanager configuration:
```yaml
# alertmanager.yml
receivers:
- name: ns8-backup-monitor
webhook_configs:
- url: http://127.0.0.1:9099/alert
send_resolved: false # resolved alerts are intentionally ignored
route:
routes:
- matchers:
- alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
receiver: ns8-backup-monitor
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
```
### Supported alert names
| Alert name | Rule set | Trigger |
|------------|----------|---------|
| `backup_failed` | NS8 native (`node_backup_status`) | One or more plans reported `result != success` |
| `backup_missing` | NS8 native (`node_backup_status`) | Expected backup did not complete in time |
| `NsBackupFailed` | Custom / legacy | Same semantic as `backup_failed` |
| `NsBackupMissing` | Custom / legacy | Same semantic as `backup_missing` |
### Label mapping
| Label | Used by | Contains |
|-------|---------|----------|
| `id` | NS8 native alerts | Backup plan identifier |
| `backup_id` | Custom / legacy alerts | Backup plan identifier |
Both labels are checked. When neither is present the correlator falls back to
scanning Redis for all plan status keys updated within `correlator.recent_window`.
---
## Outcome classification
| Outcome | Condition | Email subject |
|---------|-----------|---------------|
| `SUCCESS` | `failed == 0` and `total > 0` | `[ns8-backup] SUCCESS - all N module(s) backed up successfully` |
| `PARTIAL` | `0 < failed < total` | `[ns8-backup] PARTIAL - N/M module(s) failed` |
| `REPO_FAILURE` | `failed == total` or `total == 0` | `[ns8-backup] REPO_FAILURE - <reason>` |
`REPO_FAILURE` covers both the case where all modules failed and the case where
no status was found in Redis at all (possible repository-level or scheduling
issue). The repository health check (`repo_check.py`) runs automatically for
`PARTIAL` and `REPO_FAILURE` outcomes to provide additional diagnostics.
---
## Redis key structure
Keys read by `correlator.py` and `repo_check.py`:
```
cluster/backup/<backup_id>/status
Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
errors integer count of failed modules
module/<module_id>/backup/<backup_id>/status
Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
error human-readable error message (empty on success)
cluster/backup_repository/<repo_id>/parameters
Hash fields:
url cloud backend URL (S3, B2, rclone, ...)
path local or SFTP path
password restic repository password
backend "s3" | "b2" | "sftp" | "rclone" | "local"
aws_access_key_id S3 key ID (also used as b2_account_id for B2)
aws_secret_access_key S3 secret (also used as b2_account_key for B2)
rclone_config path to rclone configuration file
```
---
## Service management
```bash
# Status
systemctl status ns8-backup-monitor
# Restart after config change
systemctl restart ns8-backup-monitor
# Live logs
journalctl -u ns8-backup-monitor -f
# Enable debug logging without editing config.yml
journalctl -u ns8-backup-monitor -f | grep -v DEBUG
# Test the webhook manually
curl -s -X POST http://127.0.0.1:9099/alert \
-H "Content-Type: application/json" \
-d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'
```
---
## Troubleshooting
### Pipeline does not trigger on automatic backups
**Symptom:** email arrives when you click "Run backup" from the NS8 UI but not
when the scheduled timer fires.
**Cause (pre-fix):** the old `ALERT_NAMES` set only contained `NsBackupFailed`
and `NsBackupMissing`. NS8 native Prometheus rules emit `backup_failed` and
`backup_missing` instead. Additionally, native alerts carry the plan identifier
in the `id` label, not in `backup_id`.
**Fix:** update to the current version; both alert name sets and both labels
are now handled automatically.
**Verify:** run `journalctl -u ns8-backup-monitor -f` and trigger a test alert
with `amtool alert add alertname=backup_failed id=1`. You should see the
`Received alert:` DEBUG line followed by the pipeline start.
### No email received
1. Check the service is running: `systemctl status ns8-backup-monitor`.
2. Verify Alertmanager is routing to the correct receiver: `amtool config routes test alertname=backup_failed`.
3. Check `mail_to` in `config.yml` is set and `ns8-sendmail` works:
`echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test"`.
4. Increase log level to `DEBUG` in `config.yml` and restart the service.
### REPO_FAILURE with no modules found
**Cause:** the correlator ran before backup modules finished writing their
status to Redis.
**Fix:** increase `correlator.wait_seconds` in `config.yml`. A value of 60120
is safe for most clusters. Restart the service after changing the value.
---
## Uninstallation
```bash
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
```
---
## License
MIT — see [LICENSE](LICENSE).