Browse Library
BASIC RESOURCE MONITORING
Prometheus self-monitoring
28 RULESPrometheus job missing, Prometheus target missing, Prometheus all targets missing, Prometheus target missing with warmup time, Prometheus configuration reload failure
Host and hardware
35 RULESHost out of memory, Host memory under memory pressure, Host Memory is underutilized, Host unusual network throughput in, Host unusual network throughput out
S.M.A.R.T Device Monitoring
8 RULESSMART device temperature warning, SMART device temperature critical, SMART device temperature over trip value, SMART device temperature nearing trip value, SMART status
Docker containers
9 RULESContainer killed, Container absent, Container High CPU utilization, Container High Memory usage, Container Volume usage
Blackbox
9 RULESBlackbox probe failed, Blackbox configuration reload failure, Blackbox slow probe, Blackbox probe HTTP failure, Blackbox SSL certificate will expire soon
Windows Server
5 RULESWindows Server collector Error, Windows Server service Status, Windows Server CPU Usage, Windows Server memory Usage, Windows Server disk Space Usage
VMware
4 RULESVirtual Machine Memory Warning, Virtual Machine Memory Critical, High Number of Snapshots, Outdated Snapshots
Netdata
9 RULESNetdata high cpu usage, Host CPU steal noisy neighbor, Netdata high memory usage, Netdata low disk space, Netdata predicted disk full
DATABASES AND BROKERS
MySQL
14 RULESMySQL down, MySQL too many connections (> 80%), MySQL high prepared statements utilization (> 80%), MySQL high threads running, MySQL Slave IO thread not running
PostgreSQL
22 RULESPostgresql down, Postgresql restarted, Postgresql exporter error, Postgresql table not auto vacuumed, Postgresql table not auto analyzed
SQL Server
2 RULESSQL Server down, SQL Server deadlock
Patroni
1 RULESPatroni has no Leader
PGBouncer
3 RULESPGBouncer active connections, PGBouncer errors, PGBouncer max connections
Redis
12 RULESRedis down, Redis missing master, Redis too many masters, Redis disconnected slaves, Redis replication broken
MongoDB
18 RULESMongoDB Down, Mongodb replica member unhealthy, MongoDB replication lag, MongoDB replication headroom, MongoDB number cursors open
RabbitMQ
21 RULESRabbitMQ node down, RabbitMQ node not distributed, RabbitMQ instances different versions, RabbitMQ memory high, RabbitMQ file descriptors usage
Elasticsearch
19 RULESElasticsearch Heap Usage Too High, Elasticsearch Heap Usage warning, Elasticsearch disk out of space, Elasticsearch disk space low, Elasticsearch Cluster Red
Meilisearch
2 RULESMeilisearch index is empty, Meilisearch http response time
Cassandra
30 RULESCassandra Node is unavailable, Cassandra many compaction tasks are pending, Cassandra commitlog pending tasks, Cassandra compaction executor blocked tasks, Cassandra flush writer blocked tasks
Clickhouse
20 RULESClickHouse node down, ClickHouse Memory Usage Critical, ClickHouse Memory Usage Warning, ClickHouse Disk Space Low on Default, ClickHouse Disk Space Critical on Default
CouchDB
18 RULESCouchDB node down, CouchDB atom memory usage critical, CouchDB open databases critical, CouchDB open OS files critical, CouchDB 5xx error ratio high
Zookeeper
4 RULESNo description
Kafka
4 RULESKafka topics replicas, Kafka consumers group
Pulsar
10 RULESPulsar subscription high number of backlog entries, Pulsar subscription very high number of backlog entries, Pulsar topic large backlog storage size, Pulsar topic very large backlog storage size, Pulsar high write latency
Nats
19 RULESNats high connection count, Nats high subscriptions count, Nats high routes count, Nats high memory usage, Nats slow consumers
Solr
4 RULESSolr update errors, Solr query errors, Solr replication errors, Solr low live node count
Hadoop
10 RULESHadoop Name Node Down, Hadoop Resource Manager Down, Hadoop Data Node Out Of Service, Hadoop HDFS Disk Space Low, Hadoop Map Reduce Task Failures
REVERSE PROXIES AND LOAD BALANCERS
Nginx
3 RULESNginx high HTTP 4xx error rate, Nginx high HTTP 5xx error rate, Nginx latency high
Apache
3 RULESApache down, Apache workers load, Apache restart
HaProxy
30 RULESHAProxy high HTTP 4xx error rate backend, HAProxy high HTTP 5xx error rate backend, HAProxy high HTTP 4xx error rate server, HAProxy high HTTP 5xx error rate server, HAProxy server response errors
Traefik
6 RULESTraefik service down, Traefik high HTTP 4xx error rate service, Traefik high HTTP 5xx error rate service
Caddy
3 RULESCaddy Reverse Proxy Down, Caddy high HTTP 4xx error rate service, Caddy high HTTP 5xx error rate service
RUNTIMES
PHP-FPM
1 RULESPHP-FPM max-children reached
JVM
1 RULESJVM memory filling up
Sidekiq
2 RULESSidekiq queue size, Sidekiq scheduling latency too high
ORCHESTRATORS
Kubernetes
36 RULESKubernetes Node not ready, Kubernetes Node scheduling disabled, Kubernetes Node memory pressure, Kubernetes Node disk pressure, Kubernetes Node network unavailable
Nomad
4 RULESNomad job failed, Nomad job lost, Nomad job queued, Nomad blocked evaluation
Consul
3 RULESConsul service healthcheck failed, Consul missing master node, Consul agent unhealthy
Etcd
13 RULESEtcd insufficient Members, Etcd no Leader, Etcd high number of leader changes, Etcd high number of failed GRPC requests, Etcd high number of failed GRPC requests
Linkerd
1 RULESLinkerd high error rate
Istio
10 RULESIstio Kubernetes gateway availability drop, Istio Pilot high total request rate, Istio Mixer Prometheus dispatches low, Istio high total request rate, Istio low total request rate
ArgoCD
2 RULESArgoCD service not synced, ArgoCD service unhealthy
FluxCD
4 RULESFlux Kustomization Failure, Flux HelmRelease Failure, Flux Source Issue, Flux Image Issue
NETWORK, SECURITY AND STORAGE
Ceph
13 RULESCeph State, Ceph monitor clock skew, Ceph monitor low space, Ceph OSD Down, Ceph high OSD latency
SpeedTest
2 RULESSpeedTest Slow Internet Download, SpeedTest Slow Internet Upload
ZFS
4 RULESZFS offline pool
OpenEBS
1 RULESOpenEBS used pool capacity
Minio
3 RULESMinio cluster disk offline, Minio node disk offline, Minio disk space usage
SSL/TLS
4 RULESSSL certificate probe failed, SSL certificate OSCP status unknown, SSL certificate revoked, SSL certificate expiry (< 7 days)
Juniper
3 RULESJuniper switch down, Juniper high Bandwidth Usage 1GiB, Juniper high Bandwidth Usage 1GiB
CoreDNS
1 RULESCoreDNS Panic Count
Freeswitch
3 RULESFreeswitch down, Freeswitch Sessions Warning, Freeswitch Sessions Critical
Hashicorp Vault
4 RULESVault sealed, Vault too many pending tokens, Vault too many infinity tokens, Vault cluster health
Cloudflare
2 RULESCloudflare http 4xx error rate, Cloudflare http 5xx error rate
OTHER
Thanos
45 RULESThanos Compactor Multiple Running, Thanos Compactor Halted, Thanos Compactor High Compaction Failures, Thanos Compact Bucket High Operation Failures, Thanos Compact Has Not Run
Loki
4 RULESLoki process too many restarts, Loki request errors, Loki request panic, Loki request latency
Promtail
2 RULESPromtail request errors, Promtail request latency
Cortex
6 RULESCortex ruler configuration reload failure, Cortex not connected to Alertmanager, Cortex notification are being dropped, Cortex notification error, Cortex ingester unhealthy
Grafana Alloy
1 RULESGrafana Alloy service down
OpenTelemetry Collector
12 RULESOpenTelemetry Collector down, OpenTelemetry Collector receiver refused spans, OpenTelemetry Collector receiver refused metric points, OpenTelemetry Collector receiver refused log records, OpenTelemetry Collector exporter failed spans
Jenkins
8 RULESJenkins node offline, Jenkins no node online, Jenkins healthcheck, Jenkins outdated plugins, Jenkins builds health score
APC UPS
6 RULESAPC UPS Battery nearly empty, APC UPS Less than 15 Minutes of battery time remaining, APC UPS AC input outage, APC UPS low battery voltage, APC UPS high temperature
Graph Node
6 RULESProvider failed because net_version failed, Provider failed because get genesis failed, Provider failed because net_version timeout, Provider failed because get genesis timeout, Store connection is too slow