Browse Library

BASIC RESOURCE MONITORING

Prometheus

Prometheus self-monitoring

28 RULESPrometheus job missing, Prometheus target missing, Prometheus all targets missing, Prometheus target missing with warmup time, Prometheus configuration reload failure

Placeholder icon

Host and hardware

35 RULESHost out of memory, Host memory under memory pressure, Host Memory is underutilized, Host unusual network throughput in, Host unusual network throughput out

Placeholder icon

S.M.A.R.T Device Monitoring

8 RULESSMART device temperature warning, SMART device temperature critical, SMART device temperature over trip value, SMART device temperature nearing trip value, SMART status

Docker

Docker containers

9 RULESContainer killed, Container absent, Container High CPU utilization, Container High Memory usage, Container Volume usage

Placeholder icon

Blackbox

9 RULESBlackbox probe failed, Blackbox configuration reload failure, Blackbox slow probe, Blackbox probe HTTP failure, Blackbox SSL certificate will expire soon

Placeholder icon

Windows Server

5 RULESWindows Server collector Error, Windows Server service Status, Windows Server CPU Usage, Windows Server memory Usage, Windows Server disk Space Usage

VMware

VMware

4 RULESVirtual Machine Memory Warning, Virtual Machine Memory Critical, High Number of Snapshots, Outdated Snapshots

Netdata

Netdata

9 RULESNetdata high cpu usage, Host CPU steal noisy neighbor, Netdata high memory usage, Netdata low disk space, Netdata predicted disk full

DATABASES AND BROKERS

MySQL

MySQL

14 RULESMySQL down, MySQL too many connections (> 80%), MySQL high prepared statements utilization (> 80%), MySQL high threads running, MySQL Slave IO thread not running

PostgreSQL

PostgreSQL

22 RULESPostgresql down, Postgresql restarted, Postgresql exporter error, Postgresql table not auto vacuumed, Postgresql table not auto analyzed

Placeholder icon

SQL Server

2 RULESSQL Server down, SQL Server deadlock

Placeholder icon

Patroni

1 RULESPatroni has no Leader

Placeholder icon

PGBouncer

3 RULESPGBouncer active connections, PGBouncer errors, PGBouncer max connections

Redis

Redis

12 RULESRedis down, Redis missing master, Redis too many masters, Redis disconnected slaves, Redis replication broken

MongoDB

MongoDB

18 RULESMongoDB Down, Mongodb replica member unhealthy, MongoDB replication lag, MongoDB replication headroom, MongoDB number cursors open

RabbitMQ

RabbitMQ

21 RULESRabbitMQ node down, RabbitMQ node not distributed, RabbitMQ instances different versions, RabbitMQ memory high, RabbitMQ file descriptors usage

Elasticsearch

Elasticsearch

19 RULESElasticsearch Heap Usage Too High, Elasticsearch Heap Usage warning, Elasticsearch disk out of space, Elasticsearch disk space low, Elasticsearch Cluster Red

Meilisearch

Meilisearch

2 RULESMeilisearch index is empty, Meilisearch http response time

Placeholder icon

Cassandra

30 RULESCassandra Node is unavailable, Cassandra many compaction tasks are pending, Cassandra commitlog pending tasks, Cassandra compaction executor blocked tasks, Cassandra flush writer blocked tasks

ClickHouse

Clickhouse

20 RULESClickHouse node down, ClickHouse Memory Usage Critical, ClickHouse Memory Usage Warning, ClickHouse Disk Space Low on Default, ClickHouse Disk Space Critical on Default

Placeholder icon

CouchDB

18 RULESCouchDB node down, CouchDB atom memory usage critical, CouchDB open databases critical, CouchDB open OS files critical, CouchDB 5xx error ratio high

Placeholder icon

Zookeeper

4 RULESNo description

Placeholder icon

Kafka

4 RULESKafka topics replicas, Kafka consumers group

Placeholder icon

Pulsar

10 RULESPulsar subscription high number of backlog entries, Pulsar subscription very high number of backlog entries, Pulsar topic large backlog storage size, Pulsar topic very large backlog storage size, Pulsar high write latency

Placeholder icon

Nats

19 RULESNats high connection count, Nats high subscriptions count, Nats high routes count, Nats high memory usage, Nats slow consumers

Placeholder icon

Solr

4 RULESSolr update errors, Solr query errors, Solr replication errors, Solr low live node count

Placeholder icon

Hadoop

10 RULESHadoop Name Node Down, Hadoop Resource Manager Down, Hadoop Data Node Out Of Service, Hadoop HDFS Disk Space Low, Hadoop Map Reduce Task Failures

REVERSE PROXIES AND LOAD BALANCERS

NGINX

Nginx

3 RULESNginx high HTTP 4xx error rate, Nginx high HTTP 5xx error rate, Nginx latency high

Apache

Apache

3 RULESApache down, Apache workers load, Apache restart

Placeholder icon

HaProxy

30 RULESHAProxy high HTTP 4xx error rate backend, HAProxy high HTTP 5xx error rate backend, HAProxy high HTTP 4xx error rate server, HAProxy high HTTP 5xx error rate server, HAProxy server response errors

Placeholder icon

Traefik

6 RULESTraefik service down, Traefik high HTTP 4xx error rate service, Traefik high HTTP 5xx error rate service

Caddy

Caddy

3 RULESCaddy Reverse Proxy Down, Caddy high HTTP 4xx error rate service, Caddy high HTTP 5xx error rate service

RUNTIMES

Placeholder icon

PHP-FPM

1 RULESPHP-FPM max-children reached

Placeholder icon

JVM

1 RULESJVM memory filling up

Sidekiq

Sidekiq

2 RULESSidekiq queue size, Sidekiq scheduling latency too high

ORCHESTRATORS

Kubernetes

Kubernetes

36 RULESKubernetes Node not ready, Kubernetes Node scheduling disabled, Kubernetes Node memory pressure, Kubernetes Node disk pressure, Kubernetes Node network unavailable

Nomad

Nomad

4 RULESNomad job failed, Nomad job lost, Nomad job queued, Nomad blocked evaluation

Consul

Consul

3 RULESConsul service healthcheck failed, Consul missing master node, Consul agent unhealthy

etcd

Etcd

13 RULESEtcd insufficient Members, Etcd no Leader, Etcd high number of leader changes, Etcd high number of failed GRPC requests, Etcd high number of failed GRPC requests

Linkerd

Linkerd

1 RULESLinkerd high error rate

Istio

Istio

10 RULESIstio Kubernetes gateway availability drop, Istio Pilot high total request rate, Istio Mixer Prometheus dispatches low, Istio high total request rate, Istio low total request rate

Placeholder icon

ArgoCD

2 RULESArgoCD service not synced, ArgoCD service unhealthy

Placeholder icon

FluxCD

4 RULESFlux Kustomization Failure, Flux HelmRelease Failure, Flux Source Issue, Flux Image Issue

NETWORK, SECURITY AND STORAGE

Ceph

Ceph

13 RULESCeph State, Ceph monitor clock skew, Ceph monitor low space, Ceph OSD Down, Ceph high OSD latency

Speedtest

SpeedTest

2 RULESSpeedTest Slow Internet Download, SpeedTest Slow Internet Upload

Placeholder icon

ZFS

4 RULESZFS offline pool

Placeholder icon

OpenEBS

1 RULESOpenEBS used pool capacity

MinIO

Minio

3 RULESMinio cluster disk offline, Minio node disk offline, Minio disk space usage

Placeholder icon

SSL/TLS

4 RULESSSL certificate probe failed, SSL certificate OSCP status unknown, SSL certificate revoked, SSL certificate expiry (< 7 days)

Placeholder icon

Juniper

3 RULESJuniper switch down, Juniper high Bandwidth Usage 1GiB, Juniper high Bandwidth Usage 1GiB

Placeholder icon

CoreDNS

1 RULESCoreDNS Panic Count

Placeholder icon

Freeswitch

3 RULESFreeswitch down, Freeswitch Sessions Warning, Freeswitch Sessions Critical

HashiCorp

Hashicorp Vault

4 RULESVault sealed, Vault too many pending tokens, Vault too many infinity tokens, Vault cluster health

Cloudflare

Cloudflare

2 RULESCloudflare http 4xx error rate, Cloudflare http 5xx error rate

OTHER

Thanos

Thanos

45 RULESThanos Compactor Multiple Running, Thanos Compactor Halted, Thanos Compactor High Compaction Failures, Thanos Compact Bucket High Operation Failures, Thanos Compact Has Not Run

Placeholder icon

Loki

4 RULESLoki process too many restarts, Loki request errors, Loki request panic, Loki request latency

Placeholder icon

Promtail

2 RULESPromtail request errors, Promtail request latency

Placeholder icon

Cortex

6 RULESCortex ruler configuration reload failure, Cortex not connected to Alertmanager, Cortex notification are being dropped, Cortex notification error, Cortex ingester unhealthy

Grafana

Grafana Alloy

1 RULESGrafana Alloy service down

OpenTelemetry

OpenTelemetry Collector

12 RULESOpenTelemetry Collector down, OpenTelemetry Collector receiver refused spans, OpenTelemetry Collector receiver refused metric points, OpenTelemetry Collector receiver refused log records, OpenTelemetry Collector exporter failed spans

Jenkins

Jenkins

8 RULESJenkins node offline, Jenkins no node online, Jenkins healthcheck, Jenkins outdated plugins, Jenkins builds health score

Placeholder icon

APC UPS

6 RULESAPC UPS Battery nearly empty, APC UPS Less than 15 Minutes of battery time remaining, APC UPS AC input outage, APC UPS low battery voltage, APC UPS high temperature

Placeholder icon

Graph Node

6 RULESProvider failed because net_version failed, Provider failed because get genesis failed, Provider failed because net_version timeout, Provider failed because get genesis timeout, Store connection is too slow