[
  {
    "path": ".github/FUNDING.yml",
    "content": "github: [samber]\nko_fi: samuelberthe\n"
  },
  {
    "path": ".github/dependabot.yml",
    "content": "---\nversion: 2\nupdates:\n  - package-ecosystem: \"github-actions\"\n    directory: \"/\"\n    schedule:\n      interval: \"monthly\"\n"
  },
  {
    "path": ".github/workflows/dist.yml",
    "content": "name: Publish\n\non:\n  workflow_dispatch:\n  push:\n    branches:\n      - master\n\npermissions:\n  contents: write\n\njobs:\n  publish:\n    name: Publish\n    # Check if the PR is not from a fork\n    if: github.repository_owner == 'samber'\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Repo\n        uses: actions/checkout@v6\n\n      - name: Set up Ruby\n        uses: ruby/setup-ruby@v1\n        with:\n          ruby-version: 3.4\n\n      - name: Set up yq\n        uses: mikefarah/yq@v4\n\n      - name: Install liquid\n        run: |\n         gem install liquid -v 5.5.1\n         gem install liquid-cli \n\n      - name: Build rule configuration\n        run: |\n          cat _data/rules.yml | yq -I 0 -o json > _data/rules.json\n\n          rm -rf dist/rules\n\n          for service in $(cat _data/rules.json | jq -r '.groups[].services[] | @base64'); do\n            subdir=dist/rules/$(echo ${service} | base64 --decode | jq -r '.name | ascii_downcase | split(\" \") | join(\"-\")')\n            mkdir -p \"${subdir}\"\n\n            # groupName=$(echo \"{% assign groupName = name | split: ' ' %}{% capture groupNameCamelcase %}{% for word in groupName %}{{ word | capitalize }} {% endfor %}{% endcapture %} {{ groupNameCamelcase | remove: ' ' | remove: '-' }}\" | liquid $(echo ${service} | base64 --decode | jq -r '.name | ascii_downcase | split(\" \") | join(\"-\")'))\n    \n            for exporter in $(echo ${service} | base64 --decode | jq -r '.exporters[] | @base64'); do\n              exporterName=$(echo ${exporter} | base64 --decode | jq -r '.slug')\n              cat dist/template.yml | liquid \"$(echo ${exporter} | base64 --decode)\" > ${subdir}/${exporterName}.yml\n              echo ${subdir}/${exporterName}.yml\n            done\n          done\n\n          rm _data/rules.json\n\n      # https://peterevans.dev/posts/github-actions-how-to-automate-code-formatting-in-pull-requests/\n      - name: Check for modified files\n        id: git-check\n        run: echo \"modified=$(git status -s --porcelain | wc -l | awk '{$1=$1};1')\" >> $GITHUB_OUTPUT\n      - name: Push changes\n        if: steps.git-check.outputs.modified != '0'\n        run: |\n          git config --global user.name 'samber'\n          git config --global user.email 'samber@users.noreply.github.com'\n          git remote set-url origin https://x-access-token:${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}\n          git add .\n          git commit -m \"Publish\"\n          git push\n"
  },
  {
    "path": ".github/workflows/test.yml",
    "content": "name: Promtool check\n\non:\n  pull_request:\n  push:\n    branches:\n      - master\n\njobs:\n  promtool-check:\n    name: Check alert rules syntax\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Repo\n        uses: actions/checkout@v6\n\n      - name: Set up Ruby\n        uses: ruby/setup-ruby@v1\n        with:\n          ruby-version: 3.4\n\n      - name: Set up yq\n        uses: mikefarah/yq@v4\n\n      - name: Install liquid\n        run: gem install liquid-cli\n\n      - name: Build rule configuration\n        run: |\n          cat _data/rules.yml | yq -I 0 -o json > _data/rules.json\n\n          for service in $(cat _data/rules.json | jq -r '.groups[].services[] | @base64'); do\n            subdir=test/rules/$(echo ${service} | base64 --decode | jq -r '.name | ascii_downcase | split(\" \") | join(\"-\")')\n            mkdir -p \"${subdir}\"\n\n            # groupName=$(echo \"{% assign groupName = name | split: ' ' %}{% capture groupNameCamelcase %}{% for word in groupName %}{{ word | capitalize }} {% endfor %}{% endcapture %} {{ groupNameCamelcase | remove: ' ' | remove: '-' }}\" | liquid $(echo ${service} | base64 --decode | jq -r '.name | ascii_downcase | split(\" \") | join(\"-\")'))\n\n            for exporter in $(echo ${service} | base64 --decode | jq -r '.exporters[] | @base64'); do\n              exporterName=$(echo ${exporter} | base64 --decode | jq -r '.slug')\n              cat dist/template.yml | liquid \"$(echo ${exporter} | base64 --decode)\" > ${subdir}/${exporterName}.yml\n              echo ${subdir}/${exporterName}.yml\n            done\n          done\n\n          rm _data/rules.json\n\n      - name: Check Prometheus alert rules\n        uses: peimanja/promtool-github-actions@master\n        with:\n          promtool_actions_subcommand: 'rules'\n          promtool_actions_files: 'test/rules/*/*.yml'\n          promtool_actions_comment: true\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n"
  },
  {
    "path": ".gitignore",
    "content": "_site/\n.sass-cache/\n.jekyll-cache/\n.jekyll-metadata\n_data/rules.json\ntest/rules/\n/node_modules\n.worktrees/"
  },
  {
    "path": ".travis.yml",
    "content": "language: node_js\nnode_js:\n  - 'node'\n"
  },
  {
    "path": "CLAUDE.md",
    "content": "# CLAUDE.md\n\nThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.\n\n## Project Overview\n\nA curated collection of ~940 Prometheus alerting rules covering 90+ services across 100+ exporters, organized in 7 categories: basic resource monitoring (Prometheus, host/hardware, SMART, Docker, Blackbox, Windows, VMware, Netdata), databases and brokers (MySQL, PostgreSQL, Redis, MongoDB, RabbitMQ, Elasticsearch, Cassandra, Clickhouse, Kafka, etc.), reverse proxies and load balancers (Nginx, Apache, HaProxy, Traefik, Caddy), runtimes (PHP-FPM, JVM, Sidekiq), orchestrators (Kubernetes, Nomad, Consul, Etcd, Istio, ArgoCD, FluxCD), network/security/storage (Ceph, ZFS, Minio, SSL/TLS, CoreDNS, Vault, Cloudflare), and observability tools (Thanos, Loki, Cortex, OpenTelemetry Collector, Jenkins).\n\nAll rules are stored in a single YAML data file (`_data/rules.yml`) and rendered as a Jekyll-based GitHub Pages site at https://samber.github.io/awesome-prometheus-alerts. The site provides copy-pasteable Prometheus alert snippets and downloadable rule files per exporter.\n\nThe project is community-driven. Most contributions are PRs adding or updating rules in `_data/rules.yml`. Files in `dist/rules/` are auto-generated on merge — never edit them manually.\n\n## Architecture\n\n- **`_data/rules.yml`** — The single source of truth for all alerting rules. This is the main file contributors edit. It is NOT a valid Prometheus config; the site renders each rule into copy-pasteable Prometheus alert format.\n- **`rules.md`** — Jekyll template that iterates over `_data/rules.yml` and renders the rules page with copy buttons and formatted YAML blocks.\n- **`alertmanager.md`** — Static page with Prometheus/AlertManager configuration examples.\n- **`_layouts/default.html`** — Site layout (Jekyll theme: cayman).\n- **`_config.yml`** — Jekyll configuration.\n- **`dist/rules/`** — Pre-built downloadable rule files organized by service/exporter (referenced in the site for `wget` commands).\n\n## Rules YAML Structure\n\nServices are listed in README.md.\n\n`_data/rules.yml` hierarchy:\n```\ngroups:\n  - name: \"<category>\"        # e.g. \"Basic resource monitoring\"\n    services:\n      - name: \"<service>\"     # e.g. \"Host and hardware\"\n        exporters:\n          - name: \"<exporter>\"\n            slug: \"<slug>\"          # used for download URLs\n            doc_url: \"<url>\"        # optional link to exporter docs\n            comments:               # optional, exporter-level multiline notes rendered before rules\n              \"<comment>\"\n            rules:\n              - name: \"<alert name>\"\n                description: \"<text>\"\n                query: \"<PromQL>\"\n                severity: warning|critical|info\n                for: \"<duration>\"   # optional, defaults to 0m\n                comments:           # optional, rendered as multiline YAML comments\n                  \"<comment>\"\n```\n\nServices are grouped in category. If you are not sure about the classification, ask the developer.\n\n## Running Locally\n\n```bash\n# With Ruby/Bundler\ngem install bundler\nbundle install\njekyll serve\n\n# With Docker Compose\ndocker compose up -d\n\n# With Docker directly\ndocker run --rm -it -p 4000:4000 -v $(pwd):/srv/jekyll jekyll/jekyll jekyll serve\n```\n\nSite serves at http://localhost:4000/awesome-prometheus-alerts.\n\n## Contributing Rules\n\nAll rule changes go in `_data/rules.yml`. Each rule needs: `name`, `description`, `query` (valid PromQL), and `severity`. The `for` field is optional. Descriptions should be factual (\"what\") and include root cause hints (\"why\"). Queries must be tested against the latest exporter version. Never modify files in `dist/` — they are auto-generated on merge.\n\n## Query Validation\n\n- When adding or updating an alert, verify that the PromQL query references metric series that actually exist in the related exporter. Check the exporter's documentation or source code to confirm series names.\n- If a metric series has been deprecated or removed in a newer version of the exporter, update the query to use the replacement series, or remove the rule if no replacement exists. Known examples: `kube_hpa_*` renamed to `kube_horizontalpodautoscaler_*` in kube-state-metrics 2.x; `node_hwmon_temp_alarm` does not exist (correct: `node_hwmon_temp_crit_alarm_celsius`); node-exporter CLI flags get renamed across versions.\n- When writing or reviewing a query, search the internet (exporter docs, GitHub issues, changelogs) to validate correctness and catch outdated series names. When you are not sure about a metric name, always search the internet to confirm it exists and is spelled correctly before using it.\n- Pay special attention to metric naming conventions: many exporters add `_total` suffixes for counters and `_seconds_total` for time-based counters. Verify the exact name from source code, not just docs. Known examples: Spark's PrometheusResource adds `_total` and `_seconds_total` suffixes (e.g., `metrics_executor_failedTasks_total`, not `metrics_executor_failedTasks`); Oracle's `oracledb_sessions_value` not `oracledb_sessions_activity`.\n- Verify that label names used in `{{ $labels.xxx }}` template variables actually exist on the metric. Check the exporter source code for the exact label names. Known examples: cloudflare/ebpf_exporter uses `id` not `name` for programs, and `config` not `name` for decoder errors.\n- When a metric uses info-style patterns (value always 1, information carried in labels), `== 0` will never be true — the metric simply won't exist. Use `absent()` instead. Known example: `ebpf_exporter_enabled_configs`.\n- Some metrics are version-dependent. When a metric was renamed or removed in a newer version, add a comment noting the version requirement. Known examples: `go_memstats_gc_cpu_fraction` removed in client_golang v1.12+; cert-manager renamed `certmanager_http_acme_client_request_count` to `certmanager_acme_client_request_count` in v1.19+.\n- Verify the unit of a metric before setting thresholds. Some metrics use milliseconds while descriptions assume seconds. Known example: Keycloak's `keycloak_request_duration` is in milliseconds, so `> 2` means 2ms not 2s.\n- Some exporters expose labels that differ between services even within the same ecosystem. Known example: OpenStack Neutron uses `adminState=\"up\"` while Nova and Cinder use `adminState=\"enabled\"`.\n- When an official mixin exists for a service, compare thresholds and time windows against it. Known deviations to watch for: Mimir store-gateway sync uses 1800s (not 600s), Mimir compactor skipped blocks uses `[24h]` (not `[5m]`), Tempo normalizes outstanding blocks per worker.\n\n## Common Review Pitfalls (learned from PR history)\n\nThese are the most frequent issues raised during code review on this repo:\n\n### Severity levels\n- `critical` = requires immediate human attention. Do not use for informational/security notifications.\n- `warning` = needs attention soon but not urgent.\n- `info` = awareness only (e.g., config changes, underutilized resources).\n- Authentication failures, security notifications, and config-change detections are typically `info`, not `critical`.\n\n### `for` duration\n- Omit `for` when the default (0m) is intentional and appropriate — do not add `for: 0m` explicitly.\n- Add a `for` duration (e.g., `for: 2m` or `for: 5m`) to tolerate brief unavailability from restarts or transient spikes. Most \"service down\" rules should have at least `for: 1m`–`2m`.\n- Do not blanket-change all `for: 0m` to `for: 1m` — it depends on the alert's semantics and the range window used in `increase()`/`rate()`.\n\n### Query design\n- Prefer symptom-based alerts over cause-based alerts to reduce alert fatigue. Example: \"service is unreachable\" is better than \"specific internal counter changed\". Metrics like heap object count, allocation rate, or free heap slots are causes, not symptoms — prefer GC duration, latency, or error rate alerts instead.\n- Don't add unnecessary aggregation (`avg()`, `avg_over_time()`) on metrics that are local to a single node/instance. Only aggregate when the alert is cluster-wide.\n- Don't combine `min_over_time()[1m]` with `for: 2m` redundantly — pick one mechanism for smoothing. Same applies to `avg_over_time()[5m]` with `for: 5m`.\n- Remove unnecessary label filters (e.g., `job=\"cassandra\"` or `cluster=~\".*\"`) that add noise without value.\n- Verify comparison operators match the intent — e.g., \"high snapshot count\" must use `> N`, not `< N`.\n- When dividing counters (e.g., error rate = errors / total), guard against division by zero with `and total > 0` or filter appropriately. This is the most common issue in new PRs — check every ratio query.\n- Filter out system/template databases explicitly in DB queries (e.g., PostgreSQL: add `datid!=\"0\"` alongside `datname!~\"template.*|postgres\"`).\n- Never use `rate()` on a gauge metric — use `deriv()` instead. `rate()` is for monotonically increasing counters only.\n- When using `increase()` for ratio calculations, prefer `rate()` instead — `increase()` can produce incorrect results when counters reset mid-window.\n- When filtering gRPC error codes, don't use `grpc_code!=\"OK\"` — this includes normal application responses like `NotFound`, `AlreadyExists`, and `Cancelled`. Filter to actual errors: `grpc_code=~\"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown\"`.\n- When computing ratios with `rate()` on a metric that is itself already a normalized rate (e.g., Oracle's `v$waitclassmetric`), applying `rate()` computes the rate-of-change of a rate, which is not meaningful.\n- When a multi-label metric is used in a binary operation with a metric that has fewer labels, use `ignoring(extra_label)` to avoid join failures. Known example: `systemd_unit_tasks_current / ignoring(type) systemd_unit_tasks_max`.\n- When a query groups by labels (e.g., `by (le, worker)`), consider the cardinality impact — hundreds of label values means hundreds of independent alerts.\n- Ensure `{{ $value | humanizeDuration }}` is only used on values in seconds. If the metric is in milliseconds, divide by 1000 first or use `{{ $value | humanize }}ms`.\n- Avoid using `up{job=~\"exporter-name\"} == 0` or `absent(up{job=~\"exporter-name\"})` to detect whether a service is down. When targets are managed via service discovery or a job reaches multiple targets, a disappeared target causes the `up` series to become stale and vanish rather than drop to 0, so the alert never fires. Prefer application-level or cluster-level metrics instead (e.g., \"number of consul cluster members < 3\", \"PostgreSQL primary node absent\").\n\n### Thresholds\n- Alert thresholds are inherently arbitrary and depend on workload. Use `comments:` to note this when a threshold is a rough default.\n- When threshold values in a PR seem unreasonable (too high or too low), challenge them with real-world reasoning or exporter docs.\n- Watch for thresholds that are so high they only catch catastrophic scenarios and miss real problems. Examples: Go goroutine spike at 100/s (misses gradual leaks), Ruby major GC at 5/s (only fires if app is non-functional), Python gen2 GC at >1/s (extremely rare).\n- Watch for thresholds that will fire on normal healthy operation. Examples: Memcached at 90% memory is desired (it's a cache), Flink TaskManager at 90% JVM heap is normal, cache hit rate < 80% is common for cold caches.\n- For SNMP bandwidth utilization, `ifSpeed` (Gauge32) maxes at ~4.29 Gbps. For 10G+ interfaces, use `ifHighSpeed * 1000000` instead.\n- For alerts using `> 0` on counters with `rate()` or `increase()`, consider whether a single event truly warrants alerting. In most cases, a small threshold (e.g., `> 0.05` for rate, `> 3` for increase) better distinguishes real problems from transient noise.\n\n### Comments\n- When an alert or its query needs explanation (e.g., non-obvious PromQL logic, threshold rationale, edge cases), use the rule-level `comments:` field. Use multiline comments when needed.\n- Use the exporter-level `comments:` field for notes that apply to all rules under that exporter (e.g., exporter version requirements, known quirks, setup prerequisites).\n- Comments are rendered as YAML `#` comments in the output, so they are visible to users who copy-paste the rules.\n\n### Descriptions\n- Keep descriptions short, factual, and actionable.\n- Include what is happening (\"Disk is almost full\") and why it matters or what to check.\n- Use `{{ $labels.instance }}`, `{{ $value }}`, and other template variables in descriptions when useful.\n- If the description says \"average\" but the query uses `histogram_quantile(0.95, ...)`, fix the description to say \"p95\" (or vice versa).\n- When alerting on rates or ratios that may not be intuitive, include `{{ $value }}` in the description so operators can see the actual number.\n\n### Structure\n- Some services have multiple exporters (e.g., MongoDB has `percona/mongodb_exporter` and `dcu/mongodb_exporter`). Place rules under the correct exporter.\n- Search for duplicates before adding a new rule — a similar alert may already exist under a different exporter or with different thresholds.\n- The `slug` field must be unique per exporter and is used for download URLs.\n\n## Reference Sources for Cross-Checking Alerts\n\nUse these sources to criticize and validate PromQL queries, compare thresholds, and find inspiration for new rules.\n\nEverytime you consume an external resource to change a PromQL query, please compare before/after and explain why you think the external source is right.\n\n### Official project mixins (alerts maintained by the project itself)\n- https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin/alerts\n- https://github.com/prometheus/prometheus/tree/main/documentation/prometheus-mixin\n- https://github.com/prometheus/alertmanager/tree/main/doc/alertmanager-mixin\n- https://github.com/prometheus/snmp_exporter/tree/main/snmp-mixin\n- https://github.com/prometheus/mysqld_exporter/tree/main/mysqld-mixin\n- https://github.com/prometheus-community/postgres_exporter/tree/master/postgres_mixin\n- https://github.com/prometheus-community/elasticsearch_exporter (mixin via Grafana docs)\n- https://github.com/etcd-io/etcd/tree/main/contrib/mixin\n- https://github.com/thanos-io/thanos/tree/main/mixin (also: examples/alerts/)\n- https://github.com/grafana/loki/tree/main/production/loki-mixin (also: promtail-mixin/)\n- https://github.com/grafana/mimir/tree/main/operations/mimir-mixin\n- https://github.com/grafana/tempo/tree/main/operations/tempo-mixin\n- https://github.com/grafana/grafana/tree/main/grafana-mixin\n- https://github.com/ceph/ceph/tree/main/monitoring/ceph-mixin (in-tree; also https://github.com/ceph/ceph-mixins)\n- https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin\n- https://github.com/kubernetes-monitoring/kubernetes-mixin (includes runbook.md)\n- https://github.com/kubernetes/kube-state-metrics/tree/main/jsonnet/kube-state-metrics-mixin\n- https://github.com/prometheus-operator/prometheus-operator/tree/main/jsonnet/mixin\n- https://github.com/prometheus-operator/kube-prometheus\n- https://github.com/cortexproject/cortex-jsonnet\n- https://github.com/gluster/gluster-mixins\n\n### Standalone mixin repositories\n- https://github.com/povilasv/coredns-mixin\n- https://github.com/adinhodovic/rabbitmq-mixin\n- https://github.com/adinhodovic/blackbox-exporter-mixin\n- https://github.com/adinhodovic/django-mixin\n- https://github.com/adinhodovic/argo-cd-mixin\n- https://github.com/adinhodovic/ingress-nginx-mixin\n- https://github.com/adinhodovic/kubernetes-autoscaling-mixin\n- https://github.com/metalmatze/kube-cockroachdb (CockroachDB on Kubernetes)\n- https://github.com/bitnami-labs/sealed-secrets (sealed-secrets mixin)\n- https://github.com/lukas-vlcek/elasticsearch-mixin (includes runbook.md)\n- https://github.com/adinhodovic/postgresql-mixin\n- https://github.com/imusmanmalik/cert-manager-mixin\n- https://gitlab.com/uneeq-oss/cert-manager-mixin (alternative cert-manager mixin)\n- https://github.com/uneeq-oss/spinnaker-mixin\n- https://github.com/metalmatze/slo-libsonnet (SLO alerting/recording rules generation library)\n\n### Grafana jsonnet-libs (93 mixins — browse for specific services)\n- https://github.com/grafana/jsonnet-libs\n- Notable mixins with alerts: consul, memcached, elasticsearch, haproxy, clickhouse, opensearch, redis, mongodb, kafka, nginx, rabbitmq, jvm, vault, envoy, istio, jenkins, caddy, cloudflare, docker, traefik, windows, snmp, argocd, nomad, pgbouncer, minio, ceph, and 60+ more.\n\n### Mixin aggregators\n- https://monitoring.mixins.dev/ (central registry of all monitoring mixins)\n- https://github.com/monitoring-mixins/website/blob/master/mixins.json (machine-readable list of all mixins with source URLs)\n- https://github.com/nlamirault/monitoring-mixins (hub aggregating many mixins)\n\n### GitLab monitoring & infrastructure\n- https://gitlab.com/gitlab-com/runbooks (GitLab.com SRE runbooks — production alert rules, runbook docs, alertmanager config)\n- https://gitlab.com/gitlab-com/runbooks/-/tree/master/mimir-rules (production Mimir alerting rules organized by tenant/environment)\n- https://gitlab.com/gitlab-com/runbooks/-/tree/master/mimir-rules-jsonnet (jsonnet sources for GitLab alerting rules)\n- https://gitlab.com/gitlab-org/omnibus-gitlab/-/tree/master/files/gitlab-cookbooks/monitoring/templates/rules (default Prometheus rules shipped with GitLab Omnibus)\n\n### Community alert collections\n- https://github.com/jpweber/prometheus-alert-rules\n- https://github.com/bdossantos/prometheus-alert-rules\n- https://github.com/giantswarm/prometheus-rules\n- https://github.com/last9/awesome-prometheus-toolkit\n- https://github.com/warpnet/awesome-prometheus (meta-list of Prometheus resources)\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "\n# Contributing\n\n## Adding alerting rule\n\nIf you don't have time to write a PR, just copy and paste some alerts into an issue. We will format it accordingly.\n\nRules are here: `_data/rules.yml`.\n\n### Guidelines\n\nPlease ensure your pull request adheres to the following guidelines:\n\n- Search previous suggestions before making a new one, as yours may be a duplicate.\n- Keep descriptions short and simple, but descriptive.\n- Description must be factual (the \"what?\") and should provide root cause suggestions (the \"why?\"), for faster resolution.\n- Queries must be tested on latest exporter version.\n\n## Improving Github page\n\n### Run locally\n\n```\ngem install bundler\nbundle install\njekyll serve\n```\n\nOr with Docker:\n\n```\ndocker run --rm -it -p 4000:4000 -v $(pwd):/srv/jekyll jekyll/jekyll jekyll serve\n```\n\nOr with Docker Compose:\n\n```\ndocker compose up -d\n```\n"
  },
  {
    "path": "Gemfile",
    "content": "source 'https://rubygems.org'\ngem 'github-pages', '>= 232', group: :jekyll_plugins\ngem 'webrick', '~> 1.8'"
  },
  {
    "path": "LICENSE",
    "content": "Creative Commons Attribution 4.0 International License (CC BY 4.0)\n\nhttp://creativecommons.org/licenses/by/4.0/\n"
  },
  {
    "path": "README.md",
    "content": "# 👋 Awesome Prometheus Alerts [![Awesome](https://awesome.re/badge-flat.svg)](https://awesome.re)\n\n> Most alerting rules are common to every Prometheus setup. We need a place to find them all. 🤘 🚨 📊\n\nCollection available here: **[https://samber.github.io/awesome-prometheus-alerts](https://samber.github.io/awesome-prometheus-alerts)**\n\n<div align=\"center\">\n  <hr>\n  <sup><b>Sponsored by:</b></sup>\n  <br>\n  <a href=\"https://cast.ai/samuel\">\n    <div>\n      <img src=\"https://samber.github.io/awesome-prometheus-alerts/assets/sponsor-cast-ai.png\" width=\"200\" alt=\"Cast AI\">\n    </div>\n    <div>\n      Cut Kubernetes & AI costs, boost application stability.\n    </div>\n  </a>\n  <br>\n  <a href=\"https://betterstack.com\">\n    <div>\n      <img src=\"https://samber.github.io/awesome-prometheus-alerts/assets/sponsor-betterstack.png\" width=\"200\" alt=\"Better Stack\">\n    </div>\n    <div>\n      Better Stack lets you centralize, search, and visualize your logs.\n    </div>\n  </a>\n  <hr>\n</div>\n\n## ✨ Contents\n\n- [Rules](#-rules)\n- [Contributing](#-contributing)\n- [Improvements](#-improvements)\n- [Help us](#-show-your-support)\n- [License](#-license)\n\n## 🚨 Rules\n\n#### Basic resource monitoring\n\n- [Prometheus self-monitoring](https://samber.github.io/awesome-prometheus-alerts/rules#prometheus-internals)\n- [Host/Hardware](https://samber.github.io/awesome-prometheus-alerts/rules#host-and-hardware)\n- [SMART](https://samber.github.io/awesome-prometheus-alerts/rules#smart)\n- [IPMI](https://samber.github.io/awesome-prometheus-alerts/rules#ipmi)\n- [Docker Containers](https://samber.github.io/awesome-prometheus-alerts/rules#docker-containers)\n- [Blackbox](https://samber.github.io/awesome-prometheus-alerts/rules#blackbox)\n- [Windows](https://samber.github.io/awesome-prometheus-alerts/rules#windows-server)\n- [VMWare](https://samber.github.io/awesome-prometheus-alerts/rules#vmware)\n- [Proxmox VE](https://samber.github.io/awesome-prometheus-alerts/rules#proxmox-ve)\n- [Netdata](https://samber.github.io/awesome-prometheus-alerts/rules#netdata)\n- [eBPF](https://samber.github.io/awesome-prometheus-alerts/rules#ebpf)\n- [Process Exporter](https://samber.github.io/awesome-prometheus-alerts/rules#process-exporter)\n- [Systemd](https://samber.github.io/awesome-prometheus-alerts/rules#systemd)\n\n#### Databases\n\n- [MySQL](https://samber.github.io/awesome-prometheus-alerts/rules#mysql)\n- [PostgreSQL](https://samber.github.io/awesome-prometheus-alerts/rules#postgresql)\n- [SQL Server](https://samber.github.io/awesome-prometheus-alerts/rules#sql-server)\n- [Oracle Database](https://samber.github.io/awesome-prometheus-alerts/rules#oracle-database)\n- [Patroni](https://samber.github.io/awesome-prometheus-alerts/rules#patroni)\n- [PGBouncer](https://samber.github.io/awesome-prometheus-alerts/rules#pgbouncer)\n- [Redis](https://samber.github.io/awesome-prometheus-alerts/rules#redis)\n- [Memcached](https://samber.github.io/awesome-prometheus-alerts/rules#memcached)\n- [MongoDB](https://samber.github.io/awesome-prometheus-alerts/rules#mongodb)\n- [Elasticsearch](https://samber.github.io/awesome-prometheus-alerts/rules#elasticsearch)\n- [Meilisearch](https://samber.github.io/awesome-prometheus-alerts/rules#meilisearch)\n- [Cassandra](https://samber.github.io/awesome-prometheus-alerts/rules#cassandra)\n- [Clickhouse](https://samber.github.io/awesome-prometheus-alerts/rules#clickhouse)\n- [CouchDB](https://samber.github.io/awesome-prometheus-alerts/rules#couchdb)\n- [Solr](https://samber.github.io/awesome-prometheus-alerts/rules#solr)\n\n#### Message brokers\n\n- [RabbitMQ](https://samber.github.io/awesome-prometheus-alerts/rules#rabbitmq)\n- [Zookeeper](https://samber.github.io/awesome-prometheus-alerts/rules#zookeeper)\n- [Kafka](https://samber.github.io/awesome-prometheus-alerts/rules#kafka)\n- [Pulsar](https://samber.github.io/awesome-prometheus-alerts/rules#pulsar)\n- [Nats](https://samber.github.io/awesome-prometheus-alerts/rules#nats)\n\n#### Proxies, load balancers and service meshes\n\n- [Nginx](https://samber.github.io/awesome-prometheus-alerts/rules#nginx)\n- [Apache](https://samber.github.io/awesome-prometheus-alerts/rules#apache)\n- [HaProxy](https://samber.github.io/awesome-prometheus-alerts/rules#haproxy)\n- [Traefik](https://samber.github.io/awesome-prometheus-alerts/rules#traefik)\n- [Caddy](https://samber.github.io/awesome-prometheus-alerts/rules#caddy)\n- [Envoy](https://samber.github.io/awesome-prometheus-alerts/rules#envoy)\n- [Linkerd](https://samber.github.io/awesome-prometheus-alerts/rules#linkerd)\n- [Istio](https://samber.github.io/awesome-prometheus-alerts/rules#istio)\n\n#### Runtimes\n\n- [PHP-FPM](https://samber.github.io/awesome-prometheus-alerts/rules#php-fpm)\n- [JVM](https://samber.github.io/awesome-prometheus-alerts/rules#jvm)\n- [Golang](https://samber.github.io/awesome-prometheus-alerts/rules#golang)\n- [Ruby](https://samber.github.io/awesome-prometheus-alerts/rules#ruby)\n- [Python](https://samber.github.io/awesome-prometheus-alerts/rules#python)\n- [Sidekiq](https://samber.github.io/awesome-prometheus-alerts/rules#sidekiq)\n\n#### Data engineering\n\n- [Apache Flink](https://samber.github.io/awesome-prometheus-alerts/rules#apache-flink)\n- [Apache Spark](https://samber.github.io/awesome-prometheus-alerts/rules#apache-spark)\n- [Hadoop](https://samber.github.io/awesome-prometheus-alerts/rules#hadoop)\n\n#### Orchestrators\n\n- [Kubernetes](https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes)\n- [Nomad](https://samber.github.io/awesome-prometheus-alerts/rules#nomad)\n- [Consul](https://samber.github.io/awesome-prometheus-alerts/rules#consul)\n- [Etcd](https://samber.github.io/awesome-prometheus-alerts/rules#etcd)\n- [OpenStack](https://samber.github.io/awesome-prometheus-alerts/rules#openstack)\n\n#### CI/CD\n\n- [Jenkins](https://samber.github.io/awesome-prometheus-alerts/rules#jenkins)\n- [ArgoCD](https://samber.github.io/awesome-prometheus-alerts/rules#argocd)\n- [FluxCD](https://samber.github.io/awesome-prometheus-alerts/rules#fluxcd)\n- [GitLab CI](https://samber.github.io/awesome-prometheus-alerts/rules#gitlab-ci)\n- [Spinnaker](https://samber.github.io/awesome-prometheus-alerts/rules#spinnaker)\n\n#### Network and security\n\n- [SpeedTest](https://samber.github.io/awesome-prometheus-alerts/rules#speedtest)\n- [SSL/TLS](https://samber.github.io/awesome-prometheus-alerts/rules#ssl/tls)\n- [cert-manager](https://samber.github.io/awesome-prometheus-alerts/rules#cert-manager)\n- [Juniper](https://samber.github.io/awesome-prometheus-alerts/rules#juniper)\n- [CoreDNS](https://samber.github.io/awesome-prometheus-alerts/rules#coredns)\n- [FreeSwitch](https://samber.github.io/awesome-prometheus-alerts/rules#freeswitch)\n- [Hashicorp Vault](https://samber.github.io/awesome-prometheus-alerts/rules#hashicorp-vault)\n- [Keycloak](https://samber.github.io/awesome-prometheus-alerts/rules#keycloak)\n- [Cloudflare](https://samber.github.io/awesome-prometheus-alerts/rules#cloudflare)\n- [SNMP](https://samber.github.io/awesome-prometheus-alerts/rules#snmp)\n- [Cilium](https://samber.github.io/awesome-prometheus-alerts/rules#cilium)\n- [WireGuard](https://samber.github.io/awesome-prometheus-alerts/rules#wireguard)\n\n#### Storage\n\n- [Ceph](https://samber.github.io/awesome-prometheus-alerts/rules#ceph)\n- [ZFS](https://samber.github.io/awesome-prometheus-alerts/rules#zfs)\n- [OpenEBS](https://samber.github.io/awesome-prometheus-alerts/rules#openebs)\n- [Minio](https://samber.github.io/awesome-prometheus-alerts/rules#minio)\n\n#### Cloud providers\n\n- [AWS CloudWatch](https://samber.github.io/awesome-prometheus-alerts/rules#aws-cloudwatch)\n- [Google Cloud Stackdriver](https://samber.github.io/awesome-prometheus-alerts/rules#google-cloud-stackdriver)\n- [DigitalOcean](https://samber.github.io/awesome-prometheus-alerts/rules#digitalocean)\n- [Azure](https://samber.github.io/awesome-prometheus-alerts/rules#azure)\n\n#### Observability\n\n- [Thanos](https://samber.github.io/awesome-prometheus-alerts/rules#thanos)\n- [Loki](https://samber.github.io/awesome-prometheus-alerts/rules#loki)\n- [Promtail](https://samber.github.io/awesome-prometheus-alerts/rules#promtail)\n- [Cortex](https://samber.github.io/awesome-prometheus-alerts/rules#cortex)\n- [Grafana Tempo](https://samber.github.io/awesome-prometheus-alerts/rules#grafana-tempo)\n- [Grafana Mimir](https://samber.github.io/awesome-prometheus-alerts/rules#grafana-mimir)\n- [Grafana Alloy](https://samber.github.io/awesome-prometheus-alerts/rules#grafana-alloy)\n- [OpenTelemetry Collector](https://samber.github.io/awesome-prometheus-alerts/rules#opentelemetry-collector)\n- [Jaeger](https://samber.github.io/awesome-prometheus-alerts/rules#jaeger)\n\n#### Other\n\n- [APC UPS](https://samber.github.io/awesome-prometheus-alerts/rules#apc-ups)\n- [Graph Node](https://samber.github.io/awesome-prometheus-alerts/rules#graph-node)\n\n## 🤝 Contributing\n\nContributions from community (you!) are most welcome!\n\nThere are many ways to contribute: writing code, alerting rules, documentation, reporting issues, discussing better error tracking...\n\n[Instructions here](CONTRIBUTING.md)\n\n## 🏋️ Improvements\n\n- Create an alert rule builder in Jekyll for custom alerts (severity, thresholds, instances...)\n- Add resolution suggestions to rule descriptions, for faster incident resolution ([#85](https://github.com/samber/awesome-prometheus-alerts/issues/85)).\n\n## 💫 Show your support\n\nGive a ⭐️ if this project helped you!\n\n[![support us](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/samber)\n\n## 📝 License\n\n[![CC4](https://mirrors.creativecommons.org/presskit/cc.srr.primary.svg)](https://creativecommons.org/licenses/by/4.0/legalcode)\n\nLicensed under the Creative Commons 4.0 License, see LICENSE file for more detail.\n"
  },
  {
    "path": "_config.yml",
    "content": "theme: jekyll-theme-cayman\n\ntitle: Awesome Prometheus alerts\ndescription: Collection of alerting rules\n\nrepository: samber/awesome-prometheus-alerts\n\nbaseurl: /awesome-prometheus-alerts\n"
  },
  {
    "path": "_data/rules.yml",
    "content": "#\n# The following yaml cannot be copy-pasted to Prometheus configuration.\n#     Please navigate to https://samber.github.io/awesome-prometheus-alerts/rules instead.\n#\n# Contributing guidelines:\n#      https://github.com/samber/awesome-prometheus-alerts/blob/master/CONTRIBUTING.md\n#\n\ngroups:\n  - name: Basic resource monitoring\n    services:\n      - name: Prometheus self-monitoring\n        exporters:\n          - slug: embedded-exporter\n            rules:\n              - name: Prometheus job missing\n                description: A Prometheus job has disappeared\n                query: 'absent(up{job=\"prometheus\"})'\n                severity: warning\n              - name: Prometheus target missing\n                description: A Prometheus target has disappeared. An exporter might be crashed.\n                query: \"up == 0 unless on(job) (sum by (job) (up) == 0)\"\n                severity: critical\n                for: 1m\n                comments: |\n                  Only fire if at least one target in the job is still up.\n                  If all targets are down, PrometheusJobMissing or PrometheusAllTargetsMissing will fire instead.\n              - name: Prometheus all targets missing\n                description: A Prometheus job does not have living target anymore.\n                query: \"sum by (job) (up) == 0\"\n                severity: critical\n                for: 1m\n              - name: Prometheus target missing with warmup time\n                description: \"Allow a job time to start up (10 minutes) before alerting that it's down.\"\n                query: \"sum by (instance, job) ((up == 0) * on (instance) group_left(__name__) (node_time_seconds - node_boot_time_seconds > 600))\"\n                severity: critical\n                for: 1m\n              - name: Prometheus configuration reload failure\n                description: Prometheus configuration reload error\n                query: \"prometheus_config_last_reload_successful != 1\"\n                severity: warning\n              - name: Prometheus too many restarts\n                description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n                query: 'changes(process_start_time_seconds{job=~\"prometheus|pushgateway|alertmanager\"}[15m]) > 2'\n                severity: warning\n              - name: Prometheus AlertManager job missing\n                description: A Prometheus AlertManager job has disappeared\n                query: 'absent(up{job=\"alertmanager\"})'\n                severity: warning\n              - name: Prometheus AlertManager configuration reload failure\n                description: AlertManager configuration reload error\n                query: \"alertmanager_config_last_reload_successful != 1\"\n                severity: warning\n              - name: Prometheus AlertManager config not synced\n                description: Configurations of AlertManager cluster instances are out of sync\n                query: 'count(count_values(\"config_hash\", alertmanager_config_hash)) > 1'\n                severity: warning\n              - name: Prometheus AlertManager E2E dead man switch\n                description: \"Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\"\n                query: \"vector(1)\"\n                severity: critical\n              - name: Prometheus not connected to alertmanager\n                description: Prometheus cannot connect the alertmanager\n                query: \"prometheus_notifications_alertmanagers_discovered < 1\"\n                severity: critical\n              - name: Prometheus rule evaluation failures\n                description: \"Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\"\n                query: \"increase(prometheus_rule_evaluation_failures_total[3m]) > 0\"\n                severity: critical\n              - name: Prometheus template text expansion failures\n                description: \"Prometheus encountered {{ $value }} template text expansion failures\"\n                query: \"increase(prometheus_template_text_expansion_failures_total[3m]) > 0\"\n                severity: critical\n              - name: Prometheus rule evaluation slow\n                description: \"Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\"\n                query: \"prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds\"\n                severity: warning\n                for: 5m\n              - name: Prometheus notifications backlog\n                description: The Prometheus notification queue has not been empty for 10 minutes\n                query: \"min_over_time(prometheus_notifications_queue_length[10m]) > 0\"\n                severity: warning\n              - name: Prometheus AlertManager notification failing\n                description: \"Alertmanager is failing sending notifications ({{ $value }} notifications/s)\"\n                query: \"rate(alertmanager_notifications_failed_total[1m]) > 0\"\n                severity: critical\n              - name: Prometheus target empty\n                description: Prometheus has no target in service discovery\n                query: \"prometheus_sd_discovered_targets == 0\"\n                severity: critical\n              - name: Prometheus target scraping slow\n                description: Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n                query: 'prometheus_target_interval_length_seconds{quantile=\"0.9\"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile=\"0.5\"} > 1.05'\n                severity: warning\n                for: 5m\n              - name: Prometheus large scrape\n                description: \"Prometheus has many scrapes that exceed the sample limit ({{ $value }} scrapes)\"\n                query: \"increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10\"\n                severity: warning\n                for: 5m\n              - name: Prometheus target scrape duplicate\n                description: \"Prometheus has many samples rejected due to duplicate timestamps but different values ({{ $value }} samples)\"\n                query: \"increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 3\"\n                severity: warning\n              - name: Prometheus TSDB checkpoint creation failures\n                description: \"Prometheus encountered {{ $value }} checkpoint creation failures\"\n                query: \"increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0\"\n                severity: critical\n              - name: Prometheus TSDB checkpoint deletion failures\n                description: \"Prometheus encountered {{ $value }} checkpoint deletion failures\"\n                query: \"increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0\"\n                severity: critical\n              - name: Prometheus TSDB compactions failed\n                description: \"Prometheus encountered {{ $value }} TSDB compactions failures\"\n                query: \"increase(prometheus_tsdb_compactions_failed_total[1m]) > 0\"\n                severity: critical\n              - name: Prometheus TSDB head truncations failed\n                description: \"Prometheus encountered {{ $value }} TSDB head truncation failures\"\n                query: \"increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0\"\n                severity: critical\n              - name: Prometheus TSDB reload failures\n                description: \"Prometheus encountered {{ $value }} TSDB reload failures\"\n                query: \"increase(prometheus_tsdb_reloads_failures_total[1m]) > 0\"\n                severity: critical\n              - name: Prometheus TSDB WAL corruptions\n                description: \"Prometheus encountered {{ $value }} TSDB WAL corruptions\"\n                query: \"increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0\"\n                severity: critical\n              - name: Prometheus TSDB WAL truncations failed\n                description: \"Prometheus encountered {{ $value }} TSDB WAL truncation failures\"\n                query: \"increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0\"\n                severity: critical\n              - name: Prometheus timeseries cardinality\n                description: 'The \"{{ $labels.name }}\" timeseries cardinality is getting very high: {{ $value }}'\n                query: 'label_replace(count by(__name__) ({__name__=~\".+\"}), \"name\", \"$1\", \"__name__\", \"(.+)\") > 10000'\n                severity: warning\n\n      - name: Host and hardware\n        exporters:\n          - name: node-exporter\n            slug: node-exporter\n            doc_url: https://github.com/prometheus/node_exporter\n            rules:\n              - name: Host out of memory\n                description: Node memory is filling up (< 10% left)\n                query: \"(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < .10)\"\n                severity: warning\n                for: 2m\n              - name: Host memory under memory pressure\n                description: \"The node is under heavy memory pressure. High rate of major page faults ({{ $value }}/s).\"\n                query: \"(rate(node_vmstat_pgmajfault[5m]) > 1000)\"\n                severity: warning\n              - name: Host Memory is underutilized\n                description: \"Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\"\n                query: \"min_over_time(node_memory_MemFree_bytes[1w]) > node_memory_MemTotal_bytes * .8\"\n                severity: info\n                comments: |\n                  You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly\n              - name: Host unusual network throughput in\n                description: Host receive bandwidth is high (>80%).\n                query: \"((rate(node_network_receive_bytes_total[5m]) / node_network_speed_bytes) > .80) and node_network_speed_bytes > 0\"\n                severity: warning\n              - name: Host unusual network throughput out\n                description: Host transmit bandwidth is high (>80%)\n                query: \"((rate(node_network_transmit_bytes_total[5m]) / node_network_speed_bytes) > .80) and node_network_speed_bytes > 0\"\n                severity: warning\n              - name: Host disk IO utilization high\n                description: Disk utilization is high (> 80%)\n                query: \"(rate(node_disk_io_time_seconds_total[5m]) > .80)\"\n                severity: warning\n              - name: Host out of disk space\n                description: Disk is almost full (< 10% left)\n                query: '(node_filesystem_avail_bytes{fstype!~\"^(fuse.*|tmpfs|cifs|nfs)\"} / node_filesystem_size_bytes < .10 and on (instance, device, mountpoint) node_filesystem_readonly == 0)'\n                severity: critical\n                comments: |\n                  Please add ignored mountpoints in node_exporter parameters like\n                  \"--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)\".\n                  Same rule using \"node_filesystem_free_bytes\" will fire when disk fills for non-root users.\n                for: 2m\n              - name: Host disk may fill in 24 hours\n                description: Filesystem will likely run out of space within the next 24 hours.\n                query: 'predict_linear(node_filesystem_avail_bytes{fstype!~\"^(fuse.*|tmpfs|cifs|nfs)\"}[3h], 86400) <= 0 and node_filesystem_avail_bytes > 0'\n                severity: warning\n                comments: |\n                  Please add ignored mountpoints in node_exporter parameters like\n                  \"--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)\".\n                  Same rule using \"node_filesystem_free_bytes\" will fire when disk fills for non-root users.\n                for: 2m\n              - name: Host out of inodes\n                description: Disk is almost running out of available inodes (< 10% left)\n                query: \"(node_filesystem_files_free / node_filesystem_files < .10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) and node_filesystem_files > 0\"\n                severity: critical\n                for: 2m\n              - name: Host filesystem device error\n                description: \"Error stat-ing the {{ $labels.mountpoint }} filesystem\"\n                query: 'node_filesystem_device_error{fstype!~\"^(fuse.*|tmpfs|cifs|nfs)\"} == 1'\n                severity: critical\n                for: 2m\n              - name: Host inodes may fill in 24 hours\n                description: Filesystem will likely run out of inodes within the next 24 hours at current write rate\n                query: 'predict_linear(node_filesystem_files_free{fstype!~\"^(fuse.*|tmpfs|cifs|nfs)\"}[1h], 86400) <= 0 and node_filesystem_files_free > 0'\n                severity: warning\n                for: 2m\n              - name: Host unusual disk read latency\n                description: Disk latency is growing (read operations > 100ms)\n                query: \"(rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0)\"\n                severity: warning\n                for: 2m\n              - name: Host unusual disk write latency\n                description: Disk latency is growing (write operations > 100ms)\n                query: \"(rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0)\"\n                severity: warning\n                for: 2m\n              - name: Host high CPU load\n                description: CPU load is > 80%\n                query: '1 - (avg without (cpu) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) > .80'\n                severity: warning\n                for: 10m\n              - name: Host CPU is underutilized\n                description: \"CPU load has been < 20% for 1 week. Consider reducing the number of CPUs.\"\n                query: '(min without (cpu) (rate(node_cpu_seconds_total{mode=\"idle\"}[1h]))) > 0.8'\n                severity: info\n                for: 1w\n                comments: |\n                  You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly\n              - name: Host CPU steal noisy neighbor\n                description: CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n                query: 'avg without (cpu) (rate(node_cpu_seconds_total{mode=\"steal\"}[5m])) * 100 > 10'\n                severity: warning\n              - name: Host CPU high iowait\n                description: CPU iowait > 10%. Your CPU is idling waiting for storage to respond.\n                query: 'avg without (cpu) (rate(node_cpu_seconds_total{mode=\"iowait\"}[5m])) > .10'\n                severity: warning\n              - name: Host unusual disk IO\n                description: \"Disk usage >80%. Check storage for issues or increase IOPS capabilities. Check storage for issues.\"\n                query: \"rate(node_disk_io_time_seconds_total[5m]) > 0.8\"\n                severity: warning\n                for: 5m\n              - name: Host context switching high\n                description: Context switching is growing on the node (twice the daily average during the last 15m)\n                query: '(rate(node_context_switches_total[15m])/count without(mode,cpu) (node_cpu_seconds_total{mode=\"idle\"})) / (rate(node_context_switches_total[1d])/count without(mode,cpu) (node_cpu_seconds_total{mode=\"idle\"})) > 2'\n                severity: warning\n                comments: |\n                  x2 context switches is an arbitrary number.\n                  The alert threshold depends on the nature of the application.\n                  Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58\n              - name: Host swap is filling up\n                description: Swap is filling up (>80%)\n                query: \"((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) and node_memory_SwapTotal_bytes > 0\"\n                severity: warning\n                for: 2m\n              - name: Host systemd service crashed\n                description: \"systemd service {{ $labels.name }} crashed\"\n                query: '(node_systemd_unit_state{state=\"failed\"} == 1)'\n                severity: warning\n              - name: Host physical component too hot\n                description: \"Physical hardware component too hot\"\n                query: \"node_hwmon_temp_celsius > node_hwmon_temp_max_celsius\"\n                severity: warning\n                for: 5m\n              - name: Host node overtemperature alarm\n                description: \"Physical node temperature alarm triggered\"\n                query: \"((node_hwmon_temp_crit_alarm_celsius == 1) or (node_hwmon_temp_alarm == 1))\"\n                severity: critical\n              - name: Host software RAID insufficient drives\n                description: \"MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining.\"\n                query: '((node_md_disks_required - ignoring(state) node_md_disks{state=\"active\"}) > 0)'\n                comments: |\n                  Uses ignoring(state) to handle additional labels on node_md_disks. Matches the official node-exporter mixin.\n                severity: critical\n              - name: Host software RAID disk failure\n                description: \"MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention.\"\n                query: '(node_md_disks{state=\"failed\"} > 0)'\n                severity: warning\n                for: 2m\n              - name: Host kernel version deviations\n                description: Kernel version for {{ $labels.instance }} has changed.\n                query: \"changes(node_uname_info[1h]) > 0\"\n                severity: info\n              - name: Host OOM kill detected\n                description: OOM kill detected\n                query: \"(increase(node_vmstat_oom_kill[30m]) > 0)\"\n                severity: warning\n                comments: |\n                  When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 15–20 minutes to recover, the alert should still trigger.\n              - name: Host EDAC Correctable Errors detected\n                description: 'Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.'\n                query: \"(increase(node_edac_correctable_errors_total[1m]) > 0)\"\n                severity: info\n              - name: Host EDAC Uncorrectable Errors detected\n                description: 'Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.'\n                query: \"(node_edac_uncorrectable_errors_total > 0)\"\n                severity: warning\n              - name: Host Network Receive Errors\n                description: 'Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.'\n                query: \"(rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) and rate(node_network_receive_packets_total[2m]) > 0\"\n                severity: warning\n                for: 2m\n              - name: Host Network Transmit Errors\n                description: 'Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.'\n                query: \"(rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) and rate(node_network_transmit_packets_total[2m]) > 0\"\n                severity: warning\n                for: 2m\n              - name: Host Network Bond Degraded\n                description: 'Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".'\n                query: \"((node_bonding_active - node_bonding_slaves) != 0)\"\n                severity: warning\n                for: 2m\n              - name: Host conntrack limit\n                description: \"The number of conntrack is approaching limit\"\n                query: \"(node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) and node_nf_conntrack_entries_limit > 0\"\n                severity: warning\n                for: 5m\n              - name: Host clock skew\n                description: \"Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\"\n                query: \"((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0))\"\n                severity: warning\n                for: 10m\n              - name: Host clock not synchronising\n                description: \"Clock not synchronising. Ensure NTP is configured on this host.\"\n                query: \"(min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16)\"\n                severity: warning\n                for: 2m\n\n      - name: S.M.A.R.T Device Monitoring\n        exporters:\n          - name: smartctl-exporter\n            slug: smartctl-exporter\n            doc_url: https://github.com/prometheus-community/smartctl_exporter\n            rules:\n              - name: SMART device temperature warning\n                description: Device temperature warning on {{ $labels.instance }} drive {{ $labels.device }} over 60°C\n                query: '(avg_over_time(smartctl_device_temperature{temperature_type=\"current\"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type=\"drive_trip\"}) > 60'\n                severity: warning\n              - name: SMART device temperature critical\n                description: Device temperature critical on {{ $labels.instance }} drive {{ $labels.device }} over 70°C\n                query: '(max_over_time(smartctl_device_temperature{temperature_type=\"current\"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type=\"drive_trip\"}) > 70'\n                severity: critical\n              - name: SMART device temperature over trip value\n                description: Device temperature over trip value on {{ $labels.instance }} drive {{ $labels.device }})\n                query: 'max_over_time(smartctl_device_temperature{temperature_type=\"current\"} [10m]) >= on(device, instance) smartctl_device_temperature{temperature_type=\"drive_trip\"}'\n                severity: critical\n              - name: SMART device temperature nearing trip value\n                description: Device temperature at 80% of trip value on {{ $labels.instance }} drive {{ $labels.device }})\n                query: 'max_over_time(smartctl_device_temperature{temperature_type=\"current\"} [10m]) >= on(device, instance) (smartctl_device_temperature{temperature_type=\"drive_trip\"} * .80)'\n                severity: warning\n              - name: SMART status\n                description: Device has a SMART status failure on {{ $labels.instance }} drive {{ $labels.device }})\n                query: \"smartctl_device_smart_status != 1\"\n                severity: critical\n              - name: SMART critical warning\n                description: Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }})\n                query: \"smartctl_device_critical_warning > 0\"\n                severity: critical\n              - name: SMART media errors\n                description: Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }})\n                query: \"smartctl_device_media_errors > 0\"\n                severity: critical\n              - name: SMART Wearout Indicator\n                description: Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }})\n                query: \"smartctl_device_available_spare < smartctl_device_available_spare_threshold\"\n                severity: critical\n\n      - name: IPMI\n        exporters:\n          - name: prometheus-community/ipmi_exporter\n            slug: ipmi-exporter\n            doc_url: https://github.com/prometheus-community/ipmi_exporter\n            rules:\n              - name: IPMI collector down\n                description: \"IPMI collector {{ $labels.collector }} on {{ $labels.instance }} failed to scrape sensor data. Check FreeIPMI tools and BMC connectivity.\"\n                query: 'ipmi_up == 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  The ipmi_up metric is per-collector. A value of 0 means the collector could not retrieve data from the BMC.\n              - name: IPMI temperature sensor warning\n                description: \"IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\"\n                query: 'ipmi_temperature_state == 1'\n                severity: warning\n                for: 5m\n                comments: |\n                  State values: 0=nominal, 1=warning, 2=critical. Thresholds are defined in the BMC firmware.\n              - name: IPMI temperature sensor critical\n                description: \"IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Immediate attention required to prevent hardware damage.\"\n                query: 'ipmi_temperature_state == 2'\n                severity: critical\n              - name: IPMI fan speed sensor warning\n                description: \"IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\"\n                query: 'ipmi_fan_speed_state == 1'\n                severity: warning\n                for: 5m\n              - name: IPMI fan speed sensor critical\n                description: \"IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. A fan may have failed.\"\n                query: 'ipmi_fan_speed_state == 2'\n                severity: critical\n              - name: IPMI fan speed zero\n                description: \"IPMI fan {{ $labels.name }} on {{ $labels.instance }} reports 0 RPM. The fan may have failed.\"\n                query: 'ipmi_fan_speed_rpm == 0'\n                severity: critical\n                for: 5m\n              - name: IPMI voltage sensor warning\n                description: \"IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\"\n                query: 'ipmi_voltage_state == 1'\n                severity: warning\n                for: 5m\n              - name: IPMI voltage sensor critical\n                description: \"IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Power supply or motherboard issue possible.\"\n                query: 'ipmi_voltage_state == 2'\n                severity: critical\n              - name: IPMI current sensor warning\n                description: \"IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\"\n                query: 'ipmi_current_state == 1'\n                severity: warning\n                for: 5m\n              - name: IPMI current sensor critical\n                description: \"IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\"\n                query: 'ipmi_current_state == 2'\n                severity: critical\n              - name: IPMI power sensor warning\n                description: \"IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\"\n                query: 'ipmi_power_state == 1'\n                severity: warning\n                for: 5m\n              - name: IPMI power sensor critical\n                description: \"IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\"\n                query: 'ipmi_power_state == 2'\n                severity: critical\n              - name: IPMI generic sensor critical\n                description: \"IPMI sensor {{ $labels.name }} (type={{ $labels.type }}) on {{ $labels.instance }} is in critical state.\"\n                query: 'ipmi_sensor_state == 2'\n                severity: critical\n                for: 5m\n                comments: |\n                  Catches any sensor type not covered by the specific temperature/fan/voltage/current/power alerts.\n              - name: IPMI chassis power off\n                description: \"IPMI reports chassis power is off on {{ $labels.instance }}. The server may have shut down unexpectedly.\"\n                query: 'ipmi_chassis_power_state == 0'\n                severity: critical\n              - name: IPMI chassis drive fault\n                description: \"IPMI reports a drive fault on {{ $labels.instance }}. Check disk health.\"\n                query: 'ipmi_chassis_drive_fault_state == 0'\n                severity: critical\n                comments: |\n                  The metric uses inverted logic: 1=no fault, 0=fault detected.\n              - name: IPMI chassis cooling fault\n                description: \"IPMI reports a cooling/fan fault on {{ $labels.instance }}. Check fans and airflow.\"\n                query: 'ipmi_chassis_cooling_fault_state == 0'\n                severity: critical\n                comments: |\n                  The metric uses inverted logic: 1=no fault, 0=fault detected.\n              - name: IPMI SEL almost full\n                description: \"IPMI System Event Log on {{ $labels.instance }} has only {{ printf \\\"%.0f\\\" $value }} bytes free. Clear the SEL to prevent loss of new events.\"\n                query: 'ipmi_sel_free_space_bytes < 512'\n                severity: warning\n                for: 5m\n                comments: |\n                  SEL storage is typically very limited (e.g., 16KB). When full, new events may be dropped.\n\n      - name: Docker containers\n        exporters:\n          - name: google/cAdvisor\n            slug: google-cadvisor\n            doc_url: https://github.com/google/cadvisor\n            rules:\n              - name: Container killed\n                description: A container has disappeared\n                query: \"time() - container_last_seen > 60\"\n                severity: warning\n                comments: |\n                  This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.\n              - name: Container absent\n                description: A container is absent for 5 min\n                query: \"absent(container_last_seen)\"\n                severity: warning\n                for: 5m\n                comments: |\n                  This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.\n              - name: Container High CPU utilization\n                description: 'Container CPU utilization is above 80% (current: {{ $value | printf \"%.2f\" }}%)'\n                query: '(sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=\"\"}/container_spec_cpu_period{container!=\"\"}) by (pod, container) * 100) > 80 and sum(container_spec_cpu_quota{container!=\"\"}/container_spec_cpu_period{container!=\"\"}) by (pod, container) > 0'\n                comments: |\n                  Only fires for containers with explicit CPU limits. Containers without limits have cpu_quota=0, which is filtered out by the guard.\n                severity: warning\n                for: 2m\n              - name: Container High Memory usage\n                description: Container Memory usage is above 80%\n                query: '(sum(container_memory_working_set_bytes{name!=\"\"}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80'\n                severity: warning\n                comments: See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d\n                for: 2m\n              - name: Container Volume usage\n                description: Container Volume usage is above 80%\n                query: '(1 - (sum(container_fs_inodes_free{name!=\"\"}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80 and sum(container_fs_inodes_total) BY (instance) > 0'\n                severity: warning\n                for: 2m\n              - name: Container high throttle rate\n                description: \"Container is being throttled ({{ $value | humanizePercentage }})\"\n                query: 'sum(rate(container_cpu_cfs_throttled_periods_total{container!=\"\"}[5m])) by (container, pod, namespace) / sum(rate(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > ( 25 / 100 ) and sum(rate(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > 0'\n                severity: warning\n                for: 5m\n              - name: Container high low change CPU usage\n                description: This alert rule monitors the absolute change in CPU usage within a time window and triggers an alert when the change exceeds 25%.\n                query: '(abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[1m] offset 1m)) * 100)) or abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[5m] offset 1m)) * 100))) > 25'\n                severity: info\n              - name: Container Low CPU utilization\n                description: 'Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU. (current: {{ $value | printf \"%.2f\" }}%)'\n                query: '(sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=\"\"}/container_spec_cpu_period{container!=\"\"}) by (pod, container) * 100) < 20'\n                severity: info\n                for: 7d\n              - name: Container Low Memory usage\n                description: Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\n                query: '(sum(container_memory_working_set_bytes{name!=\"\"}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20'\n                severity: info\n                for: 7d\n\n      - name: Blackbox\n        exporters:\n          - name: prometheus/blackbox_exporter\n            slug: blackbox-exporter\n            doc_url: https://github.com/prometheus/blackbox_exporter\n            rules:\n              - name: Blackbox probe failed\n                description: Probe failed\n                query: probe_success == 0\n                severity: critical\n              - name: Blackbox configuration reload failure\n                description: Blackbox configuration reload failure\n                query: \"blackbox_exporter_config_last_reload_successful != 1\"\n                severity: warning\n              - name: Blackbox slow probe\n                description: Blackbox probe took more than 1s to complete\n                query: \"probe_duration_seconds > 1\"\n                severity: warning\n                for: 1m\n              - name: Blackbox probe HTTP failure\n                description: HTTP status code is not 200-399\n                query: \"probe_http_status_code <= 199 OR probe_http_status_code >= 400\"\n                severity: critical\n              - name: Blackbox SSL certificate will expire soon\n                description: SSL certificate expires in less than 20 days\n                query: \"3 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 20\"\n                severity: warning\n              - name: Blackbox SSL certificate will expire very soon\n                description: SSL certificate expires in less than 3 days\n                query: \"0 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 3\"\n                severity: critical\n              - name: Blackbox SSL certificate expired\n                description: SSL certificate has expired already\n                query: \"round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 0\"\n                severity: critical\n                comments: |\n                  For probe_ssl_earliest_cert_expiry to be exposed after expiration, you\n                  need to enable insecure_skip_verify. Note that this will disable\n                  certificate validation.\n                  See https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#tls_config\n              - name: Blackbox probe slow HTTP\n                description: HTTP request took more than 1s\n                query: \"probe_http_duration_seconds > 1\"\n                severity: warning\n                for: 1m\n              - name: Blackbox probe slow ping\n                description: Blackbox ping took more than 1s\n                query: \"probe_icmp_duration_seconds > 1\"\n                severity: warning\n                for: 1m\n\n      - name: Windows Server\n        exporters:\n          - name: prometheus-community/windows_exporter\n            slug: windows-exporter\n            doc_url: https://github.com/prometheus-community/windows_exporter\n            rules:\n              - name: Windows Server collector Error\n                description: \"Collector {{ $labels.collector }} was not successful\"\n                query: \"windows_exporter_collector_success == 0\"\n                severity: critical\n              - name: Windows Server service Status\n                description: Windows Service state is not OK\n                query: 'windows_service_status{status=\"ok\"} != 1'\n                severity: critical\n                for: 1m\n              - name: Windows Server CPU Usage\n                description: CPU Usage is more than 80%\n                query: '100 - (avg by (instance) (rate(windows_cpu_time_total{mode=\"idle\"}[2m])) * 100) > 80'\n                severity: warning\n              - name: Windows Server memory Usage\n                description: Memory usage is more than 90%\n                query: \"100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90\"\n                severity: warning\n                for: 2m\n              - name: Windows Server disk Space Usage\n                description: Disk usage is more than 80%\n                query: \"100 - 100 * (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes) > 80 and windows_logical_disk_size_bytes > 0\"\n                severity: critical\n                for: 2m\n\n      - name: VMware\n        exporters:\n          - name: pryorda/vmware_exporter\n            slug: pryorda-vmware-exporter\n            doc_url: https://github.com/pryorda/vmware_exporter\n            rules:\n              - name: Virtual Machine Memory Warning\n                description: 'High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%'\n                query: \"vmware_vm_mem_usage_average / 100 >= 80 and vmware_vm_mem_usage_average / 100 < 90\"\n                severity: warning\n                for: 5m\n              - name: Virtual Machine Memory Critical\n                description: 'High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%'\n                query: \"vmware_vm_mem_usage_average / 100 >= 90\"\n                severity: critical\n                for: 1m\n              - name: High Number of Snapshots\n                description: \"High snapshots number on {{ $labels.instance }}: {{ $value }}\"\n                query: \"vmware_vm_snapshots > 3\"\n                severity: warning\n                for: 30m\n              - name: Outdated Snapshots\n                description: 'Outdated snapshots on {{ $labels.instance }}: {{ $value | printf \"%.0f\"}} days'\n                query: \"(time() - vmware_vm_snapshot_timestamp_seconds) / (60 * 60 * 24) >= 3\"\n                severity: warning\n                for: 5m\n\n      - name: Proxmox VE\n        exporters:\n          - name: prometheus-pve/prometheus-pve-exporter\n            slug: prometheus-pve-exporter\n            doc_url: https://github.com/prometheus-pve/prometheus-pve-exporter\n            rules:\n              - name: PVE node down\n                description: 'Proxmox VE node {{ $labels.id }} is down.'\n                query: 'pve_up{id=~\"node/.*\"} == 0'\n                severity: critical\n                for: 2m\n              - name: PVE VM/CT down\n                description: 'Proxmox VE guest {{ $labels.id }} is not running.'\n                query: 'pve_up{id=~\"(qemu|lxc)/.*\"} == 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  This alert triggers for all VMs and containers that are not running.\n                  You may want to filter by specific guests using the `id` label, or exclude\n                  intentionally stopped guests with additional label matchers.\n              - name: PVE high CPU usage\n                description: 'Proxmox VE CPU usage is above 90% on {{ $labels.id }}. Current value: {{ $value | printf \"%.2f\" }}%'\n                query: 'pve_cpu_usage_ratio * 100 > 90'\n                severity: warning\n                for: 5m\n              - name: PVE high memory usage\n                description: 'Proxmox VE memory usage is above 90% on {{ $labels.id }}. Current value: {{ $value | printf \"%.2f\" }}%'\n                query: 'pve_memory_usage_bytes / pve_memory_size_bytes * 100 > 90 and pve_memory_size_bytes > 0'\n                severity: warning\n                for: 5m\n              - name: PVE storage filling up\n                description: 'Proxmox VE storage {{ $labels.id }} is above 80% used. Current value: {{ $value | printf \"%.2f\" }}%'\n                query: 'pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100 > 80 and pve_disk_size_bytes{id=~\"storage/.*\"} > 0'\n                severity: warning\n                for: 5m\n              - name: PVE storage almost full\n                description: 'Proxmox VE storage {{ $labels.id }} is above 95% used. Current value: {{ $value | printf \"%.2f\" }}%'\n                query: 'pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100 > 95 and pve_disk_size_bytes{id=~\"storage/.*\"} > 0'\n                severity: critical\n                for: 2m\n              - name: PVE guest not backed up\n                description: '{{ $value }} Proxmox VE guest(s) are not covered by any backup job.'\n                query: 'pve_not_backed_up_total > 0'\n                severity: warning\n              - name: PVE replication failed\n                description: 'Proxmox VE replication for {{ $labels.id }} has {{ $value }} failed sync(s).'\n                query: 'pve_replication_failed_syncs > 0'\n                severity: warning\n              - name: PVE cluster not quorate\n                description: 'Proxmox VE cluster has lost quorum.'\n                query: 'pve_cluster_info{quorate=\"0\"} == 1'\n                severity: critical\n                comments: |\n                  Loss of quorum means the cluster cannot make decisions about VM placement\n                  and fencing. This requires immediate attention.\n\n      - name: Netdata\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://github.com/netdata/netdata/blob/master/backends/prometheus/README.md\n            rules:\n              - name: Netdata high cpu usage\n                description: Netdata high CPU usage (> 80%)\n                query: 'netdata_cpu_cpu_percentage_average{dimension=\"idle\"} < 20'\n                severity: warning\n                for: 5m\n                comments: |\n                  This is a gauge metric (not a counter). Checking idle < 20% means CPU usage > 80%.\n              - name: Netdata CPU steal noisy neighbor\n                description: CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n                query: 'netdata_cpu_cpu_percentage_average{dimension=\"steal\"} > 10'\n                severity: warning\n                for: 5m\n              - name: Netdata high memory usage\n                description: Netdata high memory usage (> 80%)\n                query: '100 / netdata_system_ram_MiB_average * netdata_system_ram_MiB_average{dimension=~\"free|cached\"} < 20 and netdata_system_ram_MiB_average > 0'\n                severity: warning\n                for: 5m\n              - name: Netdata low disk space\n                description: Netdata low disk space (> 80%)\n                query: '100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~\"avail|cached\"} < 20 and netdata_disk_space_GB_average > 0'\n                severity: warning\n                for: 5m\n              - name: Netdata predicted disk full\n                description: Netdata predicted disk full in 24 hours\n                query: 'predict_linear(netdata_disk_space_GB_average{dimension=~\"avail|cached\"}[3h], 24 * 3600) < 0'\n                severity: warning\n              - name: Netdata MD mismatch cnt unsynchronized blocks\n                description: RAID Array have unsynchronized blocks\n                query: \"netdata_md_mismatch_cnt_unsynchronized_blocks_average > 1024\"\n                severity: warning\n                for: 2m\n              - name: Netdata disk reallocated sectors\n                description: \"Disk reallocated sectors detected ({{ $value }} sectors)\"\n                query: \"increase(netdata_smartd_log_reallocated_sectors_count_sectors_average[1m]) > 0\"\n                severity: info\n              - name: Netdata disk current pending sector\n                description: Disk current pending sector\n                query: \"netdata_smartd_log_current_pending_sector_count_sectors_average > 0\"\n                severity: warning\n              - name: Netdata reported uncorrectable disk sectors\n                description: \"Reported uncorrectable disk sectors ({{ $value }} sectors)\"\n                query: \"increase(netdata_smartd_log_offline_uncorrectable_sector_count_sectors_average[2m]) > 0\"\n                severity: warning\n\n      - name: eBPF\n        exporters:\n          - name: cloudflare/ebpf_exporter\n            slug: ebpf-exporter\n            doc_url: https://github.com/cloudflare/ebpf_exporter\n            rules:\n              - name: eBPF exporter program not attached\n                description: \"eBPF program {{ $labels.id }} failed to attach. The program is not collecting data. (instance {{ $labels.instance }})\"\n                query: 'ebpf_exporter_ebpf_program_attached == 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  The exporter uses loose attachment: if a program fails to load (missing BTF, kernel incompatibility), it sets this metric to 0 and continues running.\n              - name: eBPF exporter decoder errors\n                description: \"eBPF exporter is experiencing decoder errors for config {{ $labels.config }}. Kernel data is not being correctly transformed into labels. (instance {{ $labels.instance }})\"\n                query: 'rate(ebpf_exporter_decoder_errors_total[5m]) > 0'\n                severity: warning\n                for: 5m\n              - name: eBPF exporter no enabled configs\n                description: \"eBPF exporter has no enabled configurations. No eBPF programs are being run. (instance {{ $labels.instance }})\"\n                query: 'ebpf_exporter_enabled_configs == 0 or absent(ebpf_exporter_enabled_configs)'\n                severity: warning\n                for: 5m\n\n      - name: Process Exporter\n        exporters:\n          - name: ncabatoff/process-exporter\n            slug: process-exporter\n            doc_url: https://github.com/ncabatoff/process-exporter\n            rules:\n              - name: Process exporter group down\n                description: \"No processes found for group {{ $labels.groupname }}. The service may have stopped. (instance {{ $labels.instance }})\"\n                query: 'namedprocess_namegroup_num_procs == 0'\n                severity: warning\n                for: 5m\n              - name: Process exporter high memory usage\n                description: \"Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of resident memory. (instance {{ $labels.instance }})\"\n                query: 'namedprocess_namegroup_memory_bytes{memtype=\"resident\"} > 4e+09'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold of 4GB is arbitrary and depends on the process being monitored. Adjust per group.\n              - name: Process exporter high CPU usage\n                description: \"Process group {{ $labels.groupname }} is using {{ $value }}% CPU (core-equivalent). (instance {{ $labels.instance }})\"\n                query: 'rate(namedprocess_namegroup_cpu_seconds_total[5m]) * 100 > 80'\n                severity: warning\n                for: 5m\n                comments: |\n                  Value is core-equivalent %: 100% = 1 full core, 200% = 2 cores, etc. Threshold of 80% is per-core. Adjust based on expected workload.\n              - name: Process exporter high file descriptor usage\n                description: \"Process group {{ $labels.groupname }} is using more than 80% of its file descriptor limit. (instance {{ $labels.instance }})\"\n                query: 'namedprocess_namegroup_worst_fd_ratio > 0.8'\n                severity: warning\n                for: 5m\n              - name: Process exporter file descriptors exhausted\n                description: \"Process group {{ $labels.groupname }} has nearly exhausted its file descriptor limit. (instance {{ $labels.instance }})\"\n                query: 'namedprocess_namegroup_worst_fd_ratio > 0.95'\n                severity: critical\n                for: 2m\n              - name: Process exporter high swap usage\n                description: \"Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of swap. (instance {{ $labels.instance }})\"\n                query: 'namedprocess_namegroup_memory_bytes{memtype=\"swapped\"} > 512e+06'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold of 512MB is arbitrary. Adjust per group and environment.\n              - name: Process exporter zombie processes\n                description: \"Process group {{ $labels.groupname }} has {{ $value }} zombie processes. (instance {{ $labels.instance }})\"\n                query: 'namedprocess_namegroup_states{state=\"Zombie\"} > 5'\n                severity: warning\n                for: 5m\n              - name: Process exporter high context switching\n                description: \"Process group {{ $labels.groupname }} has a high rate of context switches ({{ $value }}/s). (instance {{ $labels.instance }})\"\n                query: 'rate(namedprocess_namegroup_context_switches_total{ctxswitchtype=\"voluntary\"}[5m]) > 50000'\n                severity: warning\n                for: 5m\n                comments: |\n                  Filters to voluntary switches only — involuntary switches are normal under CPU contention. Threshold of 50000/s is a rough default. Adjust based on workload.\n              - name: Process exporter high disk write IO\n                description: \"Process group {{ $labels.groupname }} is performing {{ $value | humanize }}B/s of disk writes. (instance {{ $labels.instance }})\"\n                query: 'rate(namedprocess_namegroup_write_bytes_total[5m]) > 100e+06'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold of 100MB/s is arbitrary. Adjust per group.\n              - name: Process exporter process restarting\n                description: \"Process group {{ $labels.groupname }} has restarted (oldest process start time changed). (instance {{ $labels.instance }})\"\n                query: 'changes(namedprocess_namegroup_oldest_start_time_seconds[5m]) > 0 and namedprocess_namegroup_num_procs > 0'\n                severity: info\n                comments: |\n                  Detects restarts by watching for changes in the oldest process start time within the group.\n\n      - name: Systemd\n        exporters:\n          - name: prometheus-community/systemd_exporter\n            slug: systemd-exporter\n            doc_url: https://github.com/prometheus-community/systemd_exporter\n            rules:\n              - name: Systemd unit failed\n                description: \"Systemd unit {{ $labels.name }} has entered failed state. (instance {{ $labels.instance }})\"\n                query: 'systemd_unit_state{state=\"failed\"} == 1'\n                severity: warning\n                for: 5m\n              - name: Systemd unit inactive\n                description: \"Systemd unit {{ $labels.name }} is inactive. (instance {{ $labels.instance }})\"\n                query: 'systemd_unit_state{state=\"inactive\", type=\"service\", name=~\"your-critical-service.+\"} == 1'\n                severity: warning\n                for: 5m\n                comments: |\n                  Many units are legitimately inactive. You must adjust the name=~ filter to match your critical services.\n              - name: Systemd service crash looping\n                description: \"Systemd service {{ $labels.name }} has restarted {{ $value }} times in the last hour. (instance {{ $labels.instance }})\"\n                query: 'increase(systemd_service_restart_total[1h]) > 5'\n                severity: critical\n                for: 5m\n              - name: Systemd unit tasks near limit\n                description: \"Systemd unit {{ $labels.name }} is using {{ $value | humanizePercentage }} of its task limit. (instance {{ $labels.instance }})\"\n                query: 'systemd_unit_tasks_current / ignoring(type) systemd_unit_tasks_max > 0.9 and ignoring(type) systemd_unit_tasks_max > 0'\n                severity: warning\n                for: 5m\n              - name: Systemd socket refused connections\n                description: \"Systemd socket {{ $labels.name }} is refusing connections. ({{ $value }} refused in last 5m, instance {{ $labels.instance }})\"\n                query: 'increase(systemd_socket_refused_connections_total[5m]) > 0'\n                severity: warning\n                for: 2m\n              - name: Systemd socket high connections\n                description: \"Systemd socket {{ $labels.name }} has {{ $value }} active connections. (instance {{ $labels.instance }})\"\n                query: 'systemd_socket_current_connections > 100'\n                severity: warning\n                for: 2m\n                comments: |\n                  Threshold of 100 connections is arbitrary. Adjust to your workload.\n              - name: Systemd timer missed trigger\n                description: \"Systemd timer {{ $labels.name }} has not triggered for over 24 hours. (instance {{ $labels.instance }})\"\n                query: '(time() - systemd_timer_last_trigger_seconds) / 3600 > 24 and systemd_timer_last_trigger_seconds > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  Triggers if timer hasn't fired in 24 hours. Adjust threshold per timer schedule.\n\n  - name: Databases\n    services:\n      - name: MySQL\n        exporters:\n          - name: prometheus/mysqld_exporter\n            slug: mysqld-exporter\n            doc_url: https://github.com/prometheus/mysqld_exporter\n            rules:\n              - name: MySQL down\n                description: MySQL instance is down on {{ $labels.instance }}\n                query: \"mysql_up == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: MySQL too many connections (> 80%)\n                description: \"More than 80% of MySQL connections are in use on {{ $labels.instance }}\"\n                query: \"max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80 and mysql_global_variables_max_connections > 0\"\n                severity: warning\n                for: 2m\n              - name: MySQL high prepared statements utilization (> 80%)\n                description: \"High utilization of prepared statements (>80%) on {{ $labels.instance }}\"\n                query: \"max_over_time(mysql_global_status_prepared_stmt_count[1m]) / mysql_global_variables_max_prepared_stmt_count * 100 > 80 and mysql_global_variables_max_prepared_stmt_count > 0\"\n                severity: warning\n                for: 2m\n              - name: MySQL high threads running\n                description: \"More than 60% of MySQL connections are in running state on {{ $labels.instance }}\"\n                query: \"max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60 and mysql_global_variables_max_connections > 0\"\n                severity: warning\n                for: 2m\n              - name: MySQL Slave IO thread not running\n                description: \"MySQL Slave IO thread not running on {{ $labels.instance }}\"\n                query: \"( mysql_slave_status_slave_io_running and ON (instance) mysql_slave_status_master_server_id > 0 ) == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: MySQL Slave SQL thread not running\n                description: \"MySQL Slave SQL thread not running on {{ $labels.instance }}\"\n                query: \"( mysql_slave_status_slave_sql_running and ON (instance) mysql_slave_status_master_server_id > 0) == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: MySQL Slave replication lag\n                description: \"MySQL replication lag on {{ $labels.instance }}\"\n                query: \"( (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) and ON (instance) mysql_slave_status_master_server_id > 0 ) > 30\"\n                severity: critical\n                for: 1m\n              - name: MySQL slow queries\n                description: \"MySQL server mysql has some new slow query ({{ $value }} in the last minute).\"\n                query: increase(mysql_global_status_slow_queries[1m]) > 0\n                severity: warning\n                for: 2m\n              - name: MySQL InnoDB log waits\n                description: \"MySQL innodb log writes stalling ({{ $value }} waits/s)\"\n                query: rate(mysql_global_status_innodb_log_waits[15m]) > 10\n                severity: warning\n              - name: MySQL restarted\n                description: MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n                query: \"mysql_global_status_uptime < 60\"\n                severity: info\n              - name: MySQL High QPS\n                description: MySQL is being overload with unusual QPS (> 10k QPS).\n                query: \"irate(mysql_global_status_questions[1m]) > 10000\"\n                severity: info\n                for: 2m\n              - name: MySQL too many open files\n                description: MySQL has too many open files, consider increase variables open_files_limit on {{ $labels.instance }}.\n                query: \"mysql_global_status_innodb_num_open_files / mysql_global_variables_open_files_limit * 100 > 75 and mysql_global_variables_open_files_limit > 0\"\n                severity: warning\n                for: 2m\n              - name: MySQL InnoDB Force Recovery is enabled\n                description: \"MySQL InnoDB force recovery is enabled on {{ $labels.instance }}\"\n                query: \"mysql_global_variables_innodb_force_recovery != 0\"\n                severity: warning\n                for: 2m\n              - name: MySQL InnoDB history_len too long\n                description: \"MySQL history_len (undo log) too long on {{ $labels.instance }}\"\n                query: \"mysql_info_schema_innodb_metrics_transaction_trx_rseg_history_len > 50000\"\n                severity: warning\n                for: 2m\n\n      - name: PostgreSQL\n        exporters:\n          - name: prometheus-community/postgres_exporter\n            slug: postgres-exporter\n            doc_url: https://github.com/prometheus-community/postgres_exporter\n            rules:\n              - name: Postgresql down\n                description: Postgresql instance is down\n                query: \"pg_up == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Postgresql restarted\n                description: Postgresql restarted\n                query: \"time() - pg_postmaster_start_time_seconds < 60\"\n                severity: critical\n              - name: Postgresql exporter error\n                description: Postgresql exporter is showing errors. A query may be buggy in query.yaml\n                query: \"pg_exporter_last_scrape_error > 0\"\n                severity: critical\n              - name: Postgresql table not auto vacuumed\n                description: Table {{ $labels.relname }} has not been auto vacuumed for 10 days\n                query: \"((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_vacuum_threshold) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10\"\n                severity: warning\n              - name: Postgresql table not auto analyzed\n                description: Table {{ $labels.relname }} has not been auto analyzed for 10 days\n                query: \"((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_analyze_threshold) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10\"\n                severity: warning\n              - name: Postgresql too many connections\n                description: PostgreSQL instance has too many connections (> 80%).\n                query: \"sum by (instance, job, server) (pg_stat_activity_count) > min by (instance, job, server) (pg_settings_max_connections * 0.8)\"\n                severity: warning\n                for: 2m\n              - name: Postgresql not enough connections\n                description: PostgreSQL instance should have more connections (> 5)\n                query: 'sum by (datname) (pg_stat_activity_count{datname!~\"template.*|postgres\"}) < 5'\n                severity: critical\n                for: 2m\n              - name: Postgresql dead locks\n                description: \"PostgreSQL has dead-locks ({{ $value }} in the last minute)\"\n                query: 'increase(pg_stat_database_deadlocks{datname!~\"template.*|postgres\"}[1m]) > 5'\n                severity: warning\n              - name: Postgresql high rollback rate\n                description: Ratio of transactions being aborted compared to committed is > 2 %\n                query: 'sum by (namespace,datname) ((rate(pg_stat_database_xact_rollback{datname!~\"template.*|postgres\",datid!=\"0\"}[3m])) / ((rate(pg_stat_database_xact_rollback{datname!~\"template.*|postgres\",datid!=\"0\"}[3m])) + (rate(pg_stat_database_xact_commit{datname!~\"template.*|postgres\",datid!=\"0\"}[3m])))) > 0.02'\n                severity: warning\n              - name: Postgresql commit rate low\n                description: Postgresql seems to be processing very few transactions\n                query: 'increase(pg_stat_database_xact_commit{datname!~\"template.*|postgres\",datid!=\"0\"}[5m]) < 5'\n                severity: critical\n                for: 2m\n              - name: Postgresql low XID consumption\n                description: Postgresql seems to be consuming transaction IDs very slowly\n                query: \"rate(pg_txid_current[1m]) < 5\"\n                severity: warning\n                for: 2m\n              - name: Postgresql unused replication slot\n                description: Unused Replication Slots\n                query: \"(pg_replication_slots_active == 0) and (pg_replication_is_replica == 0)\"\n                severity: warning\n                for: 1m\n              - name: Postgresql too many dead tuples\n                description: PostgreSQL dead tuples is too large\n                query: \"((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 and (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup) > 0\"\n                severity: warning\n                for: 2m\n              - name: Postgresql configuration changed\n                description: Postgres Database configuration change has occurred\n                query: '{__name__=~\"pg_settings_.*\",__name__!=\"pg_settings_transaction_read_only\"} != ON(__name__, instance) {__name__=~\"pg_settings_.*\",__name__!=\"pg_settings_transaction_read_only\"} OFFSET 5m'\n                severity: info\n              - name: Postgresql SSL compression active\n                description: Database allows connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n                query: \"sum by (instance) (pg_stat_ssl_compression) > 0\"\n                severity: warning\n              - name: Postgresql too many locks acquired\n                description: Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n                query: \"((sum by (instance) (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20 and (pg_settings_max_locks_per_transaction * pg_settings_max_connections) > 0\"\n                severity: critical\n                for: 2m\n              - name: Postgresql bloat index high (> 80%)\n                description: \"The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`\"\n                query: \"pg_bloat_btree_bloat_pct > 80 and on (idxname) (pg_bloat_btree_real_size > 100000000)\"\n                severity: warning\n                for: 1h\n                comments: |\n                  See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737\n              - name: Postgresql bloat table high (> 80%)\n                description: \"The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};`\"\n                query: \"pg_bloat_table_bloat_pct > 80 and on (relname) (pg_bloat_table_real_size > 200000000)\"\n                severity: warning\n                for: 1h\n                comments: |\n                  See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737\n              - name: Postgresql invalid index\n                description: \"The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};`\"\n                query: 'pg_general_index_info_pg_relation_size{indexrelname=~\".*ccnew.*\"}'\n                severity: warning\n                for: 6h\n                comments: |\n                  See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737\n              - name: Postgresql replication lag\n                description: The PostgreSQL replication lag is high (> 5s)\n                query: \"pg_replication_lag_seconds > 5\"\n                severity: warning\n                for: 30s\n\n      - name: SQL Server\n        exporters:\n          - name: Ozarklake/prometheus-mssql-exporter\n            slug: ozarklake-mssql-exporter\n            doc_url: https://github.com/Ozarklake/prometheus-mssql-exporter\n            rules:\n              - name: SQL Server down\n                description: SQL server instance is down\n                query: mssql_up == 0\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: SQL Server deadlock\n                description: SQL Server {{ $labels.instance }} is experiencing deadlocks ({{ $value }}/s)\n                query: mssql_deadlocks > 5\n                severity: warning\n                for: 1m\n\n      - name: Oracle Database\n        exporters:\n          - name: iamseth/oracledb_exporter\n            slug: iamseth-oracledb-exporter\n            doc_url: https://github.com/iamseth/oracledb_exporter\n            rules:\n              - name: Oracle DB down\n                description: Oracle Database instance is down on {{ $labels.instance }}\n                query: \"oracledb_up == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Oracle DB sessions reaching limit (> 85%)\n                description: \"Oracle Database session utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%)\"\n                query: \"oracledb_resource_current_utilization{resource_name=\\\"sessions\\\"} / oracledb_resource_limit_value{resource_name=\\\"sessions\\\"} * 100 > 85 and oracledb_resource_limit_value{resource_name=\\\"sessions\\\"} > 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold is workload-dependent. Adjust 85% to suit your environment.\n              - name: Oracle DB processes reaching limit (> 85%)\n                description: \"Oracle Database process utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%)\"\n                query: \"oracledb_resource_current_utilization{resource_name=\\\"processes\\\"} / oracledb_resource_limit_value{resource_name=\\\"processes\\\"} * 100 > 85 and oracledb_resource_limit_value{resource_name=\\\"processes\\\"} > 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold is workload-dependent. Adjust 85% to suit your environment.\n              - name: Oracle DB tablespace reaching capacity (> 85%)\n                description: \"Oracle Database tablespace {{ $labels.tablespace }} is above 85% usage on {{ $labels.instance }} (current value: {{ $value }}%)\"\n                query: \"oracledb_tablespace_used_percent > 85\"\n                severity: warning\n                for: 5m\n              - name: Oracle DB tablespace full (> 95%)\n                description: \"Oracle Database tablespace {{ $labels.tablespace }} is critically full on {{ $labels.instance }} (current value: {{ $value }}%)\"\n                query: \"oracledb_tablespace_used_percent > 95\"\n                severity: critical\n                for: 5m\n              - name: Oracle DB high user rollbacks\n                description: \"Oracle Database on {{ $labels.instance }} has a high rollback rate ({{ $value }}% of transactions are rolled back)\"\n                query: \"rate(oracledb_activity_user_rollbacks[5m]) / (rate(oracledb_activity_user_commits[5m]) + rate(oracledb_activity_user_rollbacks[5m])) * 100 > 20 and (rate(oracledb_activity_user_commits[5m]) + rate(oracledb_activity_user_rollbacks[5m])) > 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  A high rollback rate (>20%) often indicates application-level issues such as deadlocks, constraint violations, or poorly designed transactions.\n              - name: Oracle DB too many active sessions\n                description: \"Oracle Database on {{ $labels.instance }} has too many active user sessions (current value: {{ $value }})\"\n                query: \"oracledb_sessions_value{status=\\\"ACTIVE\\\", type=\\\"USER\\\"} > 200\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold is highly workload-dependent. Adjust 200 to suit your environment.\n              - name: Oracle DB high wait time (user I/O)\n                description: \"Oracle Database on {{ $labels.instance }} is experiencing high user I/O wait time\"\n                query: \"oracledb_wait_time_user_io > 300\"\n                severity: warning\n                for: 5m\n                comments: |\n                  The metric from v$waitclassmetric is already a normalized rate (centiseconds per second). Threshold 300 means 3 seconds of I/O wait per second of wall time.\n\n      - name: Patroni\n        exporters:\n          - name: Embedded exporter (Patroni >= 2.1.0)\n            slug: embedded-exporter-patroni\n            doc_url: https://patroni.readthedocs.io/en/latest/rest_api.html?highlight=prometheus#monitoring-endpoint\n            rules:\n              - name: Patroni has no Leader\n                description: A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}\n                query: (max by (scope) (patroni_primary) < 1) and (max by (scope) (patroni_standby_leader) < 1)\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n\n      - name: PGBouncer\n        exporters:\n          - name: spreaker/prometheus-pgbouncer-exporter\n            slug: spreaker-pgbouncer-exporter\n            doc_url: https://github.com/spreaker/prometheus-pgbouncer-exporter\n            rules:\n              - name: PGBouncer active connections\n                description: PGBouncer pools are filling up\n                query: \"pgbouncer_pools_server_active_connections > 200\"\n                severity: warning\n                for: 2m\n              - name: PGBouncer errors\n                description: PGBouncer is logging errors. This may be due to a server restart or an admin typing commands at the pgbouncer console.\n                query: 'increase(pgbouncer_errors_count{errmsg!=\"server conn crashed?\"}[1m]) > 10'\n                severity: warning\n              - name: PGBouncer max connections\n                description: The number of PGBouncer client connections has reached max_client_conn.\n                query: 'increase(pgbouncer_errors_count{errmsg=\"no more connections allowed (max_client_conn)\"}[2m]) > 0'\n                severity: critical\n\n      - name: Redis\n        exporters:\n          - name: oliver006/redis_exporter\n            slug: oliver006-redis-exporter\n            doc_url: https://github.com/oliver006/redis_exporter\n            rules:\n              - name: Redis down\n                description: Redis instance is down\n                query: \"redis_up == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Redis missing master\n                description: Redis cluster has no node marked as master.\n                query: '(count(redis_instance_info{role=\"master\"}) or vector(0)) < 1'\n                severity: critical\n              - name: Redis too many masters\n                description: Redis cluster has too many nodes marked as master.\n                query: 'count(redis_instance_info{role=\"master\"}) > 1'\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Redis disconnected slaves\n                description: Redis not replicating for all slaves. Consider reviewing the redis replication status.\n                query: \"count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 0\"\n                severity: critical\n              - name: Redis replication broken\n                description: Redis instance lost a slave\n                query: \"delta(redis_connected_slaves[1m]) < 0\"\n                severity: critical\n              - name: Redis cluster flapping\n                description: Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n                query: \"changes(redis_connected_slaves[1m]) > 1\"\n                severity: critical\n                for: 2m\n              - name: Redis missing backup\n                description: Redis has not been backed up for 48 hours\n                query: \"time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 48\"\n                severity: critical\n              - name: Redis out of system memory\n                description: Redis is running out of system memory (> 90%)\n                query: \"redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 and redis_total_system_memory_bytes > 0\"\n                severity: warning\n                for: 2m\n                comments: |\n                  The exporter must be started with --include-system-metrics flag or REDIS_EXPORTER_INCL_SYSTEM_METRICS=true environment variable.\n              - name: Redis out of configured maxmemory\n                description: Redis is running out of configured maxmemory (> 90%)\n                query: \"redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90 and on(instance) redis_memory_max_bytes > 0\"\n                severity: warning\n                for: 2m\n              - name: Redis too many connections\n                description: Redis is running out of connections (> 90% used)\n                query: \"redis_connected_clients / redis_config_maxclients * 100 > 90 and redis_config_maxclients > 0\"\n                severity: warning\n                for: 2m\n              - name: Redis not enough connections\n                description: Redis instance should have more connections (> 5)\n                query: \"redis_connected_clients < 5\"\n                severity: warning\n                for: 2m\n              - name: Redis rejected connections\n                description: Some connections to Redis has been rejected\n                query: \"increase(redis_rejected_connections_total[1m]) > 5\"\n                severity: warning\n\n      - name: Memcached\n        exporters:\n          - name: prometheus/memcached_exporter\n            slug: memcached-exporter\n            doc_url: https://github.com/prometheus/memcached_exporter\n            rules:\n              - name: Memcached down\n                description: Memcached instance is down on {{ $labels.instance }}\n                query: \"memcached_up == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Memcached connection limit approaching (> 80%)\n                description: \"Memcached connection usage is above 80% on {{ $labels.instance }} (current value: {{ $value }}%)\"\n                query: \"(memcached_current_connections / memcached_max_connections * 100) > 80 and memcached_max_connections > 0\"\n                severity: warning\n                for: 2m\n              - name: Memcached connection limit approaching (> 95%)\n                description: \"Memcached connection usage is above 95% on {{ $labels.instance }} (current value: {{ $value }}%)\"\n                query: \"(memcached_current_connections / memcached_max_connections * 100) > 95 and memcached_max_connections > 0\"\n                severity: critical\n                for: 2m\n              - name: Memcached out of memory errors\n                description: \"Memcached is returning out-of-memory errors on {{ $labels.instance }}\"\n                query: \"sum without (slab) (rate(memcached_slab_items_outofmemory_total[5m])) > 0\"\n                severity: warning\n                for: 5m\n              - name: Memcached memory usage high (> 90%)\n                description: \"Memcached memory usage is above 90% on {{ $labels.instance }} (current value: {{ $value }}%)\"\n                query: \"(memcached_current_bytes / memcached_limit_bytes * 100) > 90 and memcached_limit_bytes > 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  High memory usage is expected if the cache is well-utilized. This alert fires when it approaches the configured limit, which may cause evictions.\n              - name: Memcached high eviction rate\n                description: \"Memcached is evicting items at a high rate on {{ $labels.instance }} ({{ $value }} evictions/s)\"\n                query: \"rate(memcached_items_evicted_total[5m]) > 10\"\n                severity: warning\n                for: 5m\n                comments: |\n                  A sustained eviction rate indicates memory pressure. Consider increasing memcached memory limit or reducing cache usage. Threshold of 10 evictions/s is a rough default — adjust based on your workload.\n              - name: Memcached low cache hit rate (< 80%)\n                description: \"Memcached cache hit rate is below 80% on {{ $labels.instance }} (current value: {{ $value }}%)\"\n                query: '(rate(memcached_commands_total{command=\"get\", status=\"hit\"}[5m]) / (rate(memcached_commands_total{command=\"get\", status=\"hit\"}[5m]) + rate(memcached_commands_total{command=\"get\", status=\"miss\"}[5m])) * 100) < 80 and (rate(memcached_commands_total{command=\"get\", status=\"hit\"}[5m]) + rate(memcached_commands_total{command=\"get\", status=\"miss\"}[5m])) > 0'\n                severity: warning\n                for: 10m\n                comments: |\n                  A low hit rate may indicate poor cache utilization, incorrect cache keys, or TTLs that are too short. Threshold of 80% is a rough default — adjust based on your workload and access patterns.\n              - name: Memcached connections rejected\n                description: \"Memcached is rejecting connections on {{ $labels.instance }} ({{ $value }} rejections in the last 5m)\"\n                query: \"increase(memcached_connections_rejected_total[5m]) > 0\"\n                severity: warning\n                for: 5m\n              - name: Memcached items too large\n                description: \"Memcached is rejecting items exceeding max-item-size on {{ $labels.instance }} ({{ $value }} rejections in the last 5m)\"\n                query: \"increase(memcached_item_too_large_total[5m]) > 0\"\n                severity: info\n                for: 5m\n\n      - name: MongoDB\n        exporters:\n          - name: percona/mongodb_exporter\n            slug: percona-mongodb-exporter\n            doc_url: https://github.com/percona/mongodb_exporter\n            rules:\n              - name: MongoDB Down\n                description: MongoDB instance is down\n                query: \"mongodb_up == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Mongodb replica member unhealthy\n                description: MongoDB replica member is not healthy\n                query: \"mongodb_rs_members_health == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: MongoDB replication lag (Percona)\n                description: Mongodb replication lag is more than 10s\n                query: '(mongodb_rs_members_optimeDate{member_state=\"PRIMARY\"} - on (set) group_right mongodb_rs_members_optimeDate{member_state=\"SECONDARY\"}) / 1000 > 10'\n                severity: critical\n              - name: MongoDB replication headroom\n                description: MongoDB replication headroom is <= 0\n                query: 'sum(avg(mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp)) - sum(avg(mongodb_rs_members_optimeDate{member_state=\"PRIMARY\"} - on (set) group_right mongodb_rs_members_optimeDate{member_state=\"SECONDARY\"})) <= 0'\n                severity: critical\n                comments: |\n                  This query mixes old (mongodb_mongod_*) and new (mongodb_rs_*) metric names. It requires the Percona exporter to run with --compatible-mode to expose both.\n              - name: MongoDB number cursors open (Percona)\n                description: Too many cursors opened by MongoDB for clients (> 10k)\n                query: 'mongodb_ss_metrics_cursor_open{csr_type=\"total\"} > 10 * 1000'\n                severity: warning\n                for: 2m\n              - name: MongoDB cursors timeouts (Percona)\n                description: \"Too many cursors are timing out ({{ $value }} in the last minute)\"\n                query: \"increase(mongodb_ss_metrics_cursor_timedOut[1m]) > 100\"\n                severity: warning\n                for: 2m\n              - name: MongoDB too many connections (Percona)\n                description: Too many connections (> 80%)\n                query: 'mongodb_ss_connections{conn_type=\"current\"} / (mongodb_ss_connections{conn_type=\"current\"} + mongodb_ss_connections{conn_type=\"available\"}) * 100 > 80 and (mongodb_ss_connections{conn_type=\"current\"} + mongodb_ss_connections{conn_type=\"available\"}) > 0'\n                severity: warning\n                for: 2m\n\n          - name: dcu/mongodb_exporter\n            slug: dcu-mongodb-exporter\n            doc_url: https://github.com/dcu/mongodb_exporter\n            rules:\n              - name: MongoDB replication lag (DCU)\n                description: Mongodb replication lag is more than 10s\n                query: 'avg(mongodb_replset_member_optime_date{state=\"PRIMARY\"}) - avg(mongodb_replset_member_optime_date{state=\"SECONDARY\"}) > 10'\n                severity: critical\n              - name: MongoDB replication Status 3\n                description: MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync\n                query: \"mongodb_replset_member_state == 3\"\n                severity: critical\n              - name: MongoDB replication Status 6\n                description: MongoDB Replication set member as seen from another member of the set, is not yet known\n                query: \"mongodb_replset_member_state == 6\"\n                severity: critical\n              - name: MongoDB replication Status 8\n                description: MongoDB Replication set member as seen from another member of the set, is unreachable\n                query: \"mongodb_replset_member_state == 8\"\n                severity: critical\n              - name: MongoDB replication Status 9\n                description: MongoDB Replication set member is actively performing a rollback. Data is not available for reads\n                query: \"mongodb_replset_member_state == 9\"\n                severity: critical\n              - name: MongoDB replication Status 10\n                description: MongoDB Replication set member was once in a replica set but was subsequently removed\n                query: \"mongodb_replset_member_state == 10\"\n                severity: critical\n              - name: MongoDB number cursors open (DCU)\n                description: Too many cursors opened by MongoDB for clients (> 10k)\n                query: 'mongodb_metrics_cursor_open{state=\"total_open\"} > 10000'\n                severity: warning\n                for: 2m\n              - name: MongoDB cursors timeouts (DCU)\n                description: \"Too many cursors are timing out ({{ $value }} in the last minute)\"\n                query: \"increase(mongodb_metrics_cursor_timed_out_total[1m]) > 100\"\n                severity: warning\n                for: 2m\n              - name: MongoDB too many connections (DCU)\n                description: Too many connections (> 80%)\n                query: 'mongodb_connections{state=\"current\"} / (mongodb_connections{state=\"current\"} + mongodb_connections{state=\"available\"}) * 100 > 80 and (mongodb_connections{state=\"current\"} + mongodb_connections{state=\"available\"}) > 0'\n                severity: warning\n                for: 2m\n          - name: stefanprodan/mgob\n            slug: stefanprodan-mgob-exporter\n            doc_url: https://github.com/stefanprodan/mgob\n            rules:\n              - name: Mgob backup failed\n                description: MongoDB backup has failed\n                query: 'changes(mgob_scheduler_backup_total{status=\"500\"}[1h]) > 0'\n                severity: critical\n\n      - name: Elasticsearch\n        exporters:\n          - name: prometheus-community/elasticsearch_exporter\n            slug: prometheus-community-elasticsearch-exporter\n            doc_url: https://github.com/prometheus-community/elasticsearch_exporter\n            rules:\n              - name: Elasticsearch Heap Usage Too High\n                description: \"The heap usage is over 90%\"\n                query: '(elasticsearch_jvm_memory_used_bytes{area=\"heap\"} / elasticsearch_jvm_memory_max_bytes{area=\"heap\"}) * 100 > 90 and elasticsearch_jvm_memory_max_bytes{area=\"heap\"} > 0'\n                severity: critical\n                for: 2m\n              - name: Elasticsearch Heap Usage warning\n                description: \"The heap usage is over 80%\"\n                query: '(elasticsearch_jvm_memory_used_bytes{area=\"heap\"} / elasticsearch_jvm_memory_max_bytes{area=\"heap\"}) * 100 > 80 and elasticsearch_jvm_memory_max_bytes{area=\"heap\"} > 0'\n                severity: warning\n                for: 2m\n              - name: Elasticsearch disk out of space\n                description: The disk usage is over 90%\n                query: \"elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10 and elasticsearch_filesystem_data_size_bytes > 0\"\n                severity: critical\n              - name: Elasticsearch disk space low\n                description: The disk usage is over 80%\n                query: \"elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20 and elasticsearch_filesystem_data_size_bytes > 0\"\n                severity: warning\n                for: 2m\n              - name: Elasticsearch Cluster Red\n                description: Elastic Cluster Red status\n                query: 'elasticsearch_cluster_health_status{color=\"red\"} == 1'\n                severity: critical\n              - name: Elasticsearch Cluster Yellow\n                description: Elastic Cluster Yellow status\n                query: 'elasticsearch_cluster_health_status{color=\"yellow\"} == 1'\n                severity: warning\n              - name: Elasticsearch Healthy Nodes\n                description: \"Missing node in Elasticsearch cluster\"\n                query: \"elasticsearch_cluster_health_number_of_nodes < 3\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Elasticsearch Healthy Data Nodes\n                description: \"Missing data node in Elasticsearch cluster\"\n                query: \"elasticsearch_cluster_health_number_of_data_nodes < 3\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Elasticsearch relocating shards\n                description: \"Elasticsearch is relocating shards\"\n                query: \"elasticsearch_cluster_health_relocating_shards > 0\"\n                severity: info\n              - name: Elasticsearch relocating shards too long\n                description: \"Elasticsearch has been relocating shards for 15min\"\n                query: \"elasticsearch_cluster_health_relocating_shards > 0\"\n                severity: warning\n                for: 15m\n              - name: Elasticsearch initializing shards\n                description: \"Elasticsearch is initializing shards\"\n                query: \"elasticsearch_cluster_health_initializing_shards > 0\"\n                severity: info\n              - name: Elasticsearch initializing shards too long\n                description: \"Elasticsearch has been initializing shards for 15 min\"\n                query: \"elasticsearch_cluster_health_initializing_shards > 0\"\n                severity: warning\n                for: 15m\n              - name: Elasticsearch unassigned shards\n                description: \"Elasticsearch has unassigned shards\"\n                query: \"elasticsearch_cluster_health_unassigned_shards > 0\"\n                severity: critical\n                for: 2m\n              - name: Elasticsearch pending tasks\n                description: \"Elasticsearch has pending tasks. Cluster works slowly.\"\n                query: \"elasticsearch_cluster_health_number_of_pending_tasks > 0\"\n                severity: warning\n                for: 15m\n              - name: Elasticsearch no new documents\n                description: \"No new documents for 10 min!\"\n                query: 'increase(elasticsearch_indices_indexing_index_total{es_data_node=\"true\"}[10m]) < 1'\n                severity: warning\n              - name: Elasticsearch High Indexing Latency\n                description: \"The indexing latency on Elasticsearch cluster is higher than the threshold (current value: {{ $value }}s).\"\n                query: \"rate(elasticsearch_indices_indexing_index_time_seconds_total[1m]) / rate(elasticsearch_indices_indexing_index_total[1m]) > 0.0005 and rate(elasticsearch_indices_indexing_index_total[1m]) > 0\"\n                severity: warning\n                for: 10m\n              - name: Elasticsearch High Indexing Rate\n                description: \"The indexing rate on Elasticsearch cluster is higher than the threshold.\"\n                query: \"sum(rate(elasticsearch_indices_indexing_index_total[1m]))> 10000\"\n                severity: warning\n                for: 5m\n              - name: Elasticsearch High Query Rate\n                description: \"The query rate on Elasticsearch cluster is higher than the threshold.\"\n                query: \"sum(rate(elasticsearch_indices_search_query_total[1m])) > 100\"\n                severity: warning\n                for: 5m\n              - name: Elasticsearch High Query Latency\n                description: \"The query latency on Elasticsearch cluster is higher than the threshold (current value: {{ $value }}s).\"\n                query: \"rate(elasticsearch_indices_search_query_time_seconds[1m]) / rate(elasticsearch_indices_search_query_total[1m]) > 1 and rate(elasticsearch_indices_search_query_total[1m]) > 0\"\n                severity: warning\n                for: 5m\n\n      - name: Meilisearch\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://github.com/orgs/meilisearch/discussions/625\n            rules:\n              - name: Meilisearch index is empty\n                description: Meilisearch index {{ $labels.index }} has zero documents\n                query: \"meilisearch_index_docs_count == 0\"\n                severity: warning\n              - name: Meilisearch http response time\n                description: Meilisearch http response time is too high\n                query: \"meilisearch_http_response_time_seconds > 0.5\"\n                severity: warning\n\n      - name: Cassandra\n        exporters:\n          - name: instaclustr/cassandra-exporter\n            slug: instaclustr-cassandra-exporter\n            doc_url: https://github.com/instaclustr/cassandra-exporter\n            rules:\n              - name: \"Cassandra Node is unavailable\"\n                description: \"Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}\"\n                query: \"cassandra_endpoint_active < 1\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: \"Cassandra many compaction tasks are pending\"\n                description: \"Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}\"\n                query: \"cassandra_table_estimated_pending_compactions > 100\"\n                severity: warning\n              - name: \"Cassandra commitlog pending tasks (Instaclustr)\"\n                description: \"Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}\"\n                query: \"cassandra_commit_log_pending_tasks > 15\"\n                for: 2m\n                severity: warning\n              - name: \"Cassandra compaction executor blocked tasks (Instaclustr)\"\n                description: \"Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}\"\n                query: 'cassandra_thread_pool_blocked_tasks{pool=\"CompactionExecutor\"} > 15'\n                for: 2m\n                severity: warning\n              - name: \"Cassandra flush writer blocked tasks (Instaclustr)\"\n                description: \"Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}\"\n                query: 'cassandra_thread_pool_blocked_tasks{pool=\"MemtableFlushWriter\"} > 15'\n                for: 2m\n                severity: warning\n              - name: \"Cassandra connection timeouts total (Instaclustr)\"\n                description: \"Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}\"\n                query: \"sum by (cassandra_cluster,instance) (rate(cassandra_client_request_timeouts_total[5m])) > 5\"\n                for: 2m\n                severity: critical\n              - name: \"Cassandra storage exceptions (Instaclustr)\"\n                description: \"Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}\"\n                query: \"changes(cassandra_storage_exceptions_total[1m]) > 1\"\n                severity: critical\n              - name: \"Cassandra tombstone dump (Instaclustr)\"\n                description: \"Cassandra tombstone dump - {{ $labels.cassandra_cluster }}\"\n                query: 'avg(cassandra_table_tombstones_scanned{quantile=\"0.99\"}) by (instance,cassandra_cluster,keyspace) > 100'\n                for: 2m\n                severity: critical\n              - name: \"Cassandra client request unavailable write (Instaclustr)\"\n                description: \"Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }}\"\n                query: 'changes(cassandra_client_request_unavailable_exceptions_total{operation=\"write\"}[1m]) > 0'\n                for: 2m\n                severity: critical\n              - name: \"Cassandra client request unavailable read (Instaclustr)\"\n                description: \"Some Cassandra client requests are unavailable to read - {{ $labels.cassandra_cluster }}\"\n                query: 'changes(cassandra_client_request_unavailable_exceptions_total{operation=\"read\"}[1m]) > 0'\n                for: 2m\n                severity: critical\n              - name: \"Cassandra client request write failure (Instaclustr)\"\n                description: \"Write failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\"\n                query: 'increase(cassandra_client_request_failures_total{operation=\"write\"}[1m]) > 0'\n                for: 2m\n                severity: critical\n              - name: \"Cassandra client request read failure (Instaclustr)\"\n                description: \"Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\"\n                query: 'increase(cassandra_client_request_failures_total{operation=\"read\"}[1m]) > 0'\n                for: 2m\n                severity: critical\n\n          - name: criteo/cassandra_exporter\n            slug: criteo-cassandra-exporter\n            doc_url: https://github.com/criteo/cassandra_exporter\n            rules:\n              - name: Cassandra hints count\n                description: Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n                query: 'changes(cassandra_stats{name=\"org:apache:cassandra:metrics:storage:totalhints:count\"}[1m]) > 3'\n                severity: critical\n              - name: Cassandra compaction task pending\n                description: Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:compaction:pendingtasks:value\"} > 100'\n                severity: warning\n                for: 2m\n              - name: Cassandra viewwrite latency\n                description: High viewwrite latency on {{ $labels.instance }} cassandra node\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile\"} > 100000'\n                severity: warning\n                for: 2m\n              - name: Cassandra authentication failures\n                description: Increase of Cassandra authentication failures\n                query: 'rate(cassandra_stats{name=\"org:apache:cassandra:metrics:client:authfailure:count\"}[1m]) > 5'\n                severity: warning\n                for: 2m\n              - name: Cassandra node down\n                description: Cassandra node down\n                query: 'sum(cassandra_stats{name=\"org:apache:cassandra:net:failuredetector:downendpointcount\"}) by (service,group,cluster,env) > 0'\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Cassandra commitlog pending tasks (Criteo)\n                description: Unexpected number of Cassandra commitlog pending tasks\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:commitlog:pendingtasks:value\"} > 15'\n                severity: warning\n                for: 2m\n              - name: Cassandra compaction executor blocked tasks (Criteo)\n                description: Some Cassandra compaction executor tasks are blocked\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count\"} > 0'\n                severity: warning\n                for: 2m\n              - name: Cassandra flush writer blocked tasks (Criteo)\n                description: Some Cassandra flush writer tasks are blocked\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count\"} > 0'\n                severity: warning\n                for: 2m\n              - name: Cassandra repair pending tasks\n                description: Some Cassandra repair tasks are pending\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value\"} > 2'\n                severity: warning\n                for: 2m\n              - name: Cassandra repair blocked tasks\n                description: Some Cassandra repair tasks are blocked\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count\"} > 0'\n                severity: warning\n                for: 2m\n              - name: Cassandra connection timeouts total (Criteo)\n                description: Some connection between nodes are ending in timeout\n                query: 'rate(cassandra_stats{name=\"org:apache:cassandra:metrics:connection:totaltimeouts:count\"}[1m]) > 5'\n                severity: critical\n                for: 2m\n              - name: Cassandra storage exceptions (Criteo)\n                description: Something is going wrong with cassandra storage\n                query: 'changes(cassandra_stats{name=\"org:apache:cassandra:metrics:storage:exceptions:count\"}[1m]) > 1'\n                severity: critical\n              - name: Cassandra tombstone dump (Criteo)\n                description: Too much tombstones scanned in queries\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile\"} > 1000'\n                severity: critical\n              - name: Cassandra client request unavailable write (Criteo)\n                description: Write failures have occurred because too many nodes are unavailable\n                query: 'changes(cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:write:unavailables:count\"}[1m]) > 0'\n                severity: critical\n              - name: Cassandra client request unavailable read (Criteo)\n                description: Read failures have occurred because too many nodes are unavailable\n                query: 'changes(cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:read:unavailables:count\"}[1m]) > 0'\n                severity: critical\n              - name: Cassandra client request write failure (Criteo)\n                description: A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate\"} > 0'\n                severity: critical\n              - name: Cassandra client request read failure (Criteo)\n                description: A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate\"} > 0'\n                severity: critical\n              - name: Cassandra cache hit rate key cache\n                description: Key cache hit rate is below 85%\n                query: 'cassandra_stats{name=\"org:apache:cassandra:metrics:cache:keycache:hitrate:value\"} < .85'\n                severity: critical\n                for: 2m\n\n      - name: Clickhouse\n        exporters:\n          - name: Embedded Exporter\n            slug: embedded-exporter\n            doc_url: https://clickhouse.com/docs/en/operations/system-tables/metrics\n            rules:\n              - name: ClickHouse node down\n                description: \"No metrics received from ClickHouse exporter for over 2 minutes.\"\n                query: 'up{job=\"clickhouse\"} == 0'\n                severity: critical\n                for: 2m\n                comments: |\n                  Adjust the job label to match your Prometheus configuration.\n              - name: ClickHouse Memory Usage Critical\n                description: \"Memory usage is critically high, over 90%.\"\n                query: \"ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90 and ClickHouseAsyncMetrics_CGroupMemoryTotal > 0\"\n                severity: critical\n                for: 5m\n              - name: ClickHouse Memory Usage Warning\n                description: \"Memory usage is over 80%.\"\n                query: \"ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80 and ClickHouseAsyncMetrics_CGroupMemoryTotal > 0\"\n                severity: warning\n                for: 5m\n              - name: ClickHouse Disk Space Low on Default\n                description: \"Disk space on default is below 20%.\"\n                query: \"ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20 and (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) > 0\"\n                severity: warning\n                for: 2m\n              - name: ClickHouse Disk Space Critical on Default\n                description: \"Disk space on default disk is critically low, below 10%.\"\n                query: \"ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10 and (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) > 0\"\n                severity: critical\n                for: 2m\n              - name: ClickHouse Disk Space Low on Backups\n                description: \"Disk space on backups is below 20%.\"\n                query: \"ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20 and (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) > 0\"\n                severity: warning\n                for: 2m\n              - name: ClickHouse Replica Errors\n                description: \"Critical replica errors detected, either all replicas are stale or lost.\"\n                query: \"ClickHouseErrorMetric_ALL_REPLICAS_ARE_STALE == 1 or ClickHouseErrorMetric_ALL_REPLICAS_LOST == 1\"\n                severity: critical\n\n              - name: ClickHouse No Available Replicas\n                description: \"No available replicas in ClickHouse.\"\n                query: \"ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1\"\n                severity: critical\n\n              - name: ClickHouse No Live Replicas\n                description: \"There are too few live replicas available, risking data loss and service disruption.\"\n                query: \"ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1\"\n                severity: critical\n\n              - name: ClickHouse High TCP Connections\n                description: \"High number of TCP connections, indicating heavy client or inter-cluster communication.\"\n                query: \"ClickHouseMetrics_TCPConnection > 400\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Please replace the threshold with an appropriate value\n              - name: ClickHouse Interserver Connection Issues\n                description: \"High number of interserver connections may indicate replication or distributed query handling issues.\"\n                query: \"ClickHouseMetrics_InterserverConnection > 50\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Adjust the threshold based on your cluster size and expected replication traffic.\n              - name: ClickHouse ZooKeeper Connection Issues\n                description: \"ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination.\"\n                query: \"ClickHouseMetrics_ZooKeeperSession != 1\"\n                severity: warning\n                for: 3m\n              - name: ClickHouse Authentication Failures\n                description: \"Authentication failures detected, indicating potential security issues or misconfiguration.\"\n                query: \"increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 3\"\n                severity: info\n\n              - name: ClickHouse Access Denied Errors\n                description: \"Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts.\"\n                query: \"increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 3\"\n                severity: info\n\n              - name: ClickHouse rejected insert queries\n                description: \"INSERTs rejected due to too many active data parts. Reduce insert frequency.\"\n                query: \"increase(ClickHouseProfileEvents_RejectedInserts[1m]) > 0\"\n                severity: warning\n                for: 1m\n              - name: ClickHouse delayed insert queries\n                description: \"INSERTs delayed due to high number of active parts.\"\n                query: \"increase(ClickHouseProfileEvents_DelayedInserts[5m]) > 0\"\n                severity: warning\n                for: 2m\n              - name: ClickHouse zookeeper hardware exception\n                description: \"Zookeeper hardware exception: network issues communicating with ZooKeeper\"\n                query: \"increase(ClickHouseProfileEvents_ZooKeeperHardwareExceptions[1m]) > 0\"\n                severity: critical\n                for: 1m\n              - name: ClickHouse high network usage\n                description: High network usage. ClickHouse network usage exceeds 100MB/s.\n                query: \"rate(ClickHouseProfileEvents_NetworkSendBytes[1m]) > 100*1024*1024 or rate(ClickHouseProfileEvents_NetworkReceiveBytes[1m]) > 100*1024*1024\"\n                severity: warning\n                for: 2m\n                comments: |\n                  Please replace the threshold with an appropriate value\n              - name: ClickHouse distributed rejected inserts\n                description: \"INSERTs into Distributed tables rejected due to pending bytes limit.\"\n                query: \"increase(ClickHouseProfileEvents_DistributedRejectedInserts[5m]) > 0\"\n                severity: critical\n                for: 2m\n\n      - name: CouchDB\n        exporters:\n          - name: gesellix/couchdb-prometheus-exporter\n            slug: gesellix-couchdb-prometheus-exporter\n            doc_url: https://github.com/gesellix/couchdb-prometheus-exporter\n            rules:\n              - name: CouchDB node down\n                description: CouchDB node is not responding (node_up metric is 0) for more than 2 minutes\n                query: \"couchdb_httpd_node_up == 0 or couchdb_httpd_up == 0\"\n                severity: critical\n                for: 2m\n              - name: CouchDB atom memory usage critical\n                description: Atom memory usage is above 90% of limit\n                query: \"couchdb_erlang_memory_atom_used > 0.9 * couchdb_erlang_memory_atom\"\n                severity: critical\n                for: 5m\n              - name: CouchDB open databases critical\n                description: Number of open databases exceeds 90% of node capacity\n                query: \"couchdb_httpd_open_databases > 0.9 * 1000\"\n                severity: critical\n                for: 5m\n              - name: CouchDB open OS files critical\n                description: CouchDB is using more than 90% of allowed OS file descriptors, may fail to open new files\n                query: \"couchdb_httpd_open_os_files > 0.9 * 65535\"\n                severity: critical\n                for: 5m\n              - name: CouchDB 5xx error ratio high\n                description: More than 5% of HTTP requests are returning 5xx errors\n                query: \"rate(couchdb_httpd_status_codes{code=~\\\"5..\\\"}[5m]) / rate(couchdb_httpd_requests[5m]) > 0.05 and rate(couchdb_httpd_requests[5m]) > 0\"\n                severity: critical\n                for: 5m\n              - name: CouchDB temporary view read rate critical\n                description: Temporary view read rate exceeds 100 reads/sec, high risk of performance degradation\n                query: \"rate(couchdb_httpd_temporary_view_reads[5m]) > 100\"\n                severity: critical\n                for: 5m\n              - name: CouchDB Mango queries scanning too many docs\n                description: Some Mango queries are scanning too many documents, consider adding indexes\n                query: \"rate(couchdb_mango_too_many_docs_scanned[5m]) > 50\"\n                severity: warning\n                for: 5m\n              - name: CouchDB Mango queries failed due to invalid index\n                description: Some Mango queries failed to execute because the index was missing or invalid\n                query: \"rate(couchdb_mango_query_invalid_index[5m]) > 5\"\n                severity: warning\n                for: 5m\n              - name: CouchDB Mango docs examined high\n                description: High number of documents examined per Mango queries, consider indexing\n                query: \"rate(couchdb_mango_docs_examined[5m]) > 1000\"\n                severity: warning\n                for: 5m\n              - name: CouchDB Replicator manager died\n                description: Replication manager process has crashed\n                query: \"increase(couchdb_replicator_changes_manager_deaths[5m]) > 0\"\n                severity: critical\n                for: 1m\n              - name: CouchDB Replicator queue process died\n                description: Replication queue process has crashed\n                query: \"increase(couchdb_replicator_changes_queue_deaths[5m]) > 0\"\n                severity: critical\n                for: 1m\n              - name: CouchDB Replicator reader process died\n                description: Replication reader process has crashed\n                query: \"increase(couchdb_replicator_changes_reader_deaths[5m]) > 0\"\n                severity: critical\n                for: 1m\n              - name: CouchDB Replicator failed to start\n                description: One or more replication tasks failed to start\n                query: \"increase(couchdb_replicator_failed_starts[5m]) > 0\"\n                severity: critical\n                for: 1m\n              - name: CouchDB replication cluster unstable\n                description: The replication cluster is unstable, replication may be interrupted\n                query: \"couchdb_replicator_cluster_is_stable == 0\"\n                severity: critical\n                for: 2m\n              - name: CouchDB replication read failures\n                description: Replication changes feed has failed reads more than 5 times in 5 minutes\n                query: \"increase(couchdb_replicator_changes_read_failures[5m]) > 5\"\n                severity: warning\n                for: 5m\n              - name: CouchDB file descriptors high\n                description: Process is using more than 85% of allowed file descriptors\n                query: \"process_open_fds / process_max_fds > 0.85 and process_max_fds > 0\"\n                severity: warning\n                for: 5m\n              - name: CouchDB process restarted\n                description: CouchDB process has restarted recently\n                query: \"changes(process_start_time_seconds[1h]) > 0\"\n                severity: info\n                for: 1m\n              - name: CouchDB critical log entries\n                description: Critical or error log entries detected in the last 5 minutes\n                query: \"increase(couchdb_server_couch_log{level=~\\\"error|critical\\\"}[5m]) > 0\"\n                severity: critical\n                for: 1m\n\n      - name: Solr\n        exporters:\n          - name: embedded exporter\n            slug: embedded-exporter\n            doc_url: https://solr.apache.org/guide/8_11/monitoring-solr-with-prometheus-and-grafana.html\n            rules:\n              - name: Solr update errors\n                description: Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n                query: \"increase(solr_metrics_core_update_handler_errors_total[1m]) > 1\"\n                severity: critical\n              - name: Solr query errors\n                description: Solr has increased query errors in collection {{ $labels.collection }} for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n                query: 'increase(solr_metrics_core_errors_total{category=\"QUERY\"}[1m]) > 1'\n                severity: warning\n                for: 5m\n              - name: Solr replication errors\n                description: Solr collection {{ $labels.collection }} has replication errors for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n                query: 'increase(solr_metrics_core_errors_total{category=\"REPLICATION\"}[1m]) > 1'\n                severity: critical\n              - name: Solr low live node count\n                description: Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n                query: \"solr_collections_live_nodes < 2\"\n                severity: critical\n\n  - name: Message brokers\n    services:\n      - name: RabbitMQ\n        exporters:\n          - name: rabbitmq/rabbitmq-prometheus\n            slug: rabbitmq-exporter\n            doc_url: https://github.com/rabbitmq/rabbitmq-prometheus\n            rules:\n              - name: RabbitMQ node down\n                description: Less than 3 nodes running in RabbitMQ cluster\n                query: \"sum(rabbitmq_build_info) < 3\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: RabbitMQ node not distributed\n                description: Distribution link state is not 'up'\n                query: \"erlang_vm_dist_node_state < 3\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: RabbitMQ instances different versions\n                description: Running different version of RabbitMQ in the same cluster, can lead to failure.\n                query: \"count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1\"\n                severity: warning\n                for: 1h\n              - name: RabbitMQ memory high\n                description: A node use more than 90% of allocated RAM\n                query: \"rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90 and rabbitmq_resident_memory_limit_bytes > 0\"\n                severity: warning\n                for: 2m\n              - name: RabbitMQ file descriptors usage\n                description: A node use more than 90% of file descriptors\n                query: \"rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90 and rabbitmq_process_max_fds > 0\"\n                severity: warning\n                for: 2m\n              - name: RabbitMQ too many ready messages\n                description: RabbitMQ too many ready messages on {{ $labels.instance }}\n                query: \"sum(rabbitmq_queue_messages_ready) BY (queue) > 1000\"\n                severity: warning\n                for: 1m\n              - name: RabbitMQ too many unack messages\n                description: Too many unacknowledged messages\n                query: \"sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000\"\n                severity: warning\n                for: 1m\n              - name: RabbitMQ too many connections\n                description: The total connections of a node is too high\n                query: \"rabbitmq_connections > 1000\"\n                severity: warning\n                for: 2m\n              - name: RabbitMQ no queue consumer\n                description: A queue has less than 1 consumer\n                query: \"rabbitmq_queue_consumers < 1\"\n                severity: warning\n                for: 1m # allows a short service restart\n              - name: RabbitMQ unroutable messages\n                description: A queue has unroutable messages ({{ $value }} in the last 1m)\n                query: \"increase(rabbitmq_channel_messages_unroutable_returned_total[1m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[1m]) > 0\"\n                severity: warning\n                for: 2m\n\n          - name: kbudde/rabbitmq-exporter\n            slug: kbudde-rabbitmq-exporter\n            doc_url: https://github.com/kbudde/rabbitmq_exporter\n            rules:\n              - name: RabbitMQ down\n                description: RabbitMQ node down\n                query: \"rabbitmq_up == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: RabbitMQ cluster down\n                description: Less than 3 nodes running in RabbitMQ cluster\n                query: \"sum(rabbitmq_running) < 3\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: RabbitMQ cluster partition\n                description: Cluster partition\n                query: \"rabbitmq_partitions > 0\"\n                severity: critical\n              - name: RabbitMQ out of memory\n                description: Memory available for RabbitMQ is low (< 10%)\n                query: \"rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90 and rabbitmq_node_mem_limit > 0\"\n                severity: warning\n                for: 2m\n              - name: RabbitMQ instance too many connections\n                description: RabbitMQ instance has too many connections (> 1000)\n                query: \"rabbitmq_connectionsTotal > 1000\"\n                severity: warning\n                for: 2m\n              - name: RabbitMQ dead letter queue filling up\n                description: Dead letter queue is filling up (> 10 msgs)\n                query: 'rabbitmq_queue_messages{queue=\"my-dead-letter-queue\"} > 10'\n                severity: warning\n                for: 1m\n                comments: |\n                  Indicate the queue name in dedicated label.\n              - name: RabbitMQ too many messages in queue\n                description: Queue is filling up (> 1000 msgs)\n                query: 'rabbitmq_queue_messages_ready{queue=\"my-queue\"} > 1000'\n                severity: warning\n                for: 2m\n                comments: |\n                  Indicate the queue name in dedicated label.\n              - name: RabbitMQ slow queue consuming\n                description: Queue messages are consumed slowly (> 60s)\n                query: 'time() - rabbitmq_queue_head_message_timestamp{queue=\"my-queue\"} > 60'\n                severity: warning\n                for: 2m\n                comments: |\n                  Indicate the queue name in dedicated label.\n              - name: RabbitMQ no consumer\n                description: Queue has no consumer\n                query: \"rabbitmq_queue_consumers == 0\"\n                severity: critical\n                for: 5m\n                comments: |\n                  Allows a short service restart.\n              - name: RabbitMQ too many consumers\n                description: Queue should have only 1 consumer\n                query: 'rabbitmq_queue_consumers{queue=\"my-queue\"} > 1'\n                severity: critical\n                comments: |\n                  Indicate the queue name in dedicated label.\n              - name: RabbitMQ inactive exchange\n                description: Exchange receive less than 5 msgs per second\n                query: 'rate(rabbitmq_exchange_messages_published_in_total{exchange=\"my-exchange\"}[1m]) < 5'\n                severity: warning\n                comments: |\n                  Indicate the exchange name in dedicated label.\n                for: 2m\n\n      - name: Zookeeper\n        exporters:\n          - name: cloudflare/kafka_zookeeper_exporter\n            slug: cloudflare-kafka-zookeeper-exporter\n            doc_url: https://github.com/cloudflare/kafka_zookeeper_exporter\n            rules:\n          - name: dabealu/zookeeper-exporter\n            slug: dabealu-zookeeper-exporter\n            doc_url: https://github.com/dabealu/zookeeper-exporter\n            rules:\n              - name: Zookeeper Down\n                description: \"Zookeeper down on instance {{ $labels.instance }}\"\n                query: \"zk_up == 0\"\n                severity: critical\n                for: 1m\n                comments: |\n                  1m delay allows a restart without triggering an alert.\n              - name: Zookeeper missing leader\n                description: \"Zookeeper cluster has no node marked as leader\"\n                query: \"sum(zk_server_leader) == 0\"\n                severity: critical\n              - name: Zookeeper Too Many Leaders\n                description: \"Zookeeper cluster has too many nodes marked as leader\"\n                query: \"sum(zk_server_leader) > 1\"\n                severity: critical\n              - name: Zookeeper Not Ok\n                description: \"Zookeeper instance is not ok\"\n                query: \"zk_ruok == 0\"\n                severity: warning\n                for: 3m\n\n      - name: Kafka\n        exporters:\n          - name: danielqsj/kafka_exporter\n            slug: danielqsj-kafka-exporter\n            doc_url: https://github.com/danielqsj/kafka_exporter\n            rules:\n              - name: Kafka topics replicas\n                description: Kafka topic in-sync partition\n                query: \"min(kafka_topic_partition_in_sync_replica) by (topic) < 3\"\n                severity: critical\n              - name: Kafka consumer group lag\n                description: Kafka consumer group {{ $labels.consumergroup }} is lagging behind ({{ $value }} messages)\n                query: \"sum(kafka_consumergroup_lag) by (consumergroup) > 10000\"\n                severity: warning\n                for: 1m\n          - name: linkedin/Burrow\n            slug: linkedin-kafka-exporter\n            doc_url: https://github.com/linkedin/Burrow\n            rules:\n              - name: Kafka topic offset decreased\n                description: Kafka topic offset has decreased\n                query: \"delta(kafka_burrow_partition_current_offset[1m]) < 0\"\n                severity: warning\n              - name: Kafka consumer lag\n                description: Kafka consumer has a 30 minutes and increasing lag\n                query: \"kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset >= (kafka_burrow_topic_partition_offset offset 15m - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset offset 15m) AND kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset > 0\"\n                severity: warning\n                for: 15m\n\n      - name: Pulsar\n        exporters:\n          - name: embedded exporter\n            slug: embedded-exporter\n            doc_url: https://pulsar.apache.org/docs/reference-metrics/\n            rules:\n              - name: Pulsar subscription high number of backlog entries\n                description: \"The number of subscription backlog entries is over 5k\"\n                query: sum(pulsar_subscription_back_log) by (subscription) > 5000\n                for: 1h\n                severity: warning\n              - name: Pulsar subscription very high number of backlog entries\n                description: \"The number of subscription backlog entries is over 100k\"\n                query: sum(pulsar_subscription_back_log) by (subscription) > 100000\n                for: 1h\n                severity: critical\n              - name: Pulsar topic large backlog storage size\n                description: \"The topic backlog storage size is over 5 GB\"\n                query: sum(pulsar_storage_size) by (topic) > 5*1024*1024*1024\n                for: 1h\n                severity: warning\n              - name: Pulsar topic very large backlog storage size\n                description: \"The topic backlog storage size is over 20 GB\"\n                query: sum(pulsar_storage_size) by (topic) > 20*1024*1024*1024\n                for: 1h\n                severity: critical\n              - name: Pulsar high write latency\n                description: \"Messages cannot be written in a timely fashion\"\n                query: sum(pulsar_storage_write_latency_overflow > 0) by (topic)\n                for: 1h\n                severity: critical\n              - name: Pulsar large message payload\n                description: \"Observing large message payload (> 1MB)\"\n                query: sum(pulsar_entry_size_overflow > 0) by (topic)\n                for: 1h\n                severity: warning\n              - name: Pulsar high ledger disk usage\n                description: \"Observing Ledger Disk Usage (> 75%)\"\n                query: sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75\n                for: 1h\n                severity: critical\n              - name: Pulsar read only bookies\n                description: \"Observing Readonly Bookies\"\n                query: count(bookie_SERVER_STATUS{} == 0) by (pod)\n                for: 5m\n                severity: critical\n              - name: Pulsar high number of function errors\n                description: \"Observing more than 10 Function errors per minute\"\n                query: sum(rate(pulsar_function_user_exceptions_total[1m]) + rate(pulsar_function_system_exceptions_total[1m])) by (name) > 10\n                for: 1m\n                severity: critical\n              - name: Pulsar high number of sink errors\n                description: \"Observing more than 10 Sink errors per minute\"\n                query: sum(rate(pulsar_sink_sink_exceptions_total[1m])) by (name) > 10\n                for: 1m\n                severity: critical\n\n      - name: Nats\n        exporters:\n          - name: nats-io/prometheus-nats-exporter\n            slug: nats-exporter\n            doc_url: https://github.com/nats-io/prometheus-nats-exporter\n            rules:\n              - name: Nats high routes count\n                description: High number of NATS routes ({{ $value }}) for {{ $labels.instance }}\n                query: \"gnatsd_varz_routes > 10\"\n                severity: warning\n                for: 3m\n              - name: Nats high memory usage\n                description: NATS server memory usage is above 200MB for {{ $labels.instance }}\n                query: \"gnatsd_varz_mem > 200 * 1024 * 1024\"\n                severity: warning\n                for: 5m\n              - name: Nats slow consumers\n                description: There are slow consumers in NATS for {{ $labels.instance }}\n                query: \"gnatsd_varz_slow_consumers > 0\"\n                severity: critical\n                for: 3m\n              - name: Nats server down\n                description: NATS server has been down for more than 5 minutes\n                query: 'absent(up{job=\"nats\"})'\n                severity: critical\n                for: 5m\n              - name: Nats high CPU usage\n                description: NATS server is using more than 80% CPU for the last 5 minutes\n                query: \"gnatsd_varz_cpu > 80\"\n                severity: warning\n                for: 5m\n                comments: |\n                  gnatsd_varz_cpu is a gauge reporting CPU percentage (0-100 scale).\n              - name: Nats high number of connections\n                description: NATS server has more than 1000 active connections\n                query: \"gnatsd_connz_num_connections > 1000\"\n                severity: warning\n                for: 5m\n              - name: Nats high JetStream store usage\n                description: JetStream store usage is over 80%\n                query: \"gnatsd_varz_jetstream_stats_storage / gnatsd_varz_jetstream_config_max_storage > 0.8 and gnatsd_varz_jetstream_config_max_storage > 0\"\n                severity: warning\n                for: 5m\n              - name: Nats high JetStream memory usage\n                description: JetStream memory usage is over 80%\n                query: \"gnatsd_varz_jetstream_stats_memory / gnatsd_varz_jetstream_config_max_memory > 0.8 and gnatsd_varz_jetstream_config_max_memory > 0\"\n                severity: warning\n                for: 5m\n              - name: Nats high number of subscriptions\n                description: NATS server has more than 1000 active subscriptions\n                query: \"gnatsd_connz_subscriptions > 1000\"\n                severity: warning\n                for: 5m\n              - name: Nats high pending bytes\n                description: NATS server has more than 100,000 pending bytes\n                query: \"gnatsd_connz_pending_bytes > 100000\"\n                severity: warning\n                for: 5m\n              - name: Nats too many errors\n                description: NATS server has encountered {{ $value }} JetStream API errors in the last 5 minutes\n                query: \"increase(gnatsd_varz_jetstream_stats_api_errors[5m]) > 0\"\n                severity: warning\n                for: 5m\n              - name: Nats JetStream accounts exceeded\n                description: JetStream has more than 100 active accounts\n                query: \"sum(gnatsd_varz_jetstream_stats_accounts) > 100\"\n                severity: warning\n                for: 5m\n              - name: Nats leaf node connection issue\n                description: No leaf node connections on {{ $labels.instance }}\n                query: \"gnatsd_varz_leafnodes == 0\"\n                severity: warning\n                for: 5m\n\n  - name: Proxies, load balancers and service meshes\n    services:\n      - name: Nginx\n        exporters:\n          - name: knyar/nginx-lua-prometheus\n            slug: knyar-nginx-exporter\n            doc_url: https://github.com/knyar/nginx-lua-prometheus\n            rules:\n              - name: Nginx high HTTP 4xx error rate\n                description: Too many HTTP requests with status 4xx (> 5%)\n                query: 'sum(rate(nginx_http_requests_total{status=~\"^4..\"}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 and sum(rate(nginx_http_requests_total[1m])) > 0'\n                severity: critical\n                for: 1m\n              - name: Nginx high HTTP 5xx error rate\n                description: Too many HTTP requests with status 5xx (> 5%)\n                query: 'sum(rate(nginx_http_requests_total{status=~\"^5..\"}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 and sum(rate(nginx_http_requests_total[1m])) > 0'\n                severity: critical\n                for: 1m\n              - name: Nginx latency high\n                description: Nginx p99 latency is higher than 3 seconds\n                query: \"histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node, le)) > 3\"\n                severity: warning\n                for: 2m\n\n      - name: Apache\n        exporters:\n          - name: Lusitaniae/apache_exporter\n            slug: lusitaniae-apache-exporter\n            doc_url: https://github.com/Lusitaniae/apache_exporter\n            rules:\n              - name: Apache down\n                description: Apache down\n                query: \"apache_up == 0\"\n                severity: critical\n              - name: Apache workers load\n                description: Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\n                query: '(sum by (instance) (apache_workers{state=\"busy\"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80 and sum by (instance) (apache_scoreboard) > 0'\n                severity: warning\n                for: 2m\n              - name: Apache restart\n                description: Apache has just been restarted.\n                query: \"apache_uptime_seconds_total / 60 < 1\"\n                severity: warning\n\n      - name: HaProxy\n        exporters:\n          - name: Embedded exporter (HAProxy >= v2)\n            slug: embedded-exporter-v2\n            doc_url: https://github.com/haproxy/haproxy/tree/master/contrib/prometheus-exporter\n            rules:\n              - name: HAProxy high HTTP 4xx error rate backend\n                description: Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n                query: ((sum by (proxy) (rate(haproxy_server_http_responses_total{code=\"4xx\"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (proxy) (rate(haproxy_server_http_responses_total[1m])) > 0\n                severity: critical\n                for: 1m\n              - name: HAProxy high HTTP 5xx error rate backend\n                description: Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n                query: ((sum by (proxy) (rate(haproxy_server_http_responses_total{code=\"5xx\"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (proxy) (rate(haproxy_server_http_responses_total[1m])) > 0\n                severity: critical\n                for: 1m\n              - name: HAProxy high HTTP 4xx error rate server\n                description: Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n                query: ((sum by (server) (rate(haproxy_server_http_responses_total{code=\"4xx\"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0\n                severity: critical\n                for: 1m\n              - name: HAProxy high HTTP 5xx error rate server\n                description: Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n                query: ((sum by (server) (rate(haproxy_server_http_responses_total{code=\"5xx\"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0\n                severity: critical\n                for: 1m\n              - name: HAProxy server response errors\n                description: Too many response errors to {{ $labels.server }} server (> 5%).\n                query: (sum by (server) (rate(haproxy_server_response_errors_total[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100 > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0\n                severity: critical\n                for: 1m\n              - name: HAProxy backend connection errors\n                description: Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n                query: (sum by (proxy) (rate(haproxy_backend_connection_errors_total[1m]))) > 100\n                severity: critical\n                for: 1m\n              - name: HAProxy server connection errors\n                description: Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n                query: (sum by (proxy) (rate(haproxy_server_connection_errors_total[1m]))) > 100\n                severity: critical\n              - name: HAProxy backend max active session > 80%\n                description: Session limit from backend {{ $labels.proxy }} reached 80% of limit - {{ $value | printf \"%.2f\"}}%\n                query: ((haproxy_backend_current_sessions >0) * 100) / (haproxy_backend_limit_sessions > 0) > 80\n                severity: warning\n                for: 2m\n              - name: HAProxy pending requests\n                description: Some HAProxy requests are pending on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}\n                query: sum by (proxy) (haproxy_backend_current_queue) > 0\n                comments: |\n                  haproxy_backend_current_queue is a gauge (current queue depth), not a counter.\n                severity: warning\n                for: 2m\n              - name: HAProxy HTTP slowing down\n                description: Average request time is increasing - {{ $value | printf \"%.2f\"}}\n                query: avg by (instance, proxy) (haproxy_backend_max_total_time_seconds) > 1\n                severity: warning\n                for: 1m\n              - name: HAProxy retry high\n                description: High rate of retry on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}\n                query: sum by (proxy) (rate(haproxy_backend_retry_warnings_total[1m])) > 10\n                severity: warning\n                for: 2m\n              - name: HAproxy has no alive backends\n                description: HAProxy has no alive active or backup backends for {{ $labels.proxy }}\n                query: haproxy_backend_active_servers + haproxy_backend_backup_servers == 0\n                severity: critical\n              - name: HAProxy frontend security blocked requests\n                description: HAProxy is blocking requests for security reason\n                query: sum by (proxy) (rate(haproxy_frontend_denied_connections_total[2m])) > 10\n                severity: warning\n                for: 2m\n              - name: HAProxy server healthcheck failure\n                description: Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m)\n                query: increase(haproxy_server_check_failures_total[1m]) > 0\n                severity: warning\n                for: 1m\n          - name: prometheus/haproxy_exporter (HAProxy < v2)\n            slug: haproxy-exporter-v1\n            doc_url: https://github.com/prometheus/haproxy_exporter\n            rules:\n              - name: HAProxy down\n                description: HAProxy down\n                query: \"haproxy_up == 0\"\n                severity: critical\n              - name: HAProxy high HTTP 4xx error rate backend (v1)\n                description: Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n                query: 'sum by (backend) (rate(haproxy_server_http_responses_total{code=\"4xx\"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 0'\n                severity: critical\n                for: 1m\n              - name: HAProxy high HTTP 5xx error rate backend (v1)\n                description: Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n                query: 'sum by (backend) (rate(haproxy_server_http_responses_total{code=\"5xx\"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 0'\n                severity: critical\n                for: 1m\n              - name: HAProxy high HTTP 4xx error rate server (v1)\n                description: Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n                query: 'sum by (server) (rate(haproxy_server_http_responses_total{code=\"4xx\"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'\n                severity: critical\n                for: 1m\n              - name: HAProxy high HTTP 5xx error rate server (v1)\n                description: Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n                query: 'sum by (server) (rate(haproxy_server_http_responses_total{code=\"5xx\"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'\n                severity: critical\n                for: 1m\n              - name: HAProxy server response errors (v1)\n                description: Too many response errors to {{ $labels.server }} server (> 5%).\n                query: \"sum by (server) (rate(haproxy_server_response_errors_total[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0\"\n                severity: critical\n                for: 1m\n              - name: HAProxy backend connection errors (v1)\n                description: Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n                query: \"sum by (backend) (rate(haproxy_backend_connection_errors_total[1m])) > 100\"\n                severity: critical\n                for: 1m\n              - name: HAProxy server connection errors (v1)\n                description: Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n                query: \"sum by (server) (rate(haproxy_server_connection_errors_total[1m])) > 100\"\n                severity: critical\n              - name: HAProxy backend max active session\n                description: HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n                query: \"((sum by (backend) (haproxy_backend_current_sessions * 100) / sum by (backend) (haproxy_backend_limit_sessions))) > 80 and sum by (backend) (haproxy_backend_limit_sessions) > 0\"\n                severity: warning\n                for: 2m\n              - name: HAProxy pending requests (v1)\n                description: Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n                query: \"sum by (backend) (haproxy_backend_current_queue) > 0\"\n                severity: warning\n                for: 2m\n              - name: HAProxy HTTP slowing down (v1)\n                description: Average request time is increasing\n                query: \"avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 1\"\n                severity: warning\n                for: 1m\n              - name: HAProxy retry high (v1)\n                description: High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n                query: \"sum by (backend) (rate(haproxy_backend_retry_warnings_total[1m])) > 10\"\n                severity: warning\n                for: 2m\n              - name: HAProxy backend down\n                description: HAProxy backend is down\n                query: \"haproxy_backend_up == 0\"\n                severity: critical\n              - name: HAProxy server down\n                description: HAProxy server is down\n                query: \"haproxy_server_up == 0\"\n                severity: critical\n              - name: HAProxy frontend security blocked requests (v1)\n                description: HAProxy is blocking requests for security reason\n                query: \"sum by (frontend) (rate(haproxy_frontend_requests_denied_total[2m])) > 10\"\n                severity: warning\n                for: 2m\n              - name: HAProxy server healthcheck failure (v1)\n                description: Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m)\n                query: \"increase(haproxy_server_check_failures_total[1m]) > 0\"\n                severity: warning\n                for: 1m\n\n      - name: Traefik\n        exporters:\n          - name: Embedded exporter v2\n            slug: embedded-exporter-v2\n            doc_url: https://docs.traefik.io/observability/metrics/prometheus/\n            rules:\n              - name: Traefik service down\n                description: All Traefik services are down\n                query: \"count(traefik_service_server_up) by (service) == 0\"\n                severity: critical\n              - name: Traefik high HTTP 4xx error rate service\n                description: Traefik service 4xx error rate is above 5%\n                query: 'sum(rate(traefik_service_requests_total{code=~\"4.*\"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 and sum(rate(traefik_service_requests_total[3m])) by (service) > 0'\n                severity: critical\n                for: 1m\n              - name: Traefik high HTTP 5xx error rate service\n                description: Traefik service 5xx error rate is above 5%\n                query: 'sum(rate(traefik_service_requests_total{code=~\"5.*\"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 and sum(rate(traefik_service_requests_total[3m])) by (service) > 0'\n                severity: critical\n                for: 1m\n          - name: Embedded exporter v1\n            slug: embedded-exporter-v1\n            doc_url: https://docs.traefik.io/observability/metrics/prometheus/\n            rules:\n              - name: Traefik backend down\n                description: All Traefik backends are down\n                query: \"count(traefik_backend_server_up) by (backend) == 0\"\n                severity: critical\n              - name: Traefik high HTTP 4xx error rate backend\n                description: Traefik backend 4xx error rate is above 5%\n                query: 'sum(rate(traefik_backend_requests_total{code=~\"4.*\"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 and sum(rate(traefik_backend_requests_total[3m])) by (backend) > 0'\n                severity: critical\n                for: 1m\n              - name: Traefik high HTTP 5xx error rate backend\n                description: Traefik backend 5xx error rate is above 5%\n                query: 'sum(rate(traefik_backend_requests_total{code=~\"5.*\"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 and sum(rate(traefik_backend_requests_total[3m])) by (backend) > 0'\n                severity: critical\n                for: 1m\n\n      - name: Caddy\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://caddyserver.com/docs/metrics\n            rules:\n              - name: Caddy Reverse Proxy Down\n                description: \"All Caddy reverse proxies are down\"\n                query: \"count(caddy_reverse_proxy_upstreams_healthy) by (upstream) == 0\"\n                severity: critical\n\n              - name: Caddy high HTTP 4xx error rate service\n                description: \"Caddy service 4xx error rate is above 5%\"\n                query: 'sum(rate(caddy_http_request_duration_seconds_count{code=~\"4..\"}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5 and sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) > 0'\n                severity: critical\n                for: 1m\n              - name: Caddy high HTTP 5xx error rate service\n                description: \"Caddy service 5xx error rate is above 5%\"\n                query: 'sum(rate(caddy_http_request_duration_seconds_count{code=~\"5..\"}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5 and sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) > 0'\n                severity: critical\n                for: 1m\n\n      - name: Envoy\n        exporters:\n          - name: Built-in metrics\n            slug: embedded-exporter\n            doc_url: https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/statistics\n            rules:\n              - name: Envoy server not live\n                description: \"Envoy server is not live (draining or shutting down) on {{ $labels.instance }}\"\n                query: \"envoy_server_live != 1\"\n                severity: critical\n                for: 1m\n              - name: Envoy high memory usage\n                description: \"Envoy memory allocated is above 90% of heap size on {{ $labels.instance }}\"\n                query: \"envoy_server_memory_allocated / envoy_server_memory_heap_size * 100 > 90 and envoy_server_memory_heap_size > 0\"\n                severity: warning\n                for: 5m\n              - name: Envoy high downstream HTTP 5xx error rate\n                description: \"More than 5% of downstream HTTP responses are 5xx on {{ $labels.instance }} ({{ $value | printf \\\"%.1f\\\" }}%)\"\n                query: 'sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class=\"5\"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 5 and sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) > 0'\n                severity: critical\n                for: 1m\n              - name: Envoy high downstream HTTP 4xx error rate\n                description: \"More than 10% of downstream HTTP responses are 4xx on {{ $labels.instance }} ({{ $value | printf \\\"%.1f\\\" }}%)\"\n                query: 'sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class=\"4\"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 10 and sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Envoy downstream connections overflowing\n                description: \"Downstream connections are being rejected due to listener overflow on {{ $labels.instance }} ({{ $value }} in the last 5m)\"\n                query: \"increase(envoy_listener_downstream_cx_overflow[5m]) > 5\"\n                severity: warning\n              - name: Envoy cluster membership empty\n                description: \"Envoy cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} has no healthy members\"\n                query: \"envoy_cluster_membership_healthy == 0\"\n                severity: critical\n                for: 1m\n              - name: Envoy cluster membership degraded\n                description: \"More than 25% of members in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} are unhealthy\"\n                query: \"envoy_cluster_membership_healthy / envoy_cluster_membership_total * 100 < 75 and envoy_cluster_membership_total > 0\"\n                severity: warning\n                for: 5m\n              - name: Envoy high cluster upstream connection failures\n                description: \"High rate of upstream connection failures in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\"\n                query: \"increase(envoy_cluster_upstream_cx_connect_fail[5m]) > 10\"\n                severity: warning\n                for: 5m\n              - name: Envoy high cluster upstream request timeout rate\n                description: \"More than 5% of upstream requests are timing out in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\"\n                query: \"rate(envoy_cluster_upstream_rq_timeout[5m]) / rate(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and rate(envoy_cluster_upstream_rq_completed[5m]) > 0\"\n                severity: warning\n                for: 5m\n              - name: Envoy high cluster upstream 5xx error rate\n                description: \"More than 5% of upstream requests return 5xx in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\"\n                query: 'rate(envoy_cluster_upstream_rq_xx{envoy_response_code_class=\"5\"}[5m]) / rate(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and rate(envoy_cluster_upstream_rq_completed[5m]) > 0'\n                severity: critical\n                for: 1m\n              - name: Envoy cluster health check failures\n                description: \"Health checks are consistently failing in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\"\n                query: \"increase(envoy_cluster_health_check_failure[5m]) > 5\"\n                severity: warning\n                for: 5m\n              - name: Envoy cluster outlier detection ejections active\n                description: \"There are active outlier detection ejections in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\"\n                query: \"envoy_cluster_outlier_detection_ejections_active > 0\"\n                severity: info\n                for: 5m\n              - name: Envoy listener SSL connection errors\n                description: \"Envoy listener is experiencing SSL/TLS connection errors on {{ $labels.instance }} ({{ $value }} in the last 5m)\"\n                query: \"increase(envoy_listener_ssl_connection_error[5m]) > 5\"\n                severity: warning\n              - name: Envoy global downstream connections overflowing\n                description: \"Downstream connections are being rejected due to global connection limit on {{ $labels.instance }} ({{ $value }} in the last 5m)\"\n                query: \"increase(envoy_listener_downstream_global_cx_overflow[5m]) > 5\"\n                severity: critical\n              - name: Envoy SSL certificate expiring soon\n                description: \"SSL certificate loaded by Envoy on {{ $labels.instance }} expires in less than 7 days\"\n                query: \"envoy_server_days_until_first_cert_expiring < 7\"\n                severity: warning\n              - name: Envoy SSL certificate expired\n                description: \"SSL certificate loaded by Envoy on {{ $labels.instance }} has expired\"\n                query: \"envoy_server_days_until_first_cert_expiring < 0\"\n                severity: critical\n              - name: Envoy cluster circuit breaker tripped\n                description: \"Circuit breaker is open for cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\"\n                query: \"envoy_cluster_circuit_breakers_default_cx_open == 1 or envoy_cluster_circuit_breakers_default_rq_open == 1\"\n                severity: critical\n              - name: Envoy no healthy upstream\n                description: \"Upstream connection attempts failed because no healthy upstream was available in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\"\n                query: \"increase(envoy_cluster_upstream_cx_none_healthy[5m]) > 0\"\n                severity: critical\n              - name: Envoy high downstream request timeout rate\n                description: \"Downstream requests are timing out on {{ $labels.instance }} ({{ $value }} in the last 5m)\"\n                query: \"increase(envoy_http_downstream_rq_timeout[5m]) > 5\"\n                severity: warning\n                for: 5m\n\n      - name: Linkerd\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://linkerd.io/2/tasks/exporting-metrics/\n            rules:\n              - name: Linkerd high error rate\n                description: \"Linkerd error rate for {{ $labels.deployment }}{{ $labels.statefulset }}{{ $labels.daemonset }} is over 10%\"\n                query: 'sum(rate(response_total{classification=\"failure\"}[1m])) by (deployment, statefulset, daemonset) / sum(rate(response_total[1m])) by (deployment, statefulset, daemonset) * 100 > 10 and sum(rate(response_total[1m])) by (deployment, statefulset, daemonset) > 0'\n                comments: |\n                  Linkerd does not expose request_errors_total. Errors are tracked via response_total{classification=\"failure\"}.\n                severity: warning\n                for: 1m\n\n      - name: Istio\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://istio.io/latest/docs/tasks/observability/metrics/querying-metrics/\n            rules:\n              - name: Istio Kubernetes gateway availability drop\n                description: Gateway pods have dropped. Inbound traffic will likely be affected.\n                query: 'min(kube_deployment_status_replicas_available{deployment=\"istio-ingressgateway\", namespace=\"istio-system\"}) without (instance, pod) < 2'\n                severity: warning\n                for: 1m\n              - name: Istio Pilot high total request rate\n                description: Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration.\n                query: \"sum(rate(pilot_xds_push_errors[1m])) / sum(rate(pilot_xds_pushes[1m])) * 100 > 5 and sum(rate(pilot_xds_pushes[1m])) > 0\"\n                severity: warning\n                for: 1m\n              - name: Istio Mixer Prometheus dispatches low\n                description: Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly.\n                query: 'sum(rate(mixer_runtime_dispatches_total{adapter=~\"prometheus\"}[1m])) < 180'\n                severity: warning\n                for: 1m\n              - name: Istio high total request rate\n                description: Global request rate in the service mesh is unusually high.\n                query: 'sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) > 1000'\n                severity: warning\n                for: 2m\n              - name: Istio low total request rate\n                description: Global request rate in the service mesh is unusually low.\n                query: 'sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) < 100'\n                severity: warning\n                for: 2m\n              - name: Istio high 4xx error rate\n                description: High percentage of HTTP 4xx responses in Istio (> 5%).\n                query: 'sum(rate(istio_requests_total{reporter=\"destination\", response_code=~\"4.*\"}[5m])) / sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) * 100 > 5 and sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) > 0'\n                severity: warning\n                for: 1m\n              - name: Istio high 5xx error rate\n                description: High percentage of HTTP 5xx responses in Istio (> 5%).\n                query: 'sum(rate(istio_requests_total{reporter=\"destination\", response_code=~\"5.*\"}[5m])) / sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) * 100 > 5 and sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) > 0'\n                severity: warning\n                for: 1m\n              - name: Istio high request latency\n                description: Istio average requests execution is longer than 100ms.\n                query: 'rate(istio_request_duration_milliseconds_sum{reporter=\"destination\"}[1m]) / rate(istio_request_duration_milliseconds_count{reporter=\"destination\"}[1m]) > 100 and rate(istio_request_duration_milliseconds_count{reporter=\"destination\"}[1m]) > 0'\n                severity: warning\n                for: 1m\n              - name: Istio latency 99 percentile\n                description: Istio 1% slowest requests are longer than 1000ms.\n                query: \"histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le)) > 1000\"\n                severity: warning\n                for: 1m\n              - name: Istio Pilot Duplicate Entry\n                description: Istio pilot duplicate entry error.\n                query: \"sum(rate(pilot_duplicate_envoy_clusters{}[5m])) > 0\"\n                severity: critical\n\n  - name: Runtimes\n    services:\n      - name: PHP-FPM\n        exporters:\n          - name: bakins/php-fpm-exporter\n            slug: bakins-fpm-exporter\n            doc_url: https://github.com/bakins/php-fpm-exporter\n            rules:\n              - name: PHP-FPM max-children reached\n                description: PHP-FPM reached max children on {{ $labels.instance }} ({{ $value }} times in the last 5m)\n                query: \"sum(increase(phpfpm_max_children_reached_total[5m])) by (instance) > 3\"\n                severity: warning\n\n      - name: JVM\n        exporters:\n          - name: java-client\n            slug: jvm-exporter\n            doc_url: https://github.com/prometheus/client_java\n            rules:\n              - name: JVM memory filling up\n                description: JVM memory is filling up (> 80%)\n                query: '(sum by (instance)(jvm_memory_used_bytes{area=\"heap\"}) / sum by (instance)(jvm_memory_max_bytes{area=\"heap\"})) * 100 > 80 and sum by (instance)(jvm_memory_max_bytes{area=\"heap\"}) > 0'\n                severity: warning\n                for: 2m\n              - name: JVM non-heap memory filling up\n                description: JVM non-heap memory (metaspace/code cache) is filling up (> 80%)\n                query: '(sum by (instance)(jvm_memory_used_bytes{area=\"nonheap\"}) / (sum by (instance)(jvm_memory_max_bytes{area=\"nonheap\"}) > 0)) * 100 > 80'\n                severity: warning\n                for: 2m\n                comments: |\n                  Many JVM configurations leave metaspace unbounded, in which case jvm_memory_max_bytes{area=\"nonheap\"} is -1 and this alert will not fire.\n                  The query filters out max_bytes <= 0 to avoid false negatives.\n              - name: JVM GC time too high\n                description: JVM is spending too much time in garbage collection (> 5% of wall clock time)\n                query: 'sum by (instance)(rate(jvm_gc_collection_seconds_sum[5m])) > 0.05'\n                severity: warning\n                for: 5m\n              - name: JVM threads deadlocked\n                description: JVM has deadlocked threads\n                query: 'jvm_threads_deadlocked > 0'\n                severity: critical\n                for: 1m\n              - name: JVM thread count high\n                description: JVM thread count is high (> 300), potential thread leak\n                query: 'jvm_threads_current > 300'\n                severity: warning\n                for: 5m\n              - name: JVM threads BLOCKED\n                description: JVM has high number of BLOCKED threads, indicating lock contention\n                query: 'jvm_threads_state{state=\"BLOCKED\"} > 50'\n                severity: warning\n                for: 5m\n              - name: JVM old gen GC frequency\n                description: Frequent old/major GC cycles, indicating memory pressure\n                query: 'rate(jvm_gc_collection_seconds_count{gc=~\".*old.*|.*major.*\"}[5m]) > 0.3'\n                severity: warning\n                for: 5m\n                comments: |\n                  This regex matches CMS, G1, and Parallel collector names. It will not match ZGC or Shenandoah cycle names.\n                  Adjust the gc label filter if you use a different collector.\n              - name: JVM direct buffer pool filling up\n                description: JVM direct buffer pool is filling up (> 90%)\n                query: '(jvm_buffer_pool_used_bytes / jvm_buffer_pool_capacity_bytes) * 100 > 90 and jvm_buffer_pool_capacity_bytes > 0'\n                severity: warning\n                for: 5m\n              - name: JVM objects pending finalization\n                description: JVM has objects pending finalization, potential memory leak\n                query: 'jvm_memory_objects_pending_finalization > 1000'\n                severity: warning\n                for: 5m\n              - name: JVM file descriptors exhaustion\n                description: JVM process is running out of file descriptors (> 90% used)\n                query: '(process_open_fds / process_max_fds) * 100 > 90 and process_max_fds > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  process_open_fds and process_max_fds are generic metrics from the Prometheus client library, not JVM-specific.\n                  This alert will also fire for Go, Python, or any process exposing these metrics.\n              - name: JVM class loading anomaly\n                description: Rapid class loading detected, potential classloader leak\n                query: 'rate(jvm_classes_loaded_total[5m]) > 100'\n                severity: warning\n                for: 5m\n              - name: JVM compilation time spike\n                description: Excessive JIT compilation time consuming CPU\n                query: 'rate(jvm_compilation_time_seconds_total[5m]) > 0.1'\n                severity: warning\n                for: 5m\n\n      - name: Golang\n        exporters:\n          - name: client_golang\n            slug: golang-exporter\n            doc_url: https://github.com/prometheus/client_golang\n            rules:\n              - name: Go goroutine count high\n                description: Go application has too many goroutines (> 1000), potential goroutine leak\n                query: 'go_goroutines > 1000'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold is a rough default. High-concurrency servers may legitimately run thousands of goroutines. Adjust to match your baseline.\n              - name: Go GC duration high\n                description: Go GC pause duration is too high (max > 1s)\n                query: 'go_gc_duration_seconds{quantile=\"1\"} > 1'\n                severity: warning\n                for: 5m\n                comments: |\n                  quantile=\"1\" is the maximum observed GC pause in the current summary window, not p99.\n                  A single outlier pause can push this above 1s. The for: 5m ensures the max stays elevated.\n              - name: Go memory usage high\n                description: Go heap allocation is using most of the runtime's reserved memory (> 90%), indicating the process may need more memory or has a leak\n                query: '(go_memstats_heap_alloc_bytes / go_memstats_sys_bytes) * 100 > 90'\n                severity: warning\n                for: 5m\n                comments: |\n                  go_memstats_sys_bytes is the total memory obtained from the OS by the Go runtime, not total host memory.\n                  This ratio measures Go-internal memory utilization, not system-level memory pressure.\n              - name: Go thread count high\n                description: Go OS thread count is high (> 500), potential blocking syscall or CGo leak\n                query: 'go_threads > 500'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold is workload-dependent. Applications with heavy CGo or blocking I/O may legitimately use more OS threads. Adjust to match your baseline.\n              - name: Go heap objects count high\n                description: Go heap has too many live objects (> 10M), high GC pressure\n                query: 'go_memstats_heap_objects > 10000000'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold is a rough default. Adjust based on your application's normal object count.\n              - name: Go GC CPU fraction high\n                description: Go GC is consuming too much CPU (> 5%)\n                query: 'go_memstats_gc_cpu_fraction > 0.05'\n                severity: warning\n                for: 5m\n                comments: |\n                  go_memstats_gc_cpu_fraction is deprecated since Go 1.20 and may return 0 in newer versions.\n                  Consider using runtime/metrics-based alternatives if running Go >= 1.20.\n              - name: Go goroutine spike\n                description: Go goroutine count is growing rapidly\n                query: 'deriv(go_goroutines[5m]) > 100'\n                severity: warning\n                for: 5m\n              - name: Go heap fragmentation\n                description: Go heap has high idle ratio (> 90%), indicating memory fragmentation\n                query: 'go_memstats_heap_idle_bytes / go_memstats_heap_sys_bytes > 0.9'\n                severity: warning\n                for: 5m\n              - name: Go memory leak\n                description: Go application has sustained high allocation rate (> 1GB/s), potential memory leak\n                query: 'rate(go_memstats_alloc_bytes_total[5m]) > 1e9'\n                severity: warning\n                for: 5m\n              - name: Go stack memory high\n                description: Go stack memory usage is high (> 1GB), likely excessive goroutines or deep recursion\n                query: 'go_memstats_stack_inuse_bytes > 1e9'\n                severity: warning\n                for: 5m\n\n      - name: Ruby\n        exporters:\n          - name: prometheus_exporter\n            slug: ruby-exporter\n            doc_url: https://github.com/discourse/prometheus_exporter\n            rules:\n              - name: Ruby heap live slots high\n                description: Ruby heap has too many live slots (> 500k), heap bloat\n                query: 'ruby_heap_live_slots > 500000'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold is a rough default. Adjust based on your application's normal heap size.\n              - name: Ruby heap free slots high\n                description: Ruby heap has too many free slots (> 500k), memory fragmentation after large allocations\n                query: 'ruby_heap_free_slots > 500000'\n                severity: warning\n                for: 5m\n              - name: Ruby major GC rate high\n                description: Ruby is performing too many major GC cycles, indicating memory pressure\n                query: 'rate(ruby_major_gc_ops_total[5m]) > 5'\n                severity: warning\n                for: 5m\n                comments: |\n                  Major GC rate > 5/s is extremely high. Consider lowering to > 1 or > 2 for earlier detection.\n              - name: Ruby RSS high\n                description: Ruby process RSS is high (> 1GB)\n                query: 'ruby_rss > 1e9'\n                severity: warning\n                for: 5m\n              - name: Ruby allocated objects spike\n                description: Ruby is allocating objects at a high rate\n                query: 'rate(ruby_allocated_objects_total[5m]) > 100000'\n                severity: warning\n                for: 5m\n\n      - name: Python\n        exporters:\n          - name: client_python\n            slug: python-exporter\n            doc_url: https://github.com/prometheus/client_python\n            rules:\n              - name: Python GC objects uncollectable\n                description: Python has uncollectable objects, potential memory leak via reference cycles\n                query: 'increase(python_gc_objects_uncollectable_total[5m]) > 0'\n                severity: warning\n                for: 5m\n              - name: Python GC collections high\n                description: Python GC is collecting too many objects (> 10k/s), high allocation pressure\n                query: 'rate(python_gc_objects_collected_total[5m]) > 10000'\n                severity: warning\n                for: 5m\n              - name: Python file descriptors exhaustion\n                description: Python process is running out of file descriptors (> 90% used)\n                query: '(process_open_fds / process_max_fds) * 100 > 90 and process_max_fds > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  process_open_fds and process_max_fds are generic metrics from the Prometheus client library, not Python-specific.\n              - name: Python GC generation 2 collections high\n                description: Python full GC (generation 2) is running too frequently, indicating memory pressure\n                query: 'rate(python_gc_collections_total{generation=\"2\"}[5m]) > 1'\n                severity: warning\n                for: 5m\n                comments: |\n                  Gen2 collection rate > 1/s is very high. In most applications, gen2 runs are infrequent. Adjust threshold based on your workload.\n              - name: Python virtual memory high\n                description: Python process virtual memory is high (> 4GB)\n                query: 'process_virtual_memory_bytes > 4e9'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold is a rough default. Adjust based on your application's expected memory footprint.\n\n      - name: Sidekiq\n        exporters:\n          - name: Strech/sidekiq-prometheus-exporter\n            slug: strech-sidekiq-exporter\n            doc_url: https://github.com/Strech/sidekiq-prometheus-exporter\n            rules:\n              - name: Sidekiq queue size\n                description: Sidekiq queue {{ $labels.name }} is growing\n                query: \"sidekiq_queue_size > 100\"\n                severity: warning\n                for: 1m\n              - name: Sidekiq scheduling latency too high\n                description: Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.\n                query: \"max(sidekiq_queue_latency) > 60\"\n                severity: critical\n\n  - name: Data engineering\n    services:\n      - name: Apache Flink\n        exporters:\n          - name: Built-in Prometheus reporter\n            slug: flink-prometheus-reporter\n            doc_url: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/\n            rules:\n              - name: Flink job is not running\n                description: \"No Flink jobs are currently running. All jobs may have failed or been cancelled.\"\n                query: \"flink_jobmanager_numRunningJobs == 0\"\n                severity: critical\n                for: 1m\n              - name: Flink no TaskManagers registered\n                description: \"No TaskManagers are registered with the JobManager. The cluster has no processing capacity.\"\n                query: \"flink_jobmanager_numRegisteredTaskManagers == 0\"\n                severity: critical\n                for: 1m\n              - name: Flink all task slots used\n                description: \"All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled.\"\n                query: \"flink_jobmanager_taskSlotsAvailable == 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  This alert fires when there are no available task slots. Adjust the threshold if your cluster is expected to run at full capacity.\n              - name: Flink job restart increasing\n                description: \"Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes.\"\n                query: \"increase(flink_jobmanager_job_numRestarts[5m]) > 1\"\n                severity: warning\n                for: 5m\n                comments: |\n                  A single restart may be normal during deployments. Adjust threshold based on restart tolerance.\n              - name: Flink checkpoint failures\n                description: \"Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes.\"\n                query: \"increase(flink_jobmanager_job_numberOfFailedCheckpoints[10m]) > 1\"\n                severity: warning\n                for: 5m\n              - name: Flink checkpoint duration high\n                description: \"Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete.\"\n                query: \"flink_jobmanager_job_lastCheckpointDuration / 1000 > 60\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Value is converted from milliseconds to seconds for correct humanizeDuration display.\n                  Threshold is 60 seconds. Adjust based on your checkpoint interval and state size.\n              - name: Flink task backpressured\n                description: \"Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured.\"\n                query: \"flink_taskmanager_job_task_isBackPressured == 1\"\n                severity: warning\n                for: 5m\n              - name: Flink task high backpressure time\n                description: \"Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure.\"\n                query: \"flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Fires when a task spends more than 500ms/sec backpressured. This indicates the task cannot keep up with upstream data rate.\n              - name: Flink TaskManager heap memory high\n                description: \"Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%.\"\n                query: \"flink_taskmanager_Status_JVM_Memory_Heap_Used / flink_taskmanager_Status_JVM_Memory_Heap_Max > 0.9\"\n                severity: warning\n                for: 5m\n              - name: Flink JobManager heap memory high\n                description: \"Flink JobManager {{ $labels.instance }} heap memory usage is above 90%.\"\n                query: \"flink_jobmanager_Status_JVM_Memory_Heap_Used / flink_jobmanager_Status_JVM_Memory_Heap_Max > 0.9\"\n                severity: warning\n                for: 5m\n              - name: Flink TaskManager GC time high\n                description: \"Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection.\"\n                query: \"rate(flink_taskmanager_Status_JVM_GarbageCollector_All_Time[5m]) > 100\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload.\n              - name: Flink no records processed\n                description: \"Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes.\"\n                query: \"rate(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Only fires for tasks that have previously received records, to avoid false positives during startup.\n\n      - name: Apache Spark\n        exporters:\n          - name: Built-in Prometheus (PrometheusServlet + PrometheusResource)\n            slug: spark-prometheus\n            doc_url: https://spark.apache.org/docs/latest/monitoring.html\n            comments: |\n              Spark exposes metrics via two built-in endpoints:\n              - PrometheusServlet: master/worker/driver metrics at /metrics/prometheus/ (ports 8080, 8081, 4040)\n              - PrometheusResource: executor metrics at /metrics/executors/prometheus/ (port 4040, requires spark.ui.prometheus.enabled=true in Spark 3.x)\n              Metric names from PrometheusServlet include a dynamic namespace (application ID), making static PromQL queries challenging.\n              Configuration: spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet\n            rules:\n              - name: Spark no alive workers\n                description: \"No Spark workers are alive. The cluster has no processing capacity.\"\n                query: \"metrics_master_aliveWorkers_Value == 0\"\n                severity: critical\n                for: 1m\n              - name: Spark too many waiting apps\n                description: \"Spark has {{ $value }} applications waiting for resources.\"\n                query: \"metrics_master_waitingApps_Value > 10\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Adjust the threshold based on your cluster's typical queuing behavior.\n              - name: Spark worker memory exhausted\n                description: \"Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free).\"\n                query: \"metrics_worker_memFree_MB_Value == 0\"\n                severity: warning\n                for: 2m\n              - name: Spark worker cores exhausted\n                description: \"Spark worker {{ $labels.instance }} has no free cores.\"\n                query: \"metrics_worker_coresFree_Value == 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Fires when a worker has no free cores. This may be normal under high load but can indicate capacity issues.\n              - name: Spark executor high GC time\n                description: \"Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC.\"\n                query: \"metrics_executor_totalGCTime_seconds_total / metrics_executor_totalDuration > 0.1 and metrics_executor_totalDuration > 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Fires when more than 10% of executor time is spent in garbage collection.\n                  This metric comes from the PrometheusResource endpoint (/metrics/executors/prometheus/).\n              - name: Spark executor all tasks failing\n                description: \"Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed).\"\n                query: \"metrics_executor_failedTasks_total > 0 and metrics_executor_completedTasks_total == 0\"\n                severity: critical\n                for: 5m\n              - name: Spark executor high task failure rate\n                description: \"Spark executor {{ $labels.executor_id }} has a task failure rate above 10%.\"\n                query: \"metrics_executor_failedTasks_total / metrics_executor_totalTasks_total > 0.1 and metrics_executor_totalTasks_total > 0\"\n                severity: warning\n                for: 5m\n              - name: Spark executor high disk spill\n                description: \"Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory.\"\n                query: \"metrics_executor_diskUsed_bytes > 1e9\"\n                severity: warning\n                for: 5m\n                comments: |\n                  diskUsed is a gauge, not a counter — do not use rate(). Threshold of 1GB is a rough default.\n                  Disk spilling indicates insufficient memory for the workload.\n\n      - name: Hadoop\n        exporters:\n          - name: hadoop/jmx_exporter\n            slug: jmx_exporter\n            doc_url: https://github.com/prometheus/jmx_exporter\n            rules:\n              # Alert rule for NameNode availability\n              - name: Hadoop Name Node Down\n                query: up{job=\"hadoop-namenode\"} == 0\n                for: 5m\n                severity: critical\n                description: \"The Hadoop NameNode service is unavailable.\"\n\n              # Alert rule for ResourceManager availability\n              - name: Hadoop Resource Manager Down\n                query: up{job=\"hadoop-resourcemanager\"} == 0\n                for: 5m\n                severity: critical\n                description: \"The Hadoop ResourceManager service is unavailable.\"\n\n              # Alert rule for DataNode status\n              - name: Hadoop Data Node Out Of Service\n                query: hadoop_datanode_last_heartbeat == 0\n                for: 10m\n                severity: warning\n                description: \"The Hadoop DataNode is not sending heartbeats.\"\n\n              # Alert rule for low HDFS disk space\n              - name: Hadoop HDFS Disk Space Low\n                query: (hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1 and hadoop_hdfs_bytes_total > 0\n                for: 15m\n                severity: warning\n                description: \"Available HDFS disk space is running low.\"\n\n              # Alert rule for excessive MapReduce task failures\n              - name: Hadoop Map Reduce Task Failures\n                query: increase(hadoop_mapreduce_task_failures_total[1h]) > 100\n                for: 10m\n                severity: critical\n                description: \"There is an unusually high number of MapReduce task failures.\"\n\n              # Alert rule for high ResourceManager memory usage\n              - name: Hadoop Resource Manager Memory High\n                query: hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8\n                for: 15m\n                severity: warning\n                description: \"The Hadoop ResourceManager is approaching its memory limit.\"\n\n              # Alert rule for high YARN container allocation failures\n              - name: Hadoop YARN Container Allocation Failures\n                query: increase(hadoop_yarn_container_allocation_failures_total[1h]) > 10\n                for: 10m\n                severity: warning\n                description: \"There is a significant number of YARN container allocation failures.\"\n\n              # Alert rule for excessive HBase region server region count\n              - name: Hadoop HBase Region Count High\n                query: hadoop_hbase_region_count > 5000\n                for: 15m\n                severity: warning\n                description: \"The HBase cluster has an unusually high number of regions.\"\n\n              # Alert rule for low HBase region server heap space\n              - name: Hadoop HBase Region Server Heap Low\n                query: hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes > 0.8\n                for: 10m\n                severity: warning\n                description: \"HBase Region Servers are running low on heap space.\"\n\n              # Alert rule for high HBase Write Requests latency\n              - name: Hadoop HBase Write Requests Latency High\n                query: hadoop_hbase_write_requests_latency_seconds > 0.5\n                for: 10m\n                severity: warning\n                description: \"HBase Write Requests are experiencing high latency.\"\n\n  - name: Orchestrators\n    services:\n      - name: Kubernetes\n        exporters:\n          - name: kube-state-metrics\n            slug: kubestate-exporter\n            doc_url: https://github.com/kubernetes/kube-state-metrics/tree/master/docs\n            rules:\n              - name: Kubernetes Node not ready\n                description: Node {{ $labels.node }} has been unready for a long time\n                query: 'kube_node_status_condition{condition=\"Ready\",status=\"true\"} == 0'\n                severity: critical\n                for: 10m\n              - name: Kubernetes Node scheduling disabled\n                description: Node {{ $labels.node }} has been marked as unschedulable for more than 30 minutes.\n                query: 'kube_node_spec_taint{key=\"node.kubernetes.io/unschedulable\"} == 1'\n                severity: warning\n                for: 30m\n                comments: |\n                  Kubernetes Node with disabled schedules are fine.\n                  This alarm can be useful to get warned if there are nodes which are longer unscheduled.\n              - name: Kubernetes Node memory pressure\n                description: \"Node {{ $labels.node }} has MemoryPressure condition\"\n                query: 'kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\"} == 1'\n                severity: critical\n                for: 2m\n              - name: Kubernetes Node disk pressure\n                description: \"Node {{ $labels.node }} has DiskPressure condition\"\n                query: 'kube_node_status_condition{condition=\"DiskPressure\",status=\"true\"} == 1'\n                severity: critical\n                for: 2m\n              - name: Kubernetes Node network unavailable\n                description: \"Node {{ $labels.node }} has NetworkUnavailable condition\"\n                query: 'kube_node_status_condition{condition=\"NetworkUnavailable\",status=\"true\"} == 1'\n                severity: critical\n                for: 2m\n              - name: Kubernetes Node out of pod capacity\n                description: \"Node {{ $labels.node }} is out of pod capacity\"\n                query: 'sum by (node) ((kube_pod_status_phase{phase=\"Running\"} == 1) + on(uid, instance) group_left(node) (0 * kube_pod_info{pod_template_hash=\"\"})) / sum by (node) (kube_node_status_allocatable{resource=\"pods\"}) * 100 > 90'\n                severity: warning\n                for: 2m\n              - name: Kubernetes Container oom killer\n                description: \"Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\"\n                query: '(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"}[10m]) == 1'\n                severity: warning\n              - name: Kubernetes Job failed\n                description: \"Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete\"\n                query: \"kube_job_status_failed > 0\"\n                severity: warning\n              - name: Kubernetes Job not starting\n                description: \"Job {{ $labels.namespace }}/{{ $labels.job_name }} did not start for 10 minutes\"\n                query: \"kube_job_status_active == 0 and kube_job_status_failed == 0 and kube_job_status_succeeded == 0 and (time() - kube_job_status_start_time) > 600\"\n                severity: warning\n              - name: Kubernetes CronJob failing\n                description: \"CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is failing\"\n                query: \"(kube_cronjob_status_last_schedule_time > kube_cronjob_status_last_successful_time) AND (kube_cronjob_status_active == 0) AND (kube_cronjob_spec_suspend == 0)\"\n                severity: critical\n              - name: Kubernetes CronJob suspended\n                description: \"CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\"\n                query: \"kube_cronjob_spec_suspend != 0\"\n                severity: warning\n              - name: Kubernetes PersistentVolumeClaim pending\n                description: \"PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\"\n                query: 'kube_persistentvolumeclaim_status_phase{phase=\"Pending\"} == 1'\n                severity: warning\n                for: 2m\n              - name: Kubernetes Volume out of disk space\n                description: Volume is almost full (< 10% left)\n                query: \"kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10 and kubelet_volume_stats_capacity_bytes > 0\"\n                severity: warning\n                for: 2m\n              - name: Kubernetes Volume full in four days\n                description: \"Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\"\n                query: \"predict_linear(kubelet_volume_stats_available_bytes[6h:5m], 4 * 24 * 3600) < 0\"\n                severity: critical\n              - name: Kubernetes PersistentVolume error\n                description: \"Persistent volume {{ $labels.persistentvolume }} is in bad state\"\n                query: 'kube_persistentvolume_status_phase{phase=~\"Failed|Pending\", job=\"kube-state-metrics\"} > 0'\n                severity: critical\n              - name: Kubernetes StatefulSet down\n                description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down\n                query: \"kube_statefulset_replicas != kube_statefulset_status_replicas_ready > 0\"\n                severity: critical\n                for: 1m\n              - name: Kubernetes HPA scale inability\n                description: HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to scale\n                query: '(kube_horizontalpodautoscaler_spec_max_replicas - kube_horizontalpodautoscaler_status_desired_replicas) * on (horizontalpodautoscaler,namespace) (kube_horizontalpodautoscaler_status_condition{condition=\"ScalingLimited\", status=\"true\"} == 1) == 0'\n                severity: warning\n                for: 2m\n              - name: Kubernetes HPA metrics unavailability\n                description: HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics\n                query: 'kube_horizontalpodautoscaler_status_condition{status=\"false\", condition=\"ScalingActive\"} == 1'\n                severity: warning\n              - name: Kubernetes HPA scale maximum\n                description: HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods\n                query: \"(kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas) and (kube_horizontalpodautoscaler_spec_max_replicas > 1) and (kube_horizontalpodautoscaler_spec_min_replicas != kube_horizontalpodautoscaler_spec_max_replicas)\"\n                severity: info\n                for: 2m\n              - name: Kubernetes HPA underutilized\n                description: HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here.\n                query: \"max(quantile_over_time(0.5, kube_horizontalpodautoscaler_status_desired_replicas[1d]) == kube_horizontalpodautoscaler_spec_min_replicas) by (horizontalpodautoscaler) > 3\" # allow minimum 3 replicas running\n                severity: info\n              - name: Kubernetes Pod not healthy\n                description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n                query: 'sum by (namespace, pod) (kube_pod_status_phase{phase=~\"Pending|Unknown|Failed\"}) > 0'\n                severity: critical\n                for: 15m\n              - name: Kubernetes pod crash looping\n                description: Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n                query: \"increase(kube_pod_container_status_restarts_total[1m]) > 3\"\n                severity: warning\n                for: 2m\n              - name: Kubernetes ReplicaSet replicas mismatch\n                description: ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} replicas mismatch\n                query: \"kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas\"\n                severity: warning\n                for: 10m\n              - name: Kubernetes Deployment replicas mismatch\n                description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n                query: \"kube_deployment_spec_replicas != kube_deployment_status_replicas_available\"\n                severity: warning\n                for: 10m\n              - name: Kubernetes StatefulSet replicas mismatch\n                description: StatefulSet does not match the expected number of replicas.\n                query: \"kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas\"\n                severity: warning\n                for: 10m\n              - name: Kubernetes Deployment generation mismatch\n                description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back.\n                query: \"kube_deployment_status_observed_generation != kube_deployment_metadata_generation\"\n                severity: critical\n                for: 10m\n              - name: Kubernetes StatefulSet generation mismatch\n                description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back.\n                query: \"kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation\"\n                severity: critical\n                for: 10m\n              - name: Kubernetes StatefulSet update not rolled out\n                description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.\n                query: \"max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)\"\n                severity: warning\n                for: 10m\n              - name: Kubernetes DaemonSet rollout stuck\n                description: Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\n                query: \"(kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 and kube_daemonset_status_desired_number_scheduled > 0) or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0\"\n                severity: warning\n                for: 10m\n              - name: Kubernetes DaemonSet misscheduled\n                description: Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run\n                query: \"kube_daemonset_status_number_misscheduled > 0\"\n                severity: critical\n                for: 1m\n              - name: Kubernetes CronJob too long\n                description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n                query: \"kube_job_status_start_time > 0 and absent(kube_job_status_completion_time) and (time() - kube_job_status_start_time) > 3600\"\n                severity: warning\n                comments: |\n                  Threshold should be customized for each cronjob name.\n              - name: Kubernetes Job slow completion\n                description: Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time.\n                query: \"kube_job_spec_completions - kube_job_status_succeeded - kube_job_status_failed > 0\"\n                severity: critical\n                for: 12h\n              - name: Kubernetes API server errors\n                description: \"Kubernetes API server is experiencing {{ $value | humanize }}% error rate\"\n                query: 'sum(rate(apiserver_request_total{job=\"apiserver\",code=~\"(?:5..)\"}[1m])) by (instance, job) / sum(rate(apiserver_request_total{job=\"apiserver\"}[1m])) by (instance, job) * 100 > 3 and sum(rate(apiserver_request_total{job=\"apiserver\"}[1m])) by (instance, job) > 0'\n                severity: critical\n                for: 2m\n              - name: Kubernetes API client errors\n                description: \"Kubernetes API client is experiencing {{ $value | humanize }}% error rate\"\n                query: '(sum(rate(rest_client_requests_total{code=~\"(4|5)..\"}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1 and sum(rate(rest_client_requests_total[1m])) by (instance, job) > 0'\n                severity: critical\n                for: 2m\n              - name: Kubernetes client certificate expires next week\n                description: A client certificate used to authenticate to the apiserver is expiring next week.\n                query: 'apiserver_client_certificate_expiration_seconds_count{job=\"apiserver\"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 7*24*60*60'\n                severity: warning\n              - name: Kubernetes client certificate expires soon\n                description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n                query: 'apiserver_client_certificate_expiration_seconds_count{job=\"apiserver\"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 24*60*60'\n                severity: critical\n              - name: Kubernetes API server latency\n                description: \"Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\"\n                query: 'histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~\"(?:CONNECT|WATCHLIST|WATCH|PROXY)\"} [10m])) WITHOUT (subresource)) > 1'\n                severity: warning\n                for: 2m\n\n      - name: Nomad\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: Nomad job failed\n                description: Nomad job failed\n                query: \"nomad_nomad_job_summary_failed > 0\"\n                severity: warning\n              - name: Nomad job lost\n                description: Nomad job lost\n                query: \"nomad_nomad_job_summary_lost > 0\"\n                severity: warning\n              - name: Nomad job queued\n                description: Nomad job queued\n                query: \"nomad_nomad_job_summary_queued > 0\"\n                severity: warning\n                for: 2m\n              - name: Nomad blocked evaluation\n                description: Nomad blocked evaluation\n                query: \"nomad_nomad_blocked_evals_total_blocked > 0\"\n                severity: warning\n\n      - name: Consul\n        exporters:\n          - name: prometheus/consul_exporter\n            slug: consul-exporter\n            doc_url: https://github.com/prometheus/consul_exporter\n            rules:\n              - name: Consul service healthcheck failed\n                description: \"Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\"\n                query: \"consul_catalog_service_node_healthy == 0\"\n                severity: critical\n                for: 1m # allows a short service restart\n              - name: Consul missing master node\n                description: Numbers of consul raft peers should be 3, in order to preserve quorum.\n                query: \"consul_raft_peers < 3\"\n                severity: critical\n              - name: Consul agent unhealthy\n                description: A Consul agent is down\n                query: 'consul_health_node_status{status=\"critical\"} == 1'\n                severity: critical\n\n      - name: Etcd\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: Etcd insufficient Members\n                description: Etcd cluster should have an odd number of members\n                query: \"count(etcd_server_id) % 2 == 0\"\n                severity: critical\n              - name: Etcd no Leader\n                description: Etcd cluster have no leader\n                query: \"etcd_server_has_leader == 0\"\n                severity: critical\n              - name: Etcd high number of leader changes\n                description: \"Etcd leader changed {{ $value }} times during 10 minutes\"\n                query: \"increase(etcd_server_leader_changes_seen_total[10m]) > 2\"\n                severity: warning\n              - name: Etcd high number of failed GRPC requests warning\n                description: More than 1% GRPC request failure detected in Etcd\n                query: 'sum(rate(grpc_server_handled_total{grpc_code=~\"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown\"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01 and sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0'\n                severity: warning\n                for: 2m\n                comments: |\n                  Filters to actual error codes. grpc_code!=\"OK\" includes benign codes like NotFound, AlreadyExists, and Cancelled.\n              - name: Etcd high number of failed GRPC requests critical\n                description: More than 5% GRPC request failure detected in Etcd\n                query: 'sum(rate(grpc_server_handled_total{grpc_code=~\"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown\"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05 and sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0'\n                severity: critical\n                for: 2m\n                comments: |\n                  Filters to actual error codes. grpc_code!=\"OK\" includes benign codes like NotFound, AlreadyExists, and Cancelled.\n              - name: Etcd GRPC requests slow\n                description: GRPC requests slowing down, 99th percentile is over 0.15s\n                query: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type=\"unary\"}[1m])) by (grpc_service, grpc_method, le)) > 0.15'\n                severity: warning\n                for: 2m\n              - name: Etcd high number of failed HTTP requests warning\n                description: More than 1% HTTP failure detected in Etcd\n                query: \"sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01 and sum(rate(etcd_http_received_total[1m])) BY (method) > 0\"\n                severity: warning\n                for: 2m\n              - name: Etcd high number of failed HTTP requests critical\n                description: More than 5% HTTP failure detected in Etcd\n                query: \"sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05 and sum(rate(etcd_http_received_total[1m])) BY (method) > 0\"\n                severity: critical\n                for: 2m\n              - name: Etcd HTTP requests slow\n                description: HTTP requests slowing down, 99th percentile is over 0.15s\n                query: \"histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15\"\n                severity: warning\n                for: 2m\n              - name: Etcd member communication slow\n                description: Etcd member communication slowing down, 99th percentile is over 0.15s\n                query: \"histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15\"\n                severity: warning\n                for: 2m\n              - name: Etcd high number of failed proposals\n                description: \"Etcd server got {{ $value }} failed proposals in the past hour\"\n                query: \"increase(etcd_server_proposals_failed_total[1h]) > 5\"\n                severity: warning\n                for: 2m\n              - name: Etcd high fsync durations\n                description: Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\n                query: \"histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) > 0.5\"\n                severity: warning\n                for: 2m\n              - name: Etcd high commit durations\n                description: Etcd commit duration increasing, 99th percentile is over 0.25s\n                query: \"histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[1m])) > 0.25\"\n                severity: warning\n                for: 2m\n\n      - name: OpenStack\n        exporters:\n          - name: openstack-exporter/openstack-exporter\n            slug: openstack-exporter\n            doc_url: https://github.com/openstack-exporter/openstack-exporter\n            rules:\n              - name: OpenStack exporter down\n                description: The OpenStack exporter is down. OpenStack cloud metrics are no longer being collected.\n                query: 'up{job=~\".*openstack.*\"} == 0'\n                severity: critical\n                for: 2m\n              - name: OpenStack Nova agent down\n                description: \"Nova agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}\"\n                query: 'openstack_nova_agent_state{adminState=\"enabled\"} == 0'\n                severity: critical\n                for: 2m\n              - name: OpenStack Neutron agent down\n                description: \"Neutron agent {{ $labels.hostname }} ({{ $labels.service }}) is down\"\n                query: 'openstack_neutron_agent_state{adminState=\"up\"} == 0'\n                severity: critical\n                for: 2m\n              - name: OpenStack Cinder agent down\n                description: \"Cinder agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}\"\n                query: 'openstack_cinder_agent_state{adminState=\"enabled\"} == 0'\n                severity: critical\n                for: 2m\n              - name: OpenStack hypervisor high vCPU usage\n                description: \"Hypervisor {{ $labels.hostname }} vCPU usage is above 90%\"\n                query: 'openstack_nova_vcpus_used / openstack_nova_vcpus_available > 0.9 and openstack_nova_vcpus_available > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  The threshold of 90% is a rough default. Adjust based on your overcommit ratio and workload patterns.\n              - name: OpenStack hypervisor high memory usage\n                description: \"Hypervisor {{ $labels.hostname }} memory usage is above 90%\"\n                query: 'openstack_nova_memory_used_bytes / openstack_nova_memory_available_bytes > 0.9 and openstack_nova_memory_available_bytes > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  The threshold of 90% is a rough default. Adjust based on your overcommit ratio and workload patterns.\n              - name: OpenStack hypervisor high disk usage\n                description: \"Hypervisor {{ $labels.hostname }} local disk usage is above 90%\"\n                query: 'openstack_nova_local_storage_used_bytes / openstack_nova_local_storage_available_bytes > 0.9 and openstack_nova_local_storage_available_bytes > 0'\n                severity: warning\n                for: 5m\n              - name: OpenStack Nova tenant vCPU quota nearly exhausted\n                description: \"Tenant {{ $labels.tenant }} has used over 90% of its vCPU quota\"\n                query: 'openstack_nova_limits_vcpus_used / openstack_nova_limits_vcpus_max > 0.9 and openstack_nova_limits_vcpus_max > 0'\n                severity: warning\n                comments: |\n                  A value of -1 for limits_vcpus_max means unlimited quota (no limit set).\n              - name: OpenStack Nova tenant memory quota nearly exhausted\n                description: \"Tenant {{ $labels.tenant }} has used over 90% of its memory quota\"\n                query: 'openstack_nova_limits_memory_used / openstack_nova_limits_memory_max > 0.9 and openstack_nova_limits_memory_max > 0'\n                severity: warning\n              - name: OpenStack Nova tenant instance quota nearly exhausted\n                description: \"Tenant {{ $labels.tenant }} has used over 90% of its instance quota\"\n                query: 'openstack_nova_limits_instances_used / openstack_nova_limits_instances_max > 0.9 and openstack_nova_limits_instances_max > 0'\n                severity: warning\n              - name: OpenStack Cinder tenant volume quota nearly exhausted\n                description: \"Tenant {{ $labels.tenant }} has used over 90% of its volume storage quota\"\n                query: 'openstack_cinder_limits_volume_used_gb / openstack_cinder_limits_volume_max_gb > 0.9 and openstack_cinder_limits_volume_max_gb > 0'\n                severity: warning\n              - name: OpenStack Cinder pool low free capacity\n                description: \"Cinder storage pool {{ $labels.name }} has less than 10% free capacity\"\n                query: 'openstack_cinder_pool_capacity_free_gb / openstack_cinder_pool_capacity_total_gb < 0.1 and openstack_cinder_pool_capacity_total_gb > 0'\n                severity: warning\n                for: 5m\n              - name: OpenStack Neutron floating IPs associated but not active\n                description: \"{{ $value }} floating IPs are associated to a private IP but are not in ACTIVE state\"\n                query: 'openstack_neutron_floating_ips_associated_not_active > 0'\n                severity: warning\n                for: 5m\n              - name: OpenStack Neutron routers not active\n                description: \"{{ $value }} Neutron routers are not in ACTIVE state\"\n                query: 'openstack_neutron_routers_not_active > 0'\n                severity: warning\n                for: 5m\n              - name: OpenStack Neutron subnet IP pool exhaustion\n                description: \"Subnet {{ $labels.subnet_name }} on network {{ $labels.network_name }} has used over 90% of its IP pool\"\n                query: 'openstack_neutron_network_ip_availabilities_used / openstack_neutron_network_ip_availabilities_total > 0.9 and openstack_neutron_network_ip_availabilities_total > 0'\n                severity: warning\n              - name: OpenStack Neutron ports without IPs\n                description: \"{{ $value }} active ports have no IP addresses assigned\"\n                query: 'openstack_neutron_ports_no_ips > 0'\n                severity: warning\n                for: 5m\n              - name: OpenStack load balancer not online\n                description: \"Load balancer {{ $labels.name }} ({{ $labels.id }}) operating status is {{ $labels.operating_status }}\"\n                query: 'openstack_loadbalancer_loadbalancer_status{operating_status!=\"ONLINE\"} > 0'\n                severity: warning\n                for: 5m\n              - name: OpenStack Nova instances in ERROR state\n                description: \"{{ $value }} Nova instances are in ERROR state\"\n                query: 'sum(openstack_nova_server_status{status=\"ERROR\"}) > 0'\n                severity: warning\n                for: 5m\n              - name: OpenStack Cinder volumes in error state\n                description: \"{{ $value }} Cinder volumes are in an error state\"\n                query: 'openstack_cinder_volume_status_counter{status=~\"error.*\"} > 0'\n                severity: warning\n                for: 5m\n              - name: OpenStack placement resource high usage\n                description: \"Resource {{ $labels.resourcetype }} on host {{ $labels.hostname }} usage exceeds 90% of its allocation\"\n                query: 'openstack_placement_resource_usage / (openstack_placement_resource_total * openstack_placement_resource_allocation_ratio) > 0.9 and openstack_placement_resource_total > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  This alert factors in the allocation ratio to compute effective capacity.\n                  The threshold of 90% is a rough default. Adjust based on your allocation ratios and workload patterns.\n\n  - name: CI/CD\n    services:\n      - name: Jenkins\n        exporters:\n          - name: Metric plugin\n            slug: metric-plugin\n            doc_url: https://plugins.jenkins.io/prometheus/\n            rules:\n              - name: Jenkins node offline\n                description: \"At least one Jenkins node offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\"\n                query: \"jenkins_node_offline_value > 0\"\n                severity: critical\n                for: 5m\n              - name: Jenkins no node online\n                description: \"No Jenkins nodes are online: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\"\n                query: \"jenkins_node_online_value == 0\"\n                severity: critical\n              - name: Jenkins healthcheck\n                description: \"Jenkins healthcheck score: {{$value}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\"\n                query: \"jenkins_health_check_score < 1\"\n                severity: critical\n              - name: Jenkins outdated plugins\n                description: \"{{ $value }} plugins need update\"\n                query: \"sum(jenkins_plugins_withUpdate) by (instance) > 3\"\n                severity: warning\n                for: 1d\n              - name: Jenkins builds health score\n                description: \"Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\"\n                query: \"default_jenkins_builds_health_score < 1\"\n                severity: critical\n              - name: Jenkins run failure total\n                description: \"Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\"\n                query: \"delta(jenkins_runs_failure_total[1h]) > 100\"\n                severity: warning\n              - name: Jenkins build tests failing\n                description: \"Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\"\n                query: \"default_jenkins_builds_last_build_tests_failing > 0\"\n                severity: warning\n              - name: Jenkins last build failed\n                description: \"Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\"\n                query: \"default_jenkins_builds_last_build_result_ordinal == 2\"\n                severity: warning\n                comments: |\n                  * RUNNING  -1 true  - The build had no errors.\n                  * SUCCESS   0 true  - The build had no errors.\n                  * UNSTABLE  1 true  - The build had some errors but they were not fatal. For example, some tests failed.\n                  * FAILURE   2 false - The build had a fatal error.\n                  * NOT_BUILT 3 false - The module was not built.\n                  * ABORTED   4 false - The build was manually aborted.\n\n      - name: ArgoCD\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://argo-cd.readthedocs.io/en/stable/operator-manual/metrics/\n            rules:\n              - name: ArgoCD service not synced\n                description: Service {{ $labels.name }} run by argo is currently not in sync.\n                query: 'argocd_app_info{sync_status!=\"Synced\"} != 0'\n                severity: warning\n                for: 15m\n              - name: ArgoCD service unhealthy\n                description: Service {{ $labels.name }} run by argo is currently not healthy.\n                query: 'argocd_app_info{health_status!=\"Healthy\"} != 0'\n                severity: warning\n                for: 15m\n\n      - name: FluxCD\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://fluxcd.io/flux/monitoring/metrics/\n            rules:\n              - name: Flux Kustomization Failure\n                description: The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\n                query: 'gotk_resource_info{ready=\"False\", customresource_kind=\"Kustomization\"} > 0'\n                severity: warning\n                for: 15m\n              - name: Flux HelmRelease Failure\n                description: The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\n                query: 'gotk_resource_info{ready=\"False\", customresource_kind=\"HelmRelease\"} > 0'\n                severity: warning\n                for: 15m\n              - name: Flux Source Issue\n                description: Flux source {{ $labels.customresource_kind }} '{{ $labels.name }}' has issue(s).\n                query: 'gotk_resource_info{ready=\"False\", customresource_kind=~\"GitRepository|HelmRepository|Bucket|OCIRepository\"} > 0'\n                severity: warning\n                for: 15m\n              - name: Flux Image Issue\n                description: The {{ $labels.customresource_kind }} '{{ $labels.name }}' is marked as not ready.\n                query: 'gotk_resource_info{ready=\"False\", customresource_kind=~\"ImagePolicy|ImageRepository|ImageUpdateAutomation\"} > 0'\n                severity: warning\n                for: 15m\n\n      - name: GitLab CI\n        exporters:\n          - name: GitLab built-in exporter\n            slug: gitlab-built-in-exporter\n            doc_url: https://docs.gitlab.com/administration/monitoring/prometheus/gitlab_metrics/\n            rules:\n              # Puma web server\n              - name: GitLab Puma high queued connections\n                description: \"GitLab Puma has {{ $value }} queued connections on {{ $labels.instance }}. Requests are waiting for an available worker thread.\"\n                query: \"puma_queued_connections > 5\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Queued connections indicate Puma workers are saturated.\n                  Consider increasing puma['worker_processes'] or puma['max_threads'] in gitlab.rb.\n              - name: GitLab Puma no available pool capacity\n                description: \"GitLab Puma pool capacity on {{ $labels.instance }} has been at 0 for 5 minutes. All threads are busy.\"\n                query: \"puma_pool_capacity == 0\"\n                severity: critical\n                for: 5m\n              - name: GitLab Puma workers not running\n                description: \"GitLab Puma on {{ $labels.instance }} has {{ $value }} running workers out of expected total.\"\n                query: \"puma_running_workers < puma_workers\"\n                severity: warning\n                for: 5m\n              # HTTP request handling\n              - name: GitLab high HTTP error rate\n                description: \"GitLab is returning more than 5% HTTP 5xx errors on {{ $labels.instance }}.\"\n                query: 'sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5 and sum(rate(http_requests_total[5m])) > 0'\n                severity: critical\n                for: 5m\n                comments: |\n                  Threshold is 5% of all requests returning server errors.\n                  Check GitLab logs at /var/log/gitlab/ for root cause.\n              - name: GitLab high HTTP request latency\n                description: \"GitLab p95 HTTP request latency on {{ $labels.instance }} is above 10 seconds.\"\n                query: \"histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 10\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold of 10s may need adjustment based on your instance size and workload.\n              # Sidekiq background jobs\n              - name: GitLab Sidekiq jobs failing\n                description: \"GitLab Sidekiq jobs are failing at a rate of {{ $value }} per second on {{ $labels.instance }}.\"\n                query: \"rate(sidekiq_jobs_failed_total[5m]) > 0.1\"\n                severity: warning\n                for: 10m\n                comments: |\n                  This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.\n                  A sustained failure rate indicates background processing issues.\n              - name: GitLab Sidekiq queue too large\n                description: \"GitLab Sidekiq has {{ $value }} running jobs, approaching concurrency limit on {{ $labels.instance }}.\"\n                query: \"sum(sidekiq_running_jobs) >= sum(sidekiq_concurrency) * 0.9\"\n                severity: warning\n                for: 10m\n                comments: |\n                  When running jobs approach the concurrency limit, new jobs will queue up.\n                  Consider scaling Sidekiq workers or increasing concurrency.\n              - name: GitLab Sidekiq high job completion time\n                description: \"GitLab Sidekiq job p95 completion time on {{ $labels.instance }} is above 5 minutes ({{ $value | humanizeDuration }}).\"\n                query: \"histogram_quantile(0.95, sum(rate(sidekiq_jobs_completion_seconds_bucket[5m])) by (le, worker)) > 300\"\n                severity: warning\n                for: 10m\n                comments: |\n                  This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.\n              - name: GitLab Sidekiq high queue latency\n                description: \"GitLab Sidekiq jobs on {{ $labels.instance }} are waiting more than 60 seconds before being processed.\"\n                query: \"histogram_quantile(0.95, sum(rate(sidekiq_jobs_queue_duration_seconds_bucket[5m])) by (le)) > 60\"\n                severity: warning\n                for: 5m\n                comments: |\n                  This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.\n                  High queue latency means jobs are stuck waiting. Check Sidekiq concurrency and queue sizes.\n              # Database connection pool\n              - name: GitLab database connection pool saturation\n                description: \"GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) is {{ $value }}% busy.\"\n                query: \"gitlab_database_connection_pool_busy / gitlab_database_connection_pool_size * 100 > 90 and gitlab_database_connection_pool_size > 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  When the pool is near saturation, requests may block waiting for a connection.\n                  Increase db_pool_size in gitlab.rb or investigate slow queries.\n              - name: GitLab database connection pool dead connections\n                description: \"GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) has {{ $value }} dead connections.\"\n                query: \"gitlab_database_connection_pool_dead > 0\"\n                severity: warning\n                for: 5m\n              - name: GitLab database connection pool waiting\n                description: \"GitLab on {{ $labels.instance }} has {{ $value }} threads waiting for a database connection.\"\n                query: \"gitlab_database_connection_pool_waiting > 0\"\n                severity: warning\n                for: 5m\n              # CI/CD pipelines\n              - name: GitLab CI pipeline creation slow\n                description: \"GitLab CI pipeline creation p95 latency on {{ $labels.instance }} is above 30 seconds.\"\n                query: \"histogram_quantile(0.95, sum(rate(gitlab_ci_pipeline_creation_duration_seconds_bucket[5m])) by (le)) > 30\"\n                severity: warning\n                for: 5m\n              - name: GitLab CI pipeline failures increasing\n                description: \"GitLab CI pipeline failures are increasing on {{ $labels.instance }} ({{ $value }}/s).\"\n                query: \"rate(gitlab_ci_pipeline_failure_reasons[5m]) > 0\"\n                severity: warning\n                for: 10m\n                comments: |\n                  This metric may not exist in all GitLab versions. Verify against your GitLab installation.\n              - name: GitLab CI runner authentication failures\n                description: \"GitLab CI runners are experiencing authentication failures on {{ $labels.instance }} ({{ $value }} failures).\"\n                query: \"increase(gitlab_ci_runner_authentication_failure_total[5m]) > 5\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Frequent runner auth failures may indicate expired tokens or misconfigured runners.\n              # Ruby process health\n              - name: GitLab high memory usage\n                description: \"GitLab process on {{ $labels.instance }} is using {{ $value | humanize1024 }}B of RSS memory.\"\n                query: \"process_resident_memory_bytes{job=~\\\".*gitlab.*\\\"} > 2e+9\"\n                severity: warning\n                for: 10m\n                comments: |\n                  Threshold of 2GB may need adjustment based on your instance size.\n                  High memory usage can lead to OOM kills and service disruptions.\n              - name: GitLab Ruby heap fragmentation\n                description: \"GitLab Ruby heap fragmentation on {{ $labels.instance }} is {{ $value }}. High fragmentation wastes memory.\"\n                query: \"ruby_gc_stat_ext_heap_fragmentation{job=~\\\".*gitlab.*\\\"} > 0.5\"\n                severity: warning\n                for: 15m\n                comments: |\n                  Heap fragmentation above 50% means a significant amount of memory is wasted.\n                  A Puma worker restart may help reclaim memory.\n              # Uncaught errors\n              - name: GitLab rack uncaught errors\n                description: \"GitLab is experiencing uncaught errors in the Rack layer on {{ $labels.instance }} ({{ $value }}/s).\"\n                query: \"rate(rack_uncaught_errors_total[5m]) > 0\"\n                severity: warning\n                for: 5m\n              # Application version / deployment\n              - name: GitLab version mismatch\n                description: \"Multiple GitLab versions are running across the fleet.\"\n                query: 'count(count by (version) (gitlab_build_info)) > 1'\n                severity: warning\n                comments: |\n                  This may happen during a rolling deployment. If it persists, investigate incomplete upgrades.\n              # File descriptors\n              - name: GitLab high file descriptor usage\n                description: \"GitLab on {{ $labels.instance }} is using {{ $value }}% of available file descriptors.\"\n                query: 'process_open_fds{job=~\".*gitlab.*\"} / process_max_fds * 100 > 80 and process_max_fds > 0'\n                severity: warning\n                for: 5m\n              # Ruby threads\n              - name: GitLab Ruby threads saturated\n                description: \"GitLab running threads on {{ $labels.instance }} have exceeded the expected maximum ({{ $value }}).\"\n                query: \"sum by (instance) (gitlab_ruby_threads_running_threads) > on(instance) gitlab_ruby_threads_max_expected_threads * 1.5\"\n                severity: warning\n                for: 10m\n\n          - name: Workhorse\n            slug: workhorse\n            doc_url: https://docs.gitlab.com/administration/monitoring/prometheus/gitlab_metrics/#gitlab-workhorse\n            rules:\n              - name: GitLab Workhorse high error rate\n                description: \"GitLab Workhorse on {{ $labels.instance }} is returning more than 10% HTTP 5xx errors.\"\n                query: 'sum(rate(gitlab_workhorse_http_request_duration_seconds_count{code=~\"5..\"}[5m])) / sum(rate(gitlab_workhorse_http_request_duration_seconds_count[5m])) * 100 > 10 and sum(rate(gitlab_workhorse_http_request_duration_seconds_count[5m])) > 0'\n                severity: critical\n                for: 5m\n                comments: |\n                  Workhorse sits in front of Puma and handles Git HTTP, file uploads, and proxying.\n                  Threshold from GitLab Omnibus default rules: 10% for high-traffic instances.\n              - name: GitLab Workhorse high latency\n                description: \"GitLab Workhorse on {{ $labels.instance }} p95 request latency is above 10 seconds.\"\n                query: \"histogram_quantile(0.95, sum(rate(gitlab_workhorse_http_request_duration_seconds_bucket[5m])) by (le)) > 10\"\n                severity: warning\n                for: 5m\n              - name: GitLab Workhorse high in-flight requests\n                description: \"GitLab Workhorse on {{ $labels.instance }} has {{ $value }} in-flight requests.\"\n                query: \"gitlab_workhorse_http_in_flight_requests > 100\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold of 100 may need adjustment based on instance size.\n\n          - name: Gitaly\n            slug: gitaly\n            doc_url: https://docs.gitlab.com/administration/gitaly/monitoring/\n            rules:\n              - name: GitLab Gitaly high gRPC error rate\n                description: \"Gitaly on {{ $labels.instance }} is returning more than 5% gRPC errors.\"\n                query: 'sum(rate(grpc_server_handled_total{job=\"gitaly\",grpc_code!=\"OK\"}[5m])) / sum(rate(grpc_server_handled_total{job=\"gitaly\"}[5m])) * 100 > 5 and sum(rate(grpc_server_handled_total{job=\"gitaly\"}[5m])) > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  grpc_code!=\"OK\" includes non-error codes like NotFound, AlreadyExists. Consider filtering to specific error codes for less noise.\n              - name: GitLab Gitaly resource exhausted\n                description: \"Gitaly on {{ $labels.instance }} is returning ResourceExhausted errors, indicating overload ({{ $value }}%).\"\n                query: 'sum(rate(grpc_server_handled_total{job=\"gitaly\",grpc_code=\"ResourceExhausted\"}[5m])) / sum(rate(grpc_server_handled_total{job=\"gitaly\"}[5m])) * 100 > 1 and sum(rate(grpc_server_handled_total{job=\"gitaly\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n                comments: |\n                  ResourceExhausted errors from Gitaly mean Git operations are being rejected due to\n                  concurrency limits. This directly impacts users trying to push, pull, or clone.\n                  This alert is derived from the GitLab Omnibus default rules.\n              - name: GitLab Gitaly high RPC latency\n                description: \"Gitaly on {{ $labels.instance }} p95 unary RPC latency exceeds 1 second ({{ $value }}s).\"\n                query: 'histogram_quantile(0.95, sum(rate(grpc_server_handling_seconds_bucket{job=\"gitaly\",grpc_type=\"unary\"}[5m])) by (le)) > 1'\n                severity: warning\n                for: 5m\n              - name: GitLab Gitaly CPU throttled\n                description: \"Gitaly processes on {{ $labels.instance }} are being CPU throttled by cgroups.\"\n                query: \"rate(gitaly_cgroup_cpu_cfs_throttled_seconds_total[5m]) > 0\"\n                severity: warning\n                for: 5m\n              - name: GitLab Gitaly authentication failures\n                description: \"Gitaly on {{ $labels.instance }} has authentication failures ({{ $value }}).\"\n                query: 'increase(gitaly_authentications_total{status=\"failed\"}[5m]) > 0'\n                severity: warning\n              - name: GitLab Gitaly circuit breaker tripped\n                description: \"Gitaly circuit breaker has tripped on {{ $labels.instance }}. Git operations are failing.\"\n                query: 'increase(gitaly_circuit_breaker_transitions_total{to_state=\"open\"}[5m]) > 0'\n                severity: critical\n                comments: |\n                  When the circuit breaker trips to \"open\" state, Git operations (push, pull, clone) will fail.\n                  Check Gitaly service health and logs.\n\n      - name: Spinnaker\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://spinnaker.io/docs/setup/other_config/monitoring/\n            rules:\n              - name: Spinnaker circuit breaker open\n                description: \"Circuit breaker {{ $labels.name }} is open on {{ $labels.instance }}, indicating repeated downstream failures.\"\n                query: 'resilience4j_circuitbreaker_state{state=\"open\"} == 1'\n                severity: warning\n                for: 5m\n              - name: Spinnaker Orca queue backing up\n                description: \"Orca work queue has {{ $value }} messages ready for delivery but not yet picked up. Pipeline executions may be delayed.\"\n                query: 'queue_ready_depth > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  In a healthy Spinnaker, queue_ready_depth should stay at or near 0.\n                  Sustained non-zero values indicate Orca cannot keep up with incoming work.\n              - name: Spinnaker Orca queue message lag high\n                description: \"Orca queue message lag is {{ $value }}s. Pipeline stages are waiting too long before being processed.\"\n                query: 'rate(queue_message_lag_seconds_sum[5m]) / rate(queue_message_lag_seconds_count[5m]) > 30 and rate(queue_message_lag_seconds_count[5m]) > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  The 30s threshold is a rough default. Adjust based on your pipeline SLOs.\n              - name: Spinnaker dead messages\n                description: \"Orca is producing dead-lettered messages ({{ $value }} per second). These are tasks that exhausted all retries and will not be executed.\"\n                query: 'rate(queue_dead_messages_total[5m]) > 0'\n                severity: critical\n                for: 2m\n              - name: Spinnaker zombie executions\n                description: \"{{ $value }} zombie pipeline executions detected. These are executions with no corresponding queue messages.\"\n                query: 'rate(queue_zombies_total[5m]) > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  Zombies are pipeline executions that are running but have lost their queue entry.\n                  See https://spinnaker.io/docs/guides/runbooks/orca-zombie-executions/\n              - name: Spinnaker thread pool exhaustion\n                description: \"Orca message handler thread pool has {{ $value }} blocked threads on {{ $labels.instance }}. Pipeline execution throughput is degraded.\"\n                query: 'threadpool_blockingQueueSize > 0'\n                severity: warning\n                for: 5m\n              - name: Spinnaker polling monitor items over threshold\n                description: \"Igor polling monitor {{ $labels.monitor }} for {{ $labels.partition }} has exceeded its item threshold, preventing pipeline triggers.\"\n                query: 'sum by (monitor, partition) (pollingMonitor_itemsOverThreshold) > 0'\n                severity: critical\n                for: 5m\n                comments: |\n                  When this threshold is exceeded, Igor stops triggering pipelines for the affected monitor.\n                  See https://kb.armory.io/s/article/Hitting-Igor-s-caching-thresholds\n              - name: Spinnaker polling monitor failures\n                description: \"Igor polling monitor is experiencing failures ({{ $value }} per second). CI/SCM integrations may not trigger pipelines.\"\n                query: 'rate(pollingMonitor_failed_total[5m]) > 0'\n                severity: warning\n                for: 5m\n              - name: Spinnaker high API error rate\n                description: \"Spinnaker API 5xx error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}.\"\n                query: 'sum by (instance) (rate(controller_invocations_total{status=\"5xx\"}[5m])) / sum by (instance) (rate(controller_invocations_total[5m])) > 0.05 and sum by (instance) (rate(controller_invocations_total[5m])) > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  The 5% threshold is a rough default. Adjust based on your traffic patterns.\n              - name: Spinnaker API rate limit throttling\n                description: \"Gate is actively throttling API requests on {{ $labels.instance }} ({{ $value }} throttled requests per second).\"\n                query: 'rate(rateLimitThrottling_total[5m]) > 0'\n                severity: warning\n                for: 2m\n              - name: Spinnaker Clouddriver high error rate\n                description: \"Clouddriver 5xx error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Cloud operations may be failing.\"\n                query: 'sum by (instance) (rate(controller_invocations_total{status=\"5xx\", job=~\".*clouddriver.*\"}[5m])) / sum by (instance) (rate(controller_invocations_total{job=~\".*clouddriver.*\"}[5m])) > 0.05 and sum by (instance) (rate(controller_invocations_total{job=~\".*clouddriver.*\"}[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Spinnaker AWS rate limiting\n                description: \"Clouddriver is being rate-limited by AWS on {{ $labels.instance }} ({{ $value }}ms delay). Cloud operations will be slower.\"\n                query: 'amazonClientProvider_rateLimitDelayMil > 1000'\n                severity: warning\n                for: 5m\n                comments: |\n                  This metric is specific to AWS cloud providers in Clouddriver.\n                  The 1000ms threshold is a rough default. Adjust based on your AWS usage patterns.\n\n  - name: Network and security\n    services:\n      - name: SpeedTest\n        exporters:\n          - name: Speedtest exporter\n            slug: nlamirault-speedtest-exporter\n            doc_url: https://github.com/nlamirault/speedtest_exporter\n            rules:\n              - name: SpeedTest Slow Internet Download\n                description: Internet download speed is currently {{humanize $value}} Mbps.\n                query: \"avg_over_time(speedtest_download[10m]) < 100\"\n                severity: warning\n              - name: SpeedTest Slow Internet Upload\n                description: Internet upload speed is currently {{humanize $value}} Mbps.\n                query: \"avg_over_time(speedtest_upload[10m]) < 20\"\n                severity: warning\n\n      - name: SSL/TLS\n        exporters:\n          - name: ssl_exporter\n            slug: ribbybibby-ssl-exporter\n            doc_url: https://github.com/ribbybibby/ssl_exporter\n            rules:\n              - name: SSL certificate probe failed\n                description: Failed to fetch SSL information {{ $labels.instance }}\n                query: ssl_probe_success == 0\n                severity: critical\n              - name: SSL certificate OSCP status unknown\n                description: Failed to get the OSCP status {{ $labels.instance }}\n                query: ssl_ocsp_response_status == 2\n                severity: warning\n              - name: SSL certificate revoked\n                description: SSL certificate revoked {{ $labels.instance }}\n                query: ssl_ocsp_response_status == 1\n                severity: critical\n              - name: SSL certificate expiry (< 7 days)\n                description: \"{{ $labels.instance }} Certificate is expiring in 7 days\"\n                query: ssl_verified_cert_not_after{chain_no=\"0\"} - time() < 86400 * 7\n                severity: warning\n\n      - name: cert-manager\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://cert-manager.io/docs/devops-tips/prometheus-metrics/\n            rules:\n              - name: Cert-Manager absent\n                description: Cert-Manager has disappeared from Prometheus service discovery. New certificates will not be able to be minted, and existing ones can't be renewed until cert-manager is back.\n                query: 'absent(up{job=\"cert-manager\"})'\n                severity: critical\n                for: 10m\n              - name: Cert-Manager certificate expiring soon\n                description: The certificate {{ $labels.name }} is expiring in less than 21 days.\n                query: 'avg by (exported_namespace, namespace, name) (certmanager_certificate_expiration_timestamp_seconds - time()) < (21 * 24 * 3600)'\n                severity: warning\n                for: 1h\n                comments: |\n                  Threshold of 21 days is a rough default. ACME certificates are typically renewed 30 days before expiry, so expiring within 21 days may indicate issuer misconfiguration.\n              - name: Cert-Manager certificate not ready\n                description: \"The certificate {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not ready to serve traffic.\"\n                query: 'max by (name, exported_namespace, namespace, condition) (certmanager_certificate_ready_status{condition!=\"True\"} == 1)'\n                severity: critical\n                for: 10m\n              - name: Cert-Manager hitting ACME rate limits\n                description: Cert-Manager is being rate-limited by the ACME provider. Certificate issuance and renewal may be blocked for up to a week.\n                query: 'sum by (host) (rate(certmanager_http_acme_client_request_count{status=\"429\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n                comments: |\n                  In cert-manager 1.19+, the metric was renamed (dropped http_ prefix). Verify metric name against your version.\n\n      - name: Juniper\n        exporters:\n          - name: czerwonk/junos_exporter\n            slug: czerwonk-junos-exporter\n            doc_url: https://github.com/czerwonk/junos_exporter\n            rules:\n              - name: Juniper switch down\n                description: The switch appears to be down\n                query: junos_up == 0\n                severity: critical\n              - name: Juniper critical Bandwidth Usage 1GiB\n                description: Interface is highly saturated. (> 0.90GiB/s)\n                query: \"rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90\"\n                severity: critical\n                for: 1m\n              - name: Juniper warning Bandwidth Usage 1GiB\n                description: Interface is getting saturated. (> 0.80GiB/s)\n                query: \"rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80\"\n                severity: warning\n                for: 1m\n\n      - name: CoreDNS\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: CoreDNS Panic Count\n                description: Number of CoreDNS panics encountered\n                query: \"increase(coredns_panics_total[1m]) > 0\"\n                severity: critical\n\n      - name: Freeswitch\n        exporters:\n          - name: znerol/prometheus-freeswitch-exporter\n            slug: znerol-freeswitch-exporter\n            doc_url: https://pypi.org/project/prometheus-freeswitch-exporter\n            rules:\n              - name: Freeswitch down\n                description: Freeswitch is unresponsive\n                query: \"freeswitch_up == 0\"\n                severity: critical\n              - name: Freeswitch Sessions Warning\n                description: 'High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%'\n                query: \"(freeswitch_session_active * 100 / freeswitch_session_limit) > 80 and freeswitch_session_limit > 0\"\n                severity: warning\n                for: 10m\n              - name: Freeswitch Sessions Critical\n                description: 'High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%'\n                query: \"(freeswitch_session_active * 100 / freeswitch_session_limit) > 90 and freeswitch_session_limit > 0\"\n                severity: critical\n                for: 5m\n\n      - name: Hashicorp Vault\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://github.com/hashicorp/vault/blob/master/website/content/docs/configuration/telemetry.mdx#prometheus\n            rules:\n              - name: Vault sealed\n                description: \"Vault instance is sealed on {{ $labels.instance }}\"\n                query: \"vault_core_unsealed == 0\"\n                severity: critical\n              - name: Vault too many pending tokens\n                description: 'Too many pending tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%'\n                query: \"avg(vault_token_create_count - vault_token_store_count) > 0\"\n                severity: warning\n                for: 5m\n              - name: Vault too many infinity tokens\n                description: 'Too many infinity tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%'\n                query: 'vault_token_count_by_ttl{creation_ttl=\"+Inf\"} > 3'\n                severity: warning\n                for: 5m\n              - name: Vault cluster health\n                description: 'Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%'\n                query: \"sum(vault_core_active) / count(vault_core_active) <= 0.5\"\n                severity: critical\n\n      - name: Keycloak\n        exporters:\n          - name: aerogear/keycloak-metrics-spi\n            slug: aerogear-keycloak-metrics-spi\n            doc_url: https://github.com/aerogear/keycloak-metrics-spi\n            rules:\n              - name: Keycloak high login failure rate\n                description: \"More than 5% of login attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \\\"%.1f\\\" }}%).\"\n                query: '(sum by (realm) (rate(keycloak_failed_login_attempts_total[5m])) / (sum by (realm) (rate(keycloak_logins_total[5m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[5m])))) * 100 > 5 and (sum by (realm) (rate(keycloak_logins_total[5m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[5m]))) > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  Threshold of 5% is a rough default. Adjust based on your user base and expected error rates.\n                  A spike in failed logins may indicate a brute-force attack or misconfigured client.\n              - name: Keycloak no successful logins\n                description: \"No successful logins in realm {{ $labels.realm }} for the last 15 minutes.\"\n                query: 'sum by (realm) (rate(keycloak_logins_total[15m])) == 0 and (sum by (realm) (rate(keycloak_logins_total[15m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[15m]))) > 0'\n                severity: critical\n                for: 5m\n                comments: Only fires when login attempts exist but none succeed — may indicate an authentication outage.\n              - name: Keycloak high token refresh error rate\n                description: \"More than 10% of token refresh attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \\\"%.1f\\\" }}%).\"\n                query: '(sum by (realm) (rate(keycloak_refresh_tokens_errors_total[5m])) / sum by (realm) (rate(keycloak_refresh_tokens_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_refresh_tokens_total[5m])) > 0'\n                severity: warning\n                for: 5m\n                comments: Threshold of 10% is a rough default. High refresh token errors may indicate expired sessions or token store issues.\n              - name: Keycloak high code-to-token exchange error rate\n                description: \"More than 10% of code-to-token exchanges are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \\\"%.1f\\\" }}%).\"\n                query: '(sum by (realm) (rate(keycloak_code_to_tokens_errors_total[5m])) / sum by (realm) (rate(keycloak_code_to_tokens_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_code_to_tokens_total[5m])) > 0'\n                severity: warning\n                for: 5m\n                comments: Threshold of 10% is a rough default. Code-to-token failures may indicate misconfigured OAuth clients or replay attacks.\n              - name: Keycloak high registration failure rate\n                description: \"More than 10% of registration attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \\\"%.1f\\\" }}%).\"\n                query: '(sum by (realm) (rate(keycloak_registrations_errors_total[5m])) / sum by (realm) (rate(keycloak_registrations_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_registrations_total[5m])) > 0'\n                severity: warning\n                for: 5m\n                comments: Threshold of 10% is a rough default.\n              - name: Keycloak slow request response time\n                description: \"Keycloak {{ $labels.method }} requests are taking more than 2 seconds on average.\"\n                query: 'sum by (method) (rate(keycloak_request_duration_sum[5m])) / sum by (method) (rate(keycloak_request_duration_count[5m])) > 2000 and sum by (method) (rate(keycloak_request_duration_count[5m])) > 0'\n                severity: warning\n                for: 5m\n                comments: |\n                  keycloak_request_duration is in milliseconds. Threshold of 2000ms (2 seconds) is a rough default.\n\n      - name: Cloudflare\n        exporters:\n          - name: lablabs/cloudflare-exporter\n            slug: lablabs-cloudflare-exporter\n            doc_url: https://github.com/lablabs/cloudflare-exporter\n            rules:\n              - name: Cloudflare http 4xx error rate\n                description: \"Cloudflare high HTTP 4xx error rate (> 5% for domain {{ $labels.zone }})\"\n                query: '(sum by(zone) (rate(cloudflare_zone_requests_status{status=~\"^4..\"}[15m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[15m]))) * 100 > 5 and sum by (zone) (rate(cloudflare_zone_requests_status[15m])) > 0'\n                severity: warning\n              - name: Cloudflare http 5xx error rate\n                description: \"Cloudflare high HTTP 5xx error rate (> 5% for domain {{ $labels.zone }})\"\n                query: '(sum by (zone) (rate(cloudflare_zone_requests_status{status=~\"^5..\"}[5m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[5m]))) * 100 > 5 and sum by (zone) (rate(cloudflare_zone_requests_status[5m])) > 0'\n                severity: critical\n\n      - name: SNMP\n        exporters:\n          - name: prometheus/snmp_exporter\n            slug: snmp-exporter\n            doc_url: https://github.com/prometheus/snmp_exporter\n            comments: |\n              These rules use standard IF-MIB and SNMPv2-MIB metrics. Metric names depend on your snmp.yml module configuration.\n              Thresholds for bandwidth and error rates are rough defaults - adjust to your environment.\n            rules:\n              - name: SNMP target down\n                description: \"SNMP device {{ $labels.instance }} is unreachable.\"\n                query: 'up{job=~\"snmp.*\"} == 0'\n                severity: critical\n                for: 5m\n                comments: From the official snmp-mixin.\n              - name: SNMP interface down\n                description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} is operationally down while administratively up.\"\n                query: '(ifOperStatus{job=~\"snmp.*\"} == 2) and on(instance, job, ifIndex) (ifAdminStatus{job=~\"snmp.*\"} == 1)'\n                severity: critical\n                for: 2m\n              - name: SNMP interface high inbound error rate\n                description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} has an inbound error rate above 5%.\"\n                query: 'rate(ifInErrors{job=~\"snmp.*\"}[5m]) / (rate(ifHCInUcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCInBroadcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCInMulticastPkts{job=~\"snmp.*\"}[5m])) > 0.05 and (rate(ifHCInUcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCInBroadcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCInMulticastPkts{job=~\"snmp.*\"}[5m])) > 0'\n                severity: warning\n                for: 5m\n                comments: Threshold is a rough default. Adjust based on your network environment.\n              - name: SNMP interface high outbound error rate\n                description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} has an outbound error rate above 5%.\"\n                query: 'rate(ifOutErrors{job=~\"snmp.*\"}[5m]) / (rate(ifHCOutUcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCOutBroadcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCOutMulticastPkts{job=~\"snmp.*\"}[5m])) > 0.05 and (rate(ifHCOutUcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCOutBroadcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCOutMulticastPkts{job=~\"snmp.*\"}[5m])) > 0'\n                severity: warning\n                for: 5m\n                comments: Threshold is a rough default. Adjust based on your network environment.\n              - name: SNMP interface high bandwidth usage inbound\n                description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} inbound utilization is above 80%.\"\n                query: 'rate(ifHCInOctets{job=~\"snmp.*\"}[5m]) * 8 / ifSpeed > 0.80 and ifSpeed > 0'\n                severity: warning\n                for: 15m\n                comments: |\n                  Threshold is a rough default. ifSpeed is a Gauge32 that maxes out at ~4.29 Gbps. For 10G+ interfaces, use ifHighSpeed (in Mbps) instead.\n              - name: SNMP interface high bandwidth usage outbound\n                description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} outbound utilization is above 80%.\"\n                query: 'rate(ifHCOutOctets{job=~\"snmp.*\"}[5m]) * 8 / ifSpeed > 0.80 and ifSpeed > 0'\n                severity: warning\n                for: 15m\n                comments: |\n                  Threshold is a rough default. ifSpeed is a Gauge32 that maxes out at ~4.29 Gbps. For 10G+ interfaces, use ifHighSpeed (in Mbps) instead.\n              - name: SNMP device restarted\n                description: \"SNMP device {{ $labels.instance }} has restarted (uptime < 5 minutes).\"\n                query: \"sysUpTime / 100 < 300\"\n                severity: info\n                comments: sysUpTime is in centiseconds (hundredths of a second).\n\n      - name: Cilium\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://docs.cilium.io/en/stable/observability/metrics/\n            rules:\n              # Agent health\n              - name: Cilium agent unreachable nodes\n                description: \"Cilium agent {{ $labels.pod }} cannot reach {{ $value }} node(s). Check network connectivity and node health.\"\n                query: \"sum(cilium_unreachable_nodes{}) by (pod) > 0\"\n                severity: warning\n                for: 15m\n                comments: |\n                  Metric name depends on Cilium version. Use cilium_unreachable_nodes (older) or cilium_node_connectivity_status (1.14+).\n              - name: Cilium agent unreachable health endpoints\n                description: \"Cilium agent {{ $labels.pod }} cannot reach {{ $value }} health endpoint(s). Node-to-node health probes are failing.\"\n                query: \"sum(cilium_unreachable_health_endpoints{}) by (pod) > 0\"\n                severity: warning\n                for: 15m\n                comments: |\n                  Metric name depends on Cilium version. Use cilium_unreachable_health_endpoints (older) or cilium_node_connectivity_status (1.14+).\n              - name: Cilium agent failing controllers\n                description: \"Cilium agent {{ $labels.pod }} has {{ $value }} failing controller(s). Check cilium-agent logs for details.\"\n                query: \"sum(cilium_controllers_failing{}) by (pod) > 0\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Metric name depends on Cilium version. Use cilium_controllers_failing (older) or cilium_controllers_runs_total (1.14+).\n              # Endpoints\n              - name: Cilium agent endpoint failures\n                description: \"Cilium agent {{ $labels.pod }} has {{ $value }} endpoint(s) in invalid state.\"\n                query: 'sum(cilium_endpoint_state{endpoint_state=\"invalid\"}) by (pod) > 0'\n                severity: warning\n                for: 5m\n              - name: Cilium agent endpoint regeneration failures\n                description: \"Cilium agent {{ $labels.pod }} is failing to regenerate endpoints. Network policy enforcement may be stale.\"\n                query: 'sum(rate(cilium_endpoint_regenerations_total{outcome=\"fail\"}[5m])) by (pod) > 0'\n                severity: warning\n                for: 5m\n              - name: Cilium agent endpoint update failure\n                description: \"Cilium agent {{ $labels.pod }} is failing K8s endpoint update API calls ({{ $labels.method }} {{ $labels.return_code }}).\"\n                query: 'sum(rate(cilium_k8s_client_api_calls_total{method=~\"(PUT|POST|PATCH)\", endpoint=\"endpoint\", return_code!~\"2[0-9][0-9]\"}[5m])) by (pod, method, return_code) > 0'\n                severity: warning\n                for: 5m\n              - name: Cilium agent endpoint create failure\n                description: \"Cilium agent {{ $labels.pod }} is failing CNI endpoint-create calls. New pods may fail to get networking.\"\n                query: 'sum(rate(cilium_api_limiter_processed_requests_total{api_call=~\"endpoint-create\", outcome=\"fail\"}[1m])) by (pod, api_call) > 0'\n                severity: info\n                for: 5m\n              # BPF maps\n              - name: Cilium agent map operation failures\n                description: \"Cilium agent {{ $labels.pod }} has eBPF map operation failures on {{ $labels.map_name }}. Datapath may be degraded.\"\n                query: 'sum(rate(cilium_bpf_map_ops_total{outcome=\"fail\"}[5m])) by (map_name, pod) > 0'\n                severity: warning\n                for: 5m\n              - name: Cilium agent BPF map pressure\n                description: \"Cilium agent {{ $labels.pod }} eBPF map {{ $labels.map_name }} is above 90% utilization. Map may become full.\"\n                query: \"cilium_bpf_map_pressure{} > 0.9\"\n                severity: warning\n                for: 5m\n                comments: Map pressure is a ratio from 0 to 1. At 1.0, the map is full and new entries will be dropped.\n              # Conntrack and NAT\n              - name: Cilium agent conntrack table full\n                description: \"Cilium agent {{ $labels.pod }} conntrack table is full, causing packet drops. Increase CT map size or investigate connection leaks.\"\n                query: 'sum(rate(cilium_drop_count_total{reason=\"CT: Map insertion failed\"}[5m])) by (pod) > 0'\n                severity: critical\n                for: 5m\n              - name: Cilium agent conntrack failed garbage collection\n                description: \"Cilium agent {{ $labels.pod }} conntrack garbage collection is failing. Stale entries may accumulate.\"\n                query: 'sum(rate(cilium_datapath_conntrack_gc_runs_total{status=\"uncompleted\"}[5m])) by (pod) > 0'\n                severity: warning\n                for: 5m\n              - name: Cilium agent NAT table full\n                description: \"Cilium agent {{ $labels.pod }} NAT table is full, causing masquerade failures. Increase NAT map size or investigate.\"\n                query: 'sum(rate(cilium_drop_count_total{reason=\"No mapping for NAT masquerade\"}[1m])) by (pod) > 0'\n                severity: critical\n                for: 5m\n              # Packet drops\n              - name: Cilium agent high denied rate\n                description: \"Cilium agent {{ $labels.pod }} is dropping packets due to policy denial. Verify network policies are correct.\"\n                query: 'sum(rate(cilium_drop_count_total{reason=\"Policy denied\"}[1m])) by (pod) > 0'\n                severity: info\n                for: 10m\n                comments: Policy denials may be expected behavior. Investigate only if unexpected traffic is being blocked.\n              - name: Cilium agent high drop rate\n                description: \"Cilium agent {{ $labels.pod }} is dropping packets for reason {{ $labels.reason }}. This indicates infrastructure issues.\"\n                query: 'sum(rate(cilium_drop_count_total{reason!~\"Policy denied\"}[5m])) by (pod, reason) > 0'\n                severity: warning\n                for: 5m\n              # Policy\n              - name: Cilium agent policy map pressure\n                description: \"Cilium agent {{ $labels.pod }} policy BPF map is above 90% utilization. New policies may fail to apply.\"\n                query: 'sum(cilium_bpf_map_pressure{map_name=~\"cilium_policy_.*\"}) by (pod) > 0.9'\n                severity: warning\n                for: 5m\n              - name: Cilium agent policy import errors\n                description: \"Cilium agent {{ $labels.pod }} is failing to import network policies. Policy enforcement may be incomplete.\"\n                query: 'sum(rate(cilium_policy_change_total{outcome=\"fail\"}[5m])) by (pod) > 0'\n                severity: warning\n                for: 5m\n              - name: Cilium agent policy implementation delay\n                description: \"Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies.\"\n                query: \"histogram_quantile(0.99, sum(rate(cilium_policy_implementation_delay[5m])) by (le, pod)) > 60\"\n                severity: warning\n                for: 5m\n                comments: Threshold of 60s is a rough default. Adjust based on cluster size and policy complexity.\n              # Identity\n              - name: Cilium node-local high identity allocation\n                description: \"Cilium agent {{ $labels.pod }} node-local identity allocation is above 80%. Approaching the 65535 identity limit.\"\n                query: '(sum(cilium_identity{type=\"node_local\"}) by (pod) / (2^16-1)) > 0.8'\n                severity: warning\n                for: 5m\n              - name: Cilium cluster high identity allocation\n                description: \"Cilium cluster-wide identity allocation is above 80%. Approaching the maximum identity limit.\"\n                query: '(sum(cilium_identity{type=\"cluster_local\"}) by () / (2^16-256)) > 0.8'\n                severity: warning\n                for: 5m\n              # IPAM\n              - name: Cilium operator exhausted IPAM IPs\n                description: \"Cilium operator has no available IPAM IPs. New pods will fail to schedule networking.\"\n                query: 'sum(cilium_operator_ipam_ips{type=\"available\"}) by () <= 0'\n                severity: critical\n                for: 5m\n              - name: Cilium operator low available IPAM IPs\n                description: \"Cilium operator IPAM IP pool is over 90% utilized. Allocate more IPs to avoid exhaustion.\"\n                query: 'sum(cilium_operator_ipam_ips{type!=\"available\"}) by () / sum(cilium_operator_ipam_ips) by () > 0.9 and sum(cilium_operator_ipam_ips) by () > 0'\n                severity: warning\n                for: 5m\n                comments: Threshold of 90% is a rough default. Adjust based on your pod churn rate and IP pool size.\n              - name: Cilium operator IPAM interface creation failures\n                description: \"Cilium operator is failing to create IPAM network interfaces. IP allocation may be impacted.\"\n                query: 'sum(rate(cilium_operator_ipam_interface_creation_ops{status!=\"success\"}[5m])) by () > 0'\n                severity: warning\n                for: 10m\n                comments: |\n                  Some Cilium versions may not have a status label on this metric. Verify against your Cilium version.\n              # API and K8s client\n              - name: Cilium agent API errors\n                description: \"Cilium agent {{ $labels.pod }} API is returning 5xx errors ({{ $labels.return_code }}). Agent may be unhealthy.\"\n                query: 'sum(rate(cilium_agent_api_process_time_seconds_count{return_code=~\"5[0-9][0-9]\"}[5m])) by (pod, return_code) > 0'\n                severity: warning\n                for: 5m\n              - name: Cilium agent Kubernetes client errors\n                description: \"Cilium agent {{ $labels.pod }} is receiving errors from K8s API for endpoint {{ $labels.endpoint }} ({{ $labels.return_code }}).\"\n                query: 'sum(rate(cilium_k8s_client_api_calls_total{endpoint!=\"metrics\", return_code!~\"2[0-9][0-9]\"}[5m])) by (pod, endpoint, return_code) > 0'\n                severity: info\n                for: 5m\n              # ClusterMesh\n              - name: Cilium ClusterMesh remote cluster not ready\n                description: \"Cilium ClusterMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\"\n                query: \"count(cilium_clustermesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0\"\n                severity: critical\n                for: 5m\n              - name: Cilium ClusterMesh remote cluster failing\n                description: \"Cilium ClusterMesh connectivity to remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is failing.\"\n                query: \"sum(rate(cilium_clustermesh_remote_cluster_failures[5m])) by (source_cluster, target_cluster) > 0\"\n                severity: critical\n                for: 5m\n              # KVStoreMesh\n              - name: Cilium KVStoreMesh remote cluster not ready\n                description: \"Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\"\n                query: \"count(cilium_kvstoremesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0\"\n                severity: critical\n                for: 5m\n              - name: Cilium KVStoreMesh remote cluster failing\n                description: \"Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is experiencing failures.\"\n                query: \"sum(rate(cilium_kvstoremesh_remote_cluster_failures[5m])) by (source_cluster, target_cluster) > 0\"\n                severity: critical\n                for: 5m\n              - name: Cilium KVStoreMesh sync errors\n                description: \"Cilium KVStoreMesh from {{ $labels.source_cluster }} is experiencing kvstore sync errors.\"\n                query: \"sum(rate(cilium_kvstoremesh_kvstore_sync_errors_total[5m])) by (source_cluster) > 0\"\n                severity: critical\n                for: 5m\n              # Hubble\n              - name: Cilium Hubble lost events\n                description: \"Cilium Hubble on {{ $labels.pod }} is losing flow events. Observability data may be incomplete.\"\n                query: \"sum(rate(hubble_lost_events_total[5m])) by (pod) > 0\"\n                severity: warning\n                for: 5m\n              - name: Cilium Hubble high DNS error rate\n                description: \"Cilium Hubble on {{ $labels.pod }} is observing more than 10% DNS error responses.\"\n                query: 'sum(rate(hubble_dns_responses_total{rcode!=\"No Error\"}[5m])) by (pod) / sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0.1 and sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0'\n                severity: warning\n                for: 5m\n                comments: Threshold of 10% is a rough default. Some DNS errors may be normal depending on your workload.\n\n      - name: WireGuard\n        exporters:\n          - name: MindFlavor/prometheus_wireguard_exporter\n            slug: mindflavor-prometheus-wireguard-exporter\n            doc_url: https://github.com/MindFlavor/prometheus_wireguard_exporter\n            rules:\n              - name: WireGuard peer handshake too old\n                description: \"WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has not had a handshake for over 5 minutes. The tunnel may be down.\"\n                query: 'time() - wireguard_latest_handshake_seconds > 300 and wireguard_latest_handshake_seconds > 0'\n                severity: warning\n                for: 2m\n                comments: |\n                  The threshold of 300 seconds (5 minutes) is a rough default. WireGuard peers that are idle but reachable\n                  typically re-handshake every 2 minutes. Adjust based on your keepalive interval.\n                  The `> 0` guard excludes peers that have never completed a handshake (covered by a separate rule).\n              - name: WireGuard peer handshake never established\n                description: \"WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has never completed a handshake. Check peer configuration and network connectivity.\"\n                query: 'wireguard_latest_handshake_seconds == 0'\n                severity: critical\n                for: 5m\n                comments: |\n                  This alert will fire for all offline mobile/laptop peers. Consider filtering by expected-online peers.\n              - name: WireGuard no traffic on peer\n                description: \"WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has had no traffic for 15 minutes despite an active handshake.\"\n                query: '(rate(wireguard_sent_bytes_total[15m]) + rate(wireguard_received_bytes_total[15m])) == 0 and wireguard_latest_handshake_seconds > 0 and (time() - wireguard_latest_handshake_seconds) < 300'\n                severity: warning\n                for: 15m\n                comments: |\n                  This alert fires when a peer has a recent handshake but zero traffic flow.\n                  May indicate routing issues or a misconfigured allowed-ips.\n                  Only useful if you expect continuous traffic on all peers.\n\n  - name: Storage\n    services:\n      - name: Ceph\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://docs.ceph.com/en/quincy/mgr/prometheus/\n            rules:\n              - name: Ceph State\n                description: Ceph instance unhealthy\n                query: \"ceph_health_status != 0\"\n                severity: critical\n              - name: Ceph monitor clock skew\n                description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n                query: \"abs(ceph_monitor_clock_skew_seconds) > 0.2\"\n                severity: warning\n                for: 2m\n              - name: Ceph monitor low space\n                description: Ceph monitor storage is low.\n                query: \"ceph_monitor_avail_percent < 10\"\n                severity: warning\n                for: 2m\n              - name: Ceph OSD Down\n                description: Ceph Object Storage Daemon Down\n                query: \"ceph_osd_up == 0\"\n                severity: critical\n              - name: Ceph high OSD latency\n                description: \"Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.\"\n                query: \"ceph_osd_perf_apply_latency_seconds > 5\"\n                severity: warning\n                for: 1m\n              - name: Ceph OSD low space\n                description: Ceph Object Storage Daemon is going out of space. Please add more disks.\n                query: ceph_osd_utilization > 90\n                severity: warning\n                for: 2m\n              - name: Ceph OSD reweighted\n                description: Ceph Object Storage Daemon takes too much time to resize.\n                query: \"ceph_osd_weight < 1\"\n                severity: warning\n                for: 2m\n              - name: Ceph PG down\n                description: Some Ceph placement groups are down. Please ensure that all the data are available.\n                query: \"ceph_pg_down > 0\"\n                severity: critical\n              - name: Ceph PG incomplete\n                description: Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n                query: \"ceph_pg_incomplete > 0\"\n                severity: critical\n              - name: Ceph PG inconsistent\n                description: Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.\n                query: ceph_pg_inconsistent > 0\n                severity: warning\n              - name: Ceph PG activation long\n                description: Some Ceph placement groups are too long to activate.\n                query: \"ceph_pg_activating > 0\"\n                severity: warning\n                for: 2m\n              - name: Ceph PG backfill full\n                description: Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n                query: \"ceph_pg_backfill_toofull > 0\"\n                severity: warning\n                for: 2m\n              - name: Ceph PG unavailable\n                description: Some Ceph placement groups are unavailable.\n                query: \"ceph_pg_total - ceph_pg_active > 0\"\n                severity: critical\n\n      - name: ZFS\n        exporters:\n          - name: node-exporter\n            slug: node-exporter\n            doc_url: https://github.com/prometheus/node_exporter\n            rules:\n              - name: ZFS offline pool\n                description: \"A ZFS zpool is in a unexpected state: {{ $labels.state }}.\"\n                query: 'node_zfs_zpool_state{state!=\"online\"} > 0'\n                severity: critical\n                for: 1m\n          - name: ZFS exporter\n            slug: zfs_exporter\n            doc_url: https://github.com/pdf/zfs_exporter\n            rules:\n              - name: ZFS pool out of space\n                description: Disk is almost full (< 10% left)\n                query: \"zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0 and zfs_pool_size_bytes > 0\"\n                severity: warning\n              - name: ZFS pool unhealthy\n                description: ZFS pool state is {{ $value }}. See comments for more information.\n                query: \"zfs_pool_health > 0\"\n                severity: critical\n                comments: |\n                  0: ONLINE\n                  1: DEGRADED\n                  2: FAULTED\n                  3: OFFLINE\n                  4: UNAVAIL\n                  5: REMOVED\n                  6: SUSPENDED\n              - name: ZFS collector failed\n                description: ZFS collector for {{ $labels.instance }} has failed to collect information\n                query: \"zfs_scrape_collector_success != 1\"\n                severity: warning\n\n      - name: OpenEBS\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: OpenEBS used pool capacity\n                description: \"OpenEBS Pool use more than 80% of his capacity\"\n                query: \"openebs_used_pool_capacity_percent > 80\"\n                severity: warning\n                for: 2m\n\n      - name: Minio\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: Minio cluster disk offline\n                description: \"Minio cluster disk is offline\"\n                query: \"minio_cluster_drive_offline_total > 0\"\n                severity: critical\n              - name: Minio node disk offline\n                description: \"Minio cluster node disk is offline\"\n                query: \"minio_cluster_nodes_offline_total > 0\"\n                severity: critical\n              - name: Minio disk space usage\n                description: \"Minio available free space is low (< 10%)\"\n                query: minio_cluster_capacity_raw_free_bytes / minio_cluster_capacity_raw_total_bytes * 100 < 10 and minio_cluster_capacity_raw_total_bytes > 0\n                severity: warning\n\n  - name: Cloud providers\n    services:\n      - name: AWS CloudWatch\n        exporters:\n          - name: prometheus/cloudwatch_exporter\n            slug: prometheus-cloudwatch-exporter\n            doc_url: https://github.com/prometheus/cloudwatch_exporter\n            comments: |\n              CloudWatch metrics are exported as aws_{namespace}_{metric_name}_{statistic} gauges.\n              The rules below cover both exporter health and common AWS service alerts.\n              Adjust thresholds and label filters to match your CloudWatch exporter configuration.\n            rules:\n              - name: CloudWatch exporter scrape error\n                description: \"CloudWatch exporter on {{ $labels.instance }} failed to scrape metrics from AWS CloudWatch API.\"\n                query: \"cloudwatch_exporter_scrape_error > 0\"\n                severity: warning\n                for: 5m\n              - name: CloudWatch exporter slow scrape\n                description: \"CloudWatch exporter on {{ $labels.instance }} scrape is taking more than 5 minutes ({{ $value }}s). Consider reducing the number of metrics or splitting across multiple exporters.\"\n                query: \"cloudwatch_exporter_scrape_duration_seconds > 300\"\n                severity: warning\n                for: 5m\n              - name: CloudWatch API high request rate\n                description: \"CloudWatch exporter on {{ $labels.instance }} is making {{ $value }} API calls per minute to namespace {{ $labels.namespace }}. This can lead to high AWS costs.\"\n                query: \"sum by (instance, namespace) (rate(cloudwatch_requests_total[5m])) * 60 > 100\"\n                severity: warning\n                comments: |\n                  CloudWatch API calls cost money (~$0.01 per 1000 GetMetricData requests).\n                  100 requests/minute ≈ $45/month. Adjust the threshold based on your budget.\n              - name: AWS EC2 high CPU utilization\n                description: \"EC2 instance {{ $labels.instance_id }} CPU utilization is above 90% ({{ $value }}%).\"\n                query: \"aws_ec2_cpuutilization_average > 90\"\n                severity: warning\n                for: 15m\n                comments: Requires EC2 CPUUtilization metric configured in the CloudWatch exporter.\n              - name: AWS RDS low free storage space\n                description: \"RDS instance {{ $labels.dbinstance_identifier }} has less than 2GB free storage ({{ $value }} bytes remaining).\"\n                query: \"aws_rds_free_storage_space_average < 2000000000\"\n                severity: warning\n                for: 5m\n                comments: |\n                  Requires RDS FreeStorageSpace metric. The threshold of 2GB is a rough default.\n                  Adjust based on your database size.\n              - name: AWS RDS high CPU utilization\n                description: \"RDS instance {{ $labels.dbinstance_identifier }} CPU utilization is above 90% ({{ $value }}%).\"\n                query: \"aws_rds_cpuutilization_average > 90\"\n                severity: warning\n                for: 15m\n                comments: Requires RDS CPUUtilization metric configured in the CloudWatch exporter.\n              - name: AWS RDS high database connections\n                description: \"RDS instance {{ $labels.dbinstance_identifier }} has {{ $value }} active connections.\"\n                query: \"aws_rds_database_connections_average > 100\"\n                severity: warning\n                for: 5m\n                comments: |\n                  The threshold depends on the RDS instance class. Adjust based on your\n                  instance type's max_connections parameter.\n              - name: AWS SQS queue messages visible\n                description: \"SQS queue {{ $labels.queue_name }} has {{ $value }} messages waiting to be processed.\"\n                query: \"aws_sqs_approximate_number_of_messages_visible_average > 1000\"\n                severity: warning\n                for: 10m\n                comments: |\n                  Requires SQS ApproximateNumberOfMessagesVisible metric. The threshold of 1000\n                  is a rough default. Adjust based on your expected queue depth.\n              - name: AWS SQS message age too old\n                description: \"SQS queue {{ $labels.queue_name }} has messages older than 1 hour ({{ $value }}s).\"\n                query: \"aws_sqs_approximate_age_of_oldest_message_maximum > 3600\"\n                severity: warning\n                comments: Requires SQS ApproximateAgeOfOldestMessage metric.\n              - name: AWS ALB unhealthy targets\n                description: \"ALB {{ $labels.load_balancer }} has {{ $value }} unhealthy target(s) in target group {{ $labels.target_group }}.\"\n                query: \"aws_applicationelb_unhealthy_host_count_average > 0\"\n                severity: critical\n                for: 5m\n                comments: Requires ApplicationELB UnHealthyHostCount metric.\n              - name: AWS ALB high 5xx error rate\n                description: \"ALB {{ $labels.load_balancer }} 5xx error rate is above 5% ({{ $value }}%).\"\n                query: \"(aws_applicationelb_httpcode_elb_5_xx_count_sum / aws_applicationelb_request_count_sum) * 100 > 5 and aws_applicationelb_request_count_sum > 0\"\n                severity: critical\n                for: 5m\n                comments: Requires ApplicationELB HTTPCode_ELB_5XX_Count and RequestCount metrics.\n              - name: AWS ALB high target response time\n                description: \"ALB {{ $labels.load_balancer }} average target response time is above 2 seconds ({{ $value }}s).\"\n                query: \"aws_applicationelb_target_response_time_average > 2\"\n                severity: warning\n                for: 5m\n                comments: Requires ApplicationELB TargetResponseTime metric.\n              - name: AWS Lambda high error rate\n                description: \"Lambda function {{ $labels.function_name }} error rate is above 5% ({{ $value }}%).\"\n                query: \"(aws_lambda_errors_sum / aws_lambda_invocations_sum) * 100 > 5 and aws_lambda_invocations_sum > 0\"\n                severity: warning\n                for: 5m\n                comments: Requires Lambda Errors and Invocations metrics.\n\n      - name: Google Cloud Stackdriver\n        exporters:\n          - name: prometheus-community/stackdriver_exporter\n            slug: stackdriver-exporter\n            doc_url: https://github.com/prometheus-community/stackdriver_exporter\n            comments: |\n              Self-monitoring metrics use the stackdriver_monitoring_* prefix.\n              All self-monitoring metrics include a project_id label.\n            rules:\n              - name: Stackdriver exporter scrape error\n                description: \"Stackdriver exporter failed to scrape metrics from Google Cloud Monitoring API for project {{ $labels.project_id }}.\"\n                query: \"stackdriver_monitoring_last_scrape_error > 0\"\n                severity: warning\n                for: 5m\n              - name: Stackdriver exporter slow scrape\n                description: \"Stackdriver exporter scrape for project {{ $labels.project_id }} is taking more than 5 minutes ({{ $value }}s).\"\n                query: \"stackdriver_monitoring_last_scrape_duration_seconds > 300\"\n                severity: warning\n                for: 5m\n              - name: Stackdriver exporter scrape errors increasing\n                description: \"Stackdriver exporter has had {{ $value }} scrape errors in the last 15 minutes for project {{ $labels.project_id }}.\"\n                query: \"increase(stackdriver_monitoring_scrape_errors_total[15m]) > 5\"\n                severity: warning\n              - name: Stackdriver exporter high API calls\n                description: \"Stackdriver exporter is making {{ $value }} API calls per minute for project {{ $labels.project_id }}. This may hit Google Cloud Monitoring API quotas.\"\n                query: \"rate(stackdriver_monitoring_api_calls_total[5m]) * 60 > 100\"\n                severity: warning\n              - name: Stackdriver exporter scrape stale\n                description: \"Stackdriver exporter has not successfully scraped metrics for project {{ $labels.project_id }} in the last 10 minutes.\"\n                query: \"time() - stackdriver_monitoring_last_scrape_timestamp > 600\"\n                severity: warning\n\n      - name: DigitalOcean\n        exporters:\n          - name: metalmatze/digitalocean_exporter\n            slug: digitalocean-exporter\n            doc_url: https://github.com/metalmatze/digitalocean_exporter\n            rules:\n              - name: DigitalOcean droplet down\n                description: \"DigitalOcean droplet {{ $labels.name }} ({{ $labels.id }}) in {{ $labels.region }} is not running.\"\n                query: \"digitalocean_droplet_up == 0\"\n                severity: critical\n                for: 5m\n              - name: DigitalOcean account not active\n                description: \"DigitalOcean account is not active. It may be suspended or locked.\"\n                query: \"digitalocean_account_active != 1\"\n                severity: critical\n                for: 5m\n              - name: DigitalOcean database down\n                description: \"DigitalOcean managed database {{ $labels.name }} ({{ $labels.engine }}) in {{ $labels.region }} is offline.\"\n                query: \"digitalocean_database_status == 0\"\n                severity: critical\n                for: 2m\n              - name: DigitalOcean Kubernetes cluster down\n                description: \"DigitalOcean Kubernetes cluster {{ $labels.name }} ({{ $labels.version }}) in {{ $labels.region }} is not running.\"\n                query: \"digitalocean_kubernetes_cluster_up == 0\"\n                severity: critical\n                for: 5m\n              - name: DigitalOcean load balancer down\n                description: \"DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) is not active.\"\n                query: \"digitalocean_loadbalancer_status == 0\"\n                severity: critical\n                for: 2m\n              - name: DigitalOcean load balancer no backends\n                description: \"DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) has no droplets attached.\"\n                query: \"digitalocean_loadbalancer_droplets == 0\"\n                severity: warning\n                for: 1m\n              - name: DigitalOcean floating IP not assigned\n                description: \"DigitalOcean floating IP {{ $labels.ipv4 }} in {{ $labels.region }} is not assigned to any droplet.\"\n                query: \"digitalocean_floating_ipv4_active == 0\"\n                severity: warning\n              - name: DigitalOcean active incidents\n                description: \"DigitalOcean platform has {{ $value }} active incident(s).\"\n                query: \"digitalocean_incidents_total > 0\"\n                severity: warning\n              - name: DigitalOcean exporter collection errors\n                description: \"DigitalOcean exporter {{ $labels.collector }} collector has {{ $value }} errors.\"\n                query: \"increase(digitalocean_errors_total[5m]) > 0\"\n                severity: warning\n                for: 5m\n              - name: DigitalOcean droplet limit approaching\n                description: \"DigitalOcean account is using {{ $value }}% of its droplet quota.\"\n                query: \"(count(digitalocean_droplet_up) / digitalocean_account_droplet_limit) * 100 > 80 and digitalocean_account_droplet_limit > 0\"\n                severity: warning\n                comments: Fires when more than 80% of the account's droplet limit is in use.\n\n      - name: Azure\n        exporters:\n          - name: webdevops/azure-metrics-exporter\n            slug: azure-metrics-exporter\n            doc_url: https://github.com/webdevops/azure-metrics-exporter\n            comments: |\n              The exporter uses azurerm_resource_metric as the default metric name for forwarded Azure Monitor metrics.\n              The metric name can be customized via the name parameter in probe configuration.\n              Self-monitoring metrics use the azurerm_stats_* and azurerm_api_* prefixes.\n            rules:\n              - name: Azure exporter request errors\n                description: \"Azure metrics exporter on {{ $labels.instance }} has {{ $value }} API request errors in the last 15 minutes.\"\n                query: 'increase(azurerm_stats_metric_requests{result=\"error\"}[15m]) > 5'\n                severity: warning\n              - name: Azure exporter high error rate\n                description: \"Azure metrics exporter on {{ $labels.instance }} has an error rate above 10% ({{ $value }}%).\"\n                query: 'sum by (instance) (rate(azurerm_stats_metric_requests{result=\"error\"}[5m])) / sum by (instance) (rate(azurerm_stats_metric_requests[5m])) * 100 > 10 and sum by (instance) (rate(azurerm_stats_metric_requests[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Azure API read rate limit approaching\n                description: \"Azure API read rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining).\"\n                query: 'azurerm_api_ratelimit{type=\"read\"} < 100'\n                severity: warning\n                comments: |\n                  Azure Resource Manager enforces rate limits per subscription.\n                  The threshold of 100 remaining calls is a rough default. Adjust based on your\n                  scrape interval and number of monitored resources.\n              - name: Azure API write rate limit approaching\n                description: \"Azure API write rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining).\"\n                query: 'azurerm_api_ratelimit{type=\"write\"} < 50'\n                severity: warning\n              - name: Azure exporter slow collection\n                description: \"Azure metrics exporter on {{ $labels.instance }} metric collection is taking more than 5 minutes ({{ $value }}s).\"\n                query: \"azurerm_stats_metric_collecttime > 300\"\n                severity: warning\n                for: 5m\n\n\n  - name: Observability\n    services:\n      - name: Thanos\n        exporters:\n          - name: Thanos Compactor\n            slug: thanos-compactor\n            rules:\n              - name: Thanos Compactor Multiple Running\n                description: \"No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running.\"\n                query: 'sum by (job) (up{job=~\".*thanos-compact.*\"}) > 1'\n                severity: warning\n                for: 5m\n              - name: Thanos Compactor Halted\n                description: \"Thanos Compact {{$labels.job}} has failed to run and now is halted.\"\n                query: 'thanos_compact_halted{job=~\".*thanos-compact.*\"} == 1'\n                severity: warning\n                for: 5m\n              - name: Thanos Compactor High Compaction Failures\n                description: \"Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.\"\n                query: '(sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~\".*thanos-compact.*\"}[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total{job=~\".*thanos-compact.*\"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_compact_group_compactions_total{job=~\".*thanos-compact.*\"}[5m])) > 0'\n                severity: warning\n                for: 15m\n              - name: Thanos Compact Bucket High Operation Failures\n                description: \"Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\"\n                query: '(sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~\".*thanos-compact.*\"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~\".*thanos-compact.*\"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~\".*thanos-compact.*\"}[5m])) > 0'\n                severity: warning\n                for: 15m\n              - name: Thanos Compact Has Not Run\n                description: \"Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.\"\n                query: '(time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~\".*thanos-compact.*\"}[24h]))) / 60 / 60 > 24'\n                severity: warning\n\n          - name: Thanos Query\n            slug: thanos-query\n            rules:\n              - name: Thanos Query Http Request Query Error Rate High\n                description: 'Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query\" requests.'\n                query: '(sum by (job) (rate(http_requests_total{code=~\"5..\", job=~\".*thanos-query.*\", handler=\"query\"}[5m]))/  sum by (job) (rate(http_requests_total{job=~\".*thanos-query.*\", handler=\"query\"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~\".*thanos-query.*\", handler=\"query\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Thanos Query Http Request Query Range Error Rate High\n                description: 'Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query_range\" requests.'\n                query: '(sum by (job) (rate(http_requests_total{code=~\"5..\", job=~\".*thanos-query.*\", handler=\"query_range\"}[5m]))/  sum by (job) (rate(http_requests_total{job=~\".*thanos-query.*\", handler=\"query_range\"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~\".*thanos-query.*\", handler=\"query_range\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Thanos Query Grpc Server Error Rate\n                description: \"Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\"\n                query: '(sum by (job) (rate(grpc_server_handled_total{grpc_code=~\"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded\", job=~\".*thanos-query.*\"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~\".*thanos-query.*\"}[5m])) * 100 > 5) and sum by (job) (rate(grpc_server_started_total{job=~\".*thanos-query.*\"}[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Thanos Query Grpc Client Error Rate\n                description: \"Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.\"\n                query: '(sum by (job) (rate(grpc_client_handled_total{grpc_code!=\"OK\", job=~\".*thanos-query.*\"}[5m])) / sum by (job) (rate(grpc_client_started_total{job=~\".*thanos-query.*\"}[5m]))) * 100 > 5 and sum by (job) (rate(grpc_client_started_total{job=~\".*thanos-query.*\"}[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Thanos Query High D N S Failures\n                description: \"Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.\"\n                query: '(sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~\".*thanos-query.*\"}[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~\".*thanos-query.*\"}[5m]))) * 100 > 1 and sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~\".*thanos-query.*\"}[5m])) > 0'\n                severity: warning\n                for: 15m\n              - name: Thanos Query Instant Latency High\n                description: \"Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries.\"\n                query: '(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\".*thanos-query.*\", handler=\"query\"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_bucket{job=~\".*thanos-query.*\", handler=\"query\"}[5m])) > 0)'\n                severity: critical\n                for: 10m\n              - name: Thanos Query Range Latency High\n                description: \"Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.\"\n                query: '(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\".*thanos-query.*\", handler=\"query_range\"}[5m]))) > 90 and sum by (job) (rate(http_request_duration_seconds_count{job=~\".*thanos-query.*\", handler=\"query_range\"}[5m])) > 0)'\n                severity: critical\n                for: 10m\n              - name: Thanos Query Overload\n                description: \"Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.\"\n                query: \"(max_over_time(thanos_query_concurrent_gate_queries_max[5m]) - avg_over_time(thanos_query_concurrent_gate_queries_in_flight[5m]) < 1)\"\n                severity: warning\n                for: 15m\n          - name: Thanos Receiver\n            slug: thanos-receiver\n            rules:\n              - name: Thanos Receive Http Request Error Rate High\n                description: \"Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\"\n                query: '(sum by (job) (rate(http_requests_total{code=~\"5..\", job=~\".*thanos-receive.*\", handler=\"receive\"}[5m]))/  sum by (job) (rate(http_requests_total{job=~\".*thanos-receive.*\", handler=\"receive\"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~\".*thanos-receive.*\", handler=\"receive\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Thanos Receive Http Request Latency High\n                description: \"Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.\"\n                query: '(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\".*thanos-receive.*\", handler=\"receive\"}[5m]))) > 10 and sum by (job) (rate(http_request_duration_seconds_count{job=~\".*thanos-receive.*\", handler=\"receive\"}[5m])) > 0)'\n                severity: critical\n                for: 10m\n              - name: Thanos Receive High Replication Failures\n                description: \"Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.\"\n                query: 'thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result=\"error\", job=~\".*thanos-receive.*\"}[5m])) / sum by (job) (rate(thanos_receive_replications_total{job=~\".*thanos-receive.*\"}[5m]))) > (max by (job) (floor((thanos_receive_replication_factor{job=~\".*thanos-receive.*\"}+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes{job=~\".*thanos-receive.*\"}))) * 100'\n                severity: warning\n                for: 5m\n              - name: Thanos Receive High Forward Request Failures\n                description: \"Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests.\"\n                query: '(sum by (job) (rate(thanos_receive_forward_requests_total{result=\"error\", job=~\".*thanos-receive.*\"}[5m]))/  sum by (job) (rate(thanos_receive_forward_requests_total{job=~\".*thanos-receive.*\"}[5m]))) * 100 > 20 and sum by (job) (rate(thanos_receive_forward_requests_total{job=~\".*thanos-receive.*\"}[5m])) > 0'\n                severity: info\n                for: 5m\n              - name: Thanos Receive High Hashring File Refresh Failures\n                description: \"Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.\"\n                query: '(sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~\".*thanos-receive.*\"}[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~\".*thanos-receive.*\"}[5m])) > 0) and sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~\".*thanos-receive.*\"}[5m])) > 0'\n                severity: warning\n                for: 15m\n              - name: Thanos Receive Config Reload Failure\n                description: \"Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.\"\n                query: 'avg by (job) (thanos_receive_config_last_reload_successful{job=~\".*thanos-receive.*\"}) != 1'\n                severity: warning\n                for: 5m\n              - name: Thanos Receive No Upload\n                description: \"Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.\"\n                query: '(up{job=~\".*thanos-receive.*\"} - 1) + on (job, instance) (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~\".*thanos-receive.*\"}[3h])) == 0)'\n                severity: critical\n                for: 3h\n          - name: Thanos Sidecar\n            slug: thanos-sidecar\n            rules:\n              - name: Thanos Sidecar Bucket Operations Failed\n                description: \"Thanos Sidecar {{$labels.instance}} bucket operations are failing ({{ $value | humanize }}/s).\"\n                query: 'sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~\".*thanos-sidecar.*\"}[5m])) > 0.05'\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: critical\n                for: 5m\n              - name: Thanos Sidecar No Connection To Started Prometheus\n                description: \"Thanos Sidecar {{$labels.instance}} is unhealthy.\"\n                query: 'thanos_sidecar_prometheus_up{job=~\".*thanos-sidecar.*\"} == 0 and on (namespace, pod)prometheus_tsdb_data_replay_duration_seconds != 0'\n                severity: critical\n                for: 5m\n          - name: Thanos Store\n            slug: thanos-store\n            rules:\n              - name: Thanos Store Grpc Error Rate\n                description: \"Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\"\n                query: '(sum by (job) (rate(grpc_server_handled_total{grpc_code=~\"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded\", job=~\".*thanos-store.*\"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~\".*thanos-store.*\"}[5m])) * 100 > 5) and sum by (job) (rate(grpc_server_started_total{job=~\".*thanos-store.*\"}[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Thanos Store Series Gate Latency High\n                description: \"Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.\"\n                query: '(histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~\".*thanos-store.*\"}[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~\".*thanos-store.*\"}[5m])) > 0)'\n                severity: warning\n                for: 10m\n              - name: Thanos Store Bucket High Operation Failures\n                description: \"Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\"\n                query: '(sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~\".*thanos-store.*\"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~\".*thanos-store.*\"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~\".*thanos-store.*\"}[5m])) > 0'\n                severity: warning\n                for: 15m\n              - name: Thanos Store Objstore Operation Latency High\n                description: \"Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.\"\n                query: '(histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~\".*thanos-store.*\"}[5m]))) > 2 and  sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~\".*thanos-store.*\"}[5m])) > 0)'\n                severity: warning\n                for: 10m\n          - name: Thanos Ruler\n            slug: thanos-ruler\n            rules:\n              - name: Thanos Rule Queue Is Dropping Alerts\n                description: \"Thanos Rule {{$labels.instance}} is failing to queue alerts ({{ $value | humanize }}/s).\"\n                query: 'sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Thanos Rule Sender Is Failing Alerts\n                description: \"Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager ({{ $value | humanize }}/s).\"\n                query: 'sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Thanos Rule High Rule Evaluation Failures\n                description: \"Thanos Rule {{$labels.instance}} is failing to evaluate {{$value | humanize}}% of rules.\"\n                query: '(sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~\".*thanos-rule.*\"}[5m])) / sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~\".*thanos-rule.*\"}[5m])) * 100 > 5) and sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Thanos Rule High Rule Evaluation Warnings\n                description: \"Thanos Rule {{$labels.instance}} has high number of evaluation warnings ({{ $value | humanize }}/s).\"\n                query: 'sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~\".*thanos-rule.*\"}[5m])) > 0.05'\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: info\n                for: 15m\n              - name: Thanos Rule Rule Evaluation Latency High\n                description: \"Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.\"\n                query: '(sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~\".*thanos-rule.*\"}) > sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~\".*thanos-rule.*\"}))'\n                severity: warning\n                for: 5m\n              - name: Thanos Rule Grpc Error Rate\n                description: \"Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\"\n                query: '(sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~\"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded\", job=~\".*thanos-rule.*\"}[5m]))/  sum by (job, instance) (rate(grpc_server_started_total{job=~\".*thanos-rule.*\"}[5m])) * 100 > 5) and sum by (job, instance) (rate(grpc_server_started_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Thanos Rule Config Reload Failure\n                description: \"Thanos Rule {{$labels.job}} has not been able to reload its configuration.\"\n                query: 'avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~\".*thanos-rule.*\"}) != 1'\n                severity: info\n                for: 5m\n              - name: Thanos Rule Query High D N S Failures\n                description: \"Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.\"\n                query: '(sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~\".*thanos-rule.*\"}[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~\".*thanos-rule.*\"}[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n                severity: warning\n                for: 15m\n              - name: Thanos Rule Alertmanager High D N S Failures\n                description: \"Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.\"\n                query: '(sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~\".*thanos-rule.*\"}[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~\".*thanos-rule.*\"}[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n                severity: warning\n                for: 15m\n              - name: Thanos Rule No Evaluation For10 Intervals\n                description: \"Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.\"\n                query: 'time() -  max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~\".*thanos-rule.*\"})>10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~\".*thanos-rule.*\"})'\n                severity: info\n                for: 5m\n              - name: Thanos No Rule Evaluations\n                description: \"Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.\"\n                query: 'sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~\".*thanos-rule.*\"}[5m])) <= 0  and sum by (job, instance) (thanos_rule_loaded_rules{job=~\".*thanos-rule.*\"}) > 0'\n                severity: critical\n                for: 5m\n          - name: Thanos Bucket Replicate\n            slug: thanos-bucket-replicate\n            rules:\n              - name: Thanos Bucket Replicate Error Rate\n                description: \"Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed.\"\n                query: '(sum by (job) (rate(thanos_replicate_replication_runs_total{result=\"error\", job=~\".*thanos-bucket-replicate.*\"}[5m])) / on (job) group_left sum by (job) (rate(thanos_replicate_replication_runs_total{job=~\".*thanos-bucket-replicate.*\"}[5m]))) * 100 >= 10 and sum by (job) (rate(thanos_replicate_replication_runs_total{job=~\".*thanos-bucket-replicate.*\"}[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Thanos Bucket Replicate Run Latency\n                description: \"Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.\"\n                query: '(histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~\".*thanos-bucket-replicate.*\"}[5m]))) > 20 and  sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~\".*thanos-bucket-replicate.*\"}[5m])) > 0)'\n                severity: critical\n                for: 5m\n          - name: Thanos Component Absent\n            slug: thanos-component-absent\n            rules:\n              - name: Thanos Compact Is Down\n                description: \"ThanosCompact has disappeared. Prometheus target for the component cannot be discovered.\"\n                query: 'absent(up{job=~\".*thanos-compact.*\"} == 1)'\n                severity: critical\n                for: 5m\n              - name: Thanos Query Is Down\n                description: \"ThanosQuery has disappeared. Prometheus target for the component cannot be discovered.\"\n                query: 'absent(up{job=~\".*thanos-query.*\"} == 1)'\n                severity: critical\n                for: 5m\n              - name: Thanos Receive Is Down\n                description: \"ThanosReceive has disappeared. Prometheus target for the component cannot be discovered.\"\n                query: 'absent(up{job=~\".*thanos-receive.*\"} == 1)'\n                severity: critical\n                for: 5m\n              - name: Thanos Rule Is Down\n                description: \"ThanosRule has disappeared. Prometheus target for the component cannot be discovered.\"\n                query: 'absent(up{job=~\".*thanos-rule.*\"} == 1)'\n                severity: critical\n                for: 5m\n              - name: Thanos Sidecar Is Down\n                description: \"ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered.\"\n                query: 'absent(up{job=~\".*thanos-sidecar.*\"} == 1)'\n                severity: critical\n                for: 5m\n              - name: Thanos Store Is Down\n                description: \"ThanosStore has disappeared. Prometheus target for the component cannot be discovered.\"\n                query: absent(up{job=~\".*thanos-store.*\"} == 1)\n                severity: critical\n                for: 5m\n\n      - name: Loki\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: Loki process too many restarts\n                description: A loki process had too many restarts (target {{ $labels.instance }})\n                query: changes(process_start_time_seconds{job=~\".*loki.*\"}[15m]) > 2\n                severity: warning\n              - name: Loki request errors\n                description: 'The {{ $labels.job }} and {{ $labels.route }} are experiencing {{ printf \"%.2f\" $value }}% errors.'\n                query: '100 * sum(rate(loki_request_duration_seconds_count{status_code=~\"5..\"}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10 and sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 0'\n                severity: critical\n                for: 15m\n              - name: Loki request panic\n                description: The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n                query: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0\n                severity: critical\n                for: 5m\n              - name: Loki request latency\n                description: The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n                query: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~\"(?i).*tail.*\"}[5m])) by (le)))  > 1\n                severity: critical\n                for: 5m\n      - name: Promtail\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: Promtail request errors\n                description: The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}% errors.\n                query: '100 * sum(rate(promtail_request_duration_seconds_count{status_code=~\"5..|failed\"}[1m])) by (namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance) > 10 and sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance) > 0'\n                severity: critical\n                for: 5m\n              - name: Promtail request latency\n                description: The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency.\n                query: histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le)) > 1\n                severity: critical\n                for: 5m\n      - name: Cortex\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: Cortex ruler configuration reload failure\n                description: Cortex ruler configuration reload failure (instance {{ $labels.instance }})\n                query: cortex_ruler_config_last_reload_successful != 1\n                severity: warning\n              - name: Cortex not connected to Alertmanager\n                description: Cortex not connected to Alertmanager (instance {{ $labels.instance }})\n                query: cortex_prometheus_notifications_alertmanagers_discovered < 1\n                severity: critical\n              - name: Cortex notification are being dropped\n                description: \"Cortex notification are being dropped due to errors (instance {{ $labels.instance }}, {{ $value | humanize }}/s).\"\n                query: rate(cortex_prometheus_notifications_dropped_total[5m]) > 0.05\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: critical\n              - name: Cortex notification error\n                description: \"Cortex is failing when sending alert notifications (instance {{ $labels.instance }}, {{ $value | humanize }}/s).\"\n                query: rate(cortex_prometheus_notifications_errors_total[5m]) > 0.05\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: critical\n              - name: Cortex ingester unhealthy\n                description: Cortex has an unhealthy ingester\n                query: cortex_ring_members{state=\"Unhealthy\", name=\"ingester\"} > 0\n                severity: critical\n              - name: Cortex frontend queries stuck\n                description: There are queued up queries in query-frontend.\n                query: sum by (job) (cortex_query_frontend_queue_length) > 0\n                severity: critical\n                for: 5m\n\n      - name: Grafana Tempo\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://grafana.com/docs/tempo/latest/operations/monitor/\n            rules:\n              - name: Tempo distributor unhealthy\n                description: Tempo has {{ $value }} unhealthy distributor(s).\n                query: max by (job) (tempo_ring_members{state=\"Unhealthy\", name=\"distributor\"}) > 0\n                severity: warning\n                for: 15m\n              - name: Tempo live store unhealthy\n                description: Tempo has {{ $value }} unhealthy live store(s).\n                query: max by (job) (tempo_ring_members{state=\"Unhealthy\", name=\"live-store\"}) > 0\n                severity: critical\n                for: 15m\n              - name: Tempo metrics generator unhealthy\n                description: Tempo has {{ $value }} unhealthy metrics generator(s).\n                query: max by (job) (tempo_ring_members{state=\"Unhealthy\", name=\"metrics-generator\"}) > 0\n                severity: critical\n                for: 15m\n              - name: Tempo compactions failing\n                description: \"{{ $value }} compactions have failed in the past hour.\"\n                query: sum by (job) (increase(tempodb_compaction_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_compaction_errors_total[5m])) > 0\n                severity: critical\n                for: 1h\n                comments: |\n                  Uses a two-window approach: 1h for historical count and 5m to confirm the issue is ongoing.\n              - name: Tempo polls failing\n                description: \"{{ $value }} blocklist polls have failed in the past hour.\"\n                query: sum by (job) (increase(tempodb_blocklist_poll_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_blocklist_poll_errors_total[5m])) > 0\n                severity: critical\n              - name: Tempo tenant index failures\n                description: \"{{ $value }} tenant index failures in the past hour.\"\n                query: sum by (job) (increase(tempodb_blocklist_tenant_index_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_blocklist_tenant_index_errors_total[5m])) > 0\n                severity: critical\n              - name: Tempo no tenant index builders\n                description: No tenant index builders for tenant {{ $labels.tenant }}. Tenant index will quickly become stale.\n                query: sum by (tenant) (tempodb_blocklist_tenant_index_builder) == 0 and on() max(tempodb_blocklist_length) > 0\n                severity: critical\n                for: 5m\n              - name: Tempo tenant index too old\n                description: Tenant index for {{ $labels.tenant }} is {{ $value }}s old.\n                query: max by (tenant) (tempodb_blocklist_tenant_index_age_seconds) > 600\n                severity: critical\n                for: 5m\n                comments: |\n                  Threshold of 600s (10 minutes). Adjust based on your tenant index build interval.\n              - name: Tempo block list rising quickly\n                description: Tempo blocklist length is up {{ printf \"%.0f\" $value }}% over the last 7 days. Consider scaling compactors.\n                query: (avg(tempodb_blocklist_length) / avg(tempodb_blocklist_length offset 7d) - 1) * 100 > 40 and avg(tempodb_blocklist_length offset 7d) > 0\n                severity: critical\n                for: 15m\n                comments: |\n                  Fires when the blocklist grows more than 40% over 7 days.\n              - name: Tempo bad overrides\n                description: '{{ $labels.job }} failed to reload runtime overrides.'\n                query: sum by (job) (tempo_runtime_config_last_reload_successful == 0) > 0\n                severity: critical\n                for: 15m\n              - name: Tempo user configurable overrides reload failing\n                description: \"{{ $value }} user-configurable overrides reloads have failed in the past hour.\"\n                query: sum by (job) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total[1h])) > 5 and sum by (job) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total[5m])) > 0\n                severity: critical\n              - name: Tempo compaction too many outstanding blocks warning\n                description: There are too many outstanding compaction blocks for {{ $labels.instance }}. Consider increasing compactor resources.\n                query: sum by (instance) (tempodb_compaction_outstanding_blocks) > 100\n                severity: warning\n                for: 6h\n                comments: |\n                  Threshold of 100 blocks per compactor instance. Adjust based on your environment.\n              - name: Tempo compaction too many outstanding blocks critical\n                description: There are too many outstanding compaction blocks for {{ $labels.instance }}. Increase compactor resources immediately.\n                query: sum by (instance) (tempodb_compaction_outstanding_blocks) > 250\n                severity: critical\n                for: 24h\n                comments: |\n                  Official Tempo mixin normalizes by backend-worker count. Adjust threshold based on your compactor configuration.\n              - name: Tempo distributor usage tracker errors\n                description: \"Tempo distributor usage tracker errors for {{ $labels.job }} at {{ $value | humanize }}/s (reason {{ $labels.reason }}).\"\n                query: sum by (job, reason) (rate(tempo_distributor_usage_tracker_errors_total[5m])) > 0\n                severity: critical\n                for: 30m\n              - name: Tempo metrics generator processor updates failing\n                description: \"Tempo metrics generator processor updates are failing for {{ $labels.job }} ({{ $value }} failures in 5m).\"\n                query: sum by (job) (increase(tempo_metrics_generator_active_processors_update_failed_total[5m])) > 0\n                severity: critical\n                for: 15m\n              - name: Tempo metrics generator service graphs dropping spans\n                description: Tempo metrics generator is dropping {{ printf \"%.2f\" $value }}% of spans in service graphs for {{ $labels.job }}.\n                query: '100 * sum by (job) (rate(tempo_metrics_generator_processor_service_graphs_dropped_spans[5m])) / sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0.5 and sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0'\n                severity: warning\n                for: 15m\n              - name: Tempo metrics generator collections failing\n                description: \"Tempo metrics generator collections are failing for {{ $labels.job }} ({{ $value }} failures in 5m).\"\n                query: sum by (job) (increase(tempo_metrics_generator_registry_collections_failed_total[5m])) > 2\n                severity: critical\n                for: 5m\n              - name: Tempo memcached errors elevated\n                description: 'Tempo memcached error rate is {{ printf \"%.2f\" $value }}% for {{ $labels.name }} in {{ $labels.job }}.'\n                query: '100 * sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count{status_code=\"500\"}[5m])) / sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count[5m])) > 20 and sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count[5m])) > 0'\n                severity: warning\n                for: 10m\n                comments: |\n                  Fires when the memcached error rate exceeds 20%. Only relevant if Tempo is configured with memcached caching.\n\n      - name: Grafana Mimir\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://grafana.com/docs/mimir/latest/manage/monitor-grafana-mimir/\n            comments: |\n              Mimir uses the `cortex_` metric prefix for backward compatibility with Cortex. This is intentional and expected.\n            rules:\n              # Core alerts\n              - name: Mimir ingester unhealthy\n                description: Mimir has {{ $value }} unhealthy ingester(s) in the ring.\n                query: min by (job) (cortex_ring_members{state=\"Unhealthy\", name=\"ingester\"}) > 0\n                severity: critical\n                for: 15m\n              - name: Mimir request errors\n                description: 'Mimir {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}% errors.'\n                query: '100 * sum by (job, route) (rate(cortex_request_duration_seconds_count{status_code=~\"5..\", route!~\"ready|debug_pprof\"}[5m])) / sum by (job, route) (rate(cortex_request_duration_seconds_count{route!~\"ready|debug_pprof\"}[5m])) > 1 and sum by (job, route) (rate(cortex_request_duration_seconds_count{route!~\"ready|debug_pprof\"}[5m])) > 0'\n                severity: critical\n                for: 15m\n              - name: Mimir inconsistent runtime config\n                description: An inconsistent runtime config file is used across Mimir instances.\n                query: count(count by (job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1\n                severity: critical\n                for: 1h\n              - name: Mimir bad runtime config\n                description: '{{ $labels.job }} failed to reload runtime config.'\n                query: sum by (job) (cortex_runtime_config_last_reload_successful == 0) > 0\n                severity: critical\n                for: 5m\n              - name: Mimir scheduler queries stuck\n                description: There are {{ $value }} queued up queries in {{ $labels.job }}.\n                query: sum by (job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0\n                severity: critical\n                for: 7m\n              - name: Mimir cache request errors\n                description: 'Mimir cache {{ $labels.name }} is experiencing {{ printf \"%.2f\" $value }}% errors for {{ $labels.operation }} operation.'\n                query: '(sum by (name, operation, job) (rate(thanos_cache_operation_failures_total[5m])) / sum by (name, operation, job) (rate(thanos_cache_operations_total[5m]))) * 100 > 5 and sum by (name, operation, job) (rate(thanos_cache_operations_total[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Mimir KV store failure\n                description: 'Mimir {{ $labels.job }} KV store {{ $labels.kv_name }} is failing with 100% error rate.'\n                query: '(sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~\"2..\"}[5m])) / sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count[5m]))) == 1 and sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Mimir memory map areas too high\n                description: 'Mimir {{ $labels.job }} is using {{ printf \"%.0f\" $value }}% of its memory map area limit.'\n                query: 'process_memory_map_areas{job=~\".*(ingester|cortex|mimir|store-gateway).*\"} / process_memory_map_areas_limit{job=~\".*(ingester|cortex|mimir|store-gateway).*\"} * 100 > 80 and process_memory_map_areas_limit{job=~\".*(ingester|cortex|mimir|store-gateway).*\"} > 0'\n                severity: critical\n                for: 5m\n              - name: Mimir ingester instance has no tenants\n                description: Mimir ingester {{ $labels.instance }} has no tenants assigned.\n                query: (cortex_ingester_memory_users == 0) and on (instance) (cortex_ingester_memory_users offset 1h > 0)\n                severity: warning\n                for: 1h\n              - name: Mimir ruler instance has no rule groups\n                description: Mimir ruler {{ $labels.instance }} has no rule groups assigned.\n                query: (cortex_ruler_managers_total == 0) and on (instance) (cortex_ruler_managers_total offset 1h > 0)\n                severity: warning\n                for: 1h\n              - name: Mimir ingested data too far in the future\n                description: Mimir ingester {{ $labels.job }} has ingested samples with timestamps more than 1 hour in the future.\n                query: max by (job) (cortex_ingester_tsdb_head_max_timestamp_seconds - time() and cortex_ingester_tsdb_head_max_timestamp_seconds > 0) > 3600\n                severity: warning\n                for: 5m\n              - name: Mimir store gateway too many failed operations\n                description: Mimir store-gateway {{ $labels.job }} bucket operations are failing ({{ $value | humanize }}/s).\n                query: sum by (job) (rate(thanos_objstore_bucket_operation_failures_total[5m])) > 0.05\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: warning\n                for: 5m\n              - name: Mimir ring members mismatch\n                description: Mimir {{ $labels.name }} ring has inconsistent member counts across instances.\n                query: max by (name, job) (sum by (name, job, instance) (cortex_ring_members)) != min by (name, job) (sum by (name, job, instance) (cortex_ring_members))\n                severity: warning\n                for: 15m\n              # Instance limits\n              - name: Mimir ingester reaching series limit warning\n                description: 'Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its series limit.'\n                query: '(cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit=\"max_series\"} * 100 > 80) and cortex_ingester_instance_limits{limit=\"max_series\"} > 0'\n                severity: warning\n                for: 3h\n              - name: Mimir ingester reaching series limit critical\n                description: 'Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its series limit.'\n                query: '(cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit=\"max_series\"} * 100 > 90) and cortex_ingester_instance_limits{limit=\"max_series\"} > 0'\n                severity: critical\n                for: 5m\n              - name: Mimir ingester reaching tenants limit warning\n                description: 'Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its tenants limit.'\n                query: '(cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit=\"max_tenants\"} * 100 > 70) and cortex_ingester_instance_limits{limit=\"max_tenants\"} > 0'\n                severity: warning\n                for: 5m\n              - name: Mimir ingester reaching tenants limit critical\n                description: 'Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its tenants limit.'\n                query: '(cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit=\"max_tenants\"} * 100 > 80) and cortex_ingester_instance_limits{limit=\"max_tenants\"} > 0'\n                severity: critical\n                for: 5m\n              - name: Mimir reaching TCP connections limit\n                description: 'Mimir instance {{ $labels.instance }} is using {{ printf \"%.0f\" $value }}% of its TCP connections limit.'\n                query: cortex_tcp_connections / cortex_tcp_connections_limit * 100 > 80 and cortex_tcp_connections_limit > 0\n                severity: critical\n                for: 5m\n              - name: Mimir distributor inflight requests high\n                description: 'Mimir distributor {{ $labels.instance }} is using {{ printf \"%.0f\" $value }}% of its inflight push requests limit.'\n                query: '(cortex_distributor_inflight_push_requests / ignoring(limit) cortex_distributor_instance_limits{limit=\"max_inflight_push_requests\"} * 100 > 80) and cortex_distributor_instance_limits{limit=\"max_inflight_push_requests\"} > 0'\n                severity: critical\n                for: 5m\n              # Blocks and TSDB\n              - name: Mimir ingester TSDB head compaction failed\n                description: \"Mimir ingester {{ $labels.instance }} is failing to compact TSDB head ({{ $value | humanize }}/s).\"\n                query: rate(cortex_ingester_tsdb_compactions_failed_total[5m]) > 0\n                severity: critical\n                for: 15m\n              - name: Mimir ingester TSDB head truncation failed\n                description: \"Mimir ingester {{ $labels.instance }} is failing to truncate TSDB head ({{ $value | humanize }}/s).\"\n                query: rate(cortex_ingester_tsdb_head_truncations_failed_total[5m]) > 0\n                severity: critical\n              - name: Mimir ingester TSDB checkpoint creation failed\n                description: \"Mimir ingester {{ $labels.instance }} is failing to create TSDB checkpoints ({{ $value | humanize }}/s).\"\n                query: rate(cortex_ingester_tsdb_checkpoint_creations_failed_total[5m]) > 0\n                severity: critical\n              - name: Mimir ingester TSDB checkpoint deletion failed\n                description: \"Mimir ingester {{ $labels.instance }} is failing to delete TSDB checkpoints ({{ $value | humanize }}/s).\"\n                query: rate(cortex_ingester_tsdb_checkpoint_deletions_failed_total[5m]) > 0\n                severity: critical\n              - name: Mimir ingester TSDB WAL truncation failed\n                description: \"Mimir ingester {{ $labels.instance }} is failing to truncate TSDB WAL ({{ $value | humanize }}/s).\"\n                query: rate(cortex_ingester_tsdb_wal_truncations_failed_total[5m]) > 0\n                severity: warning\n              - name: Mimir ingester TSDB WAL writes failed\n                description: \"Mimir ingester {{ $labels.instance }} is failing to write to TSDB WAL ({{ $value | humanize }}/s).\"\n                query: rate(cortex_ingester_tsdb_wal_writes_failed_total[1m]) > 0\n                severity: critical\n                for: 3m\n              - name: Mimir store gateway has not synced bucket\n                description: Mimir store-gateway {{ $labels.instance }} has not synced the bucket for more than 10 minutes.\n                query: (time() - cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component=\"store-gateway\"} > 1800) and cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component=\"store-gateway\"} > 0\n                comments: |\n                  Threshold aligned with official Mimir mixin (30 minutes).\n                severity: critical\n                for: 5m\n              - name: Mimir store gateway no synced tenants\n                description: Mimir store-gateway {{ $labels.instance }} has no synced tenants.\n                query: (min by (instance, job) (cortex_bucket_stores_tenants_synced{component=\"store-gateway\"}) == 0) and on (instance) (cortex_bucket_stores_tenants_synced{component=\"store-gateway\"} offset 1h > 0)\n                severity: warning\n                for: 1h\n              - name: Mimir bucket index not updated\n                description: 'Mimir bucket index for tenant {{ $labels.user }} has not been updated for more than 35 minutes.'\n                query: min by (user, job) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 2100\n                severity: critical\n              # Compactor\n              - name: Mimir compactor not cleaning up blocks\n                description: Mimir compactor {{ $labels.instance }} has not cleaned up blocks in the last 6 hours.\n                query: (time() - cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 21600) and cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 0\n                severity: critical\n                for: 1h\n              - name: Mimir compactor not running compaction\n                description: Mimir compactor {{ $labels.instance }} has not run compaction in the last 24 hours.\n                query: (time() - cortex_compactor_last_successful_run_timestamp_seconds > 86400) and cortex_compactor_last_successful_run_timestamp_seconds > 0\n                severity: critical\n                for: 15m\n              - name: Mimir compactor has consecutive failures\n                description: \"Mimir compactor {{ $labels.instance }} has had {{ $value }} compaction failures in the last 2 hours.\"\n                query: increase(cortex_compactor_runs_failed_total{reason!=\"shutdown\"}[2h]) > 1\n                severity: critical\n              - name: Mimir compactor has run out of disk space\n                description: Mimir compactor {{ $labels.instance }} has run out of disk space.\n                query: increase(cortex_compactor_disk_out_of_space_errors_total[24h]) >= 1\n                severity: critical\n              - name: Mimir compactor has not uploaded blocks\n                description: Mimir compactor {{ $labels.instance }} has not uploaded any block in the last 24 hours.\n                query: (time() - thanos_objstore_bucket_last_successful_upload_time{component=\"compactor\"} > 86400) and thanos_objstore_bucket_last_successful_upload_time{component=\"compactor\"} > 0\n                severity: critical\n                for: 15m\n              - name: Mimir compactor skipped blocks\n                description: \"Mimir compactor has found {{ $value }} blocks that cannot be compacted (reason {{ $labels.reason }}).\"\n                query: increase(cortex_compactor_blocks_marked_for_no_compaction_total[24h]) > 0\n                comments: |\n                  Using 24h window per official mixin — compaction skips are rare events.\n                severity: warning\n                for: 5m\n              # Ruler\n              - name: Mimir ruler too many failed pushes\n                description: 'Mimir ruler {{ $labels.instance }} is failing to push {{ printf \"%.2f\" $value }}% of write requests.'\n                query: '100 * sum by (instance, job) (rate(cortex_ruler_write_requests_failed_total[5m])) / sum by (instance, job) (rate(cortex_ruler_write_requests_total[5m])) > 1 and sum by (instance, job) (rate(cortex_ruler_write_requests_total[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Mimir ruler too many failed queries\n                description: 'Mimir ruler {{ $labels.instance }} is failing {{ printf \"%.2f\" $value }}% of query evaluations.'\n                query: '100 * sum by (instance, job) (rate(cortex_ruler_queries_failed_total[5m])) / sum by (instance, job) (rate(cortex_ruler_queries_total[5m])) > 1 and sum by (instance, job) (rate(cortex_ruler_queries_total[5m])) > 0'\n                severity: critical\n                for: 5m\n              - name: Mimir ruler missed evaluations\n                description: 'Mimir ruler {{ $labels.instance }} is missing {{ printf \"%.2f\" $value }}% of rule group evaluations.'\n                query: '100 * sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_missed_total[5m])) / sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_total[5m])) > 1 and sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_total[5m])) > 0'\n                severity: warning\n                for: 5m\n              - name: Mimir ruler failed ring check\n                description: Mimir ruler {{ $labels.job }} is failing ring checks ({{ $value | humanize }}/s).\n                query: sum by (job) (rate(cortex_ruler_ring_check_errors_total[5m])) > 0.05\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: critical\n                for: 5m\n              # Alertmanager\n              - name: Mimir alertmanager sync configs failing\n                description: \"Mimir alertmanager {{ $labels.job }} is failing to sync configs ({{ $value | humanize }}/s).\"\n                query: rate(cortex_alertmanager_sync_configs_failed_total[5m]) > 0\n                severity: critical\n                for: 30m\n              - name: Mimir alertmanager ring check failing\n                description: \"Mimir alertmanager {{ $labels.job }} is failing ring checks ({{ $value | humanize }}/s).\"\n                query: rate(cortex_alertmanager_ring_check_errors_total[5m]) > 0\n                severity: critical\n                for: 10m\n              - name: Mimir alertmanager state merge failing\n                description: \"Mimir alertmanager {{ $labels.job }} is failing to merge state updates ({{ $value | humanize }}/s).\"\n                query: rate(cortex_alertmanager_partial_state_merges_failed_total[5m]) > 0\n                severity: critical\n                for: 10m\n              - name: Mimir alertmanager replication failing\n                description: \"Mimir alertmanager {{ $labels.job }} is failing to replicate state ({{ $value | humanize }}/s).\"\n                query: rate(cortex_alertmanager_state_replication_failed_total[5m]) > 0\n                severity: critical\n                for: 10m\n              - name: Mimir alertmanager persist state failing\n                description: \"Mimir alertmanager {{ $labels.job }} is failing to persist state ({{ $value | humanize }}/s).\"\n                query: rate(cortex_alertmanager_state_persist_failed_total[15m]) > 0\n                severity: critical\n                for: 1h\n              - name: Mimir alertmanager initial sync failed\n                description: Mimir alertmanager {{ $labels.job }} failed initial state sync.\n                query: increase(cortex_alertmanager_state_initial_sync_completed_total{outcome=\"failed\"}[1m]) > 0\n                severity: warning\n              - name: Mimir alertmanager instance has no tenants\n                description: Mimir alertmanager {{ $labels.instance }} has no tenants assigned.\n                query: (cortex_alertmanager_tenants_owned == 0) and on (instance) (cortex_alertmanager_tenants_owned offset 1h > 0)\n                severity: warning\n                for: 1h\n              # Gossip\n              - name: Mimir gossip members count too high\n                description: Mimir gossip cluster has more members than expected.\n                query: 'avg(memberlist_client_cluster_members_count{job=~\".*(mimir|cortex).*\"}) by (job) * 1.15 + 10 < max(memberlist_client_cluster_members_count{job=~\".*(mimir|cortex).*\"}) by (job)'\n                severity: warning\n                for: 20m\n              - name: Mimir gossip members count too low\n                description: Mimir gossip cluster has fewer members than expected.\n                query: 'avg(memberlist_client_cluster_members_count{job=~\".*(mimir|cortex).*\"}) by (job) * 0.5 > min(memberlist_client_cluster_members_count{job=~\".*(mimir|cortex).*\"}) by (job)'\n                severity: warning\n                for: 20m\n              # Go runtime\n              - name: Mimir go threads too high warning\n                description: 'Mimir {{ $labels.instance }} has {{ $value }} Go threads.'\n                query: 'go_threads{job=~\".*(mimir|cortex).*\"} > 5000'\n                severity: warning\n                for: 15m\n                comments: |\n                  A high number of Go threads may indicate a goroutine leak.\n              - name: Mimir go threads too high critical\n                description: 'Mimir {{ $labels.instance }} has {{ $value }} Go threads.'\n                query: 'go_threads{job=~\".*(mimir|cortex).*\"} > 8000'\n                severity: critical\n                for: 15m\n\n      - name: Grafana Alloy\n        exporters:\n          - slug: embedded-exporter\n            rules:\n              - name: Grafana Alloy service down\n                description: \"Alloy on instance {{ $labels.instance }} is not responding or has stopped running.\"\n                query: \"count by (instance) (alloy_build_info offset 2h) unless count by (instance) (alloy_build_info)\"\n                severity: critical\n\n      - name: OpenTelemetry Collector\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://opentelemetry.io/docs/collector/internal-telemetry/\n            comments: |\n              OpenTelemetry Collector self-monitoring metrics are exposed on port 8888 by default at the /metrics endpoint.\n              These alerts monitor the collector's health when metrics are ingested via the Prometheus OTLP endpoint or scraped directly.\n              All collector internal metrics are prefixed with 'otelcol_'.\n            rules:\n              - name: OpenTelemetry Collector down\n                description: OpenTelemetry Collector instance has disappeared or is not being scraped\n                query: 'up{job=~\".*otel.*collector.*\"} == 0'\n                severity: critical\n                for: 1m\n              - name: OpenTelemetry Collector receiver refused spans\n                description: \"OpenTelemetry Collector is refusing {{ $value | humanize }}/s spans on {{ $labels.receiver }}.\"\n                query: 'rate(otelcol_receiver_refused_spans[5m]) > 0'\n                severity: critical\n                for: 5m\n              - name: OpenTelemetry Collector receiver refused metric points\n                description: \"OpenTelemetry Collector is refusing {{ $value | humanize }}/s metric points on {{ $labels.receiver }}.\"\n                query: 'rate(otelcol_receiver_refused_metric_points[5m]) > 0'\n                severity: critical\n                for: 5m\n              - name: OpenTelemetry Collector receiver refused log records\n                description: \"OpenTelemetry Collector is refusing {{ $value | humanize }}/s log records on {{ $labels.receiver }}.\"\n                query: 'rate(otelcol_receiver_refused_log_records[5m]) > 0'\n                severity: critical\n                for: 5m\n              - name: OpenTelemetry Collector exporter failed spans\n                description: \"OpenTelemetry Collector failing to send {{ $value | humanize }}/s spans via {{ $labels.exporter }}.\"\n                query: 'rate(otelcol_exporter_send_failed_spans[5m]) > 0.05'\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: warning\n                for: 5m\n              - name: OpenTelemetry Collector exporter failed metric points\n                description: \"OpenTelemetry Collector failing to send {{ $value | humanize }}/s metric points via {{ $labels.exporter }}.\"\n                query: 'rate(otelcol_exporter_send_failed_metric_points[5m]) > 0.05'\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: warning\n                for: 5m\n              - name: OpenTelemetry Collector exporter failed log records\n                description: \"OpenTelemetry Collector failing to send {{ $value | humanize }}/s log records via {{ $labels.exporter }}.\"\n                query: 'rate(otelcol_exporter_send_failed_log_records[5m]) > 0.05'\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: warning\n                for: 5m\n              - name: OpenTelemetry Collector exporter queue nearly full\n                description: \"OpenTelemetry Collector exporter {{ $labels.exporter }} queue is over 80% full\"\n                query: '(otelcol_exporter_queue_size / on(instance, job, exporter) otelcol_exporter_queue_capacity) > 0.8 and otelcol_exporter_queue_capacity > 0'\n                severity: warning\n              - name: OpenTelemetry Collector processor refused spans\n                description: \"OpenTelemetry Collector processor {{ $labels.processor }} is refusing spans ({{ $value | humanize }}/s), likely due to backpressure.\"\n                query: 'rate(otelcol_processor_refused_spans[5m]) > 0.05'\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: warning\n                for: 5m\n              - name: OpenTelemetry Collector processor refused metric points\n                description: \"OpenTelemetry Collector processor {{ $labels.processor }} is refusing metric points ({{ $value | humanize }}/s), likely due to backpressure.\"\n                query: 'rate(otelcol_processor_refused_metric_points[5m]) > 0.05'\n                comments: |\n                  Threshold of 0.05/s avoids firing on transient single-event spikes.\n                severity: warning\n                for: 5m\n              - name: OpenTelemetry Collector high memory usage\n                description: \"OpenTelemetry Collector memory usage is above 90%\"\n                query: '(otelcol_process_runtime_heap_alloc_bytes{job=~\".*otel.*collector.*\"} / on(instance, job) otelcol_process_runtime_total_sys_memory_bytes{job=~\".*otel.*collector.*\"}) > 0.9'\n                severity: warning\n                for: 5m\n              - name: OpenTelemetry Collector OTLP receiver errors\n                description: \"OpenTelemetry Collector OTLP receiver is completely failing - all spans are being refused\"\n                query: 'rate(otelcol_receiver_accepted_spans{receiver=~\"otlp\"}[5m]) == 0 and rate(otelcol_receiver_refused_spans{receiver=~\"otlp\"}[5m]) > 0'\n                severity: critical\n                for: 2m\n\n      - name: Jaeger\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            doc_url: https://www.jaegertracing.io/docs/latest/monitoring/\n            rules:\n              - name: Jaeger agent HTTP server errors\n                description: \"Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors.\"\n                query: '100 * sum(rate(jaeger_agent_http_server_errors_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 0'\n                severity: warning\n                for: 15m\n              - name: Jaeger client RPC request errors\n                description: \"Jaeger client on {{ $labels.instance }} is experiencing {{ $value | humanize }}% RPC HTTP errors.\"\n                query: '100 * sum(rate(jaeger_client_jaeger_rpc_http_requests{status_code=~\"4xx|5xx\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 0'\n                severity: warning\n                for: 15m\n              - name: Jaeger client spans dropped\n                description: \"Jaeger client on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\"\n                query: '100 * sum(rate(jaeger_reporter_spans{result=~\"dropped|err\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 0'\n                severity: warning\n                for: 15m\n              - name: Jaeger agent spans dropped\n                description: \"Jaeger agent on {{ $labels.instance }} is dropping {{ $value | humanize }}% of span batches.\"\n                query: '100 * sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 0'\n                severity: warning\n                for: 15m\n              - name: Jaeger collector dropping spans\n                description: \"Jaeger collector on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\"\n                query: '100 * sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 0'\n                severity: warning\n                for: 15m\n              - name: Jaeger sampling update failing\n                description: \"Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of sampling policy updates.\"\n                query: '100 * sum(rate(jaeger_sampler_queries{result=\"err\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 0'\n                severity: warning\n                for: 15m\n              - name: Jaeger throttling update failing\n                description: \"Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of throttling policy updates.\"\n                query: '100 * sum(rate(jaeger_throttler_updates{result=\"err\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 0'\n                severity: warning\n                for: 15m\n              - name: Jaeger query request failures\n                description: \"Jaeger query on {{ $labels.instance }} is failing {{ $value | humanize }}% of requests.\"\n                query: '100 * sum(rate(jaeger_query_requests_total{result=\"err\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 0'\n                severity: warning\n                for: 15m\n\n  - name: Other\n    services:\n      - name: APC UPS\n        exporters:\n          - name: mdlayher/apcupsd_exporter\n            slug: apcupsd_exporter\n            doc_url: https://github.com/mdlayher/apcupsd_exporter\n            rules:\n              - name: APC UPS Battery nearly empty\n                description: Battery is almost empty (< 10% left)\n                query: \"apcupsd_battery_charge_percent < 10\"\n                severity: critical\n              - name: APC UPS Less than 15 Minutes of battery time remaining\n                description: Battery is almost empty (< 15 Minutes remaining)\n                query: \"apcupsd_battery_time_left_seconds < 900\"\n                severity: critical\n              - name: APC UPS AC input outage\n                description: UPS now running on battery (since {{$value | humanizeDuration}})\n                query: \"apcupsd_battery_time_on_seconds > 0\"\n                severity: warning\n              - name: APC UPS low battery voltage\n                description: Battery voltage is lower than nominal (< 95%)\n                query: \"(apcupsd_battery_volts / apcupsd_battery_nominal_volts) < 0.95\"\n                severity: warning\n              - name: APC UPS high temperature\n                description: Internal temperature is high ({{$value}}°C)\n                query: \"apcupsd_internal_temperature_celsius >= 40\"\n                severity: warning\n                for: 2m\n              - name: APC UPS high load\n                description: UPS load is > 80%\n                query: \"apcupsd_ups_load_percent > 80\"\n                severity: warning\n\n      - name: Graph Node\n        exporters:\n          - name: Embedded exporter\n            slug: embedded-exporter\n            rules:\n              - name: Provider failed because net_version failed\n                description: \"Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\"\n                query: \"eth_rpc_status == 1\"\n                severity: critical\n              - name: Provider failed because get genesis failed\n                description: \"Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\"\n                query: \"eth_rpc_status == 2\"\n                severity: critical\n              - name: Provider failed because net_version timeout\n                description: \"net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\"\n                query: \"eth_rpc_status == 3\"\n                severity: critical\n              - name: Provider failed because get genesis timeout\n                description: \"Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\"\n                query: \"eth_rpc_status == 4\"\n                severity: critical\n              - name: Store connection slow\n                description: \"Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\"\n                query: \"store_connection_wait_time_ms > 10\"\n                severity: warning\n              - name: Store connection very slow\n                description: \"Store connection is very slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\"\n                query: \"store_connection_wait_time_ms > 20\"\n                severity: critical\n"
  },
  {
    "path": "_layouts/default.html",
    "content": "<!DOCTYPE html>\n<html lang=\"{{ site.lang | default: \"en-US\" }}\">\n\n<head>\n  <meta charset=\"UTF-8\">\n  {% seo %}\n  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n  <meta name=\"theme-color\" content=\"#157878\">\n  <meta name=\"apple-mobile-web-app-status-bar-style\" content=\"black-translucent\">\n  <link rel=\"stylesheet\" href=\"https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css\">\n  <link rel=\"stylesheet\" href=\"{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}\">\n  <link rel=\"stylesheet\" href=\"{{ '/assets/css/app.css?v=' | append: site.github.build_revision | relative_url }}\">\n  <link rel=\"icon\" type=\"image/x-icon\" href=\"{{ '/assets/favicon.ico' | relative_url }}\">\n\n  <script src=\"https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.min.js\"></script>\n  <script src=\"https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/js/bootstrap.min.js\"></script>\n  <script src=\"https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.4/clipboard.min.js\"></script>\n  <script src=\"{{ '/assets/js/app.js?v=' | append: site.github.build_revision | relative_url }}\"></script>\n\n  <!-- Global site tag (gtag.js) - Google Analytics -->\n  <script async src=\"https://www.googletagmanager.com/gtag/js?id=UA-118604063-2\"></script>\n  <script>\n    window.dataLayer = window.dataLayer || [];\n\n    function gtag() {\n      dataLayer.push(arguments);\n    }\n    gtag('js', new Date());\n\n    gtag('config', 'UA-118604063-2');\n  </script>\n\n</head>\n\n<body>\n  <style>\n    #skip-to-content {\n      height: 1px;\n      width: 1px;\n      position: absolute;\n      overflow: hidden;\n      top: -10px;\n\n      &:focus {\n        position: fixed;\n        top: 10px;\n        left: 10px;\n        height: auto;\n        width: auto;\n        background: invert($body-link-color);\n        outline: thick solid invert($body-link-color);\n      }\n    }\n\n    ul.github-buttons-cta li {\n      display: inline-block;\n      height: 20px;\n      padding: 0px 15px;\n    }\n\n    ul.github-buttons-cta li a {\n      /* width: 100px; */\n      text-decoration: none;\n    }\n\n    .fa {\n      /* padding: 14px;\n      width: 50px;\n      height: 50px; */\n      font-size: 25px;\n      text-align: center;\n      text-decoration: none;\n      border-radius: 50%;\n    }\n\n    .fa:hover {\n      opacity: 0.8;\n    }\n\n    .fa-twitter,\n    .fa-linkedin {\n      /* background: #55ACEE; */\n      color: white;\n    }\n  </style>\n  <a id=\"skip-to-content\" href=\"#content\">Skip to the content.</a>\n\n  <header class=\"page-header\" role=\"banner\">\n    <h1 class=\"project-name\">\n      <a href=\"{{ '/' | relative_url }}\" style=\"color: white\">\n        {{ site.title | default: site.github.repository_name }}\n      </a>\n    </h1>\n    <h2 class=\"project-tagline\">{{ site.description | default: site.github.project_tagline }}</h2>\n    <a href=\"{{ '/alertmanager' | relative_url  }}\" class=\"btn\">Global configuration</a>\n    <a href=\"{{ '/rules' | relative_url }}\" class=\"btn\">Rules</a>\n    <a href=\"{{ '/sleep-peacefully' | relative_url }}\" class=\"btn\">Sleep peacefully</a>\n    <a href=\"{{ '/blackbox-exporter' | relative_url }}\" class=\"btn\">Blackbox</a>\n    <a href=\"https://github.com/samber/awesome-prometheus-alerts/blob/master/CONTRIBUTING.md\" class=\"btn\">\n      Contribute on GitHub\n    </a>\n\n    <ul class=\"github-buttons-cta\">\n      <li>\n        <a href=\"https://github.com/samber/awesome-prometheus-alerts\">\n          <img alt=\"GitHub Repo Watchers\" src=\"https://img.shields.io/github/watchers/samber/awesome-prometheus-alerts?style=social\">\n        </a>\n      </li>\n      <li>\n        <a href=\"https://github.com/samber/awesome-prometheus-alerts\">\n          <img alt=\"GitHub Repo stars\" src=\"https://img.shields.io/github/stars/samber/awesome-prometheus-alerts?style=social\">\n        </a>\n      </li>\n      <li>\n        <a href=\"https://github.com/samber/awesome-prometheus-alerts\">\n          <img alt=\"GitHub Repo forks\" src=\"https://img.shields.io/github/forks/samber/awesome-prometheus-alerts?style=social\">\n        </a>\n      </li>\n      <li>\n        <a href=\"https://twitter.com/share?via=samuelberthe&related=samuelberthe&text=🚨 📊 Here is a collection of Awesome Prometheus Alerts&url=https://samber.github.io/awesome-prometheus-alerts\"\n          class=\"fa fa-twitter\" target=\"_blank\"></a>\n      </li>\n      <li>\n        <a href=\"http://www.linkedin.com/shareArticle?mini=true&url=https://samber.github.io/awesome-prometheus-alerts/\"\n          class=\"fa fa-linkedin\" target=\"_blank\"></a>\n      </li>\n    </ul>\n\n\n    <ul id=\"sponsoring\">\n      <li>\n        Kindly supported by&nbsp; 👉\n      </li>\n      <li>\n        <a href=\"https://cast.ai/samuel\">\n          <img width=\"\" src=\"assets/sponsor-cast-ai.png\" />\n        </a>\n      </li>\n      <li>\n        <a href=\"https://betterstack.com/\">\n          <img width=\"\" src=\"assets/sponsor-betterstack.png\" />\n        </a>\n      </li>\n    </ul>\n  </header>\n\n  <main id=\"content\" class=\"main-content\" role=\"main\">\n    {{ content }}\n\n    <footer class=\"site-footer\">\n      {% if site.github.is_project_page %}\n        <span class=\"site-footer-owner\">\n          <a href=\"{{ site.github.repository_url }}\">{{ site.title }}</a> is maintained by\n          <a href=\"{{ site.github.owner_url }}\">{{ site.github.owner_name }}</a>.\n        </span>\n      {% endif %}\n    </footer>\n  </main>\n\n</body>\n\n</html>\n"
  },
  {
    "path": "alertmanager.md",
    "content": "<h1 style=\"text-align: center;\">\n  Global configuration\n</h1>\n\nIf you notice a delay between an event and the first notification, read the following blog post => [https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html](https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html).\n\n## Prometheus configuration\n\n{% highlight yaml %}\n# prometheus.yml\n\nglobal:\n  scrape_interval: 20s\n\n  # A short evaluation_interval will check alerting rules very often.\n  # It can be costly if you run Prometheus with 100+ alerts.\n  evaluation_interval: 20s\n  ...\n\nrule_files:\n  - 'alerts/*.yml'\n\nscrape_configs:\n  ...\n\n{% endhighlight %}\n\n{% highlight yaml %}\n# alerts/example-redis.yml\n\ngroups:\n\n- name: ExampleRedisGroup\n  rules:\n  - alert: ExampleRedisDown\n    expr: redis_up{} == 0\n    for: 2m\n    labels:\n      severity: critical\n    annotations:\n      summary: \"Redis instance down\"\n      description: \"Whatever\"\n\n{% endhighlight %}\n\n## AlertManager configuration\n\n{% highlight yaml %}\n{% raw %}\n# alertmanager.yml\n\nroute:\n  # When a new group of alerts is created by an incoming alert, wait at\n  # least 'group_wait' to send the initial notification.\n  # This way ensures that you get multiple alerts for the same group that start\n  # firing shortly after another are batched together on the first\n  # notification.\n  group_wait: 10s\n\n  # When the first notification was sent, wait 'group_interval' to send a batch\n  # of new alerts that started firing for that group.\n  group_interval: 30s\n\n  # If an alert has successfully been sent, wait 'repeat_interval' to\n  # resend them.\n  repeat_interval: 30m\n\n  # A default receiver\n  receiver: \"slack\"\n\n  # All the above attributes are inherited by all child routes and can\n  # overwritten on each.\n  routes:\n    - receiver: \"slack\"\n      group_wait: 10s\n      match_re:\n        severity: critical|warning\n      continue: true\n\n    - receiver: \"pager\"\n      group_wait: 10s\n      match_re:\n        severity: critical\n      continue: true\n\nreceivers:\n  - name: \"slack\"\n    slack_configs:\n      - api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxxxxx'\n        send_resolved: true\n        channel: 'monitoring'\n        text: \"{{ range .Alerts }}<!channel> {{ .Annotations.summary }}\\n{{ .Annotations.description }}\\n{{ end }}\"\n\n  - name: \"pager\"\n    webhook_configs:\n      - url: http://a.b.c.d:8080/send/sms\n        send_resolved: true\n\n{% endraw %}\n{% endhighlight %}\n\n## Reduce Prometheus server load\n\nFor expansive or frequent PromQL queries, Prometheus allows to precompute rules.\n\n{% highlight yaml %}\n{% raw %}\ngroups:\n\n  # first define the recorded rule\n  - name: ExampleRecordedGroup\n    rules:\n    - record: job:rabbitmq_queue_messages_delivered_total:rate:5m\n      expr: rate(rabbitmq_queue_messages_delivered_total[5m])\n\n  # then use it in alerts\n  - name: ExampleAlertingGroup\n    rules:\n    - alert: ExampleRabbitmqLowMessageDelivery\n      expr: sum(job:rabbitmq_queue_messages_delivered_total:rate:5m) < 10\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: \"Low delivery rate in Rabbitmq queues\"\n{% endraw %}\n{% endhighlight %}\n\n## Troubleshooting\n\nIf the notification takes too much time to be triggered, check the following delays:\n- `scrape_interval = 20s` (prometheus.yml)\n- `evaluation_interval = 20s` (prometheus.yml)\n- `increase(mysql_global_status_slow_queries[1m]) > 0` (alerts/example-mysql.yml)\n- `for: 5m` (alerts/example-mysql.yml)\n- `group_wait = 10s` (alertmanager.yml)\n\nAlso read:\n- [https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html](https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html).\n- [https://hodovi.cc/blog/creating-awesome-alertmanager-templates-for-slack/](https://hodovi.cc/blog/creating-awesome-alertmanager-templates-for-slack/)\n- [https://grafana.com/blog/2024/10/03/how-to-use-prometheus-to-efficiently-detect-anomalies-at-scale/](https://grafana.com/blog/2024/10/03/how-to-use-prometheus-to-efficiently-detect-anomalies-at-scale/)\n"
  },
  {
    "path": "assets/css/app.css",
    "content": "a.anchor {\n    font-size: 15px;\n    vertical-align: middle;\n    color: darkblue;\n    display: inline-block;\n    padding-bottom: 5px;\n    margin-right: 5px;\n    opacity: 0;\n    transition: opacity 0.4s;\n}\n\nh2:hover a.anchor,\nh3:hover a.anchor,\nh4:hover a.anchor {\n    opacity: 1;\n}\n\nsummary {\n    position: relative;\n    padding-left: 60px;\n    padding-right: 50px;\n    margin-bottom: 15px;\n    font-size: 15px;\n}\n\nh2 {\n    position: relative;\n}\n\n.clipboard-single,\n.clipboard-multiple {\n    right: 0;\n    position: absolute;\n    cursor: pointer;\n    font-size: 14px;\n    color: #606c71;\n}\n\n/* NAVBAR */\n#rules-navbar.affix {\n    /* showed by JS */\n    display: none;\n\n    position: fixed;\n    overflow: auto;\n    top: 0;\n    right: 0;\n    max-width: 250px;\n    max-height: 100%;\n    padding-top: 20px;\n    padding-bottom: 20px;\n    padding-left: 20px;\n    padding-right: 10px;\n\n    background-color: #f3f6fa;\n}\n\n/* hide menu on small screens */\n@media screen and (max-width: 1350px) {\n    #rules-navbar.affix {\n        display: none !important;\n    }\n}\n\n/* hide menu scrollbar */\n#rules-navbar.affix::-webkit-scrollbar {\n    display: none;\n}\n\n#rules-navbar.affix {\n    -ms-overflow-style: none;\n    /* IE and Edge */\n    scrollbar-width: none;\n    /* Firefox */\n}\n\n#rules-navbar.affix h3 {\n    margin-bottom: 10px;\n}\n\n#rules-navbar.affix h4 {\n    margin: 0;\n    font-weight: bold;\n    font-size: 14px;\n    line-height: 14px;\n}\n\n#rules-navbar.affix ul,\n#rules-navbar.affix ul li {\n    margin: 0;\n    padding-top: 0;\n    padding-bottom: 0;\n    line-height: normal;\n}\n\n#rules-navbar.affix>ul {\n    padding-left: 0;\n    padding-right: 0;\n}\n\n#rules-navbar.affix>ul>li {\n    margin-bottom: 10px;\n    padding-left: 0;\n    padding-right: 0;\n}\n\n#rules-navbar.affix a {\n    font-size: 14px;\n    line-height: 14px;\n}\n\n/* https://github.com/samber/awesome-prometheus-alerts/issues/356 */\n@media screen and (min-width: 64em) {\n    .main-content {\n        max-width: 85rem;\n    }\n}\n\nul#sponsoring {\n    display: flex;\n    align-items: center;\n    justify-content: center;\n    margin-top: 50px;\n}\n\nul#sponsoring li {\n    display: flex;\n    padding: 0px 15px;\n    font-size: 16px;\n}\n\nul#sponsoring li a {\n    display: flex;\n}\n\nul#sponsoring li a img {\n    max-width: 180px;\n    max-height: 40px;\n}\n\n.page-header {\n    padding-bottom: 30px;\n}\n\n@media (prefers-color-scheme: dark) {\n\n    #rules-navbar.affix {\n        background-color: #2b2b2b;\n    }\n\n    /*********************** style.css overrides ******************************/\n    /* This should *probably* be its own theme instead. */\n\n    body {\n\tcolor: #a3b0b6;\n        background-color: #242424;\n    }\n    .page-header {\n\tcolor: #fff;\n\ttext-align: center;\n\tbackground-color: #006128;\n\tbackground-image: linear-gradient(120deg, #002968, #003c04);\n    }\n    .site-footer {\n\tborder-top: solid 1px #525354;\n    }\n    hr {\n        background-color: #525354!important;\n    }\n    a {\n        color: #3d86d6;\n    }\n    .main-content h1 ,\n    .main-content h2 ,\n    .main-content h3 ,\n    .main-content h4 ,\n    .main-content h5 ,\n    .main-content h6 {\n\tcolor: #55c883;\n    }\n\n    /* Syntax Highlighting from dark-plus of pygments-styles */\n    /* See: https://github.com/lepture/pygments-styles */\n    .main-content pre {\n        background: #1E1E1E;\n\tborder: solid 1px #272f36;\n    }\n    code, .highlight {\n        background: #1E1E1E;\n        color: #D4D4D4\n    }\n    .highlight .hll {\n        background-color: #ADD6FF26\n    }\n    .highlight .c   { color: #6A9955 }\n    .highlight .err { color: #F44747 }\n    .highlight .k   { color: #C586C0 }\n    .highlight .l   { color: #CE9178 }\n    .highlight .ch  { color: #6A9955 }\n    .highlight .cm  { color: #6A9955 }\n    .highlight .cp  { color: #C586C0 }\n    .highlight .cpf { color: #CE9178 }\n    .highlight .c1  { color: #6A9955 }\n    .highlight .cs  { color: #6A9955 }\n    .highlight .gd  { color: #CE9178 }\n    .highlight .ge  { font-style: italic }\n    .highlight .gr  { color: #F44747 }\n    .highlight .gh  { color: #569CD6 }\n    .highlight .gi  { color: #B5CEA8 }\n    .highlight .go  { color: #CE9178 }\n    .highlight .gp  { color: #C8C8C8 }\n    .highlight .gs  { color: #569CD6; font-weight: bold }\n    .highlight .gu  { color: #569CD6 }\n    .highlight .gt  { color: #F44747 }\n    .highlight .kc  { color: #CE9178 }\n    .highlight .kd  { color: #C586C0 }\n    .highlight .kn  { color: #C586C0 }\n    .highlight .kp  { color: #D7BA7D }\n    .highlight .kr  { color: #C586C0 }\n    .highlight .kt  { color: #569CD6 }\n    .highlight .ld  { color: #CE9178 }\n    .highlight .m   { color: #B5CEA8 }\n    .highlight .s   { color: #CE9178 }\n    .highlight .na  { color: #9CDCFE }\n    .highlight .nb  { color: #DCDCAA }\n    .highlight .nc  { color: #4EC9B0 }\n    .highlight .no  { color: #B5CEA8 }\n    .highlight .nd  { color: #DCDCAA }\n    .highlight .ne  { color: #4EC9B0 }\n    .highlight .nf  { color: #DCDCAA }\n    .highlight .nl  { color: #C8C8C8 }\n    .highlight .nx  { color: #D4D4D4 }\n    .highlight .nt  { color: #569CD6 }\n    .highlight .w   { color: #D4D4D4 }\n    .highlight .mb  { color: #B5CEA8 }\n    .highlight .mf  { color: #B5CEA8 }\n    .highlight .mh  { color: #B5CEA8 }\n    .highlight .mi  { color: #B5CEA8 }\n    .highlight .mo  { color: #B5CEA8 }\n    .highlight .sa  { color: #CE9178 }\n    .highlight .sb  { color: #CE9178 }\n    .highlight .sc  { color: #CE9178 }\n    .highlight .dl  { color: #CE9178 }\n    .highlight .sd  { color: #CE9178 }\n    .highlight .s2  { color: #CE9178 }\n    .highlight .se  { color: #CE9178 }\n    .highlight .sh  { color: #CE9178 }\n    .highlight .si  { color: #569CD6 }\n    .highlight .sx  { color: #CE9178 }\n    .highlight .sr  { color: #D16969 }\n    .highlight .s1  { color: #CE9178 }\n    .highlight .ss  { color: #CE9178 }\n    .highlight .bp  { color: #D7BA7D }\n    .highlight .fm  { color: #DCDCAA }\n    .highlight .il  { color: #B5CEA8 }\n}\n"
  },
  {
    "path": "assets/js/app.js",
    "content": "$(function () {\n    var clipboardRules = new ClipboardJS('.clipboard-single', {\n        text: function (trigger) {\n            const id = trigger.getAttribute('data-clipboard-target-id');\n            const html = $(\"#\" + id + \" .highlight\");\n            return html.text() + '\\n';\n        },\n    });\n    var clipboardCategories = new ClipboardJS('.clipboard-multiple', {\n        text: function (trigger) {\n            const id = trigger.getAttribute('data-clipboard-target-id');\n            const html = $(\"[id^=\" + id + \"] .highlight\");\n            return Array.from(html.map((i, target) => $(target).text())).join('\\n\\n');\n        },\n    });\n});\n"
  },
  {
    "path": "blackbox-exporter.md",
    "content": "\n<h1 style=\"text-align: center;\">\n  Blackbox exporter\n</h1>\n\n## Wordwide probes\n\n<a href=\"https://github.com/prometheus/blackbox_exporter\" target=\"_blank\">Blackbox Exporter</a> gives you the ability to probe endpoints over HTTP, HTTPS, DNS, TCP and ICMP.\n\nYou should deploy blackbox exporters in multiple Point of Presence around the globe, to monitor latency. Feel free to use the following endpoints for your own projects:\n\n- https://probe-<b>montreal</b>.cleverapps.io\n- https://probe-<b>paris</b>.cleverapps.io\n- https://probe-<b>jeddah</b>.cleverapps.io\n- https://probe-<b>singapore</b>.cleverapps.io\n- https://probe-<b>sydney</b>.cleverapps.io\n- https://probe-<b>warsaw</b>.cleverapps.io\n\n☝️ Logs have been disabled. More probes from the community would be appreciated, please contribute <a href=\"https://github.com/samber/awesome-prometheus-alerts/\" target=\"_blank\">here</a>! These blackbox exporters use the following <a href=\"https://github.com/samber/blackbox_exporter/blob/master/samber.yml\" target=\"_blank\">configuration</a>.\n\n## Prometheus Configuration\n\nBlackbox exporters and endpoints must be declared in Prometheus. Here is a simple configuration, inspired by [Hayk Davtyan medium post](https://medium.com/geekculture/single-prometheus-job-for-dozens-of-blackbox-exporters-2a7ba492d6c8):\n\n```yml\n# sd/blackbox.yml\n\n- targets:\n  #\n  # Montreal\n  #\n  # http\n  - probe-montreal.cleverapps.io:_:http_2xx:_:Montreal:_:f229cy:_:https://api.screeb.app\n  - probe-montreal.cleverapps.io:_:http_2xx:_:Montreal:_:f229cy:_:https://t.screeb.app/tag.js\n  # icmp\n  - probe-montreal.cleverapps.io:_:icmp_ipv4:_:Montreal:_:f229cy:_:api.screeb.app\n  - probe-montreal.cleverapps.io:_:icmp_ipv4:_:Montreal:_:f229cy:_:t.screeb.app\n\n\n  #\n  # Paris\n  #\n  # http\n  - probe-paris.cleverapps.io:_:http_2xx:_:Paris:_:u09tgy:_:https://api.screeb.app\n  - probe-paris.cleverapps.io:_:http_2xx:_:Paris:_:u09tgy:_:https://t.screeb.app/tag.js\n  # icmp\n  - probe-paris.cleverapps.io:_:icmp_ipv4:_:Paris:_:u09tgy:_:api.screeb.app\n  - probe-paris.cleverapps.io:_:icmp_ipv4:_:Paris:_:u09tgy:_:t.screeb.app\n\n\n  #\n  # Sydney\n  #\n  # http\n  - probe-sydney.cleverapps.io:_:http_2xx:_:Sydney:_:r3gpkn:_:https://api.screeb.app\n  - probe-sydney.cleverapps.io:_:http_2xx:_:Sydney:_:r3gpkn:_:https://t.screeb.app/tag.js\n  # icmp\n  - probe-sydney.cleverapps.io:_:icmp_ipv4:_:Sydney:_:r3gpkn:_:api.screeb.app\n  - probe-sydney.cleverapps.io:_:icmp_ipv4:_:Sydney:_:r3gpkn:_:t.screeb.app\n\n  # ...\n```\n\n```yml\n# prometheus.yml\n\nglobal:\n  # ...\n\nscrape_configs:\n\n  - job_name: 'blackbox'\n    metrics_path: /probe\n    scrape_interval: 30s\n    scheme: https\n    file_sd_configs:\n      - files:\n        - /etc/prometheus/sd/blackbox.yml\n    relabel_configs:\n      # adds \"module\" label in the final labelset\n      - source_labels: [__address__]\n        regex: '.*:_:(.*):_:.*:_:.*:_:.*'\n        target_label: module\n      # adds \"geohash\" label in the final labelset\n      - source_labels: [__address__]\n        regex: '.*:_:.*:_:.*:_:(.*):_:.*'\n        target_label: geohash\n      # rewrites \"instance\" label with corresponding URL\n      - source_labels: [__address__]\n        regex: '.*:_:.*:_:.*:_:.*:_:(.*)'\n        target_label: instance\n      # rewrites \"pop\" label with corresponding location name\n      - source_labels: [__address__]\n        regex: '.*:_:.*:_:(.*):_:.*:_:.*'\n        target_label: pop\n      # passes \"module\" parameter to Blackbox exporter\n      - source_labels: [module]\n        target_label: __param_module\n      # passes \"target\" parameter to Blackbox exporter\n      - source_labels: [instance]\n        target_label: __param_target\n      # the Blackbox exporter's real hostname:port\n      - source_labels: [__address__]\n        regex: '(.*):_:.*:_:.*:_:.*:_:.*'\n        target_label: __address__\n\n  # ...\n\n```\n\n## Geohash\n\n![](assets/grafana-map-panel.png)\n\nTo display nice maps in Grafana, you need to instruct blackbox exporters about the location. Grafana map panel speaks the \"geohash\" format:\n\n- go to google map\n- extract the lat/long from the url\n- convert lat/long to geohash here: http://geohash.co\n\n## Grafana\n\nSome great dashboard have been created by the community: https://grafana.com/grafana/dashboards/?search=blackbox\n\nSince Grafana v5.0.0, a map panel is available: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/geomap/\n"
  },
  {
    "path": "dist/rules/apache/lusitaniae-apache-exporter.yml",
    "content": "groups:\n\n- name: LusitaniaeApacheExporter\n\n  \n  rules:\n\n    - alert: ApacheDown\n      expr: 'apache_up == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Apache down (instance {{ $labels.instance }})\n        description: \"Apache down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ApacheWorkersLoad\n      expr: '(sum by (instance) (apache_workers{state=\"busy\"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80 and sum by (instance) (apache_scoreboard) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Apache workers load (instance {{ $labels.instance }})\n        description: \"Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ApacheRestart\n      expr: 'apache_uptime_seconds_total / 60 < 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Apache restart (instance {{ $labels.instance }})\n        description: \"Apache has just been restarted.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/apache-flink/flink-prometheus-reporter.yml",
    "content": "groups:\n\n- name: FlinkPrometheusReporter\n\n  \n  rules:\n\n    - alert: FlinkJobIsNotRunning\n      expr: 'flink_jobmanager_numRunningJobs == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Flink job is not running (instance {{ $labels.instance }})\n        description: \"No Flink jobs are currently running. All jobs may have failed or been cancelled.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FlinkNoTaskmanagersRegistered\n      expr: 'flink_jobmanager_numRegisteredTaskManagers == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Flink no TaskManagers registered (instance {{ $labels.instance }})\n        description: \"No TaskManagers are registered with the JobManager. The cluster has no processing capacity.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This alert fires when there are no available task slots. Adjust the threshold if your cluster is expected to run at full capacity.\n    - alert: FlinkAllTaskSlotsUsed\n      expr: 'flink_jobmanager_taskSlotsAvailable == 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink all task slots used (instance {{ $labels.instance }})\n        description: \"All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # A single restart may be normal during deployments. Adjust threshold based on restart tolerance.\n    - alert: FlinkJobRestartIncreasing\n      expr: 'increase(flink_jobmanager_job_numRestarts[5m]) > 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink job restart increasing (instance {{ $labels.instance }})\n        description: \"Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FlinkCheckpointFailures\n      expr: 'increase(flink_jobmanager_job_numberOfFailedCheckpoints[10m]) > 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink checkpoint failures (instance {{ $labels.instance }})\n        description: \"Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Value is converted from milliseconds to seconds for correct humanizeDuration display.\n    # Threshold is 60 seconds. Adjust based on your checkpoint interval and state size.\n    - alert: FlinkCheckpointDurationHigh\n      expr: 'flink_jobmanager_job_lastCheckpointDuration / 1000 > 60'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink checkpoint duration high (instance {{ $labels.instance }})\n        description: \"Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FlinkTaskBackpressured\n      expr: 'flink_taskmanager_job_task_isBackPressured == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink task backpressured (instance {{ $labels.instance }})\n        description: \"Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Fires when a task spends more than 500ms/sec backpressured. This indicates the task cannot keep up with upstream data rate.\n    - alert: FlinkTaskHighBackpressureTime\n      expr: 'flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink task high backpressure time (instance {{ $labels.instance }})\n        description: \"Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FlinkTaskmanagerHeapMemoryHigh\n      expr: 'flink_taskmanager_Status_JVM_Memory_Heap_Used / flink_taskmanager_Status_JVM_Memory_Heap_Max > 0.9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink TaskManager heap memory high (instance {{ $labels.instance }})\n        description: \"Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FlinkJobmanagerHeapMemoryHigh\n      expr: 'flink_jobmanager_Status_JVM_Memory_Heap_Used / flink_jobmanager_Status_JVM_Memory_Heap_Max > 0.9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink JobManager heap memory high (instance {{ $labels.instance }})\n        description: \"Flink JobManager {{ $labels.instance }} heap memory usage is above 90%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload.\n    - alert: FlinkTaskmanagerGcTimeHigh\n      expr: 'rate(flink_taskmanager_Status_JVM_GarbageCollector_All_Time[5m]) > 100'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink TaskManager GC time high (instance {{ $labels.instance }})\n        description: \"Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Only fires for tasks that have previously received records, to avoid false positives during startup.\n    - alert: FlinkNoRecordsProcessed\n      expr: 'rate(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flink no records processed (instance {{ $labels.instance }})\n        description: \"Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/apache-spark/spark-prometheus.yml",
    "content": "groups:\n\n- name: SparkPrometheus\n\n  # Spark exposes metrics via two built-in endpoints:\n  # - PrometheusServlet: master/worker/driver metrics at /metrics/prometheus/ (ports 8080, 8081, 4040)\n  # - PrometheusResource: executor metrics at /metrics/executors/prometheus/ (port 4040, requires spark.ui.prometheus.enabled=true in Spark 3.x)\n  # Metric names from PrometheusServlet include a dynamic namespace (application ID), making static PromQL queries challenging.\n  # Configuration: spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet\n  \n  rules:\n\n    - alert: SparkNoAliveWorkers\n      expr: 'metrics_master_aliveWorkers_Value == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Spark no alive workers (instance {{ $labels.instance }})\n        description: \"No Spark workers are alive. The cluster has no processing capacity.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Adjust the threshold based on your cluster's typical queuing behavior.\n    - alert: SparkTooManyWaitingApps\n      expr: 'metrics_master_waitingApps_Value > 10'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spark too many waiting apps (instance {{ $labels.instance }})\n        description: \"Spark has {{ $value }} applications waiting for resources.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SparkWorkerMemoryExhausted\n      expr: 'metrics_worker_memFree_MB_Value == 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spark worker memory exhausted (instance {{ $labels.instance }})\n        description: \"Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Fires when a worker has no free cores. This may be normal under high load but can indicate capacity issues.\n    - alert: SparkWorkerCoresExhausted\n      expr: 'metrics_worker_coresFree_Value == 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spark worker cores exhausted (instance {{ $labels.instance }})\n        description: \"Spark worker {{ $labels.instance }} has no free cores.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Fires when more than 10% of executor time is spent in garbage collection.\n    # This metric comes from the PrometheusResource endpoint (/metrics/executors/prometheus/).\n    - alert: SparkExecutorHighGcTime\n      expr: 'metrics_executor_totalGCTime_seconds_total / metrics_executor_totalDuration > 0.1 and metrics_executor_totalDuration > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spark executor high GC time (instance {{ $labels.instance }})\n        description: \"Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SparkExecutorAllTasksFailing\n      expr: 'metrics_executor_failedTasks_total > 0 and metrics_executor_completedTasks_total == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Spark executor all tasks failing (instance {{ $labels.instance }})\n        description: \"Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SparkExecutorHighTaskFailureRate\n      expr: 'metrics_executor_failedTasks_total / metrics_executor_totalTasks_total > 0.1 and metrics_executor_totalTasks_total > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spark executor high task failure rate (instance {{ $labels.instance }})\n        description: \"Spark executor {{ $labels.executor_id }} has a task failure rate above 10%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # diskUsed is a gauge, not a counter — do not use rate(). Threshold of 1GB is a rough default.\n    # Disk spilling indicates insufficient memory for the workload.\n    - alert: SparkExecutorHighDiskSpill\n      expr: 'metrics_executor_diskUsed_bytes > 1e9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spark executor high disk spill (instance {{ $labels.instance }})\n        description: \"Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/apc-ups/apcupsd_exporter.yml",
    "content": "groups:\n\n- name: Apcupsd_exporter\n\n  \n  rules:\n\n    - alert: ApcUpsBatteryNearlyEmpty\n      expr: 'apcupsd_battery_charge_percent < 10'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: APC UPS Battery nearly empty (instance {{ $labels.instance }})\n        description: \"Battery is almost empty (< 10% left)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ApcUpsLessThan15MinutesOfBatteryTimeRemaining\n      expr: 'apcupsd_battery_time_left_seconds < 900'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: APC UPS Less than 15 Minutes of battery time remaining (instance {{ $labels.instance }})\n        description: \"Battery is almost empty (< 15 Minutes remaining)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ApcUpsAcInputOutage\n      expr: 'apcupsd_battery_time_on_seconds > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: APC UPS AC input outage (instance {{ $labels.instance }})\n        description: \"UPS now running on battery (since {{$value | humanizeDuration}})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ApcUpsLowBatteryVoltage\n      expr: '(apcupsd_battery_volts / apcupsd_battery_nominal_volts) < 0.95'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: APC UPS low battery voltage (instance {{ $labels.instance }})\n        description: \"Battery voltage is lower than nominal (< 95%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ApcUpsHighTemperature\n      expr: 'apcupsd_internal_temperature_celsius >= 40'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: APC UPS high temperature (instance {{ $labels.instance }})\n        description: \"Internal temperature is high ({{$value}}°C)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ApcUpsHighLoad\n      expr: 'apcupsd_ups_load_percent > 80'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: APC UPS high load (instance {{ $labels.instance }})\n        description: \"UPS load is > 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/argocd/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: ArgocdServiceNotSynced\n      expr: 'argocd_app_info{sync_status!=\"Synced\"} != 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: ArgoCD service not synced (instance {{ $labels.instance }})\n        description: \"Service {{ $labels.name }} run by argo is currently not in sync.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ArgocdServiceUnhealthy\n      expr: 'argocd_app_info{health_status!=\"Healthy\"} != 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: ArgoCD service unhealthy (instance {{ $labels.instance }})\n        description: \"Service {{ $labels.name }} run by argo is currently not healthy.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/aws-cloudwatch/prometheus-cloudwatch-exporter.yml",
    "content": "groups:\n\n- name: PrometheusCloudwatchExporter\n\n  # CloudWatch metrics are exported as aws_{namespace}_{metric_name}_{statistic} gauges.\n  # The rules below cover both exporter health and common AWS service alerts.\n  # Adjust thresholds and label filters to match your CloudWatch exporter configuration.\n  \n  rules:\n\n    - alert: CloudwatchExporterScrapeError\n      expr: 'cloudwatch_exporter_scrape_error > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: CloudWatch exporter scrape error (instance {{ $labels.instance }})\n        description: \"CloudWatch exporter on {{ $labels.instance }} failed to scrape metrics from AWS CloudWatch API.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CloudwatchExporterSlowScrape\n      expr: 'cloudwatch_exporter_scrape_duration_seconds > 300'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: CloudWatch exporter slow scrape (instance {{ $labels.instance }})\n        description: \"CloudWatch exporter on {{ $labels.instance }} scrape is taking more than 5 minutes ({{ $value }}s). Consider reducing the number of metrics or splitting across multiple exporters.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # CloudWatch API calls cost money (~$0.01 per 1000 GetMetricData requests).\n    # 100 requests/minute ≈ $45/month. Adjust the threshold based on your budget.\n    - alert: CloudwatchApiHighRequestRate\n      expr: 'sum by (instance, namespace) (rate(cloudwatch_requests_total[5m])) * 60 > 100'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: CloudWatch API high request rate (instance {{ $labels.instance }})\n        description: \"CloudWatch exporter on {{ $labels.instance }} is making {{ $value }} API calls per minute to namespace {{ $labels.namespace }}. This can lead to high AWS costs.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires EC2 CPUUtilization metric configured in the CloudWatch exporter.\n    - alert: AwsEc2HighCpuUtilization\n      expr: 'aws_ec2_cpuutilization_average > 90'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: AWS EC2 high CPU utilization (instance {{ $labels.instance }})\n        description: \"EC2 instance {{ $labels.instance_id }} CPU utilization is above 90% ({{ $value }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires RDS FreeStorageSpace metric. The threshold of 2GB is a rough default.\n    # Adjust based on your database size.\n    - alert: AwsRdsLowFreeStorageSpace\n      expr: 'aws_rds_free_storage_space_average < 2000000000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: AWS RDS low free storage space (instance {{ $labels.instance }})\n        description: \"RDS instance {{ $labels.dbinstance_identifier }} has less than 2GB free storage ({{ $value }} bytes remaining).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires RDS CPUUtilization metric configured in the CloudWatch exporter.\n    - alert: AwsRdsHighCpuUtilization\n      expr: 'aws_rds_cpuutilization_average > 90'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: AWS RDS high CPU utilization (instance {{ $labels.instance }})\n        description: \"RDS instance {{ $labels.dbinstance_identifier }} CPU utilization is above 90% ({{ $value }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The threshold depends on the RDS instance class. Adjust based on your\n    # instance type's max_connections parameter.\n    - alert: AwsRdsHighDatabaseConnections\n      expr: 'aws_rds_database_connections_average > 100'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: AWS RDS high database connections (instance {{ $labels.instance }})\n        description: \"RDS instance {{ $labels.dbinstance_identifier }} has {{ $value }} active connections.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires SQS ApproximateNumberOfMessagesVisible metric. The threshold of 1000\n    # is a rough default. Adjust based on your expected queue depth.\n    - alert: AwsSqsQueueMessagesVisible\n      expr: 'aws_sqs_approximate_number_of_messages_visible_average > 1000'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: AWS SQS queue messages visible (instance {{ $labels.instance }})\n        description: \"SQS queue {{ $labels.queue_name }} has {{ $value }} messages waiting to be processed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires SQS ApproximateAgeOfOldestMessage metric.\n    - alert: AwsSqsMessageAgeTooOld\n      expr: 'aws_sqs_approximate_age_of_oldest_message_maximum > 3600'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: AWS SQS message age too old (instance {{ $labels.instance }})\n        description: \"SQS queue {{ $labels.queue_name }} has messages older than 1 hour ({{ $value }}s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires ApplicationELB UnHealthyHostCount metric.\n    - alert: AwsAlbUnhealthyTargets\n      expr: 'aws_applicationelb_unhealthy_host_count_average > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: AWS ALB unhealthy targets (instance {{ $labels.instance }})\n        description: \"ALB {{ $labels.load_balancer }} has {{ $value }} unhealthy target(s) in target group {{ $labels.target_group }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires ApplicationELB HTTPCode_ELB_5XX_Count and RequestCount metrics.\n    - alert: AwsAlbHigh5xxErrorRate\n      expr: '(aws_applicationelb_httpcode_elb_5_xx_count_sum / aws_applicationelb_request_count_sum) * 100 > 5 and aws_applicationelb_request_count_sum > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: AWS ALB high 5xx error rate (instance {{ $labels.instance }})\n        description: \"ALB {{ $labels.load_balancer }} 5xx error rate is above 5% ({{ $value }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires ApplicationELB TargetResponseTime metric.\n    - alert: AwsAlbHighTargetResponseTime\n      expr: 'aws_applicationelb_target_response_time_average > 2'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: AWS ALB high target response time (instance {{ $labels.instance }})\n        description: \"ALB {{ $labels.load_balancer }} average target response time is above 2 seconds ({{ $value }}s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Requires Lambda Errors and Invocations metrics.\n    - alert: AwsLambdaHighErrorRate\n      expr: '(aws_lambda_errors_sum / aws_lambda_invocations_sum) * 100 > 5 and aws_lambda_invocations_sum > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: AWS Lambda high error rate (instance {{ $labels.instance }})\n        description: \"Lambda function {{ $labels.function_name }} error rate is above 5% ({{ $value }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/azure/azure-metrics-exporter.yml",
    "content": "groups:\n\n- name: AzureMetricsExporter\n\n  # The exporter uses azurerm_resource_metric as the default metric name for forwarded Azure Monitor metrics.\n  # The metric name can be customized via the name parameter in probe configuration.\n  # Self-monitoring metrics use the azurerm_stats_* and azurerm_api_* prefixes.\n  \n  rules:\n\n    - alert: AzureExporterRequestErrors\n      expr: 'increase(azurerm_stats_metric_requests{result=\"error\"}[15m]) > 5'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Azure exporter request errors (instance {{ $labels.instance }})\n        description: \"Azure metrics exporter on {{ $labels.instance }} has {{ $value }} API request errors in the last 15 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: AzureExporterHighErrorRate\n      expr: 'sum by (instance) (rate(azurerm_stats_metric_requests{result=\"error\"}[5m])) / sum by (instance) (rate(azurerm_stats_metric_requests[5m])) * 100 > 10 and sum by (instance) (rate(azurerm_stats_metric_requests[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Azure exporter high error rate (instance {{ $labels.instance }})\n        description: \"Azure metrics exporter on {{ $labels.instance }} has an error rate above 10% ({{ $value }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Azure Resource Manager enforces rate limits per subscription.\n    # The threshold of 100 remaining calls is a rough default. Adjust based on your\n    # scrape interval and number of monitored resources.\n    - alert: AzureApiReadRateLimitApproaching\n      expr: 'azurerm_api_ratelimit{type=\"read\"} < 100'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Azure API read rate limit approaching (instance {{ $labels.instance }})\n        description: \"Azure API read rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: AzureApiWriteRateLimitApproaching\n      expr: 'azurerm_api_ratelimit{type=\"write\"} < 50'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Azure API write rate limit approaching (instance {{ $labels.instance }})\n        description: \"Azure API write rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: AzureExporterSlowCollection\n      expr: 'azurerm_stats_metric_collecttime > 300'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Azure exporter slow collection (instance {{ $labels.instance }})\n        description: \"Azure metrics exporter on {{ $labels.instance }} metric collection is taking more than 5 minutes ({{ $value }}s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/blackbox/blackbox-exporter.yml",
    "content": "groups:\n\n- name: BlackboxExporter\n\n  \n  rules:\n\n    - alert: BlackboxProbeFailed\n      expr: 'probe_success == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Blackbox probe failed (instance {{ $labels.instance }})\n        description: \"Probe failed\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: BlackboxConfigurationReloadFailure\n      expr: 'blackbox_exporter_config_last_reload_successful != 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Blackbox configuration reload failure (instance {{ $labels.instance }})\n        description: \"Blackbox configuration reload failure\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: BlackboxSlowProbe\n      expr: 'probe_duration_seconds > 1'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Blackbox slow probe (instance {{ $labels.instance }})\n        description: \"Blackbox probe took more than 1s to complete\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: BlackboxProbeHttpFailure\n      expr: 'probe_http_status_code <= 199 OR probe_http_status_code >= 400'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})\n        description: \"HTTP status code is not 200-399\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: BlackboxSslCertificateWillExpireSoon\n      expr: '3 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 20'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})\n        description: \"SSL certificate expires in less than 20 days\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: BlackboxSslCertificateWillExpireVerySoon\n      expr: '0 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 3'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Blackbox SSL certificate will expire very soon (instance {{ $labels.instance }})\n        description: \"SSL certificate expires in less than 3 days\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # For probe_ssl_earliest_cert_expiry to be exposed after expiration, you\n    # need to enable insecure_skip_verify. Note that this will disable\n    # certificate validation.\n    # See https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#tls_config\n    - alert: BlackboxSslCertificateExpired\n      expr: 'round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})\n        description: \"SSL certificate has expired already\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: BlackboxProbeSlowHttp\n      expr: 'probe_http_duration_seconds > 1'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})\n        description: \"HTTP request took more than 1s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: BlackboxProbeSlowPing\n      expr: 'probe_icmp_duration_seconds > 1'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Blackbox probe slow ping (instance {{ $labels.instance }})\n        description: \"Blackbox ping took more than 1s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/caddy/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: CaddyReverseProxyDown\n      expr: 'count(caddy_reverse_proxy_upstreams_healthy) by (upstream) == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Caddy Reverse Proxy Down (instance {{ $labels.instance }})\n        description: \"All Caddy reverse proxies are down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CaddyHighHttp4xxErrorRateService\n      expr: 'sum(rate(caddy_http_request_duration_seconds_count{code=~\"4..\"}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5 and sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Caddy high HTTP 4xx error rate service (instance {{ $labels.instance }})\n        description: \"Caddy service 4xx error rate is above 5%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CaddyHighHttp5xxErrorRateService\n      expr: 'sum(rate(caddy_http_request_duration_seconds_count{code=~\"5..\"}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5 and sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Caddy high HTTP 5xx error rate service (instance {{ $labels.instance }})\n        description: \"Caddy service 5xx error rate is above 5%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/cassandra/criteo-cassandra-exporter.yml",
    "content": "groups:\n\n- name: CriteoCassandraExporter\n\n  \n  rules:\n\n    - alert: CassandraHintsCount\n      expr: 'changes(cassandra_stats{name=\"org:apache:cassandra:metrics:storage:totalhints:count\"}[1m]) > 3'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra hints count (instance {{ $labels.instance }})\n        description: \"Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraCompactionTaskPending\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:compaction:pendingtasks:value\"} > 100'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra compaction task pending (instance {{ $labels.instance }})\n        description: \"Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraViewwriteLatency\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile\"} > 100000'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra viewwrite latency (instance {{ $labels.instance }})\n        description: \"High viewwrite latency on {{ $labels.instance }} cassandra node\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraAuthenticationFailures\n      expr: 'rate(cassandra_stats{name=\"org:apache:cassandra:metrics:client:authfailure:count\"}[1m]) > 5'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra authentication failures (instance {{ $labels.instance }})\n        description: \"Increase of Cassandra authentication failures\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: CassandraNodeDown\n      expr: 'sum(cassandra_stats{name=\"org:apache:cassandra:net:failuredetector:downendpointcount\"}) by (service,group,cluster,env) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra node down (instance {{ $labels.instance }})\n        description: \"Cassandra node down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraCommitlogPendingTasks(criteo)\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:commitlog:pendingtasks:value\"} > 15'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra commitlog pending tasks (Criteo) (instance {{ $labels.instance }})\n        description: \"Unexpected number of Cassandra commitlog pending tasks\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraCompactionExecutorBlockedTasks(criteo)\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count\"} > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra compaction executor blocked tasks (Criteo) (instance {{ $labels.instance }})\n        description: \"Some Cassandra compaction executor tasks are blocked\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraFlushWriterBlockedTasks(criteo)\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count\"} > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra flush writer blocked tasks (Criteo) (instance {{ $labels.instance }})\n        description: \"Some Cassandra flush writer tasks are blocked\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraRepairPendingTasks\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value\"} > 2'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra repair pending tasks (instance {{ $labels.instance }})\n        description: \"Some Cassandra repair tasks are pending\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraRepairBlockedTasks\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count\"} > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra repair blocked tasks (instance {{ $labels.instance }})\n        description: \"Some Cassandra repair tasks are blocked\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraConnectionTimeoutsTotal(criteo)\n      expr: 'rate(cassandra_stats{name=\"org:apache:cassandra:metrics:connection:totaltimeouts:count\"}[1m]) > 5'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra connection timeouts total (Criteo) (instance {{ $labels.instance }})\n        description: \"Some connection between nodes are ending in timeout\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraStorageExceptions(criteo)\n      expr: 'changes(cassandra_stats{name=\"org:apache:cassandra:metrics:storage:exceptions:count\"}[1m]) > 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra storage exceptions (Criteo) (instance {{ $labels.instance }})\n        description: \"Something is going wrong with cassandra storage\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraTombstoneDump(criteo)\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile\"} > 1000'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra tombstone dump (Criteo) (instance {{ $labels.instance }})\n        description: \"Too much tombstones scanned in queries\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraClientRequestUnavailableWrite(criteo)\n      expr: 'changes(cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:write:unavailables:count\"}[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra client request unavailable write (Criteo) (instance {{ $labels.instance }})\n        description: \"Write failures have occurred because too many nodes are unavailable\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraClientRequestUnavailableRead(criteo)\n      expr: 'changes(cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:read:unavailables:count\"}[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra client request unavailable read (Criteo) (instance {{ $labels.instance }})\n        description: \"Read failures have occurred because too many nodes are unavailable\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraClientRequestWriteFailure(criteo)\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate\"} > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra client request write failure (Criteo) (instance {{ $labels.instance }})\n        description: \"A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraClientRequestReadFailure(criteo)\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate\"} > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra client request read failure (Criteo) (instance {{ $labels.instance }})\n        description: \"A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraCacheHitRateKeyCache\n      expr: 'cassandra_stats{name=\"org:apache:cassandra:metrics:cache:keycache:hitrate:value\"} < .85'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }})\n        description: \"Key cache hit rate is below 85%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/cassandra/instaclustr-cassandra-exporter.yml",
    "content": "groups:\n\n- name: InstaclustrCassandraExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: CassandraNodeIsUnavailable\n      expr: 'cassandra_endpoint_active < 1'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra Node is unavailable (instance {{ $labels.instance }})\n        description: \"Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraManyCompactionTasksArePending\n      expr: 'cassandra_table_estimated_pending_compactions > 100'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra many compaction tasks are pending (instance {{ $labels.instance }})\n        description: \"Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraCommitlogPendingTasks(instaclustr)\n      expr: 'cassandra_commit_log_pending_tasks > 15'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra commitlog pending tasks (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraCompactionExecutorBlockedTasks(instaclustr)\n      expr: 'cassandra_thread_pool_blocked_tasks{pool=\"CompactionExecutor\"} > 15'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra compaction executor blocked tasks (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraFlushWriterBlockedTasks(instaclustr)\n      expr: 'cassandra_thread_pool_blocked_tasks{pool=\"MemtableFlushWriter\"} > 15'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cassandra flush writer blocked tasks (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraConnectionTimeoutsTotal(instaclustr)\n      expr: 'sum by (cassandra_cluster,instance) (rate(cassandra_client_request_timeouts_total[5m])) > 5'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra connection timeouts total (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraStorageExceptions(instaclustr)\n      expr: 'changes(cassandra_storage_exceptions_total[1m]) > 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra storage exceptions (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraTombstoneDump(instaclustr)\n      expr: 'avg(cassandra_table_tombstones_scanned{quantile=\"0.99\"}) by (instance,cassandra_cluster,keyspace) > 100'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra tombstone dump (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Cassandra tombstone dump - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraClientRequestUnavailableWrite(instaclustr)\n      expr: 'changes(cassandra_client_request_unavailable_exceptions_total{operation=\"write\"}[1m]) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra client request unavailable write (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraClientRequestUnavailableRead(instaclustr)\n      expr: 'changes(cassandra_client_request_unavailable_exceptions_total{operation=\"read\"}[1m]) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra client request unavailable read (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Some Cassandra client requests are unavailable to read - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraClientRequestWriteFailure(instaclustr)\n      expr: 'increase(cassandra_client_request_failures_total{operation=\"write\"}[1m]) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra client request write failure (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Write failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CassandraClientRequestReadFailure(instaclustr)\n      expr: 'increase(cassandra_client_request_failures_total{operation=\"read\"}[1m]) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cassandra client request read failure (Instaclustr) (instance {{ $labels.instance }})\n        description: \"Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/ceph/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: CephState\n      expr: 'ceph_health_status != 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Ceph State (instance {{ $labels.instance }})\n        description: \"Ceph instance unhealthy\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephMonitorClockSkew\n      expr: 'abs(ceph_monitor_clock_skew_seconds) > 0.2'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ceph monitor clock skew (instance {{ $labels.instance }})\n        description: \"Ceph monitor clock skew detected. Please check ntp and hardware clock settings\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephMonitorLowSpace\n      expr: 'ceph_monitor_avail_percent < 10'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ceph monitor low space (instance {{ $labels.instance }})\n        description: \"Ceph monitor storage is low.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephOsdDown\n      expr: 'ceph_osd_up == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Ceph OSD Down (instance {{ $labels.instance }})\n        description: \"Ceph Object Storage Daemon Down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephHighOsdLatency\n      expr: 'ceph_osd_perf_apply_latency_seconds > 5'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ceph high OSD latency (instance {{ $labels.instance }})\n        description: \"Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephOsdLowSpace\n      expr: 'ceph_osd_utilization > 90'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ceph OSD low space (instance {{ $labels.instance }})\n        description: \"Ceph Object Storage Daemon is going out of space. Please add more disks.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephOsdReweighted\n      expr: 'ceph_osd_weight < 1'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ceph OSD reweighted (instance {{ $labels.instance }})\n        description: \"Ceph Object Storage Daemon takes too much time to resize.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephPgDown\n      expr: 'ceph_pg_down > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Ceph PG down (instance {{ $labels.instance }})\n        description: \"Some Ceph placement groups are down. Please ensure that all the data are available.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephPgIncomplete\n      expr: 'ceph_pg_incomplete > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Ceph PG incomplete (instance {{ $labels.instance }})\n        description: \"Some Ceph placement groups are incomplete. Please ensure that all the data are available.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephPgInconsistent\n      expr: 'ceph_pg_inconsistent > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ceph PG inconsistent (instance {{ $labels.instance }})\n        description: \"Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephPgActivationLong\n      expr: 'ceph_pg_activating > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ceph PG activation long (instance {{ $labels.instance }})\n        description: \"Some Ceph placement groups are too long to activate.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephPgBackfillFull\n      expr: 'ceph_pg_backfill_toofull > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ceph PG backfill full (instance {{ $labels.instance }})\n        description: \"Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CephPgUnavailable\n      expr: 'ceph_pg_total - ceph_pg_active > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Ceph PG unavailable (instance {{ $labels.instance }})\n        description: \"Some Ceph placement groups are unavailable.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/cert-manager/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: Cert-managerAbsent\n      expr: 'absent(up{job=\"cert-manager\"})'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cert-Manager absent (instance {{ $labels.instance }})\n        description: \"Cert-Manager has disappeared from Prometheus service discovery. New certificates will not be able to be minted, and existing ones can't be renewed until cert-manager is back.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 21 days is a rough default. ACME certificates are typically renewed 30 days before expiry, so expiring within 21 days may indicate issuer misconfiguration.\n    - alert: Cert-managerCertificateExpiringSoon\n      expr: 'avg by (exported_namespace, namespace, name) (certmanager_certificate_expiration_timestamp_seconds - time()) < (21 * 24 * 3600)'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Cert-Manager certificate expiring soon (instance {{ $labels.instance }})\n        description: \"The certificate {{ $labels.name }} is expiring in less than 21 days.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: Cert-managerCertificateNotReady\n      expr: 'max by (name, exported_namespace, namespace, condition) (certmanager_certificate_ready_status{condition!=\"True\"} == 1)'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cert-Manager certificate not ready (instance {{ $labels.instance }})\n        description: \"The certificate {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not ready to serve traffic.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # In cert-manager 1.19+, the metric was renamed (dropped http_ prefix). Verify metric name against your version.\n    - alert: Cert-managerHittingAcmeRateLimits\n      expr: 'sum by (host) (rate(certmanager_http_acme_client_request_count{status=\"429\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cert-Manager hitting ACME rate limits (instance {{ $labels.instance }})\n        description: \"Cert-Manager is being rate-limited by the ACME provider. Certificate issuance and renewal may be blocked for up to a week.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/cilium/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    # Metric name depends on Cilium version. Use cilium_unreachable_nodes (older) or cilium_node_connectivity_status (1.14+).\n    - alert: CiliumAgentUnreachableNodes\n      expr: 'sum(cilium_unreachable_nodes{}) by (pod) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent unreachable nodes (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} cannot reach {{ $value }} node(s). Check network connectivity and node health.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Metric name depends on Cilium version. Use cilium_unreachable_health_endpoints (older) or cilium_node_connectivity_status (1.14+).\n    - alert: CiliumAgentUnreachableHealthEndpoints\n      expr: 'sum(cilium_unreachable_health_endpoints{}) by (pod) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent unreachable health endpoints (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} cannot reach {{ $value }} health endpoint(s). Node-to-node health probes are failing.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Metric name depends on Cilium version. Use cilium_controllers_failing (older) or cilium_controllers_runs_total (1.14+).\n    - alert: CiliumAgentFailingControllers\n      expr: 'sum(cilium_controllers_failing{}) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent failing controllers (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} has {{ $value }} failing controller(s). Check cilium-agent logs for details.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentEndpointFailures\n      expr: 'sum(cilium_endpoint_state{endpoint_state=\"invalid\"}) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent endpoint failures (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} has {{ $value }} endpoint(s) in invalid state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentEndpointRegenerationFailures\n      expr: 'sum(rate(cilium_endpoint_regenerations_total{outcome=\"fail\"}[5m])) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent endpoint regeneration failures (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} is failing to regenerate endpoints. Network policy enforcement may be stale.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentEndpointUpdateFailure\n      expr: 'sum(rate(cilium_k8s_client_api_calls_total{method=~\"(PUT|POST|PATCH)\", endpoint=\"endpoint\", return_code!~\"2[0-9][0-9]\"}[5m])) by (pod, method, return_code) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent endpoint update failure (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} is failing K8s endpoint update API calls ({{ $labels.method }} {{ $labels.return_code }}).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentEndpointCreateFailure\n      expr: 'sum(rate(cilium_api_limiter_processed_requests_total{api_call=~\"endpoint-create\", outcome=\"fail\"}[1m])) by (pod, api_call) > 0'\n      for: 5m\n      labels:\n        severity: info\n      annotations:\n        summary: Cilium agent endpoint create failure (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} is failing CNI endpoint-create calls. New pods may fail to get networking.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentMapOperationFailures\n      expr: 'sum(rate(cilium_bpf_map_ops_total{outcome=\"fail\"}[5m])) by (map_name, pod) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent map operation failures (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} has eBPF map operation failures on {{ $labels.map_name }}. Datapath may be degraded.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Map pressure is a ratio from 0 to 1. At 1.0, the map is full and new entries will be dropped.\n    - alert: CiliumAgentBpfMapPressure\n      expr: 'cilium_bpf_map_pressure{} > 0.9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent BPF map pressure (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} eBPF map {{ $labels.map_name }} is above 90% utilization. Map may become full.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentConntrackTableFull\n      expr: 'sum(rate(cilium_drop_count_total{reason=\"CT: Map insertion failed\"}[5m])) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cilium agent conntrack table full (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} conntrack table is full, causing packet drops. Increase CT map size or investigate connection leaks.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentConntrackFailedGarbageCollection\n      expr: 'sum(rate(cilium_datapath_conntrack_gc_runs_total{status=\"uncompleted\"}[5m])) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent conntrack failed garbage collection (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} conntrack garbage collection is failing. Stale entries may accumulate.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentNatTableFull\n      expr: 'sum(rate(cilium_drop_count_total{reason=\"No mapping for NAT masquerade\"}[1m])) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cilium agent NAT table full (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} NAT table is full, causing masquerade failures. Increase NAT map size or investigate.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Policy denials may be expected behavior. Investigate only if unexpected traffic is being blocked.\n    - alert: CiliumAgentHighDeniedRate\n      expr: 'sum(rate(cilium_drop_count_total{reason=\"Policy denied\"}[1m])) by (pod) > 0'\n      for: 10m\n      labels:\n        severity: info\n      annotations:\n        summary: Cilium agent high denied rate (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} is dropping packets due to policy denial. Verify network policies are correct.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentHighDropRate\n      expr: 'sum(rate(cilium_drop_count_total{reason!~\"Policy denied\"}[5m])) by (pod, reason) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent high drop rate (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} is dropping packets for reason {{ $labels.reason }}. This indicates infrastructure issues.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentPolicyMapPressure\n      expr: 'sum(cilium_bpf_map_pressure{map_name=~\"cilium_policy_.*\"}) by (pod) > 0.9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent policy map pressure (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} policy BPF map is above 90% utilization. New policies may fail to apply.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentPolicyImportErrors\n      expr: 'sum(rate(cilium_policy_change_total{outcome=\"fail\"}[5m])) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent policy import errors (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} is failing to import network policies. Policy enforcement may be incomplete.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 60s is a rough default. Adjust based on cluster size and policy complexity.\n    - alert: CiliumAgentPolicyImplementationDelay\n      expr: 'histogram_quantile(0.99, sum(rate(cilium_policy_implementation_delay[5m])) by (le, pod)) > 60'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent policy implementation delay (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumNode-localHighIdentityAllocation\n      expr: '(sum(cilium_identity{type=\"node_local\"}) by (pod) / (2^16-1)) > 0.8'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium node-local high identity allocation (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} node-local identity allocation is above 80%. Approaching the 65535 identity limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumClusterHighIdentityAllocation\n      expr: '(sum(cilium_identity{type=\"cluster_local\"}) by () / (2^16-256)) > 0.8'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium cluster high identity allocation (instance {{ $labels.instance }})\n        description: \"Cilium cluster-wide identity allocation is above 80%. Approaching the maximum identity limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumOperatorExhaustedIpamIps\n      expr: 'sum(cilium_operator_ipam_ips{type=\"available\"}) by () <= 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cilium operator exhausted IPAM IPs (instance {{ $labels.instance }})\n        description: \"Cilium operator has no available IPAM IPs. New pods will fail to schedule networking.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 90% is a rough default. Adjust based on your pod churn rate and IP pool size.\n    - alert: CiliumOperatorLowAvailableIpamIps\n      expr: 'sum(cilium_operator_ipam_ips{type!=\"available\"}) by () / sum(cilium_operator_ipam_ips) by () > 0.9 and sum(cilium_operator_ipam_ips) by () > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium operator low available IPAM IPs (instance {{ $labels.instance }})\n        description: \"Cilium operator IPAM IP pool is over 90% utilized. Allocate more IPs to avoid exhaustion.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Some Cilium versions may not have a status label on this metric. Verify against your Cilium version.\n    - alert: CiliumOperatorIpamInterfaceCreationFailures\n      expr: 'sum(rate(cilium_operator_ipam_interface_creation_ops{status!=\"success\"}[5m])) by () > 0'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium operator IPAM interface creation failures (instance {{ $labels.instance }})\n        description: \"Cilium operator is failing to create IPAM network interfaces. IP allocation may be impacted.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentApiErrors\n      expr: 'sum(rate(cilium_agent_api_process_time_seconds_count{return_code=~\"5[0-9][0-9]\"}[5m])) by (pod, return_code) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium agent API errors (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} API is returning 5xx errors ({{ $labels.return_code }}). Agent may be unhealthy.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumAgentKubernetesClientErrors\n      expr: 'sum(rate(cilium_k8s_client_api_calls_total{endpoint!=\"metrics\", return_code!~\"2[0-9][0-9]\"}[5m])) by (pod, endpoint, return_code) > 0'\n      for: 5m\n      labels:\n        severity: info\n      annotations:\n        summary: Cilium agent Kubernetes client errors (instance {{ $labels.instance }})\n        description: \"Cilium agent {{ $labels.pod }} is receiving errors from K8s API for endpoint {{ $labels.endpoint }} ({{ $labels.return_code }}).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumClustermeshRemoteClusterNotReady\n      expr: 'count(cilium_clustermesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cilium ClusterMesh remote cluster not ready (instance {{ $labels.instance }})\n        description: \"Cilium ClusterMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumClustermeshRemoteClusterFailing\n      expr: 'sum(rate(cilium_clustermesh_remote_cluster_failures[5m])) by (source_cluster, target_cluster) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cilium ClusterMesh remote cluster failing (instance {{ $labels.instance }})\n        description: \"Cilium ClusterMesh connectivity to remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is failing.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumKvstoremeshRemoteClusterNotReady\n      expr: 'count(cilium_kvstoremesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cilium KVStoreMesh remote cluster not ready (instance {{ $labels.instance }})\n        description: \"Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumKvstoremeshRemoteClusterFailing\n      expr: 'sum(rate(cilium_kvstoremesh_remote_cluster_failures[5m])) by (source_cluster, target_cluster) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cilium KVStoreMesh remote cluster failing (instance {{ $labels.instance }})\n        description: \"Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is experiencing failures.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumKvstoremeshSyncErrors\n      expr: 'sum(rate(cilium_kvstoremesh_kvstore_sync_errors_total[5m])) by (source_cluster) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cilium KVStoreMesh sync errors (instance {{ $labels.instance }})\n        description: \"Cilium KVStoreMesh from {{ $labels.source_cluster }} is experiencing kvstore sync errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CiliumHubbleLostEvents\n      expr: 'sum(rate(hubble_lost_events_total[5m])) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium Hubble lost events (instance {{ $labels.instance }})\n        description: \"Cilium Hubble on {{ $labels.pod }} is losing flow events. Observability data may be incomplete.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 10% is a rough default. Some DNS errors may be normal depending on your workload.\n    - alert: CiliumHubbleHighDnsErrorRate\n      expr: 'sum(rate(hubble_dns_responses_total{rcode!=\"No Error\"}[5m])) by (pod) / sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0.1 and sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cilium Hubble high DNS error rate (instance {{ $labels.instance }})\n        description: \"Cilium Hubble on {{ $labels.pod }} is observing more than 10% DNS error responses.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/clickhouse/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    # Adjust the job label to match your Prometheus configuration.\n    - alert: ClickhouseNodeDown\n      expr: 'up{job=\"clickhouse\"} == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: ClickHouse node down (instance {{ $labels.instance }})\n        description: \"No metrics received from ClickHouse exporter for over 2 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseMemoryUsageCritical\n      expr: 'ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90 and ClickHouseAsyncMetrics_CGroupMemoryTotal > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: ClickHouse Memory Usage Critical (instance {{ $labels.instance }})\n        description: \"Memory usage is critically high, over 90%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseMemoryUsageWarning\n      expr: 'ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80 and ClickHouseAsyncMetrics_CGroupMemoryTotal > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse Memory Usage Warning (instance {{ $labels.instance }})\n        description: \"Memory usage is over 80%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseDiskSpaceLowOnDefault\n      expr: 'ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20 and (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse Disk Space Low on Default (instance {{ $labels.instance }})\n        description: \"Disk space on default is below 20%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseDiskSpaceCriticalOnDefault\n      expr: 'ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10 and (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: ClickHouse Disk Space Critical on Default (instance {{ $labels.instance }})\n        description: \"Disk space on default disk is critically low, below 10%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseDiskSpaceLowOnBackups\n      expr: 'ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20 and (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse Disk Space Low on Backups (instance {{ $labels.instance }})\n        description: \"Disk space on backups is below 20%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseReplicaErrors\n      expr: 'ClickHouseErrorMetric_ALL_REPLICAS_ARE_STALE == 1 or ClickHouseErrorMetric_ALL_REPLICAS_LOST == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: ClickHouse Replica Errors (instance {{ $labels.instance }})\n        description: \"Critical replica errors detected, either all replicas are stale or lost.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseNoAvailableReplicas\n      expr: 'ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: ClickHouse No Available Replicas (instance {{ $labels.instance }})\n        description: \"No available replicas in ClickHouse.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseNoLiveReplicas\n      expr: 'ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: ClickHouse No Live Replicas (instance {{ $labels.instance }})\n        description: \"There are too few live replicas available, risking data loss and service disruption.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Please replace the threshold with an appropriate value\n    - alert: ClickhouseHighTcpConnections\n      expr: 'ClickHouseMetrics_TCPConnection > 400'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse High TCP Connections (instance {{ $labels.instance }})\n        description: \"High number of TCP connections, indicating heavy client or inter-cluster communication.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Adjust the threshold based on your cluster size and expected replication traffic.\n    - alert: ClickhouseInterserverConnectionIssues\n      expr: 'ClickHouseMetrics_InterserverConnection > 50'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse Interserver Connection Issues (instance {{ $labels.instance }})\n        description: \"High number of interserver connections may indicate replication or distributed query handling issues.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseZookeeperConnectionIssues\n      expr: 'ClickHouseMetrics_ZooKeeperSession != 1'\n      for: 3m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse ZooKeeper Connection Issues (instance {{ $labels.instance }})\n        description: \"ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseAuthenticationFailures\n      expr: 'increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 3'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: ClickHouse Authentication Failures (instance {{ $labels.instance }})\n        description: \"Authentication failures detected, indicating potential security issues or misconfiguration.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseAccessDeniedErrors\n      expr: 'increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 3'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: ClickHouse Access Denied Errors (instance {{ $labels.instance }})\n        description: \"Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseRejectedInsertQueries\n      expr: 'increase(ClickHouseProfileEvents_RejectedInserts[1m]) > 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse rejected insert queries (instance {{ $labels.instance }})\n        description: \"INSERTs rejected due to too many active data parts. Reduce insert frequency.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseDelayedInsertQueries\n      expr: 'increase(ClickHouseProfileEvents_DelayedInserts[5m]) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse delayed insert queries (instance {{ $labels.instance }})\n        description: \"INSERTs delayed due to high number of active parts.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseZookeeperHardwareException\n      expr: 'increase(ClickHouseProfileEvents_ZooKeeperHardwareExceptions[1m]) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: ClickHouse zookeeper hardware exception (instance {{ $labels.instance }})\n        description: \"Zookeeper hardware exception: network issues communicating with ZooKeeper\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Please replace the threshold with an appropriate value\n    - alert: ClickhouseHighNetworkUsage\n      expr: 'rate(ClickHouseProfileEvents_NetworkSendBytes[1m]) > 100*1024*1024 or rate(ClickHouseProfileEvents_NetworkReceiveBytes[1m]) > 100*1024*1024'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: ClickHouse high network usage (instance {{ $labels.instance }})\n        description: \"High network usage. ClickHouse network usage exceeds 100MB/s.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ClickhouseDistributedRejectedInserts\n      expr: 'increase(ClickHouseProfileEvents_DistributedRejectedInserts[5m]) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: ClickHouse distributed rejected inserts (instance {{ $labels.instance }})\n        description: \"INSERTs into Distributed tables rejected due to pending bytes limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/cloudflare/lablabs-cloudflare-exporter.yml",
    "content": "groups:\n\n- name: LablabsCloudflareExporter\n\n  \n  rules:\n\n    - alert: CloudflareHttp4xxErrorRate\n      expr: '(sum by(zone) (rate(cloudflare_zone_requests_status{status=~\"^4..\"}[15m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[15m]))) * 100 > 5 and sum by (zone) (rate(cloudflare_zone_requests_status[15m])) > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cloudflare http 4xx error rate (instance {{ $labels.instance }})\n        description: \"Cloudflare high HTTP 4xx error rate (> 5% for domain {{ $labels.zone }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CloudflareHttp5xxErrorRate\n      expr: '(sum by (zone) (rate(cloudflare_zone_requests_status{status=~\"^5..\"}[5m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[5m]))) * 100 > 5 and sum by (zone) (rate(cloudflare_zone_requests_status[5m])) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cloudflare http 5xx error rate (instance {{ $labels.instance }})\n        description: \"Cloudflare high HTTP 5xx error rate (> 5% for domain {{ $labels.zone }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/consul/consul-exporter.yml",
    "content": "groups:\n\n- name: ConsulExporter\n\n  \n  rules:\n\n    - alert: ConsulServiceHealthcheckFailed\n      expr: 'consul_catalog_service_node_healthy == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Consul service healthcheck failed (instance {{ $labels.instance }})\n        description: \"Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ConsulMissingMasterNode\n      expr: 'consul_raft_peers < 3'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Consul missing master node (instance {{ $labels.instance }})\n        description: \"Numbers of consul raft peers should be 3, in order to preserve quorum.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ConsulAgentUnhealthy\n      expr: 'consul_health_node_status{status=\"critical\"} == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Consul agent unhealthy (instance {{ $labels.instance }})\n        description: \"A Consul agent is down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/coredns/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: CorednsPanicCount\n      expr: 'increase(coredns_panics_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: CoreDNS Panic Count (instance {{ $labels.instance }})\n        description: \"Number of CoreDNS panics encountered\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/cortex/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: CortexRulerConfigurationReloadFailure\n      expr: 'cortex_ruler_config_last_reload_successful != 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Cortex ruler configuration reload failure (instance {{ $labels.instance }})\n        description: \"Cortex ruler configuration reload failure (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CortexNotConnectedToAlertmanager\n      expr: 'cortex_prometheus_notifications_alertmanagers_discovered < 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cortex not connected to Alertmanager (instance {{ $labels.instance }})\n        description: \"Cortex not connected to Alertmanager (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: CortexNotificationAreBeingDropped\n      expr: 'rate(cortex_prometheus_notifications_dropped_total[5m]) > 0.05'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cortex notification are being dropped (instance {{ $labels.instance }})\n        description: \"Cortex notification are being dropped due to errors (instance {{ $labels.instance }}, {{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: CortexNotificationError\n      expr: 'rate(cortex_prometheus_notifications_errors_total[5m]) > 0.05'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cortex notification error (instance {{ $labels.instance }})\n        description: \"Cortex is failing when sending alert notifications (instance {{ $labels.instance }}, {{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CortexIngesterUnhealthy\n      expr: 'cortex_ring_members{state=\"Unhealthy\", name=\"ingester\"} > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cortex ingester unhealthy (instance {{ $labels.instance }})\n        description: \"Cortex has an unhealthy ingester\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CortexFrontendQueriesStuck\n      expr: 'sum by (job) (cortex_query_frontend_queue_length) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Cortex frontend queries stuck (instance {{ $labels.instance }})\n        description: \"There are queued up queries in query-frontend.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/couchdb/gesellix-couchdb-prometheus-exporter.yml",
    "content": "groups:\n\n- name: GesellixCouchdbPrometheusExporter\n\n  \n  rules:\n\n    - alert: CouchdbNodeDown\n      expr: 'couchdb_httpd_node_up == 0 or couchdb_httpd_up == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB node down (instance {{ $labels.instance }})\n        description: \"CouchDB node is not responding (node_up metric is 0) for more than 2 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbAtomMemoryUsageCritical\n      expr: 'couchdb_erlang_memory_atom_used > 0.9 * couchdb_erlang_memory_atom'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB atom memory usage critical (instance {{ $labels.instance }})\n        description: \"Atom memory usage is above 90% of limit\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbOpenDatabasesCritical\n      expr: 'couchdb_httpd_open_databases > 0.9 * 1000'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB open databases critical (instance {{ $labels.instance }})\n        description: \"Number of open databases exceeds 90% of node capacity\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbOpenOsFilesCritical\n      expr: 'couchdb_httpd_open_os_files > 0.9 * 65535'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB open OS files critical (instance {{ $labels.instance }})\n        description: \"CouchDB is using more than 90% of allowed OS file descriptors, may fail to open new files\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: Couchdb5xxErrorRatioHigh\n      expr: 'rate(couchdb_httpd_status_codes{code=~\"5..\"}[5m]) / rate(couchdb_httpd_requests[5m]) > 0.05 and rate(couchdb_httpd_requests[5m]) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB 5xx error ratio high (instance {{ $labels.instance }})\n        description: \"More than 5% of HTTP requests are returning 5xx errors\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbTemporaryViewReadRateCritical\n      expr: 'rate(couchdb_httpd_temporary_view_reads[5m]) > 100'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB temporary view read rate critical (instance {{ $labels.instance }})\n        description: \"Temporary view read rate exceeds 100 reads/sec, high risk of performance degradation\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbMangoQueriesScanningTooManyDocs\n      expr: 'rate(couchdb_mango_too_many_docs_scanned[5m]) > 50'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: CouchDB Mango queries scanning too many docs (instance {{ $labels.instance }})\n        description: \"Some Mango queries are scanning too many documents, consider adding indexes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbMangoQueriesFailedDueToInvalidIndex\n      expr: 'rate(couchdb_mango_query_invalid_index[5m]) > 5'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: CouchDB Mango queries failed due to invalid index (instance {{ $labels.instance }})\n        description: \"Some Mango queries failed to execute because the index was missing or invalid\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbMangoDocsExaminedHigh\n      expr: 'rate(couchdb_mango_docs_examined[5m]) > 1000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: CouchDB Mango docs examined high (instance {{ $labels.instance }})\n        description: \"High number of documents examined per Mango queries, consider indexing\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbReplicatorManagerDied\n      expr: 'increase(couchdb_replicator_changes_manager_deaths[5m]) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB Replicator manager died (instance {{ $labels.instance }})\n        description: \"Replication manager process has crashed\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbReplicatorQueueProcessDied\n      expr: 'increase(couchdb_replicator_changes_queue_deaths[5m]) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB Replicator queue process died (instance {{ $labels.instance }})\n        description: \"Replication queue process has crashed\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbReplicatorReaderProcessDied\n      expr: 'increase(couchdb_replicator_changes_reader_deaths[5m]) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB Replicator reader process died (instance {{ $labels.instance }})\n        description: \"Replication reader process has crashed\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbReplicatorFailedToStart\n      expr: 'increase(couchdb_replicator_failed_starts[5m]) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB Replicator failed to start (instance {{ $labels.instance }})\n        description: \"One or more replication tasks failed to start\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbReplicationClusterUnstable\n      expr: 'couchdb_replicator_cluster_is_stable == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB replication cluster unstable (instance {{ $labels.instance }})\n        description: \"The replication cluster is unstable, replication may be interrupted\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbReplicationReadFailures\n      expr: 'increase(couchdb_replicator_changes_read_failures[5m]) > 5'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: CouchDB replication read failures (instance {{ $labels.instance }})\n        description: \"Replication changes feed has failed reads more than 5 times in 5 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbFileDescriptorsHigh\n      expr: 'process_open_fds / process_max_fds > 0.85 and process_max_fds > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: CouchDB file descriptors high (instance {{ $labels.instance }})\n        description: \"Process is using more than 85% of allowed file descriptors\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbProcessRestarted\n      expr: 'changes(process_start_time_seconds[1h]) > 0'\n      for: 1m\n      labels:\n        severity: info\n      annotations:\n        summary: CouchDB process restarted (instance {{ $labels.instance }})\n        description: \"CouchDB process has restarted recently\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: CouchdbCriticalLogEntries\n      expr: 'increase(couchdb_server_couch_log{level=~\"error|critical\"}[5m]) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: CouchDB critical log entries (instance {{ $labels.instance }})\n        description: \"Critical or error log entries detected in the last 5 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/digitalocean/digitalocean-exporter.yml",
    "content": "groups:\n\n- name: DigitaloceanExporter\n\n  \n  rules:\n\n    - alert: DigitaloceanDropletDown\n      expr: 'digitalocean_droplet_up == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: DigitalOcean droplet down (instance {{ $labels.instance }})\n        description: \"DigitalOcean droplet {{ $labels.name }} ({{ $labels.id }}) in {{ $labels.region }} is not running.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: DigitaloceanAccountNotActive\n      expr: 'digitalocean_account_active != 1'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: DigitalOcean account not active (instance {{ $labels.instance }})\n        description: \"DigitalOcean account is not active. It may be suspended or locked.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: DigitaloceanDatabaseDown\n      expr: 'digitalocean_database_status == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: DigitalOcean database down (instance {{ $labels.instance }})\n        description: \"DigitalOcean managed database {{ $labels.name }} ({{ $labels.engine }}) in {{ $labels.region }} is offline.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: DigitaloceanKubernetesClusterDown\n      expr: 'digitalocean_kubernetes_cluster_up == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: DigitalOcean Kubernetes cluster down (instance {{ $labels.instance }})\n        description: \"DigitalOcean Kubernetes cluster {{ $labels.name }} ({{ $labels.version }}) in {{ $labels.region }} is not running.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: DigitaloceanLoadBalancerDown\n      expr: 'digitalocean_loadbalancer_status == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: DigitalOcean load balancer down (instance {{ $labels.instance }})\n        description: \"DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) is not active.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: DigitaloceanLoadBalancerNoBackends\n      expr: 'digitalocean_loadbalancer_droplets == 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: DigitalOcean load balancer no backends (instance {{ $labels.instance }})\n        description: \"DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) has no droplets attached.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: DigitaloceanFloatingIpNotAssigned\n      expr: 'digitalocean_floating_ipv4_active == 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: DigitalOcean floating IP not assigned (instance {{ $labels.instance }})\n        description: \"DigitalOcean floating IP {{ $labels.ipv4 }} in {{ $labels.region }} is not assigned to any droplet.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: DigitaloceanActiveIncidents\n      expr: 'digitalocean_incidents_total > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: DigitalOcean active incidents (instance {{ $labels.instance }})\n        description: \"DigitalOcean platform has {{ $value }} active incident(s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: DigitaloceanExporterCollectionErrors\n      expr: 'increase(digitalocean_errors_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: DigitalOcean exporter collection errors (instance {{ $labels.instance }})\n        description: \"DigitalOcean exporter {{ $labels.collector }} collector has {{ $value }} errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Fires when more than 80% of the account's droplet limit is in use.\n    - alert: DigitaloceanDropletLimitApproaching\n      expr: '(count(digitalocean_droplet_up) / digitalocean_account_droplet_limit) * 100 > 80 and digitalocean_account_droplet_limit > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: DigitalOcean droplet limit approaching (instance {{ $labels.instance }})\n        description: \"DigitalOcean account is using {{ $value }}% of its droplet quota.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/docker-containers/google-cadvisor.yml",
    "content": "groups:\n\n- name: GoogleCadvisor\n\n  \n  rules:\n\n    # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.\n    - alert: ContainerKilled\n      expr: 'time() - container_last_seen > 60'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Container killed (instance {{ $labels.instance }})\n        description: \"A container has disappeared\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.\n    - alert: ContainerAbsent\n      expr: 'absent(container_last_seen)'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Container absent (instance {{ $labels.instance }})\n        description: \"A container is absent for 5 min\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Only fires for containers with explicit CPU limits. Containers without limits have cpu_quota=0, which is filtered out by the guard.\n    - alert: ContainerHighCpuUtilization\n      expr: '(sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=\"\"}/container_spec_cpu_period{container!=\"\"}) by (pod, container) * 100) > 80 and sum(container_spec_cpu_quota{container!=\"\"}/container_spec_cpu_period{container!=\"\"}) by (pod, container) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Container High CPU utilization (instance {{ $labels.instance }})\n        description: \"Container CPU utilization is above 80% (current: {{ $value | printf \\\"%.2f\\\" }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d\n    - alert: ContainerHighMemoryUsage\n      expr: '(sum(container_memory_working_set_bytes{name!=\"\"}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Container High Memory usage (instance {{ $labels.instance }})\n        description: \"Container Memory usage is above 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ContainerVolumeUsage\n      expr: '(1 - (sum(container_fs_inodes_free{name!=\"\"}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80 and sum(container_fs_inodes_total) BY (instance) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Container Volume usage (instance {{ $labels.instance }})\n        description: \"Container Volume usage is above 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ContainerHighThrottleRate\n      expr: 'sum(rate(container_cpu_cfs_throttled_periods_total{container!=\"\"}[5m])) by (container, pod, namespace) / sum(rate(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > ( 25 / 100 ) and sum(rate(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Container high throttle rate (instance {{ $labels.instance }})\n        description: \"Container is being throttled ({{ $value | humanizePercentage }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ContainerHighLowChangeCpuUsage\n      expr: '(abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[1m] offset 1m)) * 100)) or abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[5m] offset 1m)) * 100))) > 25'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Container high low change CPU usage (instance {{ $labels.instance }})\n        description: \"This alert rule monitors the absolute change in CPU usage within a time window and triggers an alert when the change exceeds 25%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ContainerLowCpuUtilization\n      expr: '(sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=\"\"}/container_spec_cpu_period{container!=\"\"}) by (pod, container) * 100) < 20'\n      for: 7d\n      labels:\n        severity: info\n      annotations:\n        summary: Container Low CPU utilization (instance {{ $labels.instance }})\n        description: \"Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU. (current: {{ $value | printf \\\"%.2f\\\" }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ContainerLowMemoryUsage\n      expr: '(sum(container_memory_working_set_bytes{name!=\"\"}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20'\n      for: 7d\n      labels:\n        severity: info\n      annotations:\n        summary: Container Low Memory usage (instance {{ $labels.instance }})\n        description: \"Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/ebpf/ebpf-exporter.yml",
    "content": "groups:\n\n- name: EbpfExporter\n\n  \n  rules:\n\n    # The exporter uses loose attachment: if a program fails to load (missing BTF, kernel incompatibility), it sets this metric to 0 and continues running.\n    - alert: EbpfExporterProgramNotAttached\n      expr: 'ebpf_exporter_ebpf_program_attached == 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: eBPF exporter program not attached (instance {{ $labels.instance }})\n        description: \"eBPF program {{ $labels.id }} failed to attach. The program is not collecting data. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EbpfExporterDecoderErrors\n      expr: 'rate(ebpf_exporter_decoder_errors_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: eBPF exporter decoder errors (instance {{ $labels.instance }})\n        description: \"eBPF exporter is experiencing decoder errors for config {{ $labels.config }}. Kernel data is not being correctly transformed into labels. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EbpfExporterNoEnabledConfigs\n      expr: 'ebpf_exporter_enabled_configs == 0 or absent(ebpf_exporter_enabled_configs)'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: eBPF exporter no enabled configs (instance {{ $labels.instance }})\n        description: \"eBPF exporter has no enabled configurations. No eBPF programs are being run. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/elasticsearch/prometheus-community-elasticsearch-exporter.yml",
    "content": "groups:\n\n- name: PrometheusCommunityElasticsearchExporter\n\n  \n  rules:\n\n    - alert: ElasticsearchHeapUsageTooHigh\n      expr: '(elasticsearch_jvm_memory_used_bytes{area=\"heap\"} / elasticsearch_jvm_memory_max_bytes{area=\"heap\"}) * 100 > 90 and elasticsearch_jvm_memory_max_bytes{area=\"heap\"} > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})\n        description: \"The heap usage is over 90%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchHeapUsageWarning\n      expr: '(elasticsearch_jvm_memory_used_bytes{area=\"heap\"} / elasticsearch_jvm_memory_max_bytes{area=\"heap\"}) * 100 > 80 and elasticsearch_jvm_memory_max_bytes{area=\"heap\"} > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }})\n        description: \"The heap usage is over 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchDiskOutOfSpace\n      expr: 'elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10 and elasticsearch_filesystem_data_size_bytes > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Elasticsearch disk out of space (instance {{ $labels.instance }})\n        description: \"The disk usage is over 90%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchDiskSpaceLow\n      expr: 'elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20 and elasticsearch_filesystem_data_size_bytes > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch disk space low (instance {{ $labels.instance }})\n        description: \"The disk usage is over 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchClusterRed\n      expr: 'elasticsearch_cluster_health_status{color=\"red\"} == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Elasticsearch Cluster Red (instance {{ $labels.instance }})\n        description: \"Elastic Cluster Red status\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchClusterYellow\n      expr: 'elasticsearch_cluster_health_status{color=\"yellow\"} == 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }})\n        description: \"Elastic Cluster Yellow status\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: ElasticsearchHealthyNodes\n      expr: 'elasticsearch_cluster_health_number_of_nodes < 3'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})\n        description: \"Missing node in Elasticsearch cluster\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: ElasticsearchHealthyDataNodes\n      expr: 'elasticsearch_cluster_health_number_of_data_nodes < 3'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }})\n        description: \"Missing data node in Elasticsearch cluster\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchRelocatingShards\n      expr: 'elasticsearch_cluster_health_relocating_shards > 0'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Elasticsearch relocating shards (instance {{ $labels.instance }})\n        description: \"Elasticsearch is relocating shards\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchRelocatingShardsTooLong\n      expr: 'elasticsearch_cluster_health_relocating_shards > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch relocating shards too long (instance {{ $labels.instance }})\n        description: \"Elasticsearch has been relocating shards for 15min\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchInitializingShards\n      expr: 'elasticsearch_cluster_health_initializing_shards > 0'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Elasticsearch initializing shards (instance {{ $labels.instance }})\n        description: \"Elasticsearch is initializing shards\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchInitializingShardsTooLong\n      expr: 'elasticsearch_cluster_health_initializing_shards > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch initializing shards too long (instance {{ $labels.instance }})\n        description: \"Elasticsearch has been initializing shards for 15 min\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchUnassignedShards\n      expr: 'elasticsearch_cluster_health_unassigned_shards > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Elasticsearch unassigned shards (instance {{ $labels.instance }})\n        description: \"Elasticsearch has unassigned shards\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchPendingTasks\n      expr: 'elasticsearch_cluster_health_number_of_pending_tasks > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch pending tasks (instance {{ $labels.instance }})\n        description: \"Elasticsearch has pending tasks. Cluster works slowly.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchNoNewDocuments\n      expr: 'increase(elasticsearch_indices_indexing_index_total{es_data_node=\"true\"}[10m]) < 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch no new documents (instance {{ $labels.instance }})\n        description: \"No new documents for 10 min!\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchHighIndexingLatency\n      expr: 'rate(elasticsearch_indices_indexing_index_time_seconds_total[1m]) / rate(elasticsearch_indices_indexing_index_total[1m]) > 0.0005 and rate(elasticsearch_indices_indexing_index_total[1m]) > 0'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch High Indexing Latency (instance {{ $labels.instance }})\n        description: \"The indexing latency on Elasticsearch cluster is higher than the threshold (current value: {{ $value }}s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchHighIndexingRate\n      expr: 'sum(rate(elasticsearch_indices_indexing_index_total[1m]))> 10000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch High Indexing Rate (instance {{ $labels.instance }})\n        description: \"The indexing rate on Elasticsearch cluster is higher than the threshold.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchHighQueryRate\n      expr: 'sum(rate(elasticsearch_indices_search_query_total[1m])) > 100'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch High Query Rate (instance {{ $labels.instance }})\n        description: \"The query rate on Elasticsearch cluster is higher than the threshold.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ElasticsearchHighQueryLatency\n      expr: 'rate(elasticsearch_indices_search_query_time_seconds[1m]) / rate(elasticsearch_indices_search_query_total[1m]) > 1 and rate(elasticsearch_indices_search_query_total[1m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Elasticsearch High Query Latency (instance {{ $labels.instance }})\n        description: \"The query latency on Elasticsearch cluster is higher than the threshold (current value: {{ $value }}s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/envoy/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: EnvoyServerNotLive\n      expr: 'envoy_server_live != 1'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Envoy server not live (instance {{ $labels.instance }})\n        description: \"Envoy server is not live (draining or shutting down) on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyHighMemoryUsage\n      expr: 'envoy_server_memory_allocated / envoy_server_memory_heap_size * 100 > 90 and envoy_server_memory_heap_size > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy high memory usage (instance {{ $labels.instance }})\n        description: \"Envoy memory allocated is above 90% of heap size on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyHighDownstreamHttp5xxErrorRate\n      expr: 'sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class=\"5\"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 5 and sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Envoy high downstream HTTP 5xx error rate (instance {{ $labels.instance }})\n        description: \"More than 5% of downstream HTTP responses are 5xx on {{ $labels.instance }} ({{ $value | printf \\\"%.1f\\\" }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyHighDownstreamHttp4xxErrorRate\n      expr: 'sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class=\"4\"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 10 and sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy high downstream HTTP 4xx error rate (instance {{ $labels.instance }})\n        description: \"More than 10% of downstream HTTP responses are 4xx on {{ $labels.instance }} ({{ $value | printf \\\"%.1f\\\" }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyDownstreamConnectionsOverflowing\n      expr: 'increase(envoy_listener_downstream_cx_overflow[5m]) > 5'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy downstream connections overflowing (instance {{ $labels.instance }})\n        description: \"Downstream connections are being rejected due to listener overflow on {{ $labels.instance }} ({{ $value }} in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyClusterMembershipEmpty\n      expr: 'envoy_cluster_membership_healthy == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Envoy cluster membership empty (instance {{ $labels.instance }})\n        description: \"Envoy cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} has no healthy members\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyClusterMembershipDegraded\n      expr: 'envoy_cluster_membership_healthy / envoy_cluster_membership_total * 100 < 75 and envoy_cluster_membership_total > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy cluster membership degraded (instance {{ $labels.instance }})\n        description: \"More than 25% of members in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} are unhealthy\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyHighClusterUpstreamConnectionFailures\n      expr: 'increase(envoy_cluster_upstream_cx_connect_fail[5m]) > 10'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy high cluster upstream connection failures (instance {{ $labels.instance }})\n        description: \"High rate of upstream connection failures in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyHighClusterUpstreamRequestTimeoutRate\n      expr: 'rate(envoy_cluster_upstream_rq_timeout[5m]) / rate(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and rate(envoy_cluster_upstream_rq_completed[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy high cluster upstream request timeout rate (instance {{ $labels.instance }})\n        description: \"More than 5% of upstream requests are timing out in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyHighClusterUpstream5xxErrorRate\n      expr: 'rate(envoy_cluster_upstream_rq_xx{envoy_response_code_class=\"5\"}[5m]) / rate(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and rate(envoy_cluster_upstream_rq_completed[5m]) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Envoy high cluster upstream 5xx error rate (instance {{ $labels.instance }})\n        description: \"More than 5% of upstream requests return 5xx in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyClusterHealthCheckFailures\n      expr: 'increase(envoy_cluster_health_check_failure[5m]) > 5'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy cluster health check failures (instance {{ $labels.instance }})\n        description: \"Health checks are consistently failing in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyClusterOutlierDetectionEjectionsActive\n      expr: 'envoy_cluster_outlier_detection_ejections_active > 0'\n      for: 5m\n      labels:\n        severity: info\n      annotations:\n        summary: Envoy cluster outlier detection ejections active (instance {{ $labels.instance }})\n        description: \"There are active outlier detection ejections in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyListenerSslConnectionErrors\n      expr: 'increase(envoy_listener_ssl_connection_error[5m]) > 5'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy listener SSL connection errors (instance {{ $labels.instance }})\n        description: \"Envoy listener is experiencing SSL/TLS connection errors on {{ $labels.instance }} ({{ $value }} in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyGlobalDownstreamConnectionsOverflowing\n      expr: 'increase(envoy_listener_downstream_global_cx_overflow[5m]) > 5'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Envoy global downstream connections overflowing (instance {{ $labels.instance }})\n        description: \"Downstream connections are being rejected due to global connection limit on {{ $labels.instance }} ({{ $value }} in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoySslCertificateExpiringSoon\n      expr: 'envoy_server_days_until_first_cert_expiring < 7'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy SSL certificate expiring soon (instance {{ $labels.instance }})\n        description: \"SSL certificate loaded by Envoy on {{ $labels.instance }} expires in less than 7 days\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoySslCertificateExpired\n      expr: 'envoy_server_days_until_first_cert_expiring < 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Envoy SSL certificate expired (instance {{ $labels.instance }})\n        description: \"SSL certificate loaded by Envoy on {{ $labels.instance }} has expired\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyClusterCircuitBreakerTripped\n      expr: 'envoy_cluster_circuit_breakers_default_cx_open == 1 or envoy_cluster_circuit_breakers_default_rq_open == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Envoy cluster circuit breaker tripped (instance {{ $labels.instance }})\n        description: \"Circuit breaker is open for cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyNoHealthyUpstream\n      expr: 'increase(envoy_cluster_upstream_cx_none_healthy[5m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Envoy no healthy upstream (instance {{ $labels.instance }})\n        description: \"Upstream connection attempts failed because no healthy upstream was available in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EnvoyHighDownstreamRequestTimeoutRate\n      expr: 'increase(envoy_http_downstream_rq_timeout[5m]) > 5'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Envoy high downstream request timeout rate (instance {{ $labels.instance }})\n        description: \"Downstream requests are timing out on {{ $labels.instance }} ({{ $value }} in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/etcd/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: EtcdInsufficientMembers\n      expr: 'count(etcd_server_id) % 2 == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Etcd insufficient Members (instance {{ $labels.instance }})\n        description: \"Etcd cluster should have an odd number of members\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdNoLeader\n      expr: 'etcd_server_has_leader == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Etcd no Leader (instance {{ $labels.instance }})\n        description: \"Etcd cluster have no leader\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdHighNumberOfLeaderChanges\n      expr: 'increase(etcd_server_leader_changes_seen_total[10m]) > 2'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd high number of leader changes (instance {{ $labels.instance }})\n        description: \"Etcd leader changed {{ $value }} times during 10 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Filters to actual error codes. grpc_code!=\"OK\" includes benign codes like NotFound, AlreadyExists, and Cancelled.\n    - alert: EtcdHighNumberOfFailedGrpcRequestsWarning\n      expr: 'sum(rate(grpc_server_handled_total{grpc_code=~\"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown\"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01 and sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd high number of failed GRPC requests warning (instance {{ $labels.instance }})\n        description: \"More than 1% GRPC request failure detected in Etcd\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Filters to actual error codes. grpc_code!=\"OK\" includes benign codes like NotFound, AlreadyExists, and Cancelled.\n    - alert: EtcdHighNumberOfFailedGrpcRequestsCritical\n      expr: 'sum(rate(grpc_server_handled_total{grpc_code=~\"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown\"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05 and sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Etcd high number of failed GRPC requests critical (instance {{ $labels.instance }})\n        description: \"More than 5% GRPC request failure detected in Etcd\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdGrpcRequestsSlow\n      expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type=\"unary\"}[1m])) by (grpc_service, grpc_method, le)) > 0.15'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd GRPC requests slow (instance {{ $labels.instance }})\n        description: \"GRPC requests slowing down, 99th percentile is over 0.15s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdHighNumberOfFailedHttpRequestsWarning\n      expr: 'sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01 and sum(rate(etcd_http_received_total[1m])) BY (method) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd high number of failed HTTP requests warning (instance {{ $labels.instance }})\n        description: \"More than 1% HTTP failure detected in Etcd\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdHighNumberOfFailedHttpRequestsCritical\n      expr: 'sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05 and sum(rate(etcd_http_received_total[1m])) BY (method) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Etcd high number of failed HTTP requests critical (instance {{ $labels.instance }})\n        description: \"More than 5% HTTP failure detected in Etcd\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdHttpRequestsSlow\n      expr: 'histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd HTTP requests slow (instance {{ $labels.instance }})\n        description: \"HTTP requests slowing down, 99th percentile is over 0.15s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdMemberCommunicationSlow\n      expr: 'histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd member communication slow (instance {{ $labels.instance }})\n        description: \"Etcd member communication slowing down, 99th percentile is over 0.15s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdHighNumberOfFailedProposals\n      expr: 'increase(etcd_server_proposals_failed_total[1h]) > 5'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd high number of failed proposals (instance {{ $labels.instance }})\n        description: \"Etcd server got {{ $value }} failed proposals in the past hour\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdHighFsyncDurations\n      expr: 'histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) > 0.5'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd high fsync durations (instance {{ $labels.instance }})\n        description: \"Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: EtcdHighCommitDurations\n      expr: 'histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[1m])) > 0.25'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Etcd high commit durations (instance {{ $labels.instance }})\n        description: \"Etcd commit duration increasing, 99th percentile is over 0.25s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/fluxcd/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: FluxKustomizationFailure\n      expr: 'gotk_resource_info{ready=\"False\", customresource_kind=\"Kustomization\"} > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flux Kustomization Failure (instance {{ $labels.instance }})\n        description: \"The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FluxHelmreleaseFailure\n      expr: 'gotk_resource_info{ready=\"False\", customresource_kind=\"HelmRelease\"} > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flux HelmRelease Failure (instance {{ $labels.instance }})\n        description: \"The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FluxSourceIssue\n      expr: 'gotk_resource_info{ready=\"False\", customresource_kind=~\"GitRepository|HelmRepository|Bucket|OCIRepository\"} > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flux Source Issue (instance {{ $labels.instance }})\n        description: \"Flux source {{ $labels.customresource_kind }} '{{ $labels.name }}' has issue(s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FluxImageIssue\n      expr: 'gotk_resource_info{ready=\"False\", customresource_kind=~\"ImagePolicy|ImageRepository|ImageUpdateAutomation\"} > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Flux Image Issue (instance {{ $labels.instance }})\n        description: \"The {{ $labels.customresource_kind }} '{{ $labels.name }}' is marked as not ready.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/freeswitch/znerol-freeswitch-exporter.yml",
    "content": "groups:\n\n- name: ZnerolFreeswitchExporter\n\n  \n  rules:\n\n    - alert: FreeswitchDown\n      expr: 'freeswitch_up == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Freeswitch down (instance {{ $labels.instance }})\n        description: \"Freeswitch is unresponsive\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FreeswitchSessionsWarning\n      expr: '(freeswitch_session_active * 100 / freeswitch_session_limit) > 80 and freeswitch_session_limit > 0'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Freeswitch Sessions Warning (instance {{ $labels.instance }})\n        description: \"High sessions usage on {{ $labels.instance }}: {{ $value | printf \\\"%.2f\\\"}}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: FreeswitchSessionsCritical\n      expr: '(freeswitch_session_active * 100 / freeswitch_session_limit) > 90 and freeswitch_session_limit > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Freeswitch Sessions Critical (instance {{ $labels.instance }})\n        description: \"High sessions usage on {{ $labels.instance }}: {{ $value | printf \\\"%.2f\\\"}}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/gitlab-ci/gitaly.yml",
    "content": "groups:\n\n- name: Gitaly\n\n  \n  rules:\n\n    # grpc_code!=\"OK\" includes non-error codes like NotFound, AlreadyExists. Consider filtering to specific error codes for less noise.\n    - alert: GitlabGitalyHighGrpcErrorRate\n      expr: 'sum(rate(grpc_server_handled_total{job=\"gitaly\",grpc_code!=\"OK\"}[5m])) / sum(rate(grpc_server_handled_total{job=\"gitaly\"}[5m])) * 100 > 5 and sum(rate(grpc_server_handled_total{job=\"gitaly\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Gitaly high gRPC error rate (instance {{ $labels.instance }})\n        description: \"Gitaly on {{ $labels.instance }} is returning more than 5% gRPC errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # ResourceExhausted errors from Gitaly mean Git operations are being rejected due to\n    # concurrency limits. This directly impacts users trying to push, pull, or clone.\n    # This alert is derived from the GitLab Omnibus default rules.\n    - alert: GitlabGitalyResourceExhausted\n      expr: 'sum(rate(grpc_server_handled_total{job=\"gitaly\",grpc_code=\"ResourceExhausted\"}[5m])) / sum(rate(grpc_server_handled_total{job=\"gitaly\"}[5m])) * 100 > 1 and sum(rate(grpc_server_handled_total{job=\"gitaly\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: GitLab Gitaly resource exhausted (instance {{ $labels.instance }})\n        description: \"Gitaly on {{ $labels.instance }} is returning ResourceExhausted errors, indicating overload ({{ $value }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabGitalyHighRpcLatency\n      expr: 'histogram_quantile(0.95, sum(rate(grpc_server_handling_seconds_bucket{job=\"gitaly\",grpc_type=\"unary\"}[5m])) by (le)) > 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Gitaly high RPC latency (instance {{ $labels.instance }})\n        description: \"Gitaly on {{ $labels.instance }} p95 unary RPC latency exceeds 1 second ({{ $value }}s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabGitalyCpuThrottled\n      expr: 'rate(gitaly_cgroup_cpu_cfs_throttled_seconds_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Gitaly CPU throttled (instance {{ $labels.instance }})\n        description: \"Gitaly processes on {{ $labels.instance }} are being CPU throttled by cgroups.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabGitalyAuthenticationFailures\n      expr: 'increase(gitaly_authentications_total{status=\"failed\"}[5m]) > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Gitaly authentication failures (instance {{ $labels.instance }})\n        description: \"Gitaly on {{ $labels.instance }} has authentication failures ({{ $value }}).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # When the circuit breaker trips to \"open\" state, Git operations (push, pull, clone) will fail.\n    # Check Gitaly service health and logs.\n    - alert: GitlabGitalyCircuitBreakerTripped\n      expr: 'increase(gitaly_circuit_breaker_transitions_total{to_state=\"open\"}[5m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: GitLab Gitaly circuit breaker tripped (instance {{ $labels.instance }})\n        description: \"Gitaly circuit breaker has tripped on {{ $labels.instance }}. Git operations are failing.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/gitlab-ci/gitlab-built-in-exporter.yml",
    "content": "groups:\n\n- name: GitlabBuiltInExporter\n\n  \n  rules:\n\n    # Queued connections indicate Puma workers are saturated.\n    # Consider increasing puma['worker_processes'] or puma['max_threads'] in gitlab.rb.\n    - alert: GitlabPumaHighQueuedConnections\n      expr: 'puma_queued_connections > 5'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Puma high queued connections (instance {{ $labels.instance }})\n        description: \"GitLab Puma has {{ $value }} queued connections on {{ $labels.instance }}. Requests are waiting for an available worker thread.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabPumaNoAvailablePoolCapacity\n      expr: 'puma_pool_capacity == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: GitLab Puma no available pool capacity (instance {{ $labels.instance }})\n        description: \"GitLab Puma pool capacity on {{ $labels.instance }} has been at 0 for 5 minutes. All threads are busy.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabPumaWorkersNotRunning\n      expr: 'puma_running_workers < puma_workers'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Puma workers not running (instance {{ $labels.instance }})\n        description: \"GitLab Puma on {{ $labels.instance }} has {{ $value }} running workers out of expected total.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is 5% of all requests returning server errors.\n    # Check GitLab logs at /var/log/gitlab/ for root cause.\n    - alert: GitlabHighHttpErrorRate\n      expr: 'sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5 and sum(rate(http_requests_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: GitLab high HTTP error rate (instance {{ $labels.instance }})\n        description: \"GitLab is returning more than 5% HTTP 5xx errors on {{ $labels.instance }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 10s may need adjustment based on your instance size and workload.\n    - alert: GitlabHighHttpRequestLatency\n      expr: 'histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 10'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab high HTTP request latency (instance {{ $labels.instance }})\n        description: \"GitLab p95 HTTP request latency on {{ $labels.instance }} is above 10 seconds.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.\n    # A sustained failure rate indicates background processing issues.\n    - alert: GitlabSidekiqJobsFailing\n      expr: 'rate(sidekiq_jobs_failed_total[5m]) > 0.1'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Sidekiq jobs failing (instance {{ $labels.instance }})\n        description: \"GitLab Sidekiq jobs are failing at a rate of {{ $value }} per second on {{ $labels.instance }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # When running jobs approach the concurrency limit, new jobs will queue up.\n    # Consider scaling Sidekiq workers or increasing concurrency.\n    - alert: GitlabSidekiqQueueTooLarge\n      expr: 'sum(sidekiq_running_jobs) >= sum(sidekiq_concurrency) * 0.9'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Sidekiq queue too large (instance {{ $labels.instance }})\n        description: \"GitLab Sidekiq has {{ $value }} running jobs, approaching concurrency limit on {{ $labels.instance }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.\n    - alert: GitlabSidekiqHighJobCompletionTime\n      expr: 'histogram_quantile(0.95, sum(rate(sidekiq_jobs_completion_seconds_bucket[5m])) by (le, worker)) > 300'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Sidekiq high job completion time (instance {{ $labels.instance }})\n        description: \"GitLab Sidekiq job p95 completion time on {{ $labels.instance }} is above 5 minutes ({{ $value | humanizeDuration }}).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.\n    # High queue latency means jobs are stuck waiting. Check Sidekiq concurrency and queue sizes.\n    - alert: GitlabSidekiqHighQueueLatency\n      expr: 'histogram_quantile(0.95, sum(rate(sidekiq_jobs_queue_duration_seconds_bucket[5m])) by (le)) > 60'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Sidekiq high queue latency (instance {{ $labels.instance }})\n        description: \"GitLab Sidekiq jobs on {{ $labels.instance }} are waiting more than 60 seconds before being processed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # When the pool is near saturation, requests may block waiting for a connection.\n    # Increase db_pool_size in gitlab.rb or investigate slow queries.\n    - alert: GitlabDatabaseConnectionPoolSaturation\n      expr: 'gitlab_database_connection_pool_busy / gitlab_database_connection_pool_size * 100 > 90 and gitlab_database_connection_pool_size > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab database connection pool saturation (instance {{ $labels.instance }})\n        description: \"GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) is {{ $value }}% busy.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabDatabaseConnectionPoolDeadConnections\n      expr: 'gitlab_database_connection_pool_dead > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab database connection pool dead connections (instance {{ $labels.instance }})\n        description: \"GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) has {{ $value }} dead connections.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabDatabaseConnectionPoolWaiting\n      expr: 'gitlab_database_connection_pool_waiting > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab database connection pool waiting (instance {{ $labels.instance }})\n        description: \"GitLab on {{ $labels.instance }} has {{ $value }} threads waiting for a database connection.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabCiPipelineCreationSlow\n      expr: 'histogram_quantile(0.95, sum(rate(gitlab_ci_pipeline_creation_duration_seconds_bucket[5m])) by (le)) > 30'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab CI pipeline creation slow (instance {{ $labels.instance }})\n        description: \"GitLab CI pipeline creation p95 latency on {{ $labels.instance }} is above 30 seconds.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This metric may not exist in all GitLab versions. Verify against your GitLab installation.\n    - alert: GitlabCiPipelineFailuresIncreasing\n      expr: 'rate(gitlab_ci_pipeline_failure_reasons[5m]) > 0'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab CI pipeline failures increasing (instance {{ $labels.instance }})\n        description: \"GitLab CI pipeline failures are increasing on {{ $labels.instance }} ({{ $value }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Frequent runner auth failures may indicate expired tokens or misconfigured runners.\n    - alert: GitlabCiRunnerAuthenticationFailures\n      expr: 'increase(gitlab_ci_runner_authentication_failure_total[5m]) > 5'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab CI runner authentication failures (instance {{ $labels.instance }})\n        description: \"GitLab CI runners are experiencing authentication failures on {{ $labels.instance }} ({{ $value }} failures).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 2GB may need adjustment based on your instance size.\n    # High memory usage can lead to OOM kills and service disruptions.\n    - alert: GitlabHighMemoryUsage\n      expr: 'process_resident_memory_bytes{job=~\".*gitlab.*\"} > 2e+9'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab high memory usage (instance {{ $labels.instance }})\n        description: \"GitLab process on {{ $labels.instance }} is using {{ $value | humanize1024 }}B of RSS memory.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Heap fragmentation above 50% means a significant amount of memory is wasted.\n    # A Puma worker restart may help reclaim memory.\n    - alert: GitlabRubyHeapFragmentation\n      expr: 'ruby_gc_stat_ext_heap_fragmentation{job=~\".*gitlab.*\"} > 0.5'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Ruby heap fragmentation (instance {{ $labels.instance }})\n        description: \"GitLab Ruby heap fragmentation on {{ $labels.instance }} is {{ $value }}. High fragmentation wastes memory.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabRackUncaughtErrors\n      expr: 'rate(rack_uncaught_errors_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab rack uncaught errors (instance {{ $labels.instance }})\n        description: \"GitLab is experiencing uncaught errors in the Rack layer on {{ $labels.instance }} ({{ $value }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This may happen during a rolling deployment. If it persists, investigate incomplete upgrades.\n    - alert: GitlabVersionMismatch\n      expr: 'count(count by (version) (gitlab_build_info)) > 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab version mismatch (instance {{ $labels.instance }})\n        description: \"Multiple GitLab versions are running across the fleet.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabHighFileDescriptorUsage\n      expr: 'process_open_fds{job=~\".*gitlab.*\"} / process_max_fds * 100 > 80 and process_max_fds > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab high file descriptor usage (instance {{ $labels.instance }})\n        description: \"GitLab on {{ $labels.instance }} is using {{ $value }}% of available file descriptors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabRubyThreadsSaturated\n      expr: 'sum by (instance) (gitlab_ruby_threads_running_threads) > on(instance) gitlab_ruby_threads_max_expected_threads * 1.5'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Ruby threads saturated (instance {{ $labels.instance }})\n        description: \"GitLab running threads on {{ $labels.instance }} have exceeded the expected maximum ({{ $value }}).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/gitlab-ci/workhorse.yml",
    "content": "groups:\n\n- name: Workhorse\n\n  \n  rules:\n\n    # Workhorse sits in front of Puma and handles Git HTTP, file uploads, and proxying.\n    # Threshold from GitLab Omnibus default rules: 10% for high-traffic instances.\n    - alert: GitlabWorkhorseHighErrorRate\n      expr: 'sum(rate(gitlab_workhorse_http_request_duration_seconds_count{code=~\"5..\"}[5m])) / sum(rate(gitlab_workhorse_http_request_duration_seconds_count[5m])) * 100 > 10 and sum(rate(gitlab_workhorse_http_request_duration_seconds_count[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: GitLab Workhorse high error rate (instance {{ $labels.instance }})\n        description: \"GitLab Workhorse on {{ $labels.instance }} is returning more than 10% HTTP 5xx errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GitlabWorkhorseHighLatency\n      expr: 'histogram_quantile(0.95, sum(rate(gitlab_workhorse_http_request_duration_seconds_bucket[5m])) by (le)) > 10'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Workhorse high latency (instance {{ $labels.instance }})\n        description: \"GitLab Workhorse on {{ $labels.instance }} p95 request latency is above 10 seconds.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 100 may need adjustment based on instance size.\n    - alert: GitlabWorkhorseHighIn-flightRequests\n      expr: 'gitlab_workhorse_http_in_flight_requests > 100'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: GitLab Workhorse high in-flight requests (instance {{ $labels.instance }})\n        description: \"GitLab Workhorse on {{ $labels.instance }} has {{ $value }} in-flight requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/golang/golang-exporter.yml",
    "content": "groups:\n\n- name: GolangExporter\n\n  \n  rules:\n\n    # Threshold is a rough default. High-concurrency servers may legitimately run thousands of goroutines. Adjust to match your baseline.\n    - alert: GoGoroutineCountHigh\n      expr: 'go_goroutines > 1000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go goroutine count high (instance {{ $labels.instance }})\n        description: \"Go application has too many goroutines (> 1000), potential goroutine leak\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # quantile=\"1\" is the maximum observed GC pause in the current summary window, not p99.\n    # A single outlier pause can push this above 1s. The for: 5m ensures the max stays elevated.\n    - alert: GoGcDurationHigh\n      expr: 'go_gc_duration_seconds{quantile=\"1\"} > 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go GC duration high (instance {{ $labels.instance }})\n        description: \"Go GC pause duration is too high (max > 1s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # go_memstats_sys_bytes is the total memory obtained from the OS by the Go runtime, not total host memory.\n    # This ratio measures Go-internal memory utilization, not system-level memory pressure.\n    - alert: GoMemoryUsageHigh\n      expr: '(go_memstats_heap_alloc_bytes / go_memstats_sys_bytes) * 100 > 90'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go memory usage high (instance {{ $labels.instance }})\n        description: \"Go heap allocation is using most of the runtime's reserved memory (> 90%), indicating the process may need more memory or has a leak\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is workload-dependent. Applications with heavy CGo or blocking I/O may legitimately use more OS threads. Adjust to match your baseline.\n    - alert: GoThreadCountHigh\n      expr: 'go_threads > 500'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go thread count high (instance {{ $labels.instance }})\n        description: \"Go OS thread count is high (> 500), potential blocking syscall or CGo leak\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is a rough default. Adjust based on your application's normal object count.\n    - alert: GoHeapObjectsCountHigh\n      expr: 'go_memstats_heap_objects > 10000000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go heap objects count high (instance {{ $labels.instance }})\n        description: \"Go heap has too many live objects (> 10M), high GC pressure\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # go_memstats_gc_cpu_fraction is deprecated since Go 1.20 and may return 0 in newer versions.\n    # Consider using runtime/metrics-based alternatives if running Go >= 1.20.\n    - alert: GoGcCpuFractionHigh\n      expr: 'go_memstats_gc_cpu_fraction > 0.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go GC CPU fraction high (instance {{ $labels.instance }})\n        description: \"Go GC is consuming too much CPU (> 5%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GoGoroutineSpike\n      expr: 'deriv(go_goroutines[5m]) > 100'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go goroutine spike (instance {{ $labels.instance }})\n        description: \"Go goroutine count is growing rapidly\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GoHeapFragmentation\n      expr: 'go_memstats_heap_idle_bytes / go_memstats_heap_sys_bytes > 0.9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go heap fragmentation (instance {{ $labels.instance }})\n        description: \"Go heap has high idle ratio (> 90%), indicating memory fragmentation\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GoMemoryLeak\n      expr: 'rate(go_memstats_alloc_bytes_total[5m]) > 1e9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go memory leak (instance {{ $labels.instance }})\n        description: \"Go application has sustained high allocation rate (> 1GB/s), potential memory leak\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: GoStackMemoryHigh\n      expr: 'go_memstats_stack_inuse_bytes > 1e9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Go stack memory high (instance {{ $labels.instance }})\n        description: \"Go stack memory usage is high (> 1GB), likely excessive goroutines or deep recursion\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/google-cloud-stackdriver/stackdriver-exporter.yml",
    "content": "groups:\n\n- name: StackdriverExporter\n\n  # Self-monitoring metrics use the stackdriver_monitoring_* prefix.\n  # All self-monitoring metrics include a project_id label.\n  \n  rules:\n\n    - alert: StackdriverExporterScrapeError\n      expr: 'stackdriver_monitoring_last_scrape_error > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Stackdriver exporter scrape error (instance {{ $labels.instance }})\n        description: \"Stackdriver exporter failed to scrape metrics from Google Cloud Monitoring API for project {{ $labels.project_id }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: StackdriverExporterSlowScrape\n      expr: 'stackdriver_monitoring_last_scrape_duration_seconds > 300'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Stackdriver exporter slow scrape (instance {{ $labels.instance }})\n        description: \"Stackdriver exporter scrape for project {{ $labels.project_id }} is taking more than 5 minutes ({{ $value }}s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: StackdriverExporterScrapeErrorsIncreasing\n      expr: 'increase(stackdriver_monitoring_scrape_errors_total[15m]) > 5'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Stackdriver exporter scrape errors increasing (instance {{ $labels.instance }})\n        description: \"Stackdriver exporter has had {{ $value }} scrape errors in the last 15 minutes for project {{ $labels.project_id }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: StackdriverExporterHighApiCalls\n      expr: 'rate(stackdriver_monitoring_api_calls_total[5m]) * 60 > 100'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Stackdriver exporter high API calls (instance {{ $labels.instance }})\n        description: \"Stackdriver exporter is making {{ $value }} API calls per minute for project {{ $labels.project_id }}. This may hit Google Cloud Monitoring API quotas.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: StackdriverExporterScrapeStale\n      expr: 'time() - stackdriver_monitoring_last_scrape_timestamp > 600'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Stackdriver exporter scrape stale (instance {{ $labels.instance }})\n        description: \"Stackdriver exporter has not successfully scraped metrics for project {{ $labels.project_id }} in the last 10 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/grafana-alloy/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: GrafanaAlloyServiceDown\n      expr: 'count by (instance) (alloy_build_info offset 2h) unless count by (instance) (alloy_build_info)'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Grafana Alloy service down (instance {{ $labels.instance }})\n        description: \"Alloy on instance {{ $labels.instance }} is not responding or has stopped running.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/grafana-mimir/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  # Mimir uses the `cortex_` metric prefix for backward compatibility with Cortex. This is intentional and expected.\n  \n  rules:\n\n    - alert: MimirIngesterUnhealthy\n      expr: 'min by (job) (cortex_ring_members{state=\"Unhealthy\", name=\"ingester\"}) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ingester unhealthy (instance {{ $labels.instance }})\n        description: \"Mimir has {{ $value }} unhealthy ingester(s) in the ring.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirRequestErrors\n      expr: '100 * sum by (job, route) (rate(cortex_request_duration_seconds_count{status_code=~\"5..\", route!~\"ready|debug_pprof\"}[5m])) / sum by (job, route) (rate(cortex_request_duration_seconds_count{route!~\"ready|debug_pprof\"}[5m])) > 1 and sum by (job, route) (rate(cortex_request_duration_seconds_count{route!~\"ready|debug_pprof\"}[5m])) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir request errors (instance {{ $labels.instance }})\n        description: \"Mimir {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \\\"%.2f\\\" $value }}% errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirInconsistentRuntimeConfig\n      expr: 'count(count by (job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1'\n      for: 1h\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir inconsistent runtime config (instance {{ $labels.instance }})\n        description: \"An inconsistent runtime config file is used across Mimir instances.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirBadRuntimeConfig\n      expr: 'sum by (job) (cortex_runtime_config_last_reload_successful == 0) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir bad runtime config (instance {{ $labels.instance }})\n        description: \"{{ $labels.job }} failed to reload runtime config.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirSchedulerQueriesStuck\n      expr: 'sum by (job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0'\n      for: 7m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir scheduler queries stuck (instance {{ $labels.instance }})\n        description: \"There are {{ $value }} queued up queries in {{ $labels.job }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirCacheRequestErrors\n      expr: '(sum by (name, operation, job) (rate(thanos_cache_operation_failures_total[5m])) / sum by (name, operation, job) (rate(thanos_cache_operations_total[5m]))) * 100 > 5 and sum by (name, operation, job) (rate(thanos_cache_operations_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir cache request errors (instance {{ $labels.instance }})\n        description: \"Mimir cache {{ $labels.name }} is experiencing {{ printf \\\"%.2f\\\" $value }}% errors for {{ $labels.operation }} operation.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirKvStoreFailure\n      expr: '(sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~\"2..\"}[5m])) / sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count[5m]))) == 1 and sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir KV store failure (instance {{ $labels.instance }})\n        description: \"Mimir {{ $labels.job }} KV store {{ $labels.kv_name }} is failing with 100% error rate.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirMemoryMapAreasTooHigh\n      expr: 'process_memory_map_areas{job=~\".*(ingester|cortex|mimir|store-gateway).*\"} / process_memory_map_areas_limit{job=~\".*(ingester|cortex|mimir|store-gateway).*\"} * 100 > 80 and process_memory_map_areas_limit{job=~\".*(ingester|cortex|mimir|store-gateway).*\"} > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir memory map areas too high (instance {{ $labels.instance }})\n        description: \"Mimir {{ $labels.job }} is using {{ printf \\\"%.0f\\\" $value }}% of its memory map area limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterInstanceHasNoTenants\n      expr: '(cortex_ingester_memory_users == 0) and on (instance) (cortex_ingester_memory_users offset 1h > 0)'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir ingester instance has no tenants (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} has no tenants assigned.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirRulerInstanceHasNoRuleGroups\n      expr: '(cortex_ruler_managers_total == 0) and on (instance) (cortex_ruler_managers_total offset 1h > 0)'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir ruler instance has no rule groups (instance {{ $labels.instance }})\n        description: \"Mimir ruler {{ $labels.instance }} has no rule groups assigned.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngestedDataTooFarInTheFuture\n      expr: 'max by (job) (cortex_ingester_tsdb_head_max_timestamp_seconds - time() and cortex_ingester_tsdb_head_max_timestamp_seconds > 0) > 3600'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir ingested data too far in the future (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.job }} has ingested samples with timestamps more than 1 hour in the future.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: MimirStoreGatewayTooManyFailedOperations\n      expr: 'sum by (job) (rate(thanos_objstore_bucket_operation_failures_total[5m])) > 0.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir store gateway too many failed operations (instance {{ $labels.instance }})\n        description: \"Mimir store-gateway {{ $labels.job }} bucket operations are failing ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirRingMembersMismatch\n      expr: 'max by (name, job) (sum by (name, job, instance) (cortex_ring_members)) != min by (name, job) (sum by (name, job, instance) (cortex_ring_members))'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir ring members mismatch (instance {{ $labels.instance }})\n        description: \"Mimir {{ $labels.name }} ring has inconsistent member counts across instances.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterReachingSeriesLimitWarning\n      expr: '(cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit=\"max_series\"} * 100 > 80) and cortex_ingester_instance_limits{limit=\"max_series\"} > 0'\n      for: 3h\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir ingester reaching series limit warning (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} has reached {{ printf \\\"%.0f\\\" $value }}% of its series limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterReachingSeriesLimitCritical\n      expr: '(cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit=\"max_series\"} * 100 > 90) and cortex_ingester_instance_limits{limit=\"max_series\"} > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ingester reaching series limit critical (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} has reached {{ printf \\\"%.0f\\\" $value }}% of its series limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterReachingTenantsLimitWarning\n      expr: '(cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit=\"max_tenants\"} * 100 > 70) and cortex_ingester_instance_limits{limit=\"max_tenants\"} > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir ingester reaching tenants limit warning (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} has reached {{ printf \\\"%.0f\\\" $value }}% of its tenants limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterReachingTenantsLimitCritical\n      expr: '(cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit=\"max_tenants\"} * 100 > 80) and cortex_ingester_instance_limits{limit=\"max_tenants\"} > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ingester reaching tenants limit critical (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} has reached {{ printf \\\"%.0f\\\" $value }}% of its tenants limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirReachingTcpConnectionsLimit\n      expr: 'cortex_tcp_connections / cortex_tcp_connections_limit * 100 > 80 and cortex_tcp_connections_limit > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir reaching TCP connections limit (instance {{ $labels.instance }})\n        description: \"Mimir instance {{ $labels.instance }} is using {{ printf \\\"%.0f\\\" $value }}% of its TCP connections limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirDistributorInflightRequestsHigh\n      expr: '(cortex_distributor_inflight_push_requests / ignoring(limit) cortex_distributor_instance_limits{limit=\"max_inflight_push_requests\"} * 100 > 80) and cortex_distributor_instance_limits{limit=\"max_inflight_push_requests\"} > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir distributor inflight requests high (instance {{ $labels.instance }})\n        description: \"Mimir distributor {{ $labels.instance }} is using {{ printf \\\"%.0f\\\" $value }}% of its inflight push requests limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterTsdbHeadCompactionFailed\n      expr: 'rate(cortex_ingester_tsdb_compactions_failed_total[5m]) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ingester TSDB head compaction failed (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} is failing to compact TSDB head ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterTsdbHeadTruncationFailed\n      expr: 'rate(cortex_ingester_tsdb_head_truncations_failed_total[5m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ingester TSDB head truncation failed (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} is failing to truncate TSDB head ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterTsdbCheckpointCreationFailed\n      expr: 'rate(cortex_ingester_tsdb_checkpoint_creations_failed_total[5m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ingester TSDB checkpoint creation failed (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} is failing to create TSDB checkpoints ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterTsdbCheckpointDeletionFailed\n      expr: 'rate(cortex_ingester_tsdb_checkpoint_deletions_failed_total[5m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ingester TSDB checkpoint deletion failed (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} is failing to delete TSDB checkpoints ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterTsdbWalTruncationFailed\n      expr: 'rate(cortex_ingester_tsdb_wal_truncations_failed_total[5m]) > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir ingester TSDB WAL truncation failed (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} is failing to truncate TSDB WAL ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirIngesterTsdbWalWritesFailed\n      expr: 'rate(cortex_ingester_tsdb_wal_writes_failed_total[1m]) > 0'\n      for: 3m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ingester TSDB WAL writes failed (instance {{ $labels.instance }})\n        description: \"Mimir ingester {{ $labels.instance }} is failing to write to TSDB WAL ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold aligned with official Mimir mixin (30 minutes).\n    - alert: MimirStoreGatewayHasNotSyncedBucket\n      expr: '(time() - cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component=\"store-gateway\"} > 1800) and cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component=\"store-gateway\"} > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir store gateway has not synced bucket (instance {{ $labels.instance }})\n        description: \"Mimir store-gateway {{ $labels.instance }} has not synced the bucket for more than 10 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirStoreGatewayNoSyncedTenants\n      expr: '(min by (instance, job) (cortex_bucket_stores_tenants_synced{component=\"store-gateway\"}) == 0) and on (instance) (cortex_bucket_stores_tenants_synced{component=\"store-gateway\"} offset 1h > 0)'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir store gateway no synced tenants (instance {{ $labels.instance }})\n        description: \"Mimir store-gateway {{ $labels.instance }} has no synced tenants.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirBucketIndexNotUpdated\n      expr: 'min by (user, job) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 2100'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir bucket index not updated (instance {{ $labels.instance }})\n        description: \"Mimir bucket index for tenant {{ $labels.user }} has not been updated for more than 35 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirCompactorNotCleaningUpBlocks\n      expr: '(time() - cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 21600) and cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 0'\n      for: 1h\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir compactor not cleaning up blocks (instance {{ $labels.instance }})\n        description: \"Mimir compactor {{ $labels.instance }} has not cleaned up blocks in the last 6 hours.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirCompactorNotRunningCompaction\n      expr: '(time() - cortex_compactor_last_successful_run_timestamp_seconds > 86400) and cortex_compactor_last_successful_run_timestamp_seconds > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir compactor not running compaction (instance {{ $labels.instance }})\n        description: \"Mimir compactor {{ $labels.instance }} has not run compaction in the last 24 hours.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirCompactorHasConsecutiveFailures\n      expr: 'increase(cortex_compactor_runs_failed_total{reason!=\"shutdown\"}[2h]) > 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir compactor has consecutive failures (instance {{ $labels.instance }})\n        description: \"Mimir compactor {{ $labels.instance }} has had {{ $value }} compaction failures in the last 2 hours.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirCompactorHasRunOutOfDiskSpace\n      expr: 'increase(cortex_compactor_disk_out_of_space_errors_total[24h]) >= 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir compactor has run out of disk space (instance {{ $labels.instance }})\n        description: \"Mimir compactor {{ $labels.instance }} has run out of disk space.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirCompactorHasNotUploadedBlocks\n      expr: '(time() - thanos_objstore_bucket_last_successful_upload_time{component=\"compactor\"} > 86400) and thanos_objstore_bucket_last_successful_upload_time{component=\"compactor\"} > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir compactor has not uploaded blocks (instance {{ $labels.instance }})\n        description: \"Mimir compactor {{ $labels.instance }} has not uploaded any block in the last 24 hours.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Using 24h window per official mixin — compaction skips are rare events.\n    - alert: MimirCompactorSkippedBlocks\n      expr: 'increase(cortex_compactor_blocks_marked_for_no_compaction_total[24h]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir compactor skipped blocks (instance {{ $labels.instance }})\n        description: \"Mimir compactor has found {{ $value }} blocks that cannot be compacted (reason {{ $labels.reason }}).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirRulerTooManyFailedPushes\n      expr: '100 * sum by (instance, job) (rate(cortex_ruler_write_requests_failed_total[5m])) / sum by (instance, job) (rate(cortex_ruler_write_requests_total[5m])) > 1 and sum by (instance, job) (rate(cortex_ruler_write_requests_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ruler too many failed pushes (instance {{ $labels.instance }})\n        description: \"Mimir ruler {{ $labels.instance }} is failing to push {{ printf \\\"%.2f\\\" $value }}% of write requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirRulerTooManyFailedQueries\n      expr: '100 * sum by (instance, job) (rate(cortex_ruler_queries_failed_total[5m])) / sum by (instance, job) (rate(cortex_ruler_queries_total[5m])) > 1 and sum by (instance, job) (rate(cortex_ruler_queries_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ruler too many failed queries (instance {{ $labels.instance }})\n        description: \"Mimir ruler {{ $labels.instance }} is failing {{ printf \\\"%.2f\\\" $value }}% of query evaluations.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirRulerMissedEvaluations\n      expr: '100 * sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_missed_total[5m])) / sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_total[5m])) > 1 and sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir ruler missed evaluations (instance {{ $labels.instance }})\n        description: \"Mimir ruler {{ $labels.instance }} is missing {{ printf \\\"%.2f\\\" $value }}% of rule group evaluations.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: MimirRulerFailedRingCheck\n      expr: 'sum by (job) (rate(cortex_ruler_ring_check_errors_total[5m])) > 0.05'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir ruler failed ring check (instance {{ $labels.instance }})\n        description: \"Mimir ruler {{ $labels.job }} is failing ring checks ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirAlertmanagerSyncConfigsFailing\n      expr: 'rate(cortex_alertmanager_sync_configs_failed_total[5m]) > 0'\n      for: 30m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir alertmanager sync configs failing (instance {{ $labels.instance }})\n        description: \"Mimir alertmanager {{ $labels.job }} is failing to sync configs ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirAlertmanagerRingCheckFailing\n      expr: 'rate(cortex_alertmanager_ring_check_errors_total[5m]) > 0'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir alertmanager ring check failing (instance {{ $labels.instance }})\n        description: \"Mimir alertmanager {{ $labels.job }} is failing ring checks ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirAlertmanagerStateMergeFailing\n      expr: 'rate(cortex_alertmanager_partial_state_merges_failed_total[5m]) > 0'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir alertmanager state merge failing (instance {{ $labels.instance }})\n        description: \"Mimir alertmanager {{ $labels.job }} is failing to merge state updates ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirAlertmanagerReplicationFailing\n      expr: 'rate(cortex_alertmanager_state_replication_failed_total[5m]) > 0'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir alertmanager replication failing (instance {{ $labels.instance }})\n        description: \"Mimir alertmanager {{ $labels.job }} is failing to replicate state ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirAlertmanagerPersistStateFailing\n      expr: 'rate(cortex_alertmanager_state_persist_failed_total[15m]) > 0'\n      for: 1h\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir alertmanager persist state failing (instance {{ $labels.instance }})\n        description: \"Mimir alertmanager {{ $labels.job }} is failing to persist state ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirAlertmanagerInitialSyncFailed\n      expr: 'increase(cortex_alertmanager_state_initial_sync_completed_total{outcome=\"failed\"}[1m]) > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir alertmanager initial sync failed (instance {{ $labels.instance }})\n        description: \"Mimir alertmanager {{ $labels.job }} failed initial state sync.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirAlertmanagerInstanceHasNoTenants\n      expr: '(cortex_alertmanager_tenants_owned == 0) and on (instance) (cortex_alertmanager_tenants_owned offset 1h > 0)'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir alertmanager instance has no tenants (instance {{ $labels.instance }})\n        description: \"Mimir alertmanager {{ $labels.instance }} has no tenants assigned.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirGossipMembersCountTooHigh\n      expr: 'avg(memberlist_client_cluster_members_count{job=~\".*(mimir|cortex).*\"}) by (job) * 1.15 + 10 < max(memberlist_client_cluster_members_count{job=~\".*(mimir|cortex).*\"}) by (job)'\n      for: 20m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir gossip members count too high (instance {{ $labels.instance }})\n        description: \"Mimir gossip cluster has more members than expected.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirGossipMembersCountTooLow\n      expr: 'avg(memberlist_client_cluster_members_count{job=~\".*(mimir|cortex).*\"}) by (job) * 0.5 > min(memberlist_client_cluster_members_count{job=~\".*(mimir|cortex).*\"}) by (job)'\n      for: 20m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir gossip members count too low (instance {{ $labels.instance }})\n        description: \"Mimir gossip cluster has fewer members than expected.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # A high number of Go threads may indicate a goroutine leak.\n    - alert: MimirGoThreadsTooHighWarning\n      expr: 'go_threads{job=~\".*(mimir|cortex).*\"} > 5000'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Mimir go threads too high warning (instance {{ $labels.instance }})\n        description: \"Mimir {{ $labels.instance }} has {{ $value }} Go threads.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MimirGoThreadsTooHighCritical\n      expr: 'go_threads{job=~\".*(mimir|cortex).*\"} > 8000'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mimir go threads too high critical (instance {{ $labels.instance }})\n        description: \"Mimir {{ $labels.instance }} has {{ $value }} Go threads.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/grafana-tempo/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: TempoDistributorUnhealthy\n      expr: 'max by (job) (tempo_ring_members{state=\"Unhealthy\", name=\"distributor\"}) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Tempo distributor unhealthy (instance {{ $labels.instance }})\n        description: \"Tempo has {{ $value }} unhealthy distributor(s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoLiveStoreUnhealthy\n      expr: 'max by (job) (tempo_ring_members{state=\"Unhealthy\", name=\"live-store\"}) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo live store unhealthy (instance {{ $labels.instance }})\n        description: \"Tempo has {{ $value }} unhealthy live store(s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoMetricsGeneratorUnhealthy\n      expr: 'max by (job) (tempo_ring_members{state=\"Unhealthy\", name=\"metrics-generator\"}) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo metrics generator unhealthy (instance {{ $labels.instance }})\n        description: \"Tempo has {{ $value }} unhealthy metrics generator(s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Uses a two-window approach: 1h for historical count and 5m to confirm the issue is ongoing.\n    - alert: TempoCompactionsFailing\n      expr: 'sum by (job) (increase(tempodb_compaction_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_compaction_errors_total[5m])) > 0'\n      for: 1h\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo compactions failing (instance {{ $labels.instance }})\n        description: \"{{ $value }} compactions have failed in the past hour.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoPollsFailing\n      expr: 'sum by (job) (increase(tempodb_blocklist_poll_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_blocklist_poll_errors_total[5m])) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo polls failing (instance {{ $labels.instance }})\n        description: \"{{ $value }} blocklist polls have failed in the past hour.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoTenantIndexFailures\n      expr: 'sum by (job) (increase(tempodb_blocklist_tenant_index_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_blocklist_tenant_index_errors_total[5m])) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo tenant index failures (instance {{ $labels.instance }})\n        description: \"{{ $value }} tenant index failures in the past hour.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoNoTenantIndexBuilders\n      expr: 'sum by (tenant) (tempodb_blocklist_tenant_index_builder) == 0 and on() max(tempodb_blocklist_length) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo no tenant index builders (instance {{ $labels.instance }})\n        description: \"No tenant index builders for tenant {{ $labels.tenant }}. Tenant index will quickly become stale.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 600s (10 minutes). Adjust based on your tenant index build interval.\n    - alert: TempoTenantIndexTooOld\n      expr: 'max by (tenant) (tempodb_blocklist_tenant_index_age_seconds) > 600'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo tenant index too old (instance {{ $labels.instance }})\n        description: \"Tenant index for {{ $labels.tenant }} is {{ $value }}s old.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Fires when the blocklist grows more than 40% over 7 days.\n    - alert: TempoBlockListRisingQuickly\n      expr: '(avg(tempodb_blocklist_length) / avg(tempodb_blocklist_length offset 7d) - 1) * 100 > 40 and avg(tempodb_blocklist_length offset 7d) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo block list rising quickly (instance {{ $labels.instance }})\n        description: \"Tempo blocklist length is up {{ printf \\\"%.0f\\\" $value }}% over the last 7 days. Consider scaling compactors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoBadOverrides\n      expr: 'sum by (job) (tempo_runtime_config_last_reload_successful == 0) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo bad overrides (instance {{ $labels.instance }})\n        description: \"{{ $labels.job }} failed to reload runtime overrides.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoUserConfigurableOverridesReloadFailing\n      expr: 'sum by (job) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total[1h])) > 5 and sum by (job) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total[5m])) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo user configurable overrides reload failing (instance {{ $labels.instance }})\n        description: \"{{ $value }} user-configurable overrides reloads have failed in the past hour.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 100 blocks per compactor instance. Adjust based on your environment.\n    - alert: TempoCompactionTooManyOutstandingBlocksWarning\n      expr: 'sum by (instance) (tempodb_compaction_outstanding_blocks) > 100'\n      for: 6h\n      labels:\n        severity: warning\n      annotations:\n        summary: Tempo compaction too many outstanding blocks warning (instance {{ $labels.instance }})\n        description: \"There are too many outstanding compaction blocks for {{ $labels.instance }}. Consider increasing compactor resources.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Official Tempo mixin normalizes by backend-worker count. Adjust threshold based on your compactor configuration.\n    - alert: TempoCompactionTooManyOutstandingBlocksCritical\n      expr: 'sum by (instance) (tempodb_compaction_outstanding_blocks) > 250'\n      for: 24h\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo compaction too many outstanding blocks critical (instance {{ $labels.instance }})\n        description: \"There are too many outstanding compaction blocks for {{ $labels.instance }}. Increase compactor resources immediately.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoDistributorUsageTrackerErrors\n      expr: 'sum by (job, reason) (rate(tempo_distributor_usage_tracker_errors_total[5m])) > 0'\n      for: 30m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo distributor usage tracker errors (instance {{ $labels.instance }})\n        description: \"Tempo distributor usage tracker errors for {{ $labels.job }} at {{ $value | humanize }}/s (reason {{ $labels.reason }}).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoMetricsGeneratorProcessorUpdatesFailing\n      expr: 'sum by (job) (increase(tempo_metrics_generator_active_processors_update_failed_total[5m])) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo metrics generator processor updates failing (instance {{ $labels.instance }})\n        description: \"Tempo metrics generator processor updates are failing for {{ $labels.job }} ({{ $value }} failures in 5m).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoMetricsGeneratorServiceGraphsDroppingSpans\n      expr: '100 * sum by (job) (rate(tempo_metrics_generator_processor_service_graphs_dropped_spans[5m])) / sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0.5 and sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Tempo metrics generator service graphs dropping spans (instance {{ $labels.instance }})\n        description: \"Tempo metrics generator is dropping {{ printf \\\"%.2f\\\" $value }}% of spans in service graphs for {{ $labels.job }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TempoMetricsGeneratorCollectionsFailing\n      expr: 'sum by (job) (increase(tempo_metrics_generator_registry_collections_failed_total[5m])) > 2'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Tempo metrics generator collections failing (instance {{ $labels.instance }})\n        description: \"Tempo metrics generator collections are failing for {{ $labels.job }} ({{ $value }} failures in 5m).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Fires when the memcached error rate exceeds 20%. Only relevant if Tempo is configured with memcached caching.\n    - alert: TempoMemcachedErrorsElevated\n      expr: '100 * sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count{status_code=\"500\"}[5m])) / sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count[5m])) > 20 and sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count[5m])) > 0'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Tempo memcached errors elevated (instance {{ $labels.instance }})\n        description: \"Tempo memcached error rate is {{ printf \\\"%.2f\\\" $value }}% for {{ $labels.name }} in {{ $labels.job }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/graph-node/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: ProviderFailedBecauseNet_versionFailed\n      expr: 'eth_rpc_status == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Provider failed because net_version failed (instance {{ $labels.instance }})\n        description: \"Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ProviderFailedBecauseGetGenesisFailed\n      expr: 'eth_rpc_status == 2'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Provider failed because get genesis failed (instance {{ $labels.instance }})\n        description: \"Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ProviderFailedBecauseNet_versionTimeout\n      expr: 'eth_rpc_status == 3'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Provider failed because net_version timeout (instance {{ $labels.instance }})\n        description: \"net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ProviderFailedBecauseGetGenesisTimeout\n      expr: 'eth_rpc_status == 4'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Provider failed because get genesis timeout (instance {{ $labels.instance }})\n        description: \"Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: StoreConnectionSlow\n      expr: 'store_connection_wait_time_ms > 10'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Store connection slow (instance {{ $labels.instance }})\n        description: \"Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: StoreConnectionVerySlow\n      expr: 'store_connection_wait_time_ms > 20'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Store connection very slow (instance {{ $labels.instance }})\n        description: \"Store connection is very slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/hadoop/jmx_exporter.yml",
    "content": "groups:\n\n- name: Jmx_exporter\n\n  \n  rules:\n\n    - alert: HadoopNameNodeDown\n      expr: 'up{job=\"hadoop-namenode\"} == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Hadoop Name Node Down (instance {{ $labels.instance }})\n        description: \"The Hadoop NameNode service is unavailable.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopResourceManagerDown\n      expr: 'up{job=\"hadoop-resourcemanager\"} == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Hadoop Resource Manager Down (instance {{ $labels.instance }})\n        description: \"The Hadoop ResourceManager service is unavailable.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopDataNodeOutOfService\n      expr: 'hadoop_datanode_last_heartbeat == 0'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Hadoop Data Node Out Of Service (instance {{ $labels.instance }})\n        description: \"The Hadoop DataNode is not sending heartbeats.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopHdfsDiskSpaceLow\n      expr: '(hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1 and hadoop_hdfs_bytes_total > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Hadoop HDFS Disk Space Low (instance {{ $labels.instance }})\n        description: \"Available HDFS disk space is running low.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopMapReduceTaskFailures\n      expr: 'increase(hadoop_mapreduce_task_failures_total[1h]) > 100'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Hadoop Map Reduce Task Failures (instance {{ $labels.instance }})\n        description: \"There is an unusually high number of MapReduce task failures.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopResourceManagerMemoryHigh\n      expr: 'hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Hadoop Resource Manager Memory High (instance {{ $labels.instance }})\n        description: \"The Hadoop ResourceManager is approaching its memory limit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopYarnContainerAllocationFailures\n      expr: 'increase(hadoop_yarn_container_allocation_failures_total[1h]) > 10'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Hadoop YARN Container Allocation Failures (instance {{ $labels.instance }})\n        description: \"There is a significant number of YARN container allocation failures.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopHbaseRegionCountHigh\n      expr: 'hadoop_hbase_region_count > 5000'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Hadoop HBase Region Count High (instance {{ $labels.instance }})\n        description: \"The HBase cluster has an unusually high number of regions.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopHbaseRegionServerHeapLow\n      expr: 'hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes > 0.8'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Hadoop HBase Region Server Heap Low (instance {{ $labels.instance }})\n        description: \"HBase Region Servers are running low on heap space.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HadoopHbaseWriteRequestsLatencyHigh\n      expr: 'hadoop_hbase_write_requests_latency_seconds > 0.5'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Hadoop HBase Write Requests Latency High (instance {{ $labels.instance }})\n        description: \"HBase Write Requests are experiencing high latency.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/haproxy/embedded-exporter-v2.yml",
    "content": "groups:\n\n- name: EmbeddedExporterV2\n\n  \n  rules:\n\n    - alert: HaproxyHighHttp4xxErrorRateBackend\n      expr: '((sum by (proxy) (rate(haproxy_server_http_responses_total{code=\"4xx\"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (proxy) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHighHttp5xxErrorRateBackend\n      expr: '((sum by (proxy) (rate(haproxy_server_http_responses_total{code=\"5xx\"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (proxy) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHighHttp4xxErrorRateServer\n      expr: '((sum by (server) (rate(haproxy_server_http_responses_total{code=\"4xx\"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHighHttp5xxErrorRateServer\n      expr: '((sum by (server) (rate(haproxy_server_http_responses_total{code=\"5xx\"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyServerResponseErrors\n      expr: '(sum by (server) (rate(haproxy_server_response_errors_total[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100 > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy server response errors (instance {{ $labels.instance }})\n        description: \"Too many response errors to {{ $labels.server }} server (> 5%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyBackendConnectionErrors\n      expr: '(sum by (proxy) (rate(haproxy_backend_connection_errors_total[1m]))) > 100'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy backend connection errors (instance {{ $labels.instance }})\n        description: \"Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyServerConnectionErrors\n      expr: '(sum by (proxy) (rate(haproxy_server_connection_errors_total[1m]))) > 100'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy server connection errors (instance {{ $labels.instance }})\n        description: \"Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyBackendMaxActiveSession>80%\n      expr: '((haproxy_backend_current_sessions >0) * 100) / (haproxy_backend_limit_sessions > 0) > 80'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy backend max active session > 80% (instance {{ $labels.instance }})\n        description: \"Session limit from backend {{ $labels.proxy }} reached 80% of limit - {{ $value | printf \\\"%.2f\\\"}}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # haproxy_backend_current_queue is a gauge (current queue depth), not a counter.\n    - alert: HaproxyPendingRequests\n      expr: 'sum by (proxy) (haproxy_backend_current_queue) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy pending requests (instance {{ $labels.instance }})\n        description: \"Some HAProxy requests are pending on {{ $labels.proxy }} - {{ $value | printf \\\"%.2f\\\"}}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHttpSlowingDown\n      expr: 'avg by (instance, proxy) (haproxy_backend_max_total_time_seconds) > 1'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})\n        description: \"Average request time is increasing - {{ $value | printf \\\"%.2f\\\"}}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyRetryHigh\n      expr: 'sum by (proxy) (rate(haproxy_backend_retry_warnings_total[1m])) > 10'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy retry high (instance {{ $labels.instance }})\n        description: \"High rate of retry on {{ $labels.proxy }} - {{ $value | printf \\\"%.2f\\\"}}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHasNoAliveBackends\n      expr: 'haproxy_backend_active_servers + haproxy_backend_backup_servers == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAproxy has no alive backends (instance {{ $labels.instance }})\n        description: \"HAProxy has no alive active or backup backends for {{ $labels.proxy }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyFrontendSecurityBlockedRequests\n      expr: 'sum by (proxy) (rate(haproxy_frontend_denied_connections_total[2m])) > 10'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }})\n        description: \"HAProxy is blocking requests for security reason\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyServerHealthcheckFailure\n      expr: 'increase(haproxy_server_check_failures_total[1m]) > 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})\n        description: \"Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/haproxy/haproxy-exporter-v1.yml",
    "content": "groups:\n\n- name: HaproxyExporterV1\n\n  \n  rules:\n\n    - alert: HaproxyDown\n      expr: 'haproxy_up == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy down (instance {{ $labels.instance }})\n        description: \"HAProxy down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHighHttp4xxErrorRateBackend(v1)\n      expr: 'sum by (backend) (rate(haproxy_server_http_responses_total{code=\"4xx\"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy high HTTP 4xx error rate backend (v1) (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHighHttp5xxErrorRateBackend(v1)\n      expr: 'sum by (backend) (rate(haproxy_server_http_responses_total{code=\"5xx\"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy high HTTP 5xx error rate backend (v1) (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHighHttp4xxErrorRateServer(v1)\n      expr: 'sum by (server) (rate(haproxy_server_http_responses_total{code=\"4xx\"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy high HTTP 4xx error rate server (v1) (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHighHttp5xxErrorRateServer(v1)\n      expr: 'sum by (server) (rate(haproxy_server_http_responses_total{code=\"5xx\"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy high HTTP 5xx error rate server (v1) (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyServerResponseErrors(v1)\n      expr: 'sum by (server) (rate(haproxy_server_response_errors_total[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy server response errors (v1) (instance {{ $labels.instance }})\n        description: \"Too many response errors to {{ $labels.server }} server (> 5%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyBackendConnectionErrors(v1)\n      expr: 'sum by (backend) (rate(haproxy_backend_connection_errors_total[1m])) > 100'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy backend connection errors (v1) (instance {{ $labels.instance }})\n        description: \"Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyServerConnectionErrors(v1)\n      expr: 'sum by (server) (rate(haproxy_server_connection_errors_total[1m])) > 100'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy server connection errors (v1) (instance {{ $labels.instance }})\n        description: \"Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyBackendMaxActiveSession\n      expr: '((sum by (backend) (haproxy_backend_current_sessions * 100) / sum by (backend) (haproxy_backend_limit_sessions))) > 80 and sum by (backend) (haproxy_backend_limit_sessions) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy backend max active session (instance {{ $labels.instance }})\n        description: \"HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyPendingRequests(v1)\n      expr: 'sum by (backend) (haproxy_backend_current_queue) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy pending requests (v1) (instance {{ $labels.instance }})\n        description: \"Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyHttpSlowingDown(v1)\n      expr: 'avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 1'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy HTTP slowing down (v1) (instance {{ $labels.instance }})\n        description: \"Average request time is increasing\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyRetryHigh(v1)\n      expr: 'sum by (backend) (rate(haproxy_backend_retry_warnings_total[1m])) > 10'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy retry high (v1) (instance {{ $labels.instance }})\n        description: \"High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyBackendDown\n      expr: 'haproxy_backend_up == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy backend down (instance {{ $labels.instance }})\n        description: \"HAProxy backend is down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyServerDown\n      expr: 'haproxy_server_up == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: HAProxy server down (instance {{ $labels.instance }})\n        description: \"HAProxy server is down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyFrontendSecurityBlockedRequests(v1)\n      expr: 'sum by (frontend) (rate(haproxy_frontend_requests_denied_total[2m])) > 10'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy frontend security blocked requests (v1) (instance {{ $labels.instance }})\n        description: \"HAProxy is blocking requests for security reason\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HaproxyServerHealthcheckFailure(v1)\n      expr: 'increase(haproxy_server_check_failures_total[1m]) > 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: HAProxy server healthcheck failure (v1) (instance {{ $labels.instance }})\n        description: \"Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/hashicorp-vault/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: VaultSealed\n      expr: 'vault_core_unsealed == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Vault sealed (instance {{ $labels.instance }})\n        description: \"Vault instance is sealed on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: VaultTooManyPendingTokens\n      expr: 'avg(vault_token_create_count - vault_token_store_count) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Vault too many pending tokens (instance {{ $labels.instance }})\n        description: \"Too many pending tokens {{ $labels.instance }}: {{ $value | printf \\\"%.2f\\\"}}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: VaultTooManyInfinityTokens\n      expr: 'vault_token_count_by_ttl{creation_ttl=\"+Inf\"} > 3'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Vault too many infinity tokens (instance {{ $labels.instance }})\n        description: \"Too many infinity tokens {{ $labels.instance }}: {{ $value | printf \\\"%.2f\\\"}}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: VaultClusterHealth\n      expr: 'sum(vault_core_active) / count(vault_core_active) <= 0.5'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Vault cluster health (instance {{ $labels.instance }})\n        description: \"Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf \\\"%.2f\\\"}}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/host-and-hardware/node-exporter.yml",
    "content": "groups:\n\n- name: NodeExporter\n\n  \n  rules:\n\n    - alert: HostOutOfMemory\n      expr: '(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < .10)'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host out of memory (instance {{ $labels.instance }})\n        description: \"Node memory is filling up (< 10% left)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostMemoryUnderMemoryPressure\n      expr: '(rate(node_vmstat_pgmajfault[5m]) > 1000)'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host memory under memory pressure (instance {{ $labels.instance }})\n        description: \"The node is under heavy memory pressure. High rate of major page faults ({{ $value }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly\n    - alert: HostMemoryIsUnderutilized\n      expr: 'min_over_time(node_memory_MemFree_bytes[1w]) > node_memory_MemTotal_bytes * .8'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Host Memory is underutilized (instance {{ $labels.instance }})\n        description: \"Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostUnusualNetworkThroughputIn\n      expr: '((rate(node_network_receive_bytes_total[5m]) / node_network_speed_bytes) > .80) and node_network_speed_bytes > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host unusual network throughput in (instance {{ $labels.instance }})\n        description: \"Host receive bandwidth is high (>80%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostUnusualNetworkThroughputOut\n      expr: '((rate(node_network_transmit_bytes_total[5m]) / node_network_speed_bytes) > .80) and node_network_speed_bytes > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host unusual network throughput out (instance {{ $labels.instance }})\n        description: \"Host transmit bandwidth is high (>80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostDiskIoUtilizationHigh\n      expr: '(rate(node_disk_io_time_seconds_total[5m]) > .80)'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host disk IO utilization high (instance {{ $labels.instance }})\n        description: \"Disk utilization is high (> 80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Please add ignored mountpoints in node_exporter parameters like\n    # \"--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)\".\n    # Same rule using \"node_filesystem_free_bytes\" will fire when disk fills for non-root users.\n    - alert: HostOutOfDiskSpace\n      expr: '(node_filesystem_avail_bytes{fstype!~\"^(fuse.*|tmpfs|cifs|nfs)\"} / node_filesystem_size_bytes < .10 and on (instance, device, mountpoint) node_filesystem_readonly == 0)'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Host out of disk space (instance {{ $labels.instance }})\n        description: \"Disk is almost full (< 10% left)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Please add ignored mountpoints in node_exporter parameters like\n    # \"--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)\".\n    # Same rule using \"node_filesystem_free_bytes\" will fire when disk fills for non-root users.\n    - alert: HostDiskMayFillIn24Hours\n      expr: 'predict_linear(node_filesystem_avail_bytes{fstype!~\"^(fuse.*|tmpfs|cifs|nfs)\"}[3h], 86400) <= 0 and node_filesystem_avail_bytes > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host disk may fill in 24 hours (instance {{ $labels.instance }})\n        description: \"Filesystem will likely run out of space within the next 24 hours.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostOutOfInodes\n      expr: '(node_filesystem_files_free / node_filesystem_files < .10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) and node_filesystem_files > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Host out of inodes (instance {{ $labels.instance }})\n        description: \"Disk is almost running out of available inodes (< 10% left)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostFilesystemDeviceError\n      expr: 'node_filesystem_device_error{fstype!~\"^(fuse.*|tmpfs|cifs|nfs)\"} == 1'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Host filesystem device error (instance {{ $labels.instance }})\n        description: \"Error stat-ing the {{ $labels.mountpoint }} filesystem\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostInodesMayFillIn24Hours\n      expr: 'predict_linear(node_filesystem_files_free{fstype!~\"^(fuse.*|tmpfs|cifs|nfs)\"}[1h], 86400) <= 0 and node_filesystem_files_free > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host inodes may fill in 24 hours (instance {{ $labels.instance }})\n        description: \"Filesystem will likely run out of inodes within the next 24 hours at current write rate\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostUnusualDiskReadLatency\n      expr: '(rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0)'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host unusual disk read latency (instance {{ $labels.instance }})\n        description: \"Disk latency is growing (read operations > 100ms)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostUnusualDiskWriteLatency\n      expr: '(rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0)'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host unusual disk write latency (instance {{ $labels.instance }})\n        description: \"Disk latency is growing (write operations > 100ms)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostHighCpuLoad\n      expr: '1 - (avg without (cpu) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) > .80'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host high CPU load (instance {{ $labels.instance }})\n        description: \"CPU load is > 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly\n    - alert: HostCpuIsUnderutilized\n      expr: '(min without (cpu) (rate(node_cpu_seconds_total{mode=\"idle\"}[1h]))) > 0.8'\n      for: 1w\n      labels:\n        severity: info\n      annotations:\n        summary: Host CPU is underutilized (instance {{ $labels.instance }})\n        description: \"CPU load has been < 20% for 1 week. Consider reducing the number of CPUs.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostCpuStealNoisyNeighbor\n      expr: 'avg without (cpu) (rate(node_cpu_seconds_total{mode=\"steal\"}[5m])) * 100 > 10'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})\n        description: \"CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostCpuHighIowait\n      expr: 'avg without (cpu) (rate(node_cpu_seconds_total{mode=\"iowait\"}[5m])) > .10'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host CPU high iowait (instance {{ $labels.instance }})\n        description: \"CPU iowait > 10%. Your CPU is idling waiting for storage to respond.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostUnusualDiskIo\n      expr: 'rate(node_disk_io_time_seconds_total[5m]) > 0.8'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host unusual disk IO (instance {{ $labels.instance }})\n        description: \"Disk usage >80%. Check storage for issues or increase IOPS capabilities. Check storage for issues.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # x2 context switches is an arbitrary number.\n    # The alert threshold depends on the nature of the application.\n    # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58\n    - alert: HostContextSwitchingHigh\n      expr: '(rate(node_context_switches_total[15m])/count without(mode,cpu) (node_cpu_seconds_total{mode=\"idle\"})) / (rate(node_context_switches_total[1d])/count without(mode,cpu) (node_cpu_seconds_total{mode=\"idle\"})) > 2'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host context switching high (instance {{ $labels.instance }})\n        description: \"Context switching is growing on the node (twice the daily average during the last 15m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostSwapIsFillingUp\n      expr: '((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) and node_memory_SwapTotal_bytes > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host swap is filling up (instance {{ $labels.instance }})\n        description: \"Swap is filling up (>80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostSystemdServiceCrashed\n      expr: '(node_systemd_unit_state{state=\"failed\"} == 1)'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host systemd service crashed (instance {{ $labels.instance }})\n        description: \"systemd service {{ $labels.name }} crashed\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostPhysicalComponentTooHot\n      expr: 'node_hwmon_temp_celsius > node_hwmon_temp_max_celsius'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host physical component too hot (instance {{ $labels.instance }})\n        description: \"Physical hardware component too hot\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostNodeOvertemperatureAlarm\n      expr: '((node_hwmon_temp_crit_alarm_celsius == 1) or (node_hwmon_temp_alarm == 1))'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Host node overtemperature alarm (instance {{ $labels.instance }})\n        description: \"Physical node temperature alarm triggered\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Uses ignoring(state) to handle additional labels on node_md_disks. Matches the official node-exporter mixin.\n    - alert: HostSoftwareRaidInsufficientDrives\n      expr: '((node_md_disks_required - ignoring(state) node_md_disks{state=\"active\"}) > 0)'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Host software RAID insufficient drives (instance {{ $labels.instance }})\n        description: \"MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostSoftwareRaidDiskFailure\n      expr: '(node_md_disks{state=\"failed\"} > 0)'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host software RAID disk failure (instance {{ $labels.instance }})\n        description: \"MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostKernelVersionDeviations\n      expr: 'changes(node_uname_info[1h]) > 0'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Host kernel version deviations (instance {{ $labels.instance }})\n        description: \"Kernel version for {{ $labels.instance }} has changed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 15–20 minutes to recover, the alert should still trigger.\n    - alert: HostOomKillDetected\n      expr: '(increase(node_vmstat_oom_kill[30m]) > 0)'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host OOM kill detected (instance {{ $labels.instance }})\n        description: \"OOM kill detected\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostEdacCorrectableErrorsDetected\n      expr: '(increase(node_edac_correctable_errors_total[1m]) > 0)'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})\n        description: \"Host {{ $labels.instance }} has had {{ printf \\\"%.0f\\\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostEdacUncorrectableErrorsDetected\n      expr: '(node_edac_uncorrectable_errors_total > 0)'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})\n        description: \"Host {{ $labels.instance }} has had {{ printf \\\"%.0f\\\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostNetworkReceiveErrors\n      expr: '(rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) and rate(node_network_receive_packets_total[2m]) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host Network Receive Errors (instance {{ $labels.instance }})\n        description: \"Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \\\"%.0f\\\" $value }} receive errors in the last two minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostNetworkTransmitErrors\n      expr: '(rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) and rate(node_network_transmit_packets_total[2m]) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host Network Transmit Errors (instance {{ $labels.instance }})\n        description: \"Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \\\"%.0f\\\" $value }} transmit errors in the last two minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostNetworkBondDegraded\n      expr: '((node_bonding_active - node_bonding_slaves) != 0)'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host Network Bond Degraded (instance {{ $labels.instance }})\n        description: \"Bond \\\"{{ $labels.device }}\\\" degraded on \\\"{{ $labels.instance }}\\\".\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostConntrackLimit\n      expr: '(node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) and node_nf_conntrack_entries_limit > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host conntrack limit (instance {{ $labels.instance }})\n        description: \"The number of conntrack is approaching limit\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostClockSkew\n      expr: '((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0))'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host clock skew (instance {{ $labels.instance }})\n        description: \"Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HostClockNotSynchronising\n      expr: '(min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16)'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Host clock not synchronising (instance {{ $labels.instance }})\n        description: \"Clock not synchronising. Ensure NTP is configured on this host.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/ipmi/ipmi-exporter.yml",
    "content": "groups:\n\n- name: IpmiExporter\n\n  \n  rules:\n\n    # The ipmi_up metric is per-collector. A value of 0 means the collector could not retrieve data from the BMC.\n    - alert: IpmiCollectorDown\n      expr: 'ipmi_up == 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: IPMI collector down (instance {{ $labels.instance }})\n        description: \"IPMI collector {{ $labels.collector }} on {{ $labels.instance }} failed to scrape sensor data. Check FreeIPMI tools and BMC connectivity.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # State values: 0=nominal, 1=warning, 2=critical. Thresholds are defined in the BMC firmware.\n    - alert: IpmiTemperatureSensorWarning\n      expr: 'ipmi_temperature_state == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: IPMI temperature sensor warning (instance {{ $labels.instance }})\n        description: \"IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiTemperatureSensorCritical\n      expr: 'ipmi_temperature_state == 2'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI temperature sensor critical (instance {{ $labels.instance }})\n        description: \"IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Immediate attention required to prevent hardware damage.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiFanSpeedSensorWarning\n      expr: 'ipmi_fan_speed_state == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: IPMI fan speed sensor warning (instance {{ $labels.instance }})\n        description: \"IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiFanSpeedSensorCritical\n      expr: 'ipmi_fan_speed_state == 2'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI fan speed sensor critical (instance {{ $labels.instance }})\n        description: \"IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. A fan may have failed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiFanSpeedZero\n      expr: 'ipmi_fan_speed_rpm == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI fan speed zero (instance {{ $labels.instance }})\n        description: \"IPMI fan {{ $labels.name }} on {{ $labels.instance }} reports 0 RPM. The fan may have failed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiVoltageSensorWarning\n      expr: 'ipmi_voltage_state == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: IPMI voltage sensor warning (instance {{ $labels.instance }})\n        description: \"IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiVoltageSensorCritical\n      expr: 'ipmi_voltage_state == 2'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI voltage sensor critical (instance {{ $labels.instance }})\n        description: \"IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Power supply or motherboard issue possible.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiCurrentSensorWarning\n      expr: 'ipmi_current_state == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: IPMI current sensor warning (instance {{ $labels.instance }})\n        description: \"IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiCurrentSensorCritical\n      expr: 'ipmi_current_state == 2'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI current sensor critical (instance {{ $labels.instance }})\n        description: \"IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiPowerSensorWarning\n      expr: 'ipmi_power_state == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: IPMI power sensor warning (instance {{ $labels.instance }})\n        description: \"IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiPowerSensorCritical\n      expr: 'ipmi_power_state == 2'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI power sensor critical (instance {{ $labels.instance }})\n        description: \"IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Catches any sensor type not covered by the specific temperature/fan/voltage/current/power alerts.\n    - alert: IpmiGenericSensorCritical\n      expr: 'ipmi_sensor_state == 2'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI generic sensor critical (instance {{ $labels.instance }})\n        description: \"IPMI sensor {{ $labels.name }} (type={{ $labels.type }}) on {{ $labels.instance }} is in critical state.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IpmiChassisPowerOff\n      expr: 'ipmi_chassis_power_state == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI chassis power off (instance {{ $labels.instance }})\n        description: \"IPMI reports chassis power is off on {{ $labels.instance }}. The server may have shut down unexpectedly.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The metric uses inverted logic: 1=no fault, 0=fault detected.\n    - alert: IpmiChassisDriveFault\n      expr: 'ipmi_chassis_drive_fault_state == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI chassis drive fault (instance {{ $labels.instance }})\n        description: \"IPMI reports a drive fault on {{ $labels.instance }}. Check disk health.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The metric uses inverted logic: 1=no fault, 0=fault detected.\n    - alert: IpmiChassisCoolingFault\n      expr: 'ipmi_chassis_cooling_fault_state == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: IPMI chassis cooling fault (instance {{ $labels.instance }})\n        description: \"IPMI reports a cooling/fan fault on {{ $labels.instance }}. Check fans and airflow.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # SEL storage is typically very limited (e.g., 16KB). When full, new events may be dropped.\n    - alert: IpmiSelAlmostFull\n      expr: 'ipmi_sel_free_space_bytes < 512'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: IPMI SEL almost full (instance {{ $labels.instance }})\n        description: \"IPMI System Event Log on {{ $labels.instance }} has only {{ printf \\\"%.0f\\\" $value }} bytes free. Clear the SEL to prevent loss of new events.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/istio/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: IstioKubernetesGatewayAvailabilityDrop\n      expr: 'min(kube_deployment_status_replicas_available{deployment=\"istio-ingressgateway\", namespace=\"istio-system\"}) without (instance, pod) < 2'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio Kubernetes gateway availability drop (instance {{ $labels.instance }})\n        description: \"Gateway pods have dropped. Inbound traffic will likely be affected.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioPilotHighTotalRequestRate\n      expr: 'sum(rate(pilot_xds_push_errors[1m])) / sum(rate(pilot_xds_pushes[1m])) * 100 > 5 and sum(rate(pilot_xds_pushes[1m])) > 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio Pilot high total request rate (instance {{ $labels.instance }})\n        description: \"Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioMixerPrometheusDispatchesLow\n      expr: 'sum(rate(mixer_runtime_dispatches_total{adapter=~\"prometheus\"}[1m])) < 180'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio Mixer Prometheus dispatches low (instance {{ $labels.instance }})\n        description: \"Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioHighTotalRequestRate\n      expr: 'sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) > 1000'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio high total request rate (instance {{ $labels.instance }})\n        description: \"Global request rate in the service mesh is unusually high.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioLowTotalRequestRate\n      expr: 'sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) < 100'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio low total request rate (instance {{ $labels.instance }})\n        description: \"Global request rate in the service mesh is unusually low.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioHigh4xxErrorRate\n      expr: 'sum(rate(istio_requests_total{reporter=\"destination\", response_code=~\"4.*\"}[5m])) / sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) * 100 > 5 and sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) > 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio high 4xx error rate (instance {{ $labels.instance }})\n        description: \"High percentage of HTTP 4xx responses in Istio (> 5%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioHigh5xxErrorRate\n      expr: 'sum(rate(istio_requests_total{reporter=\"destination\", response_code=~\"5.*\"}[5m])) / sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) * 100 > 5 and sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) > 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio high 5xx error rate (instance {{ $labels.instance }})\n        description: \"High percentage of HTTP 5xx responses in Istio (> 5%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioHighRequestLatency\n      expr: 'rate(istio_request_duration_milliseconds_sum{reporter=\"destination\"}[1m]) / rate(istio_request_duration_milliseconds_count{reporter=\"destination\"}[1m]) > 100 and rate(istio_request_duration_milliseconds_count{reporter=\"destination\"}[1m]) > 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio high request latency (instance {{ $labels.instance }})\n        description: \"Istio average requests execution is longer than 100ms.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioLatency99Percentile\n      expr: 'histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le)) > 1000'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Istio latency 99 percentile (instance {{ $labels.instance }})\n        description: \"Istio 1% slowest requests are longer than 1000ms.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: IstioPilotDuplicateEntry\n      expr: 'sum(rate(pilot_duplicate_envoy_clusters{}[5m])) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Istio Pilot Duplicate Entry (instance {{ $labels.instance }})\n        description: \"Istio pilot duplicate entry error.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/jaeger/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: JaegerAgentHttpServerErrors\n      expr: '100 * sum(rate(jaeger_agent_http_server_errors_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jaeger agent HTTP server errors (instance {{ $labels.instance }})\n        description: \"Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JaegerClientRpcRequestErrors\n      expr: '100 * sum(rate(jaeger_client_jaeger_rpc_http_requests{status_code=~\"4xx|5xx\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jaeger client RPC request errors (instance {{ $labels.instance }})\n        description: \"Jaeger client on {{ $labels.instance }} is experiencing {{ $value | humanize }}% RPC HTTP errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JaegerClientSpansDropped\n      expr: '100 * sum(rate(jaeger_reporter_spans{result=~\"dropped|err\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jaeger client spans dropped (instance {{ $labels.instance }})\n        description: \"Jaeger client on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JaegerAgentSpansDropped\n      expr: '100 * sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jaeger agent spans dropped (instance {{ $labels.instance }})\n        description: \"Jaeger agent on {{ $labels.instance }} is dropping {{ $value | humanize }}% of span batches.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JaegerCollectorDroppingSpans\n      expr: '100 * sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jaeger collector dropping spans (instance {{ $labels.instance }})\n        description: \"Jaeger collector on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JaegerSamplingUpdateFailing\n      expr: '100 * sum(rate(jaeger_sampler_queries{result=\"err\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jaeger sampling update failing (instance {{ $labels.instance }})\n        description: \"Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of sampling policy updates.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JaegerThrottlingUpdateFailing\n      expr: '100 * sum(rate(jaeger_throttler_updates{result=\"err\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jaeger throttling update failing (instance {{ $labels.instance }})\n        description: \"Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of throttling policy updates.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JaegerQueryRequestFailures\n      expr: '100 * sum(rate(jaeger_query_requests_total{result=\"err\"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jaeger query request failures (instance {{ $labels.instance }})\n        description: \"Jaeger query on {{ $labels.instance }} is failing {{ $value | humanize }}% of requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/jenkins/metric-plugin.yml",
    "content": "groups:\n\n- name: MetricPlugin\n\n  \n  rules:\n\n    - alert: JenkinsNodeOffline\n      expr: 'jenkins_node_offline_value > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Jenkins node offline (instance {{ $labels.instance }})\n        description: \"At least one Jenkins node offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JenkinsNoNodeOnline\n      expr: 'jenkins_node_online_value == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Jenkins no node online (instance {{ $labels.instance }})\n        description: \"No Jenkins nodes are online: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JenkinsHealthcheck\n      expr: 'jenkins_health_check_score < 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Jenkins healthcheck (instance {{ $labels.instance }})\n        description: \"Jenkins healthcheck score: {{$value}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JenkinsOutdatedPlugins\n      expr: 'sum(jenkins_plugins_withUpdate) by (instance) > 3'\n      for: 1d\n      labels:\n        severity: warning\n      annotations:\n        summary: Jenkins outdated plugins (instance {{ $labels.instance }})\n        description: \"{{ $value }} plugins need update\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JenkinsBuildsHealthScore\n      expr: 'default_jenkins_builds_health_score < 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Jenkins builds health score (instance {{ $labels.instance }})\n        description: \"Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JenkinsRunFailureTotal\n      expr: 'delta(jenkins_runs_failure_total[1h]) > 100'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jenkins run failure total (instance {{ $labels.instance }})\n        description: \"Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JenkinsBuildTestsFailing\n      expr: 'default_jenkins_builds_last_build_tests_failing > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jenkins build tests failing (instance {{ $labels.instance }})\n        description: \"Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # * RUNNING  -1 true  - The build had no errors.\n    # * SUCCESS   0 true  - The build had no errors.\n    # * UNSTABLE  1 true  - The build had some errors but they were not fatal. For example, some tests failed.\n    # * FAILURE   2 false - The build had a fatal error.\n    # * NOT_BUILT 3 false - The module was not built.\n    # * ABORTED   4 false - The build was manually aborted.\n    - alert: JenkinsLastBuildFailed\n      expr: 'default_jenkins_builds_last_build_result_ordinal == 2'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Jenkins last build failed (instance {{ $labels.instance }})\n        description: \"Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/juniper/czerwonk-junos-exporter.yml",
    "content": "groups:\n\n- name: CzerwonkJunosExporter\n\n  \n  rules:\n\n    - alert: JuniperSwitchDown\n      expr: 'junos_up == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Juniper switch down (instance {{ $labels.instance }})\n        description: \"The switch appears to be down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JuniperCriticalBandwidthUsage1gib\n      expr: 'rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Juniper critical Bandwidth Usage 1GiB (instance {{ $labels.instance }})\n        description: \"Interface is highly saturated. (> 0.90GiB/s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JuniperWarningBandwidthUsage1gib\n      expr: 'rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Juniper warning Bandwidth Usage 1GiB (instance {{ $labels.instance }})\n        description: \"Interface is getting saturated. (> 0.80GiB/s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/jvm/jvm-exporter.yml",
    "content": "groups:\n\n- name: JvmExporter\n\n  \n  rules:\n\n    - alert: JvmMemoryFillingUp\n      expr: '(sum by (instance)(jvm_memory_used_bytes{area=\"heap\"}) / sum by (instance)(jvm_memory_max_bytes{area=\"heap\"})) * 100 > 80 and sum by (instance)(jvm_memory_max_bytes{area=\"heap\"}) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM memory filling up (instance {{ $labels.instance }})\n        description: \"JVM memory is filling up (> 80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Many JVM configurations leave metaspace unbounded, in which case jvm_memory_max_bytes{area=\"nonheap\"} is -1 and this alert will not fire.\n    # The query filters out max_bytes <= 0 to avoid false negatives.\n    - alert: JvmNon-heapMemoryFillingUp\n      expr: '(sum by (instance)(jvm_memory_used_bytes{area=\"nonheap\"}) / (sum by (instance)(jvm_memory_max_bytes{area=\"nonheap\"}) > 0)) * 100 > 80'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM non-heap memory filling up (instance {{ $labels.instance }})\n        description: \"JVM non-heap memory (metaspace/code cache) is filling up (> 80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JvmGcTimeTooHigh\n      expr: 'sum by (instance)(rate(jvm_gc_collection_seconds_sum[5m])) > 0.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM GC time too high (instance {{ $labels.instance }})\n        description: \"JVM is spending too much time in garbage collection (> 5% of wall clock time)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JvmThreadsDeadlocked\n      expr: 'jvm_threads_deadlocked > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: JVM threads deadlocked (instance {{ $labels.instance }})\n        description: \"JVM has deadlocked threads\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JvmThreadCountHigh\n      expr: 'jvm_threads_current > 300'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM thread count high (instance {{ $labels.instance }})\n        description: \"JVM thread count is high (> 300), potential thread leak\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JvmThreadsBlocked\n      expr: 'jvm_threads_state{state=\"BLOCKED\"} > 50'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM threads BLOCKED (instance {{ $labels.instance }})\n        description: \"JVM has high number of BLOCKED threads, indicating lock contention\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This regex matches CMS, G1, and Parallel collector names. It will not match ZGC or Shenandoah cycle names.\n    # Adjust the gc label filter if you use a different collector.\n    - alert: JvmOldGenGcFrequency\n      expr: 'rate(jvm_gc_collection_seconds_count{gc=~\".*old.*|.*major.*\"}[5m]) > 0.3'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM old gen GC frequency (instance {{ $labels.instance }})\n        description: \"Frequent old/major GC cycles, indicating memory pressure\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JvmDirectBufferPoolFillingUp\n      expr: '(jvm_buffer_pool_used_bytes / jvm_buffer_pool_capacity_bytes) * 100 > 90 and jvm_buffer_pool_capacity_bytes > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM direct buffer pool filling up (instance {{ $labels.instance }})\n        description: \"JVM direct buffer pool is filling up (> 90%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JvmObjectsPendingFinalization\n      expr: 'jvm_memory_objects_pending_finalization > 1000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM objects pending finalization (instance {{ $labels.instance }})\n        description: \"JVM has objects pending finalization, potential memory leak\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # process_open_fds and process_max_fds are generic metrics from the Prometheus client library, not JVM-specific.\n    # This alert will also fire for Go, Python, or any process exposing these metrics.\n    - alert: JvmFileDescriptorsExhaustion\n      expr: '(process_open_fds / process_max_fds) * 100 > 90 and process_max_fds > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM file descriptors exhaustion (instance {{ $labels.instance }})\n        description: \"JVM process is running out of file descriptors (> 90% used)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JvmClassLoadingAnomaly\n      expr: 'rate(jvm_classes_loaded_total[5m]) > 100'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM class loading anomaly (instance {{ $labels.instance }})\n        description: \"Rapid class loading detected, potential classloader leak\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: JvmCompilationTimeSpike\n      expr: 'rate(jvm_compilation_time_seconds_total[5m]) > 0.1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: JVM compilation time spike (instance {{ $labels.instance }})\n        description: \"Excessive JIT compilation time consuming CPU\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/kafka/danielqsj-kafka-exporter.yml",
    "content": "groups:\n\n- name: DanielqsjKafkaExporter\n\n  \n  rules:\n\n    - alert: KafkaTopicsReplicas\n      expr: 'min(kafka_topic_partition_in_sync_replica) by (topic) < 3'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kafka topics replicas (instance {{ $labels.instance }})\n        description: \"Kafka topic in-sync partition\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KafkaConsumerGroupLag\n      expr: 'sum(kafka_consumergroup_lag) by (consumergroup) > 10000'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kafka consumer group lag (instance {{ $labels.instance }})\n        description: \"Kafka consumer group {{ $labels.consumergroup }} is lagging behind ({{ $value }} messages)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/kafka/linkedin-kafka-exporter.yml",
    "content": "groups:\n\n- name: LinkedinKafkaExporter\n\n  \n  rules:\n\n    - alert: KafkaTopicOffsetDecreased\n      expr: 'delta(kafka_burrow_partition_current_offset[1m]) < 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kafka topic offset decreased (instance {{ $labels.instance }})\n        description: \"Kafka topic offset has decreased\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KafkaConsumerLag\n      expr: 'kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset >= (kafka_burrow_topic_partition_offset offset 15m - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset offset 15m) AND kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kafka consumer lag (instance {{ $labels.instance }})\n        description: \"Kafka consumer has a 30 minutes and increasing lag\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/keycloak/aerogear-keycloak-metrics-spi.yml",
    "content": "groups:\n\n- name: AerogearKeycloakMetricsSpi\n\n  \n  rules:\n\n    # Threshold of 5% is a rough default. Adjust based on your user base and expected error rates.\n    # A spike in failed logins may indicate a brute-force attack or misconfigured client.\n    - alert: KeycloakHighLoginFailureRate\n      expr: '(sum by (realm) (rate(keycloak_failed_login_attempts_total[5m])) / (sum by (realm) (rate(keycloak_logins_total[5m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[5m])))) * 100 > 5 and (sum by (realm) (rate(keycloak_logins_total[5m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[5m]))) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Keycloak high login failure rate (instance {{ $labels.instance }})\n        description: \"More than 5% of login attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \\\"%.1f\\\" }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Only fires when login attempts exist but none succeed — may indicate an authentication outage.\n    - alert: KeycloakNoSuccessfulLogins\n      expr: 'sum by (realm) (rate(keycloak_logins_total[15m])) == 0 and (sum by (realm) (rate(keycloak_logins_total[15m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[15m]))) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Keycloak no successful logins (instance {{ $labels.instance }})\n        description: \"No successful logins in realm {{ $labels.realm }} for the last 15 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 10% is a rough default. High refresh token errors may indicate expired sessions or token store issues.\n    - alert: KeycloakHighTokenRefreshErrorRate\n      expr: '(sum by (realm) (rate(keycloak_refresh_tokens_errors_total[5m])) / sum by (realm) (rate(keycloak_refresh_tokens_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_refresh_tokens_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Keycloak high token refresh error rate (instance {{ $labels.instance }})\n        description: \"More than 10% of token refresh attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \\\"%.1f\\\" }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 10% is a rough default. Code-to-token failures may indicate misconfigured OAuth clients or replay attacks.\n    - alert: KeycloakHighCode-to-tokenExchangeErrorRate\n      expr: '(sum by (realm) (rate(keycloak_code_to_tokens_errors_total[5m])) / sum by (realm) (rate(keycloak_code_to_tokens_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_code_to_tokens_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Keycloak high code-to-token exchange error rate (instance {{ $labels.instance }})\n        description: \"More than 10% of code-to-token exchanges are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \\\"%.1f\\\" }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 10% is a rough default.\n    - alert: KeycloakHighRegistrationFailureRate\n      expr: '(sum by (realm) (rate(keycloak_registrations_errors_total[5m])) / sum by (realm) (rate(keycloak_registrations_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_registrations_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Keycloak high registration failure rate (instance {{ $labels.instance }})\n        description: \"More than 10% of registration attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \\\"%.1f\\\" }}%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # keycloak_request_duration is in milliseconds. Threshold of 2000ms (2 seconds) is a rough default.\n    - alert: KeycloakSlowRequestResponseTime\n      expr: 'sum by (method) (rate(keycloak_request_duration_sum[5m])) / sum by (method) (rate(keycloak_request_duration_count[5m])) > 2000 and sum by (method) (rate(keycloak_request_duration_count[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Keycloak slow request response time (instance {{ $labels.instance }})\n        description: \"Keycloak {{ $labels.method }} requests are taking more than 2 seconds on average.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/kubernetes/kubestate-exporter.yml",
    "content": "groups:\n\n- name: KubestateExporter\n\n  \n  rules:\n\n    - alert: KubernetesNodeNotReady\n      expr: 'kube_node_status_condition{condition=\"Ready\",status=\"true\"} == 0'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes Node not ready (instance {{ $labels.instance }})\n        description: \"Node {{ $labels.node }} has been unready for a long time\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Kubernetes Node with disabled schedules are fine.\n    # This alarm can be useful to get warned if there are nodes which are longer unscheduled.\n    - alert: KubernetesNodeSchedulingDisabled\n      expr: 'kube_node_spec_taint{key=\"node.kubernetes.io/unschedulable\"} == 1'\n      for: 30m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes Node scheduling disabled (instance {{ $labels.instance }})\n        description: \"Node {{ $labels.node }} has been marked as unschedulable for more than 30 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesNodeMemoryPressure\n      expr: 'kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\"} == 1'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes Node memory pressure (instance {{ $labels.instance }})\n        description: \"Node {{ $labels.node }} has MemoryPressure condition\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesNodeDiskPressure\n      expr: 'kube_node_status_condition{condition=\"DiskPressure\",status=\"true\"} == 1'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes Node disk pressure (instance {{ $labels.instance }})\n        description: \"Node {{ $labels.node }} has DiskPressure condition\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesNodeNetworkUnavailable\n      expr: 'kube_node_status_condition{condition=\"NetworkUnavailable\",status=\"true\"} == 1'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes Node network unavailable (instance {{ $labels.instance }})\n        description: \"Node {{ $labels.node }} has NetworkUnavailable condition\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesNodeOutOfPodCapacity\n      expr: 'sum by (node) ((kube_pod_status_phase{phase=\"Running\"} == 1) + on(uid, instance) group_left(node) (0 * kube_pod_info{pod_template_hash=\"\"})) / sum by (node) (kube_node_status_allocatable{resource=\"pods\"}) * 100 > 90'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes Node out of pod capacity (instance {{ $labels.instance }})\n        description: \"Node {{ $labels.node }} is out of pod capacity\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesContainerOomKiller\n      expr: '(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"}[10m]) == 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes Container oom killer (instance {{ $labels.instance }})\n        description: \"Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesJobFailed\n      expr: 'kube_job_status_failed > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes Job failed (instance {{ $labels.instance }})\n        description: \"Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesJobNotStarting\n      expr: 'kube_job_status_active == 0 and kube_job_status_failed == 0 and kube_job_status_succeeded == 0 and (time() - kube_job_status_start_time) > 600'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes Job not starting (instance {{ $labels.instance }})\n        description: \"Job {{ $labels.namespace }}/{{ $labels.job_name }} did not start for 10 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesCronjobFailing\n      expr: '(kube_cronjob_status_last_schedule_time > kube_cronjob_status_last_successful_time) AND (kube_cronjob_status_active == 0) AND (kube_cronjob_spec_suspend == 0)'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes CronJob failing (instance {{ $labels.instance }})\n        description: \"CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is failing\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesCronjobSuspended\n      expr: 'kube_cronjob_spec_suspend != 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes CronJob suspended (instance {{ $labels.instance }})\n        description: \"CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesPersistentvolumeclaimPending\n      expr: 'kube_persistentvolumeclaim_status_phase{phase=\"Pending\"} == 1'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }})\n        description: \"PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesVolumeOutOfDiskSpace\n      expr: 'kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10 and kubelet_volume_stats_capacity_bytes > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }})\n        description: \"Volume is almost full (< 10% left)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesVolumeFullInFourDays\n      expr: 'predict_linear(kubelet_volume_stats_available_bytes[6h:5m], 4 * 24 * 3600) < 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes Volume full in four days (instance {{ $labels.instance }})\n        description: \"Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesPersistentvolumeError\n      expr: 'kube_persistentvolume_status_phase{phase=~\"Failed|Pending\", job=\"kube-state-metrics\"} > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})\n        description: \"Persistent volume {{ $labels.persistentvolume }} is in bad state\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesStatefulsetDown\n      expr: 'kube_statefulset_replicas != kube_statefulset_status_replicas_ready > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})\n        description: \"StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesHpaScaleInability\n      expr: '(kube_horizontalpodautoscaler_spec_max_replicas - kube_horizontalpodautoscaler_status_desired_replicas) * on (horizontalpodautoscaler,namespace) (kube_horizontalpodautoscaler_status_condition{condition=\"ScalingLimited\", status=\"true\"} == 1) == 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes HPA scale inability (instance {{ $labels.instance }})\n        description: \"HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to scale\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesHpaMetricsUnavailability\n      expr: 'kube_horizontalpodautoscaler_status_condition{status=\"false\", condition=\"ScalingActive\"} == 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes HPA metrics unavailability (instance {{ $labels.instance }})\n        description: \"HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesHpaScaleMaximum\n      expr: '(kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas) and (kube_horizontalpodautoscaler_spec_max_replicas > 1) and (kube_horizontalpodautoscaler_spec_min_replicas != kube_horizontalpodautoscaler_spec_max_replicas)'\n      for: 2m\n      labels:\n        severity: info\n      annotations:\n        summary: Kubernetes HPA scale maximum (instance {{ $labels.instance }})\n        description: \"HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesHpaUnderutilized\n      expr: 'max(quantile_over_time(0.5, kube_horizontalpodautoscaler_status_desired_replicas[1d]) == kube_horizontalpodautoscaler_spec_min_replicas) by (horizontalpodautoscaler) > 3'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Kubernetes HPA underutilized (instance {{ $labels.instance }})\n        description: \"HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesPodNotHealthy\n      expr: 'sum by (namespace, pod) (kube_pod_status_phase{phase=~\"Pending|Unknown|Failed\"}) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})\n        description: \"Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesPodCrashLooping\n      expr: 'increase(kube_pod_container_status_restarts_total[1m]) > 3'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes pod crash looping (instance {{ $labels.instance }})\n        description: \"Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesReplicasetReplicasMismatch\n      expr: 'kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes ReplicaSet replicas mismatch (instance {{ $labels.instance }})\n        description: \"ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} replicas mismatch\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesDeploymentReplicasMismatch\n      expr: 'kube_deployment_spec_replicas != kube_deployment_status_replicas_available'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})\n        description: \"Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesStatefulsetReplicasMismatch\n      expr: 'kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})\n        description: \"StatefulSet does not match the expected number of replicas.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesDeploymentGenerationMismatch\n      expr: 'kube_deployment_status_observed_generation != kube_deployment_metadata_generation'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }})\n        description: \"Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesStatefulsetGenerationMismatch\n      expr: 'kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }})\n        description: \"StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesStatefulsetUpdateNotRolledOut\n      expr: 'max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }})\n        description: \"StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesDaemonsetRolloutStuck\n      expr: '(kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 and kube_daemonset_status_desired_number_scheduled > 0) or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})\n        description: \"Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesDaemonsetMisscheduled\n      expr: 'kube_daemonset_status_number_misscheduled > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }})\n        description: \"Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold should be customized for each cronjob name.\n    - alert: KubernetesCronjobTooLong\n      expr: 'kube_job_status_start_time > 0 and absent(kube_job_status_completion_time) and (time() - kube_job_status_start_time) > 3600'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes CronJob too long (instance {{ $labels.instance }})\n        description: \"CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesJobSlowCompletion\n      expr: 'kube_job_spec_completions - kube_job_status_succeeded - kube_job_status_failed > 0'\n      for: 12h\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes Job slow completion (instance {{ $labels.instance }})\n        description: \"Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesApiServerErrors\n      expr: 'sum(rate(apiserver_request_total{job=\"apiserver\",code=~\"(?:5..)\"}[1m])) by (instance, job) / sum(rate(apiserver_request_total{job=\"apiserver\"}[1m])) by (instance, job) * 100 > 3 and sum(rate(apiserver_request_total{job=\"apiserver\"}[1m])) by (instance, job) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes API server errors (instance {{ $labels.instance }})\n        description: \"Kubernetes API server is experiencing {{ $value | humanize }}% error rate\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesApiClientErrors\n      expr: '(sum(rate(rest_client_requests_total{code=~\"(4|5)..\"}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1 and sum(rate(rest_client_requests_total[1m])) by (instance, job) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes API client errors (instance {{ $labels.instance }})\n        description: \"Kubernetes API client is experiencing {{ $value | humanize }}% error rate\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesClientCertificateExpiresNextWeek\n      expr: 'apiserver_client_certificate_expiration_seconds_count{job=\"apiserver\"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 7*24*60*60'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }})\n        description: \"A client certificate used to authenticate to the apiserver is expiring next week.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesClientCertificateExpiresSoon\n      expr: 'apiserver_client_certificate_expiration_seconds_count{job=\"apiserver\"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 24*60*60'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }})\n        description: \"A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: KubernetesApiServerLatency\n      expr: 'histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~\"(?:CONNECT|WATCHLIST|WATCH|PROXY)\"} [10m])) WITHOUT (subresource)) > 1'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Kubernetes API server latency (instance {{ $labels.instance }})\n        description: \"Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/linkerd/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    # Linkerd does not expose request_errors_total. Errors are tracked via response_total{classification=\"failure\"}.\n    - alert: LinkerdHighErrorRate\n      expr: 'sum(rate(response_total{classification=\"failure\"}[1m])) by (deployment, statefulset, daemonset) / sum(rate(response_total[1m])) by (deployment, statefulset, daemonset) * 100 > 10 and sum(rate(response_total[1m])) by (deployment, statefulset, daemonset) > 0'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Linkerd high error rate (instance {{ $labels.instance }})\n        description: \"Linkerd error rate for {{ $labels.deployment }}{{ $labels.statefulset }}{{ $labels.daemonset }} is over 10%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/loki/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: LokiProcessTooManyRestarts\n      expr: 'changes(process_start_time_seconds{job=~\".*loki.*\"}[15m]) > 2'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Loki process too many restarts (instance {{ $labels.instance }})\n        description: \"A loki process had too many restarts (target {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: LokiRequestErrors\n      expr: '100 * sum(rate(loki_request_duration_seconds_count{status_code=~\"5..\"}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10 and sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 0'\n      for: 15m\n      labels:\n        severity: critical\n      annotations:\n        summary: Loki request errors (instance {{ $labels.instance }})\n        description: \"The {{ $labels.job }} and {{ $labels.route }} are experiencing {{ printf \\\"%.2f\\\" $value }}% errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: LokiRequestPanic\n      expr: 'sum(increase(loki_panic_total[10m])) by (namespace, job) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Loki request panic (instance {{ $labels.instance }})\n        description: \"The {{ $labels.job }} is experiencing {{ printf \\\"%.2f\\\" $value }}% increase of panics\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: LokiRequestLatency\n      expr: '(histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~\"(?i).*tail.*\"}[5m])) by (le)))  > 1'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Loki request latency (instance {{ $labels.instance }})\n        description: \"The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \\\"%.2f\\\" $value }}s 99th percentile latency\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/meilisearch/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: MeilisearchIndexIsEmpty\n      expr: 'meilisearch_index_docs_count == 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Meilisearch index is empty (instance {{ $labels.instance }})\n        description: \"Meilisearch index {{ $labels.index }} has zero documents\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MeilisearchHttpResponseTime\n      expr: 'meilisearch_http_response_time_seconds > 0.5'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Meilisearch http response time (instance {{ $labels.instance }})\n        description: \"Meilisearch http response time is too high\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/memcached/memcached-exporter.yml",
    "content": "groups:\n\n- name: MemcachedExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: MemcachedDown\n      expr: 'memcached_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Memcached down (instance {{ $labels.instance }})\n        description: \"Memcached instance is down on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MemcachedConnectionLimitApproaching(>80%)\n      expr: '(memcached_current_connections / memcached_max_connections * 100) > 80 and memcached_max_connections > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Memcached connection limit approaching (> 80%) (instance {{ $labels.instance }})\n        description: \"Memcached connection usage is above 80% on {{ $labels.instance }} (current value: {{ $value }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MemcachedConnectionLimitApproaching(>95%)\n      expr: '(memcached_current_connections / memcached_max_connections * 100) > 95 and memcached_max_connections > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Memcached connection limit approaching (> 95%) (instance {{ $labels.instance }})\n        description: \"Memcached connection usage is above 95% on {{ $labels.instance }} (current value: {{ $value }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MemcachedOutOfMemoryErrors\n      expr: 'sum without (slab) (rate(memcached_slab_items_outofmemory_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Memcached out of memory errors (instance {{ $labels.instance }})\n        description: \"Memcached is returning out-of-memory errors on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # High memory usage is expected if the cache is well-utilized. This alert fires when it approaches the configured limit, which may cause evictions.\n    - alert: MemcachedMemoryUsageHigh(>90%)\n      expr: '(memcached_current_bytes / memcached_limit_bytes * 100) > 90 and memcached_limit_bytes > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Memcached memory usage high (> 90%) (instance {{ $labels.instance }})\n        description: \"Memcached memory usage is above 90% on {{ $labels.instance }} (current value: {{ $value }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # A sustained eviction rate indicates memory pressure. Consider increasing memcached memory limit or reducing cache usage. Threshold of 10 evictions/s is a rough default — adjust based on your workload.\n    - alert: MemcachedHighEvictionRate\n      expr: 'rate(memcached_items_evicted_total[5m]) > 10'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Memcached high eviction rate (instance {{ $labels.instance }})\n        description: \"Memcached is evicting items at a high rate on {{ $labels.instance }} ({{ $value }} evictions/s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # A low hit rate may indicate poor cache utilization, incorrect cache keys, or TTLs that are too short. Threshold of 80% is a rough default — adjust based on your workload and access patterns.\n    - alert: MemcachedLowCacheHitRate(<80%)\n      expr: '(rate(memcached_commands_total{command=\"get\", status=\"hit\"}[5m]) / (rate(memcached_commands_total{command=\"get\", status=\"hit\"}[5m]) + rate(memcached_commands_total{command=\"get\", status=\"miss\"}[5m])) * 100) < 80 and (rate(memcached_commands_total{command=\"get\", status=\"hit\"}[5m]) + rate(memcached_commands_total{command=\"get\", status=\"miss\"}[5m])) > 0'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Memcached low cache hit rate (< 80%) (instance {{ $labels.instance }})\n        description: \"Memcached cache hit rate is below 80% on {{ $labels.instance }} (current value: {{ $value }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MemcachedConnectionsRejected\n      expr: 'increase(memcached_connections_rejected_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Memcached connections rejected (instance {{ $labels.instance }})\n        description: \"Memcached is rejecting connections on {{ $labels.instance }} ({{ $value }} rejections in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MemcachedItemsTooLarge\n      expr: 'increase(memcached_item_too_large_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: info\n      annotations:\n        summary: Memcached items too large (instance {{ $labels.instance }})\n        description: \"Memcached is rejecting items exceeding max-item-size on {{ $labels.instance }} ({{ $value }} rejections in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/minio/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: MinioClusterDiskOffline\n      expr: 'minio_cluster_drive_offline_total > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Minio cluster disk offline (instance {{ $labels.instance }})\n        description: \"Minio cluster disk is offline\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MinioNodeDiskOffline\n      expr: 'minio_cluster_nodes_offline_total > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Minio node disk offline (instance {{ $labels.instance }})\n        description: \"Minio cluster node disk is offline\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MinioDiskSpaceUsage\n      expr: 'minio_cluster_capacity_raw_free_bytes / minio_cluster_capacity_raw_total_bytes * 100 < 10 and minio_cluster_capacity_raw_total_bytes > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Minio disk space usage (instance {{ $labels.instance }})\n        description: \"Minio available free space is low (< 10%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/mongodb/dcu-mongodb-exporter.yml",
    "content": "groups:\n\n- name: DcuMongodbExporter\n\n  \n  rules:\n\n    - alert: MongodbReplicationLag(dcu)\n      expr: 'avg(mongodb_replset_member_optime_date{state=\"PRIMARY\"}) - avg(mongodb_replset_member_optime_date{state=\"SECONDARY\"}) > 10'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB replication lag (DCU) (instance {{ $labels.instance }})\n        description: \"Mongodb replication lag is more than 10s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbReplicationStatus3\n      expr: 'mongodb_replset_member_state == 3'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB replication Status 3 (instance {{ $labels.instance }})\n        description: \"MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbReplicationStatus6\n      expr: 'mongodb_replset_member_state == 6'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB replication Status 6 (instance {{ $labels.instance }})\n        description: \"MongoDB Replication set member as seen from another member of the set, is not yet known\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbReplicationStatus8\n      expr: 'mongodb_replset_member_state == 8'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB replication Status 8 (instance {{ $labels.instance }})\n        description: \"MongoDB Replication set member as seen from another member of the set, is unreachable\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbReplicationStatus9\n      expr: 'mongodb_replset_member_state == 9'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB replication Status 9 (instance {{ $labels.instance }})\n        description: \"MongoDB Replication set member is actively performing a rollback. Data is not available for reads\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbReplicationStatus10\n      expr: 'mongodb_replset_member_state == 10'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB replication Status 10 (instance {{ $labels.instance }})\n        description: \"MongoDB Replication set member was once in a replica set but was subsequently removed\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbNumberCursorsOpen(dcu)\n      expr: 'mongodb_metrics_cursor_open{state=\"total_open\"} > 10000'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MongoDB number cursors open (DCU) (instance {{ $labels.instance }})\n        description: \"Too many cursors opened by MongoDB for clients (> 10k)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbCursorsTimeouts(dcu)\n      expr: 'increase(mongodb_metrics_cursor_timed_out_total[1m]) > 100'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MongoDB cursors timeouts (DCU) (instance {{ $labels.instance }})\n        description: \"Too many cursors are timing out ({{ $value }} in the last minute)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbTooManyConnections(dcu)\n      expr: 'mongodb_connections{state=\"current\"} / (mongodb_connections{state=\"current\"} + mongodb_connections{state=\"available\"}) * 100 > 80 and (mongodb_connections{state=\"current\"} + mongodb_connections{state=\"available\"}) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MongoDB too many connections (DCU) (instance {{ $labels.instance }})\n        description: \"Too many connections (> 80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/mongodb/percona-mongodb-exporter.yml",
    "content": "groups:\n\n- name: PerconaMongodbExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: MongodbDown\n      expr: 'mongodb_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB Down (instance {{ $labels.instance }})\n        description: \"MongoDB instance is down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: MongodbReplicaMemberUnhealthy\n      expr: 'mongodb_rs_members_health == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mongodb replica member unhealthy (instance {{ $labels.instance }})\n        description: \"MongoDB replica member is not healthy\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbReplicationLag(percona)\n      expr: '(mongodb_rs_members_optimeDate{member_state=\"PRIMARY\"} - on (set) group_right mongodb_rs_members_optimeDate{member_state=\"SECONDARY\"}) / 1000 > 10'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB replication lag (Percona) (instance {{ $labels.instance }})\n        description: \"Mongodb replication lag is more than 10s\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This query mixes old (mongodb_mongod_*) and new (mongodb_rs_*) metric names. It requires the Percona exporter to run with --compatible-mode to expose both.\n    - alert: MongodbReplicationHeadroom\n      expr: 'sum(avg(mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp)) - sum(avg(mongodb_rs_members_optimeDate{member_state=\"PRIMARY\"} - on (set) group_right mongodb_rs_members_optimeDate{member_state=\"SECONDARY\"})) <= 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: MongoDB replication headroom (instance {{ $labels.instance }})\n        description: \"MongoDB replication headroom is <= 0\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbNumberCursorsOpen(percona)\n      expr: 'mongodb_ss_metrics_cursor_open{csr_type=\"total\"} > 10 * 1000'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MongoDB number cursors open (Percona) (instance {{ $labels.instance }})\n        description: \"Too many cursors opened by MongoDB for clients (> 10k)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbCursorsTimeouts(percona)\n      expr: 'increase(mongodb_ss_metrics_cursor_timedOut[1m]) > 100'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MongoDB cursors timeouts (Percona) (instance {{ $labels.instance }})\n        description: \"Too many cursors are timing out ({{ $value }} in the last minute)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MongodbTooManyConnections(percona)\n      expr: 'mongodb_ss_connections{conn_type=\"current\"} / (mongodb_ss_connections{conn_type=\"current\"} + mongodb_ss_connections{conn_type=\"available\"}) * 100 > 80 and (mongodb_ss_connections{conn_type=\"current\"} + mongodb_ss_connections{conn_type=\"available\"}) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MongoDB too many connections (Percona) (instance {{ $labels.instance }})\n        description: \"Too many connections (> 80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/mongodb/stefanprodan-mgob-exporter.yml",
    "content": "groups:\n\n- name: StefanprodanMgobExporter\n\n  \n  rules:\n\n    - alert: MgobBackupFailed\n      expr: 'changes(mgob_scheduler_backup_total{status=\"500\"}[1h]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Mgob backup failed (instance {{ $labels.instance }})\n        description: \"MongoDB backup has failed\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/mysql/mysqld-exporter.yml",
    "content": "groups:\n\n- name: MysqldExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: MysqlDown\n      expr: 'mysql_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: MySQL down (instance {{ $labels.instance }})\n        description: \"MySQL instance is down on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlTooManyConnections(>80%)\n      expr: 'max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80 and mysql_global_variables_max_connections > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})\n        description: \"More than 80% of MySQL connections are in use on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlHighPreparedStatementsUtilization(>80%)\n      expr: 'max_over_time(mysql_global_status_prepared_stmt_count[1m]) / mysql_global_variables_max_prepared_stmt_count * 100 > 80 and mysql_global_variables_max_prepared_stmt_count > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MySQL high prepared statements utilization (> 80%) (instance {{ $labels.instance }})\n        description: \"High utilization of prepared statements (>80%) on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlHighThreadsRunning\n      expr: 'max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60 and mysql_global_variables_max_connections > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MySQL high threads running (instance {{ $labels.instance }})\n        description: \"More than 60% of MySQL connections are in running state on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: MysqlSlaveIoThreadNotRunning\n      expr: '( mysql_slave_status_slave_io_running and ON (instance) mysql_slave_status_master_server_id > 0 ) == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: MySQL Slave IO thread not running (instance {{ $labels.instance }})\n        description: \"MySQL Slave IO thread not running on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: MysqlSlaveSqlThreadNotRunning\n      expr: '( mysql_slave_status_slave_sql_running and ON (instance) mysql_slave_status_master_server_id > 0) == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }})\n        description: \"MySQL Slave SQL thread not running on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlSlaveReplicationLag\n      expr: '( (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) and ON (instance) mysql_slave_status_master_server_id > 0 ) > 30'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: MySQL Slave replication lag (instance {{ $labels.instance }})\n        description: \"MySQL replication lag on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlSlowQueries\n      expr: 'increase(mysql_global_status_slow_queries[1m]) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MySQL slow queries (instance {{ $labels.instance }})\n        description: \"MySQL server mysql has some new slow query ({{ $value }} in the last minute).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlInnodbLogWaits\n      expr: 'rate(mysql_global_status_innodb_log_waits[15m]) > 10'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: MySQL InnoDB log waits (instance {{ $labels.instance }})\n        description: \"MySQL innodb log writes stalling ({{ $value }} waits/s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlRestarted\n      expr: 'mysql_global_status_uptime < 60'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: MySQL restarted (instance {{ $labels.instance }})\n        description: \"MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlHighQps\n      expr: 'irate(mysql_global_status_questions[1m]) > 10000'\n      for: 2m\n      labels:\n        severity: info\n      annotations:\n        summary: MySQL High QPS (instance {{ $labels.instance }})\n        description: \"MySQL is being overload with unusual QPS (> 10k QPS).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlTooManyOpenFiles\n      expr: 'mysql_global_status_innodb_num_open_files / mysql_global_variables_open_files_limit * 100 > 75 and mysql_global_variables_open_files_limit > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MySQL too many open files (instance {{ $labels.instance }})\n        description: \"MySQL has too many open files, consider increase variables open_files_limit on {{ $labels.instance }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlInnodbForceRecoveryIsEnabled\n      expr: 'mysql_global_variables_innodb_force_recovery != 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MySQL InnoDB Force Recovery is enabled (instance {{ $labels.instance }})\n        description: \"MySQL InnoDB force recovery is enabled on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: MysqlInnodbHistory_lenTooLong\n      expr: 'mysql_info_schema_innodb_metrics_transaction_trx_rseg_history_len > 50000'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: MySQL InnoDB history_len too long (instance {{ $labels.instance }})\n        description: \"MySQL history_len (undo log) too long on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/nats/nats-exporter.yml",
    "content": "groups:\n\n- name: NatsExporter\n\n  \n  rules:\n\n    - alert: NatsHighRoutesCount\n      expr: 'gnatsd_varz_routes > 10'\n      for: 3m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats high routes count (instance {{ $labels.instance }})\n        description: \"High number of NATS routes ({{ $value }}) for {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsHighMemoryUsage\n      expr: 'gnatsd_varz_mem > 200 * 1024 * 1024'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats high memory usage (instance {{ $labels.instance }})\n        description: \"NATS server memory usage is above 200MB for {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsSlowConsumers\n      expr: 'gnatsd_varz_slow_consumers > 0'\n      for: 3m\n      labels:\n        severity: critical\n      annotations:\n        summary: Nats slow consumers (instance {{ $labels.instance }})\n        description: \"There are slow consumers in NATS for {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsServerDown\n      expr: 'absent(up{job=\"nats\"})'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Nats server down (instance {{ $labels.instance }})\n        description: \"NATS server has been down for more than 5 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # gnatsd_varz_cpu is a gauge reporting CPU percentage (0-100 scale).\n    - alert: NatsHighCpuUsage\n      expr: 'gnatsd_varz_cpu > 80'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats high CPU usage (instance {{ $labels.instance }})\n        description: \"NATS server is using more than 80% CPU for the last 5 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsHighNumberOfConnections\n      expr: 'gnatsd_connz_num_connections > 1000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats high number of connections (instance {{ $labels.instance }})\n        description: \"NATS server has more than 1000 active connections\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsHighJetstreamStoreUsage\n      expr: 'gnatsd_varz_jetstream_stats_storage / gnatsd_varz_jetstream_config_max_storage > 0.8 and gnatsd_varz_jetstream_config_max_storage > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats high JetStream store usage (instance {{ $labels.instance }})\n        description: \"JetStream store usage is over 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsHighJetstreamMemoryUsage\n      expr: 'gnatsd_varz_jetstream_stats_memory / gnatsd_varz_jetstream_config_max_memory > 0.8 and gnatsd_varz_jetstream_config_max_memory > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats high JetStream memory usage (instance {{ $labels.instance }})\n        description: \"JetStream memory usage is over 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsHighNumberOfSubscriptions\n      expr: 'gnatsd_connz_subscriptions > 1000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats high number of subscriptions (instance {{ $labels.instance }})\n        description: \"NATS server has more than 1000 active subscriptions\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsHighPendingBytes\n      expr: 'gnatsd_connz_pending_bytes > 100000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats high pending bytes (instance {{ $labels.instance }})\n        description: \"NATS server has more than 100,000 pending bytes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsTooManyErrors\n      expr: 'increase(gnatsd_varz_jetstream_stats_api_errors[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats too many errors (instance {{ $labels.instance }})\n        description: \"NATS server has encountered {{ $value }} JetStream API errors in the last 5 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsJetstreamAccountsExceeded\n      expr: 'sum(gnatsd_varz_jetstream_stats_accounts) > 100'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats JetStream accounts exceeded (instance {{ $labels.instance }})\n        description: \"JetStream has more than 100 active accounts\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NatsLeafNodeConnectionIssue\n      expr: 'gnatsd_varz_leafnodes == 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nats leaf node connection issue (instance {{ $labels.instance }})\n        description: \"No leaf node connections on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/netdata/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    # This is a gauge metric (not a counter). Checking idle < 20% means CPU usage > 80%.\n    - alert: NetdataHighCpuUsage\n      expr: 'netdata_cpu_cpu_percentage_average{dimension=\"idle\"} < 20'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Netdata high cpu usage (instance {{ $labels.instance }})\n        description: \"Netdata high CPU usage (> 80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NetdataCpuStealNoisyNeighbor\n      expr: 'netdata_cpu_cpu_percentage_average{dimension=\"steal\"} > 10'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Netdata CPU steal noisy neighbor (instance {{ $labels.instance }})\n        description: \"CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NetdataHighMemoryUsage\n      expr: '100 / netdata_system_ram_MiB_average * netdata_system_ram_MiB_average{dimension=~\"free|cached\"} < 20 and netdata_system_ram_MiB_average > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Netdata high memory usage (instance {{ $labels.instance }})\n        description: \"Netdata high memory usage (> 80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NetdataLowDiskSpace\n      expr: '100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~\"avail|cached\"} < 20 and netdata_disk_space_GB_average > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Netdata low disk space (instance {{ $labels.instance }})\n        description: \"Netdata low disk space (> 80%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NetdataPredictedDiskFull\n      expr: 'predict_linear(netdata_disk_space_GB_average{dimension=~\"avail|cached\"}[3h], 24 * 3600) < 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Netdata predicted disk full (instance {{ $labels.instance }})\n        description: \"Netdata predicted disk full in 24 hours\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NetdataMdMismatchCntUnsynchronizedBlocks\n      expr: 'netdata_md_mismatch_cnt_unsynchronized_blocks_average > 1024'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Netdata MD mismatch cnt unsynchronized blocks (instance {{ $labels.instance }})\n        description: \"RAID Array have unsynchronized blocks\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NetdataDiskReallocatedSectors\n      expr: 'increase(netdata_smartd_log_reallocated_sectors_count_sectors_average[1m]) > 0'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Netdata disk reallocated sectors (instance {{ $labels.instance }})\n        description: \"Disk reallocated sectors detected ({{ $value }} sectors)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NetdataDiskCurrentPendingSector\n      expr: 'netdata_smartd_log_current_pending_sector_count_sectors_average > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Netdata disk current pending sector (instance {{ $labels.instance }})\n        description: \"Disk current pending sector\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NetdataReportedUncorrectableDiskSectors\n      expr: 'increase(netdata_smartd_log_offline_uncorrectable_sector_count_sectors_average[2m]) > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Netdata reported uncorrectable disk sectors (instance {{ $labels.instance }})\n        description: \"Reported uncorrectable disk sectors ({{ $value }} sectors)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/nginx/knyar-nginx-exporter.yml",
    "content": "groups:\n\n- name: KnyarNginxExporter\n\n  \n  rules:\n\n    - alert: NginxHighHttp4xxErrorRate\n      expr: 'sum(rate(nginx_http_requests_total{status=~\"^4..\"}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 and sum(rate(nginx_http_requests_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 4xx (> 5%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NginxHighHttp5xxErrorRate\n      expr: 'sum(rate(nginx_http_requests_total{status=~\"^5..\"}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 and sum(rate(nginx_http_requests_total[1m])) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})\n        description: \"Too many HTTP requests with status 5xx (> 5%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NginxLatencyHigh\n      expr: 'histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node, le)) > 3'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nginx latency high (instance {{ $labels.instance }})\n        description: \"Nginx p99 latency is higher than 3 seconds\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/nomad/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: NomadJobFailed\n      expr: 'nomad_nomad_job_summary_failed > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nomad job failed (instance {{ $labels.instance }})\n        description: \"Nomad job failed\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NomadJobLost\n      expr: 'nomad_nomad_job_summary_lost > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nomad job lost (instance {{ $labels.instance }})\n        description: \"Nomad job lost\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NomadJobQueued\n      expr: 'nomad_nomad_job_summary_queued > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nomad job queued (instance {{ $labels.instance }})\n        description: \"Nomad job queued\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: NomadBlockedEvaluation\n      expr: 'nomad_nomad_blocked_evals_total_blocked > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Nomad blocked evaluation (instance {{ $labels.instance }})\n        description: \"Nomad blocked evaluation\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/openebs/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: OpenebsUsedPoolCapacity\n      expr: 'openebs_used_pool_capacity_percent > 80'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenEBS used pool capacity (instance {{ $labels.instance }})\n        description: \"OpenEBS Pool use more than 80% of his capacity\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/openstack/openstack-exporter.yml",
    "content": "groups:\n\n- name: OpenstackExporter\n\n  \n  rules:\n\n    - alert: OpenstackExporterDown\n      expr: 'up{job=~\".*openstack.*\"} == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenStack exporter down (instance {{ $labels.instance }})\n        description: \"The OpenStack exporter is down. OpenStack cloud metrics are no longer being collected.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNovaAgentDown\n      expr: 'openstack_nova_agent_state{adminState=\"enabled\"} == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenStack Nova agent down (instance {{ $labels.instance }})\n        description: \"Nova agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNeutronAgentDown\n      expr: 'openstack_neutron_agent_state{adminState=\"up\"} == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenStack Neutron agent down (instance {{ $labels.instance }})\n        description: \"Neutron agent {{ $labels.hostname }} ({{ $labels.service }}) is down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackCinderAgentDown\n      expr: 'openstack_cinder_agent_state{adminState=\"enabled\"} == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenStack Cinder agent down (instance {{ $labels.instance }})\n        description: \"Cinder agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The threshold of 90% is a rough default. Adjust based on your overcommit ratio and workload patterns.\n    - alert: OpenstackHypervisorHighVcpuUsage\n      expr: 'openstack_nova_vcpus_used / openstack_nova_vcpus_available > 0.9 and openstack_nova_vcpus_available > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack hypervisor high vCPU usage (instance {{ $labels.instance }})\n        description: \"Hypervisor {{ $labels.hostname }} vCPU usage is above 90%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The threshold of 90% is a rough default. Adjust based on your overcommit ratio and workload patterns.\n    - alert: OpenstackHypervisorHighMemoryUsage\n      expr: 'openstack_nova_memory_used_bytes / openstack_nova_memory_available_bytes > 0.9 and openstack_nova_memory_available_bytes > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack hypervisor high memory usage (instance {{ $labels.instance }})\n        description: \"Hypervisor {{ $labels.hostname }} memory usage is above 90%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackHypervisorHighDiskUsage\n      expr: 'openstack_nova_local_storage_used_bytes / openstack_nova_local_storage_available_bytes > 0.9 and openstack_nova_local_storage_available_bytes > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack hypervisor high disk usage (instance {{ $labels.instance }})\n        description: \"Hypervisor {{ $labels.hostname }} local disk usage is above 90%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # A value of -1 for limits_vcpus_max means unlimited quota (no limit set).\n    - alert: OpenstackNovaTenantVcpuQuotaNearlyExhausted\n      expr: 'openstack_nova_limits_vcpus_used / openstack_nova_limits_vcpus_max > 0.9 and openstack_nova_limits_vcpus_max > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Nova tenant vCPU quota nearly exhausted (instance {{ $labels.instance }})\n        description: \"Tenant {{ $labels.tenant }} has used over 90% of its vCPU quota\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNovaTenantMemoryQuotaNearlyExhausted\n      expr: 'openstack_nova_limits_memory_used / openstack_nova_limits_memory_max > 0.9 and openstack_nova_limits_memory_max > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Nova tenant memory quota nearly exhausted (instance {{ $labels.instance }})\n        description: \"Tenant {{ $labels.tenant }} has used over 90% of its memory quota\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNovaTenantInstanceQuotaNearlyExhausted\n      expr: 'openstack_nova_limits_instances_used / openstack_nova_limits_instances_max > 0.9 and openstack_nova_limits_instances_max > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Nova tenant instance quota nearly exhausted (instance {{ $labels.instance }})\n        description: \"Tenant {{ $labels.tenant }} has used over 90% of its instance quota\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackCinderTenantVolumeQuotaNearlyExhausted\n      expr: 'openstack_cinder_limits_volume_used_gb / openstack_cinder_limits_volume_max_gb > 0.9 and openstack_cinder_limits_volume_max_gb > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Cinder tenant volume quota nearly exhausted (instance {{ $labels.instance }})\n        description: \"Tenant {{ $labels.tenant }} has used over 90% of its volume storage quota\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackCinderPoolLowFreeCapacity\n      expr: 'openstack_cinder_pool_capacity_free_gb / openstack_cinder_pool_capacity_total_gb < 0.1 and openstack_cinder_pool_capacity_total_gb > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Cinder pool low free capacity (instance {{ $labels.instance }})\n        description: \"Cinder storage pool {{ $labels.name }} has less than 10% free capacity\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNeutronFloatingIpsAssociatedButNotActive\n      expr: 'openstack_neutron_floating_ips_associated_not_active > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Neutron floating IPs associated but not active (instance {{ $labels.instance }})\n        description: \"{{ $value }} floating IPs are associated to a private IP but are not in ACTIVE state\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNeutronRoutersNotActive\n      expr: 'openstack_neutron_routers_not_active > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Neutron routers not active (instance {{ $labels.instance }})\n        description: \"{{ $value }} Neutron routers are not in ACTIVE state\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNeutronSubnetIpPoolExhaustion\n      expr: 'openstack_neutron_network_ip_availabilities_used / openstack_neutron_network_ip_availabilities_total > 0.9 and openstack_neutron_network_ip_availabilities_total > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Neutron subnet IP pool exhaustion (instance {{ $labels.instance }})\n        description: \"Subnet {{ $labels.subnet_name }} on network {{ $labels.network_name }} has used over 90% of its IP pool\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNeutronPortsWithoutIps\n      expr: 'openstack_neutron_ports_no_ips > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Neutron ports without IPs (instance {{ $labels.instance }})\n        description: \"{{ $value }} active ports have no IP addresses assigned\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackLoadBalancerNotOnline\n      expr: 'openstack_loadbalancer_loadbalancer_status{operating_status!=\"ONLINE\"} > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack load balancer not online (instance {{ $labels.instance }})\n        description: \"Load balancer {{ $labels.name }} ({{ $labels.id }}) operating status is {{ $labels.operating_status }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackNovaInstancesInErrorState\n      expr: 'sum(openstack_nova_server_status{status=\"ERROR\"}) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Nova instances in ERROR state (instance {{ $labels.instance }})\n        description: \"{{ $value }} Nova instances are in ERROR state\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpenstackCinderVolumesInErrorState\n      expr: 'openstack_cinder_volume_status_counter{status=~\"error.*\"} > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack Cinder volumes in error state (instance {{ $labels.instance }})\n        description: \"{{ $value }} Cinder volumes are in an error state\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This alert factors in the allocation ratio to compute effective capacity.\n    # The threshold of 90% is a rough default. Adjust based on your allocation ratios and workload patterns.\n    - alert: OpenstackPlacementResourceHighUsage\n      expr: 'openstack_placement_resource_usage / (openstack_placement_resource_total * openstack_placement_resource_allocation_ratio) > 0.9 and openstack_placement_resource_total > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenStack placement resource high usage (instance {{ $labels.instance }})\n        description: \"Resource {{ $labels.resourcetype }} on host {{ $labels.hostname }} usage exceeds 90% of its allocation\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/opentelemetry-collector/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  # OpenTelemetry Collector self-monitoring metrics are exposed on port 8888 by default at the /metrics endpoint.\n  # These alerts monitor the collector's health when metrics are ingested via the Prometheus OTLP endpoint or scraped directly.\n  # All collector internal metrics are prefixed with 'otelcol_'.\n  \n  rules:\n\n    - alert: OpentelemetryCollectorDown\n      expr: 'up{job=~\".*otel.*collector.*\"} == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenTelemetry Collector down (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector instance has disappeared or is not being scraped\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpentelemetryCollectorReceiverRefusedSpans\n      expr: 'rate(otelcol_receiver_refused_spans[5m]) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenTelemetry Collector receiver refused spans (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector is refusing {{ $value | humanize }}/s spans on {{ $labels.receiver }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpentelemetryCollectorReceiverRefusedMetricPoints\n      expr: 'rate(otelcol_receiver_refused_metric_points[5m]) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenTelemetry Collector receiver refused metric points (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector is refusing {{ $value | humanize }}/s metric points on {{ $labels.receiver }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpentelemetryCollectorReceiverRefusedLogRecords\n      expr: 'rate(otelcol_receiver_refused_log_records[5m]) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenTelemetry Collector receiver refused log records (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector is refusing {{ $value | humanize }}/s log records on {{ $labels.receiver }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: OpentelemetryCollectorExporterFailedSpans\n      expr: 'rate(otelcol_exporter_send_failed_spans[5m]) > 0.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenTelemetry Collector exporter failed spans (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector failing to send {{ $value | humanize }}/s spans via {{ $labels.exporter }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: OpentelemetryCollectorExporterFailedMetricPoints\n      expr: 'rate(otelcol_exporter_send_failed_metric_points[5m]) > 0.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenTelemetry Collector exporter failed metric points (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector failing to send {{ $value | humanize }}/s metric points via {{ $labels.exporter }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: OpentelemetryCollectorExporterFailedLogRecords\n      expr: 'rate(otelcol_exporter_send_failed_log_records[5m]) > 0.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenTelemetry Collector exporter failed log records (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector failing to send {{ $value | humanize }}/s log records via {{ $labels.exporter }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpentelemetryCollectorExporterQueueNearlyFull\n      expr: '(otelcol_exporter_queue_size / on(instance, job, exporter) otelcol_exporter_queue_capacity) > 0.8 and otelcol_exporter_queue_capacity > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenTelemetry Collector exporter queue nearly full (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector exporter {{ $labels.exporter }} queue is over 80% full\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: OpentelemetryCollectorProcessorRefusedSpans\n      expr: 'rate(otelcol_processor_refused_spans[5m]) > 0.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenTelemetry Collector processor refused spans (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector processor {{ $labels.processor }} is refusing spans ({{ $value | humanize }}/s), likely due to backpressure.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: OpentelemetryCollectorProcessorRefusedMetricPoints\n      expr: 'rate(otelcol_processor_refused_metric_points[5m]) > 0.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenTelemetry Collector processor refused metric points (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector processor {{ $labels.processor }} is refusing metric points ({{ $value | humanize }}/s), likely due to backpressure.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpentelemetryCollectorHighMemoryUsage\n      expr: '(otelcol_process_runtime_heap_alloc_bytes{job=~\".*otel.*collector.*\"} / on(instance, job) otelcol_process_runtime_total_sys_memory_bytes{job=~\".*otel.*collector.*\"}) > 0.9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: OpenTelemetry Collector high memory usage (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector memory usage is above 90%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OpentelemetryCollectorOtlpReceiverErrors\n      expr: 'rate(otelcol_receiver_accepted_spans{receiver=~\"otlp\"}[5m]) == 0 and rate(otelcol_receiver_refused_spans{receiver=~\"otlp\"}[5m]) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: OpenTelemetry Collector OTLP receiver errors (instance {{ $labels.instance }})\n        description: \"OpenTelemetry Collector OTLP receiver is completely failing - all spans are being refused\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/oracle-database/iamseth-oracledb-exporter.yml",
    "content": "groups:\n\n- name: IamsethOracledbExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: OracleDbDown\n      expr: 'oracledb_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Oracle DB down (instance {{ $labels.instance }})\n        description: \"Oracle Database instance is down on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is workload-dependent. Adjust 85% to suit your environment.\n    - alert: OracleDbSessionsReachingLimit(>85%)\n      expr: 'oracledb_resource_current_utilization{resource_name=\"sessions\"} / oracledb_resource_limit_value{resource_name=\"sessions\"} * 100 > 85 and oracledb_resource_limit_value{resource_name=\"sessions\"} > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Oracle DB sessions reaching limit (> 85%) (instance {{ $labels.instance }})\n        description: \"Oracle Database session utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is workload-dependent. Adjust 85% to suit your environment.\n    - alert: OracleDbProcessesReachingLimit(>85%)\n      expr: 'oracledb_resource_current_utilization{resource_name=\"processes\"} / oracledb_resource_limit_value{resource_name=\"processes\"} * 100 > 85 and oracledb_resource_limit_value{resource_name=\"processes\"} > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Oracle DB processes reaching limit (> 85%) (instance {{ $labels.instance }})\n        description: \"Oracle Database process utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OracleDbTablespaceReachingCapacity(>85%)\n      expr: 'oracledb_tablespace_used_percent > 85'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Oracle DB tablespace reaching capacity (> 85%) (instance {{ $labels.instance }})\n        description: \"Oracle Database tablespace {{ $labels.tablespace }} is above 85% usage on {{ $labels.instance }} (current value: {{ $value }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OracleDbTablespaceFull(>95%)\n      expr: 'oracledb_tablespace_used_percent > 95'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Oracle DB tablespace full (> 95%) (instance {{ $labels.instance }})\n        description: \"Oracle Database tablespace {{ $labels.tablespace }} is critically full on {{ $labels.instance }} (current value: {{ $value }}%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # A high rollback rate (>20%) often indicates application-level issues such as deadlocks, constraint violations, or poorly designed transactions.\n    - alert: OracleDbHighUserRollbacks\n      expr: 'rate(oracledb_activity_user_rollbacks[5m]) / (rate(oracledb_activity_user_commits[5m]) + rate(oracledb_activity_user_rollbacks[5m])) * 100 > 20 and (rate(oracledb_activity_user_commits[5m]) + rate(oracledb_activity_user_rollbacks[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Oracle DB high user rollbacks (instance {{ $labels.instance }})\n        description: \"Oracle Database on {{ $labels.instance }} has a high rollback rate ({{ $value }}% of transactions are rolled back)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is highly workload-dependent. Adjust 200 to suit your environment.\n    - alert: OracleDbTooManyActiveSessions\n      expr: 'oracledb_sessions_value{status=\"ACTIVE\", type=\"USER\"} > 200'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Oracle DB too many active sessions (instance {{ $labels.instance }})\n        description: \"Oracle Database on {{ $labels.instance }} has too many active user sessions (current value: {{ $value }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The metric from v$waitclassmetric is already a normalized rate (centiseconds per second). Threshold 300 means 3 seconds of I/O wait per second of wall time.\n    - alert: OracleDbHighWaitTime(userI/o)\n      expr: 'oracledb_wait_time_user_io > 300'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Oracle DB high wait time (user I/O) (instance {{ $labels.instance }})\n        description: \"Oracle Database on {{ $labels.instance }} is experiencing high user I/O wait time\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/patroni/embedded-exporter-patroni.yml",
    "content": "groups:\n\n- name: EmbeddedExporterPatroni\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: PatroniHasNoLeader\n      expr: '(max by (scope) (patroni_primary) < 1) and (max by (scope) (patroni_standby_leader) < 1)'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Patroni has no Leader (instance {{ $labels.instance }})\n        description: \"A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/pgbouncer/spreaker-pgbouncer-exporter.yml",
    "content": "groups:\n\n- name: SpreakerPgbouncerExporter\n\n  \n  rules:\n\n    - alert: PgbouncerActiveConnections\n      expr: 'pgbouncer_pools_server_active_connections > 200'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: PGBouncer active connections (instance {{ $labels.instance }})\n        description: \"PGBouncer pools are filling up\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PgbouncerErrors\n      expr: 'increase(pgbouncer_errors_count{errmsg!=\"server conn crashed?\"}[1m]) > 10'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: PGBouncer errors (instance {{ $labels.instance }})\n        description: \"PGBouncer is logging errors. This may be due to a server restart or an admin typing commands at the pgbouncer console.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PgbouncerMaxConnections\n      expr: 'increase(pgbouncer_errors_count{errmsg=\"no more connections allowed (max_client_conn)\"}[2m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: PGBouncer max connections (instance {{ $labels.instance }})\n        description: \"The number of PGBouncer client connections has reached max_client_conn.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/php-fpm/bakins-fpm-exporter.yml",
    "content": "groups:\n\n- name: BakinsFpmExporter\n\n  \n  rules:\n\n    - alert: Php-fpmMax-childrenReached\n      expr: 'sum(increase(phpfpm_max_children_reached_total[5m])) by (instance) > 3'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: PHP-FPM max-children reached (instance {{ $labels.instance }})\n        description: \"PHP-FPM reached max children on {{ $labels.instance }} ({{ $value }} times in the last 5m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/postgresql/postgres-exporter.yml",
    "content": "groups:\n\n- name: PostgresExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: PostgresqlDown\n      expr: 'pg_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Postgresql down (instance {{ $labels.instance }})\n        description: \"Postgresql instance is down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlRestarted\n      expr: 'time() - pg_postmaster_start_time_seconds < 60'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Postgresql restarted (instance {{ $labels.instance }})\n        description: \"Postgresql restarted\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlExporterError\n      expr: 'pg_exporter_last_scrape_error > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Postgresql exporter error (instance {{ $labels.instance }})\n        description: \"Postgresql exporter is showing errors. A query may be buggy in query.yaml\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlTableNotAutoVacuumed\n      expr: '((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_vacuum_threshold) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql table not auto vacuumed (instance {{ $labels.instance }})\n        description: \"Table {{ $labels.relname }} has not been auto vacuumed for 10 days\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlTableNotAutoAnalyzed\n      expr: '((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_analyze_threshold) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql table not auto analyzed (instance {{ $labels.instance }})\n        description: \"Table {{ $labels.relname }} has not been auto analyzed for 10 days\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlTooManyConnections\n      expr: 'sum by (instance, job, server) (pg_stat_activity_count) > min by (instance, job, server) (pg_settings_max_connections * 0.8)'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql too many connections (instance {{ $labels.instance }})\n        description: \"PostgreSQL instance has too many connections (> 80%).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlNotEnoughConnections\n      expr: 'sum by (datname) (pg_stat_activity_count{datname!~\"template.*|postgres\"}) < 5'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Postgresql not enough connections (instance {{ $labels.instance }})\n        description: \"PostgreSQL instance should have more connections (> 5)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlDeadLocks\n      expr: 'increase(pg_stat_database_deadlocks{datname!~\"template.*|postgres\"}[1m]) > 5'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql dead locks (instance {{ $labels.instance }})\n        description: \"PostgreSQL has dead-locks ({{ $value }} in the last minute)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlHighRollbackRate\n      expr: 'sum by (namespace,datname) ((rate(pg_stat_database_xact_rollback{datname!~\"template.*|postgres\",datid!=\"0\"}[3m])) / ((rate(pg_stat_database_xact_rollback{datname!~\"template.*|postgres\",datid!=\"0\"}[3m])) + (rate(pg_stat_database_xact_commit{datname!~\"template.*|postgres\",datid!=\"0\"}[3m])))) > 0.02'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql high rollback rate (instance {{ $labels.instance }})\n        description: \"Ratio of transactions being aborted compared to committed is > 2 %\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlCommitRateLow\n      expr: 'increase(pg_stat_database_xact_commit{datname!~\"template.*|postgres\",datid!=\"0\"}[5m]) < 5'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Postgresql commit rate low (instance {{ $labels.instance }})\n        description: \"Postgresql seems to be processing very few transactions\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlLowXidConsumption\n      expr: 'rate(pg_txid_current[1m]) < 5'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql low XID consumption (instance {{ $labels.instance }})\n        description: \"Postgresql seems to be consuming transaction IDs very slowly\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlUnusedReplicationSlot\n      expr: '(pg_replication_slots_active == 0) and (pg_replication_is_replica == 0)'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql unused replication slot (instance {{ $labels.instance }})\n        description: \"Unused Replication Slots\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlTooManyDeadTuples\n      expr: '((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 and (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql too many dead tuples (instance {{ $labels.instance }})\n        description: \"PostgreSQL dead tuples is too large\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlConfigurationChanged\n      expr: '{__name__=~\"pg_settings_.*\",__name__!=\"pg_settings_transaction_read_only\"} != ON(__name__, instance) {__name__=~\"pg_settings_.*\",__name__!=\"pg_settings_transaction_read_only\"} OFFSET 5m'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Postgresql configuration changed (instance {{ $labels.instance }})\n        description: \"Postgres Database configuration change has occurred\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlSslCompressionActive\n      expr: 'sum by (instance) (pg_stat_ssl_compression) > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql SSL compression active (instance {{ $labels.instance }})\n        description: \"Database allows connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlTooManyLocksAcquired\n      expr: '((sum by (instance) (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20 and (pg_settings_max_locks_per_transaction * pg_settings_max_connections) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Postgresql too many locks acquired (instance {{ $labels.instance }})\n        description: \"Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737\n    - alert: PostgresqlBloatIndexHigh(>80%)\n      expr: 'pg_bloat_btree_bloat_pct > 80 and on (idxname) (pg_bloat_btree_real_size > 100000000)'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql bloat index high (> 80%) (instance {{ $labels.instance }})\n        description: \"The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737\n    - alert: PostgresqlBloatTableHigh(>80%)\n      expr: 'pg_bloat_table_bloat_pct > 80 and on (relname) (pg_bloat_table_real_size > 200000000)'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql bloat table high (> 80%) (instance {{ $labels.instance }})\n        description: \"The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737\n    - alert: PostgresqlInvalidIndex\n      expr: 'pg_general_index_info_pg_relation_size{indexrelname=~\".*ccnew.*\"}'\n      for: 6h\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql invalid index (instance {{ $labels.instance }})\n        description: \"The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};`\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PostgresqlReplicationLag\n      expr: 'pg_replication_lag_seconds > 5'\n      for: 30s\n      labels:\n        severity: warning\n      annotations:\n        summary: Postgresql replication lag (instance {{ $labels.instance }})\n        description: \"The PostgreSQL replication lag is high (> 5s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/process-exporter/process-exporter.yml",
    "content": "groups:\n\n- name: ProcessExporter\n\n  \n  rules:\n\n    - alert: ProcessExporterGroupDown\n      expr: 'namedprocess_namegroup_num_procs == 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Process exporter group down (instance {{ $labels.instance }})\n        description: \"No processes found for group {{ $labels.groupname }}. The service may have stopped. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 4GB is arbitrary and depends on the process being monitored. Adjust per group.\n    - alert: ProcessExporterHighMemoryUsage\n      expr: 'namedprocess_namegroup_memory_bytes{memtype=\"resident\"} > 4e+09'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Process exporter high memory usage (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of resident memory. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Value is core-equivalent %: 100% = 1 full core, 200% = 2 cores, etc. Threshold of 80% is per-core. Adjust based on expected workload.\n    - alert: ProcessExporterHighCpuUsage\n      expr: 'rate(namedprocess_namegroup_cpu_seconds_total[5m]) * 100 > 80'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Process exporter high CPU usage (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} is using {{ $value }}% CPU (core-equivalent). (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ProcessExporterHighFileDescriptorUsage\n      expr: 'namedprocess_namegroup_worst_fd_ratio > 0.8'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Process exporter high file descriptor usage (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} is using more than 80% of its file descriptor limit. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ProcessExporterFileDescriptorsExhausted\n      expr: 'namedprocess_namegroup_worst_fd_ratio > 0.95'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Process exporter file descriptors exhausted (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} has nearly exhausted its file descriptor limit. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 512MB is arbitrary. Adjust per group and environment.\n    - alert: ProcessExporterHighSwapUsage\n      expr: 'namedprocess_namegroup_memory_bytes{memtype=\"swapped\"} > 512e+06'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Process exporter high swap usage (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of swap. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ProcessExporterZombieProcesses\n      expr: 'namedprocess_namegroup_states{state=\"Zombie\"} > 5'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Process exporter zombie processes (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} has {{ $value }} zombie processes. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Filters to voluntary switches only — involuntary switches are normal under CPU contention. Threshold of 50000/s is a rough default. Adjust based on workload.\n    - alert: ProcessExporterHighContextSwitching\n      expr: 'rate(namedprocess_namegroup_context_switches_total{ctxswitchtype=\"voluntary\"}[5m]) > 50000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Process exporter high context switching (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} has a high rate of context switches ({{ $value }}/s). (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 100MB/s is arbitrary. Adjust per group.\n    - alert: ProcessExporterHighDiskWriteIo\n      expr: 'rate(namedprocess_namegroup_write_bytes_total[5m]) > 100e+06'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Process exporter high disk write IO (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} is performing {{ $value | humanize }}B/s of disk writes. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Detects restarts by watching for changes in the oldest process start time within the group.\n    - alert: ProcessExporterProcessRestarting\n      expr: 'changes(namedprocess_namegroup_oldest_start_time_seconds[5m]) > 0 and namedprocess_namegroup_num_procs > 0'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: Process exporter process restarting (instance {{ $labels.instance }})\n        description: \"Process group {{ $labels.groupname }} has restarted (oldest process start time changed). (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/prometheus-self-monitoring/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: PrometheusJobMissing\n      expr: 'absent(up{job=\"prometheus\"})'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus job missing (instance {{ $labels.instance }})\n        description: \"A Prometheus job has disappeared\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Only fire if at least one target in the job is still up.\n    # If all targets are down, PrometheusJobMissing or PrometheusAllTargetsMissing will fire instead.\n    - alert: PrometheusTargetMissing\n      expr: 'up == 0 unless on(job) (sum by (job) (up) == 0)'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus target missing (instance {{ $labels.instance }})\n        description: \"A Prometheus target has disappeared. An exporter might be crashed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusAllTargetsMissing\n      expr: 'sum by (job) (up) == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus all targets missing (instance {{ $labels.instance }})\n        description: \"A Prometheus job does not have living target anymore.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTargetMissingWithWarmupTime\n      expr: 'sum by (instance, job) ((up == 0) * on (instance) group_left(__name__) (node_time_seconds - node_boot_time_seconds > 600))'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus target missing with warmup time (instance {{ $labels.instance }})\n        description: \"Allow a job time to start up (10 minutes) before alerting that it's down.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusConfigurationReloadFailure\n      expr: 'prometheus_config_last_reload_successful != 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus configuration reload failure (instance {{ $labels.instance }})\n        description: \"Prometheus configuration reload error\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTooManyRestarts\n      expr: 'changes(process_start_time_seconds{job=~\"prometheus|pushgateway|alertmanager\"}[15m]) > 2'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus too many restarts (instance {{ $labels.instance }})\n        description: \"Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusAlertmanagerJobMissing\n      expr: 'absent(up{job=\"alertmanager\"})'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus AlertManager job missing (instance {{ $labels.instance }})\n        description: \"A Prometheus AlertManager job has disappeared\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusAlertmanagerConfigurationReloadFailure\n      expr: 'alertmanager_config_last_reload_successful != 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})\n        description: \"AlertManager configuration reload error\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusAlertmanagerConfigNotSynced\n      expr: 'count(count_values(\"config_hash\", alertmanager_config_hash)) > 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})\n        description: \"Configurations of AlertManager cluster instances are out of sync\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusAlertmanagerE2eDeadManSwitch\n      expr: 'vector(1)'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})\n        description: \"Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusNotConnectedToAlertmanager\n      expr: 'prometheus_notifications_alertmanagers_discovered < 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})\n        description: \"Prometheus cannot connect the alertmanager\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusRuleEvaluationFailures\n      expr: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTemplateTextExpansionFailures\n      expr: 'increase(prometheus_template_text_expansion_failures_total[3m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus template text expansion failures (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} template text expansion failures\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusRuleEvaluationSlow\n      expr: 'prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})\n        description: \"Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusNotificationsBacklog\n      expr: 'min_over_time(prometheus_notifications_queue_length[10m]) > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus notifications backlog (instance {{ $labels.instance }})\n        description: \"The Prometheus notification queue has not been empty for 10 minutes\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusAlertmanagerNotificationFailing\n      expr: 'rate(alertmanager_notifications_failed_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})\n        description: \"Alertmanager is failing sending notifications ({{ $value }} notifications/s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTargetEmpty\n      expr: 'prometheus_sd_discovered_targets == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus target empty (instance {{ $labels.instance }})\n        description: \"Prometheus has no target in service discovery\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTargetScrapingSlow\n      expr: 'prometheus_target_interval_length_seconds{quantile=\"0.9\"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile=\"0.5\"} > 1.05'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus target scraping slow (instance {{ $labels.instance }})\n        description: \"Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusLargeScrape\n      expr: 'increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus large scrape (instance {{ $labels.instance }})\n        description: \"Prometheus has many scrapes that exceed the sample limit ({{ $value }} scrapes)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTargetScrapeDuplicate\n      expr: 'increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 3'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})\n        description: \"Prometheus has many samples rejected due to duplicate timestamps but different values ({{ $value }} samples)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTsdbCheckpointCreationFailures\n      expr: 'increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} checkpoint creation failures\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTsdbCheckpointDeletionFailures\n      expr: 'increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} checkpoint deletion failures\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTsdbCompactionsFailed\n      expr: 'increase(prometheus_tsdb_compactions_failed_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} TSDB compactions failures\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTsdbHeadTruncationsFailed\n      expr: 'increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} TSDB head truncation failures\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTsdbReloadFailures\n      expr: 'increase(prometheus_tsdb_reloads_failures_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} TSDB reload failures\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTsdbWalCorruptions\n      expr: 'increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} TSDB WAL corruptions\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTsdbWalTruncationsFailed\n      expr: 'increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})\n        description: \"Prometheus encountered {{ $value }} TSDB WAL truncation failures\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PrometheusTimeseriesCardinality\n      expr: 'label_replace(count by(__name__) ({__name__=~\".+\"}), \"name\", \"$1\", \"__name__\", \"(.+)\") > 10000'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Prometheus timeseries cardinality (instance {{ $labels.instance }})\n        description: \"The \\\"{{ $labels.name }}\\\" timeseries cardinality is getting very high: {{ $value }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/promtail/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: PromtailRequestErrors\n      expr: '100 * sum(rate(promtail_request_duration_seconds_count{status_code=~\"5..|failed\"}[1m])) by (namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance) > 10 and sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Promtail request errors (instance {{ $labels.instance }})\n        description: \"The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \\\"%.2f\\\" $value }}% errors.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PromtailRequestLatency\n      expr: 'histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le)) > 1'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Promtail request latency (instance {{ $labels.instance }})\n        description: \"The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \\\"%.2f\\\" $value }}s 99th percentile latency.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/proxmox-ve/prometheus-pve-exporter.yml",
    "content": "groups:\n\n- name: PrometheusPveExporter\n\n  \n  rules:\n\n    - alert: PveNodeDown\n      expr: 'pve_up{id=~\"node/.*\"} == 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: PVE node down (instance {{ $labels.instance }})\n        description: \"Proxmox VE node {{ $labels.id }} is down.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This alert triggers for all VMs and containers that are not running.\n    # You may want to filter by specific guests using the `id` label, or exclude\n    # intentionally stopped guests with additional label matchers.\n    - alert: PveVm/ctDown\n      expr: 'pve_up{id=~\"(qemu|lxc)/.*\"} == 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: PVE VM/CT down (instance {{ $labels.instance }})\n        description: \"Proxmox VE guest {{ $labels.id }} is not running.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PveHighCpuUsage\n      expr: 'pve_cpu_usage_ratio * 100 > 90'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: PVE high CPU usage (instance {{ $labels.instance }})\n        description: \"Proxmox VE CPU usage is above 90% on {{ $labels.id }}. Current value: {{ $value | printf \\\"%.2f\\\" }}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PveHighMemoryUsage\n      expr: 'pve_memory_usage_bytes / pve_memory_size_bytes * 100 > 90 and pve_memory_size_bytes > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: PVE high memory usage (instance {{ $labels.instance }})\n        description: \"Proxmox VE memory usage is above 90% on {{ $labels.id }}. Current value: {{ $value | printf \\\"%.2f\\\" }}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PveStorageFillingUp\n      expr: 'pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100 > 80 and pve_disk_size_bytes{id=~\"storage/.*\"} > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: PVE storage filling up (instance {{ $labels.instance }})\n        description: \"Proxmox VE storage {{ $labels.id }} is above 80% used. Current value: {{ $value | printf \\\"%.2f\\\" }}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PveStorageAlmostFull\n      expr: 'pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100 > 95 and pve_disk_size_bytes{id=~\"storage/.*\"} > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: PVE storage almost full (instance {{ $labels.instance }})\n        description: \"Proxmox VE storage {{ $labels.id }} is above 95% used. Current value: {{ $value | printf \\\"%.2f\\\" }}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PveGuestNotBackedUp\n      expr: 'pve_not_backed_up_total > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: PVE guest not backed up (instance {{ $labels.instance }})\n        description: \"{{ $value }} Proxmox VE guest(s) are not covered by any backup job.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PveReplicationFailed\n      expr: 'pve_replication_failed_syncs > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: PVE replication failed (instance {{ $labels.instance }})\n        description: \"Proxmox VE replication for {{ $labels.id }} has {{ $value }} failed sync(s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Loss of quorum means the cluster cannot make decisions about VM placement\n    # and fencing. This requires immediate attention.\n    - alert: PveClusterNotQuorate\n      expr: 'pve_cluster_info{quorate=\"0\"} == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: PVE cluster not quorate (instance {{ $labels.instance }})\n        description: \"Proxmox VE cluster has lost quorum.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/pulsar/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: PulsarSubscriptionHighNumberOfBacklogEntries\n      expr: 'sum(pulsar_subscription_back_log) by (subscription) > 5000'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Pulsar subscription high number of backlog entries (instance {{ $labels.instance }})\n        description: \"The number of subscription backlog entries is over 5k\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarSubscriptionVeryHighNumberOfBacklogEntries\n      expr: 'sum(pulsar_subscription_back_log) by (subscription) > 100000'\n      for: 1h\n      labels:\n        severity: critical\n      annotations:\n        summary: Pulsar subscription very high number of backlog entries (instance {{ $labels.instance }})\n        description: \"The number of subscription backlog entries is over 100k\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarTopicLargeBacklogStorageSize\n      expr: 'sum(pulsar_storage_size) by (topic) > 5*1024*1024*1024'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Pulsar topic large backlog storage size (instance {{ $labels.instance }})\n        description: \"The topic backlog storage size is over 5 GB\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarTopicVeryLargeBacklogStorageSize\n      expr: 'sum(pulsar_storage_size) by (topic) > 20*1024*1024*1024'\n      for: 1h\n      labels:\n        severity: critical\n      annotations:\n        summary: Pulsar topic very large backlog storage size (instance {{ $labels.instance }})\n        description: \"The topic backlog storage size is over 20 GB\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarHighWriteLatency\n      expr: 'sum(pulsar_storage_write_latency_overflow > 0) by (topic)'\n      for: 1h\n      labels:\n        severity: critical\n      annotations:\n        summary: Pulsar high write latency (instance {{ $labels.instance }})\n        description: \"Messages cannot be written in a timely fashion\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarLargeMessagePayload\n      expr: 'sum(pulsar_entry_size_overflow > 0) by (topic)'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: Pulsar large message payload (instance {{ $labels.instance }})\n        description: \"Observing large message payload (> 1MB)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarHighLedgerDiskUsage\n      expr: 'sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75'\n      for: 1h\n      labels:\n        severity: critical\n      annotations:\n        summary: Pulsar high ledger disk usage (instance {{ $labels.instance }})\n        description: \"Observing Ledger Disk Usage (> 75%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarReadOnlyBookies\n      expr: 'count(bookie_SERVER_STATUS{} == 0) by (pod)'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Pulsar read only bookies (instance {{ $labels.instance }})\n        description: \"Observing Readonly Bookies\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarHighNumberOfFunctionErrors\n      expr: 'sum(rate(pulsar_function_user_exceptions_total[1m]) + rate(pulsar_function_system_exceptions_total[1m])) by (name) > 10'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Pulsar high number of function errors (instance {{ $labels.instance }})\n        description: \"Observing more than 10 Function errors per minute\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PulsarHighNumberOfSinkErrors\n      expr: 'sum(rate(pulsar_sink_sink_exceptions_total[1m])) by (name) > 10'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Pulsar high number of sink errors (instance {{ $labels.instance }})\n        description: \"Observing more than 10 Sink errors per minute\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/python/python-exporter.yml",
    "content": "groups:\n\n- name: PythonExporter\n\n  \n  rules:\n\n    - alert: PythonGcObjectsUncollectable\n      expr: 'increase(python_gc_objects_uncollectable_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Python GC objects uncollectable (instance {{ $labels.instance }})\n        description: \"Python has uncollectable objects, potential memory leak via reference cycles\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: PythonGcCollectionsHigh\n      expr: 'rate(python_gc_objects_collected_total[5m]) > 10000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Python GC collections high (instance {{ $labels.instance }})\n        description: \"Python GC is collecting too many objects (> 10k/s), high allocation pressure\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # process_open_fds and process_max_fds are generic metrics from the Prometheus client library, not Python-specific.\n    - alert: PythonFileDescriptorsExhaustion\n      expr: '(process_open_fds / process_max_fds) * 100 > 90 and process_max_fds > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Python file descriptors exhaustion (instance {{ $labels.instance }})\n        description: \"Python process is running out of file descriptors (> 90% used)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Gen2 collection rate > 1/s is very high. In most applications, gen2 runs are infrequent. Adjust threshold based on your workload.\n    - alert: PythonGcGeneration2CollectionsHigh\n      expr: 'rate(python_gc_collections_total{generation=\"2\"}[5m]) > 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Python GC generation 2 collections high (instance {{ $labels.instance }})\n        description: \"Python full GC (generation 2) is running too frequently, indicating memory pressure\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is a rough default. Adjust based on your application's expected memory footprint.\n    - alert: PythonVirtualMemoryHigh\n      expr: 'process_virtual_memory_bytes > 4e9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Python virtual memory high (instance {{ $labels.instance }})\n        description: \"Python process virtual memory is high (> 4GB)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/rabbitmq/kbudde-rabbitmq-exporter.yml",
    "content": "groups:\n\n- name: KbuddeRabbitmqExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: RabbitmqDown\n      expr: 'rabbitmq_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: RabbitMQ down (instance {{ $labels.instance }})\n        description: \"RabbitMQ node down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: RabbitmqClusterDown\n      expr: 'sum(rabbitmq_running) < 3'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: RabbitMQ cluster down (instance {{ $labels.instance }})\n        description: \"Less than 3 nodes running in RabbitMQ cluster\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqClusterPartition\n      expr: 'rabbitmq_partitions > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: RabbitMQ cluster partition (instance {{ $labels.instance }})\n        description: \"Cluster partition\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqOutOfMemory\n      expr: 'rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90 and rabbitmq_node_mem_limit > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ out of memory (instance {{ $labels.instance }})\n        description: \"Memory available for RabbitMQ is low (< 10%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqInstanceTooManyConnections\n      expr: 'rabbitmq_connectionsTotal > 1000'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ instance too many connections (instance {{ $labels.instance }})\n        description: \"RabbitMQ instance has too many connections (> 1000)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Indicate the queue name in dedicated label.\n    - alert: RabbitmqDeadLetterQueueFillingUp\n      expr: 'rabbitmq_queue_messages{queue=\"my-dead-letter-queue\"} > 10'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ dead letter queue filling up (instance {{ $labels.instance }})\n        description: \"Dead letter queue is filling up (> 10 msgs)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Indicate the queue name in dedicated label.\n    - alert: RabbitmqTooManyMessagesInQueue\n      expr: 'rabbitmq_queue_messages_ready{queue=\"my-queue\"} > 1000'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ too many messages in queue (instance {{ $labels.instance }})\n        description: \"Queue is filling up (> 1000 msgs)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Indicate the queue name in dedicated label.\n    - alert: RabbitmqSlowQueueConsuming\n      expr: 'time() - rabbitmq_queue_head_message_timestamp{queue=\"my-queue\"} > 60'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ slow queue consuming (instance {{ $labels.instance }})\n        description: \"Queue messages are consumed slowly (> 60s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Allows a short service restart.\n    - alert: RabbitmqNoConsumer\n      expr: 'rabbitmq_queue_consumers == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: RabbitMQ no consumer (instance {{ $labels.instance }})\n        description: \"Queue has no consumer\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Indicate the queue name in dedicated label.\n    - alert: RabbitmqTooManyConsumers\n      expr: 'rabbitmq_queue_consumers{queue=\"my-queue\"} > 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: RabbitMQ too many consumers (instance {{ $labels.instance }})\n        description: \"Queue should have only 1 consumer\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Indicate the exchange name in dedicated label.\n    - alert: RabbitmqInactiveExchange\n      expr: 'rate(rabbitmq_exchange_messages_published_in_total{exchange=\"my-exchange\"}[1m]) < 5'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ inactive exchange (instance {{ $labels.instance }})\n        description: \"Exchange receive less than 5 msgs per second\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/rabbitmq/rabbitmq-exporter.yml",
    "content": "groups:\n\n- name: RabbitmqExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: RabbitmqNodeDown\n      expr: 'sum(rabbitmq_build_info) < 3'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: RabbitMQ node down (instance {{ $labels.instance }})\n        description: \"Less than 3 nodes running in RabbitMQ cluster\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: RabbitmqNodeNotDistributed\n      expr: 'erlang_vm_dist_node_state < 3'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: RabbitMQ node not distributed (instance {{ $labels.instance }})\n        description: \"Distribution link state is not 'up'\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqInstancesDifferentVersions\n      expr: 'count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1'\n      for: 1h\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ instances different versions (instance {{ $labels.instance }})\n        description: \"Running different version of RabbitMQ in the same cluster, can lead to failure.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqMemoryHigh\n      expr: 'rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90 and rabbitmq_resident_memory_limit_bytes > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ memory high (instance {{ $labels.instance }})\n        description: \"A node use more than 90% of allocated RAM\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqFileDescriptorsUsage\n      expr: 'rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90 and rabbitmq_process_max_fds > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ file descriptors usage (instance {{ $labels.instance }})\n        description: \"A node use more than 90% of file descriptors\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqTooManyReadyMessages\n      expr: 'sum(rabbitmq_queue_messages_ready) BY (queue) > 1000'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ too many ready messages (instance {{ $labels.instance }})\n        description: \"RabbitMQ too many ready messages on {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqTooManyUnackMessages\n      expr: 'sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ too many unack messages (instance {{ $labels.instance }})\n        description: \"Too many unacknowledged messages\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqTooManyConnections\n      expr: 'rabbitmq_connections > 1000'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ too many connections (instance {{ $labels.instance }})\n        description: \"The total connections of a node is too high\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqNoQueueConsumer\n      expr: 'rabbitmq_queue_consumers < 1'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ no queue consumer (instance {{ $labels.instance }})\n        description: \"A queue has less than 1 consumer\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RabbitmqUnroutableMessages\n      expr: 'increase(rabbitmq_channel_messages_unroutable_returned_total[1m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[1m]) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: RabbitMQ unroutable messages (instance {{ $labels.instance }})\n        description: \"A queue has unroutable messages ({{ $value }} in the last 1m)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/redis/oliver006-redis-exporter.yml",
    "content": "groups:\n\n- name: Oliver006RedisExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: RedisDown\n      expr: 'redis_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Redis down (instance {{ $labels.instance }})\n        description: \"Redis instance is down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisMissingMaster\n      expr: '(count(redis_instance_info{role=\"master\"}) or vector(0)) < 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Redis missing master (instance {{ $labels.instance }})\n        description: \"Redis cluster has no node marked as master.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: RedisTooManyMasters\n      expr: 'count(redis_instance_info{role=\"master\"}) > 1'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Redis too many masters (instance {{ $labels.instance }})\n        description: \"Redis cluster has too many nodes marked as master.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisDisconnectedSlaves\n      expr: 'count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Redis disconnected slaves (instance {{ $labels.instance }})\n        description: \"Redis not replicating for all slaves. Consider reviewing the redis replication status.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisReplicationBroken\n      expr: 'delta(redis_connected_slaves[1m]) < 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Redis replication broken (instance {{ $labels.instance }})\n        description: \"Redis instance lost a slave\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisClusterFlapping\n      expr: 'changes(redis_connected_slaves[1m]) > 1'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Redis cluster flapping (instance {{ $labels.instance }})\n        description: \"Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisMissingBackup\n      expr: 'time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 48'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Redis missing backup (instance {{ $labels.instance }})\n        description: \"Redis has not been backed up for 48 hours\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The exporter must be started with --include-system-metrics flag or REDIS_EXPORTER_INCL_SYSTEM_METRICS=true environment variable.\n    - alert: RedisOutOfSystemMemory\n      expr: 'redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 and redis_total_system_memory_bytes > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Redis out of system memory (instance {{ $labels.instance }})\n        description: \"Redis is running out of system memory (> 90%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisOutOfConfiguredMaxmemory\n      expr: 'redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90 and on(instance) redis_memory_max_bytes > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Redis out of configured maxmemory (instance {{ $labels.instance }})\n        description: \"Redis is running out of configured maxmemory (> 90%)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisTooManyConnections\n      expr: 'redis_connected_clients / redis_config_maxclients * 100 > 90 and redis_config_maxclients > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Redis too many connections (instance {{ $labels.instance }})\n        description: \"Redis is running out of connections (> 90% used)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisNotEnoughConnections\n      expr: 'redis_connected_clients < 5'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Redis not enough connections (instance {{ $labels.instance }})\n        description: \"Redis instance should have more connections (> 5)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RedisRejectedConnections\n      expr: 'increase(redis_rejected_connections_total[1m]) > 5'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Redis rejected connections (instance {{ $labels.instance }})\n        description: \"Some connections to Redis has been rejected\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/ruby/ruby-exporter.yml",
    "content": "groups:\n\n- name: RubyExporter\n\n  \n  rules:\n\n    # Threshold is a rough default. Adjust based on your application's normal heap size.\n    - alert: RubyHeapLiveSlotsHigh\n      expr: 'ruby_heap_live_slots > 500000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ruby heap live slots high (instance {{ $labels.instance }})\n        description: \"Ruby heap has too many live slots (> 500k), heap bloat\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RubyHeapFreeSlotsHigh\n      expr: 'ruby_heap_free_slots > 500000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ruby heap free slots high (instance {{ $labels.instance }})\n        description: \"Ruby heap has too many free slots (> 500k), memory fragmentation after large allocations\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Major GC rate > 5/s is extremely high. Consider lowering to > 1 or > 2 for earlier detection.\n    - alert: RubyMajorGcRateHigh\n      expr: 'rate(ruby_major_gc_ops_total[5m]) > 5'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ruby major GC rate high (instance {{ $labels.instance }})\n        description: \"Ruby is performing too many major GC cycles, indicating memory pressure\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RubyRssHigh\n      expr: 'ruby_rss > 1e9'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ruby RSS high (instance {{ $labels.instance }})\n        description: \"Ruby process RSS is high (> 1GB)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: RubyAllocatedObjectsSpike\n      expr: 'rate(ruby_allocated_objects_total[5m]) > 100000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Ruby allocated objects spike (instance {{ $labels.instance }})\n        description: \"Ruby is allocating objects at a high rate\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/s.m.a.r.t-device-monitoring/smartctl-exporter.yml",
    "content": "groups:\n\n- name: SmartctlExporter\n\n  \n  rules:\n\n    - alert: SmartDeviceTemperatureWarning\n      expr: '(avg_over_time(smartctl_device_temperature{temperature_type=\"current\"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type=\"drive_trip\"}) > 60'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: SMART device temperature warning (instance {{ $labels.instance }})\n        description: \"Device temperature warning on {{ $labels.instance }} drive {{ $labels.device }} over 60°C\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SmartDeviceTemperatureCritical\n      expr: '(max_over_time(smartctl_device_temperature{temperature_type=\"current\"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type=\"drive_trip\"}) > 70'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: SMART device temperature critical (instance {{ $labels.instance }})\n        description: \"Device temperature critical on {{ $labels.instance }} drive {{ $labels.device }} over 70°C\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SmartDeviceTemperatureOverTripValue\n      expr: 'max_over_time(smartctl_device_temperature{temperature_type=\"current\"} [10m]) >= on(device, instance) smartctl_device_temperature{temperature_type=\"drive_trip\"}'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: SMART device temperature over trip value (instance {{ $labels.instance }})\n        description: \"Device temperature over trip value on {{ $labels.instance }} drive {{ $labels.device }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SmartDeviceTemperatureNearingTripValue\n      expr: 'max_over_time(smartctl_device_temperature{temperature_type=\"current\"} [10m]) >= on(device, instance) (smartctl_device_temperature{temperature_type=\"drive_trip\"} * .80)'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: SMART device temperature nearing trip value (instance {{ $labels.instance }})\n        description: \"Device temperature at 80% of trip value on {{ $labels.instance }} drive {{ $labels.device }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SmartStatus\n      expr: 'smartctl_device_smart_status != 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: SMART status (instance {{ $labels.instance }})\n        description: \"Device has a SMART status failure on {{ $labels.instance }} drive {{ $labels.device }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SmartCriticalWarning\n      expr: 'smartctl_device_critical_warning > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: SMART critical warning (instance {{ $labels.instance }})\n        description: \"Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SmartMediaErrors\n      expr: 'smartctl_device_media_errors > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: SMART media errors (instance {{ $labels.instance }})\n        description: \"Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SmartWearoutIndicator\n      expr: 'smartctl_device_available_spare < smartctl_device_available_spare_threshold'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: SMART Wearout Indicator (instance {{ $labels.instance }})\n        description: \"Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/sidekiq/strech-sidekiq-exporter.yml",
    "content": "groups:\n\n- name: StrechSidekiqExporter\n\n  \n  rules:\n\n    - alert: SidekiqQueueSize\n      expr: 'sidekiq_queue_size > 100'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: Sidekiq queue size (instance {{ $labels.instance }})\n        description: \"Sidekiq queue {{ $labels.name }} is growing\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SidekiqSchedulingLatencyTooHigh\n      expr: 'max(sidekiq_queue_latency) > 60'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }})\n        description: \"Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/snmp/snmp-exporter.yml",
    "content": "groups:\n\n- name: SnmpExporter\n\n  # These rules use standard IF-MIB and SNMPv2-MIB metrics. Metric names depend on your snmp.yml module configuration.\n  # Thresholds for bandwidth and error rates are rough defaults - adjust to your environment.\n  \n  rules:\n\n    # From the official snmp-mixin.\n    - alert: SnmpTargetDown\n      expr: 'up{job=~\"snmp.*\"} == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: SNMP target down (instance {{ $labels.instance }})\n        description: \"SNMP device {{ $labels.instance }} is unreachable.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SnmpInterfaceDown\n      expr: '(ifOperStatus{job=~\"snmp.*\"} == 2) and on(instance, job, ifIndex) (ifAdminStatus{job=~\"snmp.*\"} == 1)'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: SNMP interface down (instance {{ $labels.instance }})\n        description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} is operationally down while administratively up.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is a rough default. Adjust based on your network environment.\n    - alert: SnmpInterfaceHighInboundErrorRate\n      expr: 'rate(ifInErrors{job=~\"snmp.*\"}[5m]) / (rate(ifHCInUcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCInBroadcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCInMulticastPkts{job=~\"snmp.*\"}[5m])) > 0.05 and (rate(ifHCInUcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCInBroadcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCInMulticastPkts{job=~\"snmp.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: SNMP interface high inbound error rate (instance {{ $labels.instance }})\n        description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} has an inbound error rate above 5%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is a rough default. Adjust based on your network environment.\n    - alert: SnmpInterfaceHighOutboundErrorRate\n      expr: 'rate(ifOutErrors{job=~\"snmp.*\"}[5m]) / (rate(ifHCOutUcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCOutBroadcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCOutMulticastPkts{job=~\"snmp.*\"}[5m])) > 0.05 and (rate(ifHCOutUcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCOutBroadcastPkts{job=~\"snmp.*\"}[5m]) + rate(ifHCOutMulticastPkts{job=~\"snmp.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: SNMP interface high outbound error rate (instance {{ $labels.instance }})\n        description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} has an outbound error rate above 5%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is a rough default. ifSpeed is a Gauge32 that maxes out at ~4.29 Gbps. For 10G+ interfaces, use ifHighSpeed (in Mbps) instead.\n    - alert: SnmpInterfaceHighBandwidthUsageInbound\n      expr: 'rate(ifHCInOctets{job=~\"snmp.*\"}[5m]) * 8 / ifSpeed > 0.80 and ifSpeed > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: SNMP interface high bandwidth usage inbound (instance {{ $labels.instance }})\n        description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} inbound utilization is above 80%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold is a rough default. ifSpeed is a Gauge32 that maxes out at ~4.29 Gbps. For 10G+ interfaces, use ifHighSpeed (in Mbps) instead.\n    - alert: SnmpInterfaceHighBandwidthUsageOutbound\n      expr: 'rate(ifHCOutOctets{job=~\"snmp.*\"}[5m]) * 8 / ifSpeed > 0.80 and ifSpeed > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: SNMP interface high bandwidth usage outbound (instance {{ $labels.instance }})\n        description: \"Interface {{ $labels.ifDescr }} on {{ $labels.instance }} outbound utilization is above 80%.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # sysUpTime is in centiseconds (hundredths of a second).\n    - alert: SnmpDeviceRestarted\n      expr: 'sysUpTime / 100 < 300'\n      for: 0m\n      labels:\n        severity: info\n      annotations:\n        summary: SNMP device restarted (instance {{ $labels.instance }})\n        description: \"SNMP device {{ $labels.instance }} has restarted (uptime < 5 minutes).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/solr/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: SolrUpdateErrors\n      expr: 'increase(solr_metrics_core_update_handler_errors_total[1m]) > 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Solr update errors (instance {{ $labels.instance }})\n        description: \"Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SolrQueryErrors\n      expr: 'increase(solr_metrics_core_errors_total{category=\"QUERY\"}[1m]) > 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Solr query errors (instance {{ $labels.instance }})\n        description: \"Solr has increased query errors in collection {{ $labels.collection }} for replica {{ $labels.replica }} on {{ $labels.base_url }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SolrReplicationErrors\n      expr: 'increase(solr_metrics_core_errors_total{category=\"REPLICATION\"}[1m]) > 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Solr replication errors (instance {{ $labels.instance }})\n        description: \"Solr collection {{ $labels.collection }} has replication errors for replica {{ $labels.replica }} on {{ $labels.base_url }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SolrLowLiveNodeCount\n      expr: 'solr_collections_live_nodes < 2'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Solr low live node count (instance {{ $labels.instance }})\n        description: \"Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/speedtest/nlamirault-speedtest-exporter.yml",
    "content": "groups:\n\n- name: NlamiraultSpeedtestExporter\n\n  \n  rules:\n\n    - alert: SpeedtestSlowInternetDownload\n      expr: 'avg_over_time(speedtest_download[10m]) < 100'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: SpeedTest Slow Internet Download (instance {{ $labels.instance }})\n        description: \"Internet download speed is currently {{humanize $value}} Mbps.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SpeedtestSlowInternetUpload\n      expr: 'avg_over_time(speedtest_upload[10m]) < 20'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: SpeedTest Slow Internet Upload (instance {{ $labels.instance }})\n        description: \"Internet upload speed is currently {{humanize $value}} Mbps.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/spinnaker/embedded-exporter.yml",
    "content": "groups:\n\n- name: EmbeddedExporter\n\n  \n  rules:\n\n    - alert: SpinnakerCircuitBreakerOpen\n      expr: 'resilience4j_circuitbreaker_state{state=\"open\"} == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker circuit breaker open (instance {{ $labels.instance }})\n        description: \"Circuit breaker {{ $labels.name }} is open on {{ $labels.instance }}, indicating repeated downstream failures.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # In a healthy Spinnaker, queue_ready_depth should stay at or near 0.\n    # Sustained non-zero values indicate Orca cannot keep up with incoming work.\n    - alert: SpinnakerOrcaQueueBackingUp\n      expr: 'queue_ready_depth > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker Orca queue backing up (instance {{ $labels.instance }})\n        description: \"Orca work queue has {{ $value }} messages ready for delivery but not yet picked up. Pipeline executions may be delayed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The 30s threshold is a rough default. Adjust based on your pipeline SLOs.\n    - alert: SpinnakerOrcaQueueMessageLagHigh\n      expr: 'rate(queue_message_lag_seconds_sum[5m]) / rate(queue_message_lag_seconds_count[5m]) > 30 and rate(queue_message_lag_seconds_count[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker Orca queue message lag high (instance {{ $labels.instance }})\n        description: \"Orca queue message lag is {{ $value }}s. Pipeline stages are waiting too long before being processed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SpinnakerDeadMessages\n      expr: 'rate(queue_dead_messages_total[5m]) > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Spinnaker dead messages (instance {{ $labels.instance }})\n        description: \"Orca is producing dead-lettered messages ({{ $value }} per second). These are tasks that exhausted all retries and will not be executed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Zombies are pipeline executions that are running but have lost their queue entry.\n    # See https://spinnaker.io/docs/guides/runbooks/orca-zombie-executions/\n    - alert: SpinnakerZombieExecutions\n      expr: 'rate(queue_zombies_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker zombie executions (instance {{ $labels.instance }})\n        description: \"{{ $value }} zombie pipeline executions detected. These are executions with no corresponding queue messages.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SpinnakerThreadPoolExhaustion\n      expr: 'threadpool_blockingQueueSize > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker thread pool exhaustion (instance {{ $labels.instance }})\n        description: \"Orca message handler thread pool has {{ $value }} blocked threads on {{ $labels.instance }}. Pipeline execution throughput is degraded.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # When this threshold is exceeded, Igor stops triggering pipelines for the affected monitor.\n    # See https://kb.armory.io/s/article/Hitting-Igor-s-caching-thresholds\n    - alert: SpinnakerPollingMonitorItemsOverThreshold\n      expr: 'sum by (monitor, partition) (pollingMonitor_itemsOverThreshold) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Spinnaker polling monitor items over threshold (instance {{ $labels.instance }})\n        description: \"Igor polling monitor {{ $labels.monitor }} for {{ $labels.partition }} has exceeded its item threshold, preventing pipeline triggers.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SpinnakerPollingMonitorFailures\n      expr: 'rate(pollingMonitor_failed_total[5m]) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker polling monitor failures (instance {{ $labels.instance }})\n        description: \"Igor polling monitor is experiencing failures ({{ $value }} per second). CI/SCM integrations may not trigger pipelines.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # The 5% threshold is a rough default. Adjust based on your traffic patterns.\n    - alert: SpinnakerHighApiErrorRate\n      expr: 'sum by (instance) (rate(controller_invocations_total{status=\"5xx\"}[5m])) / sum by (instance) (rate(controller_invocations_total[5m])) > 0.05 and sum by (instance) (rate(controller_invocations_total[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker high API error rate (instance {{ $labels.instance }})\n        description: \"Spinnaker API 5xx error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SpinnakerApiRateLimitThrottling\n      expr: 'rate(rateLimitThrottling_total[5m]) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker API rate limit throttling (instance {{ $labels.instance }})\n        description: \"Gate is actively throttling API requests on {{ $labels.instance }} ({{ $value }} throttled requests per second).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SpinnakerClouddriverHighErrorRate\n      expr: 'sum by (instance) (rate(controller_invocations_total{status=\"5xx\", job=~\".*clouddriver.*\"}[5m])) / sum by (instance) (rate(controller_invocations_total{job=~\".*clouddriver.*\"}[5m])) > 0.05 and sum by (instance) (rate(controller_invocations_total{job=~\".*clouddriver.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker Clouddriver high error rate (instance {{ $labels.instance }})\n        description: \"Clouddriver 5xx error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Cloud operations may be failing.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This metric is specific to AWS cloud providers in Clouddriver.\n    # The 1000ms threshold is a rough default. Adjust based on your AWS usage patterns.\n    - alert: SpinnakerAwsRateLimiting\n      expr: 'amazonClientProvider_rateLimitDelayMil > 1000'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Spinnaker AWS rate limiting (instance {{ $labels.instance }})\n        description: \"Clouddriver is being rate-limited by AWS on {{ $labels.instance }} ({{ $value }}ms delay). Cloud operations will be slower.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/sql-server/ozarklake-mssql-exporter.yml",
    "content": "groups:\n\n- name: OzarklakeMssqlExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: SqlServerDown\n      expr: 'mssql_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: SQL Server down (instance {{ $labels.instance }})\n        description: \"SQL server instance is down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SqlServerDeadlock\n      expr: 'mssql_deadlocks > 5'\n      for: 1m\n      labels:\n        severity: warning\n      annotations:\n        summary: SQL Server deadlock (instance {{ $labels.instance }})\n        description: \"SQL Server {{ $labels.instance }} is experiencing deadlocks ({{ $value }}/s)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/ssl/tls/ribbybibby-ssl-exporter.yml",
    "content": "groups:\n\n- name: RibbybibbySslExporter\n\n  \n  rules:\n\n    - alert: SslCertificateProbeFailed\n      expr: 'ssl_probe_success == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: SSL certificate probe failed (instance {{ $labels.instance }})\n        description: \"Failed to fetch SSL information {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SslCertificateOscpStatusUnknown\n      expr: 'ssl_ocsp_response_status == 2'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: SSL certificate OSCP status unknown (instance {{ $labels.instance }})\n        description: \"Failed to get the OSCP status {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SslCertificateRevoked\n      expr: 'ssl_ocsp_response_status == 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: SSL certificate revoked (instance {{ $labels.instance }})\n        description: \"SSL certificate revoked {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SslCertificateExpiry(<7Days)\n      expr: 'ssl_verified_cert_not_after{chain_no=\"0\"} - time() < 86400 * 7'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: SSL certificate expiry (< 7 days) (instance {{ $labels.instance }})\n        description: \"{{ $labels.instance }} Certificate is expiring in 7 days\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/systemd/systemd-exporter.yml",
    "content": "groups:\n\n- name: SystemdExporter\n\n  \n  rules:\n\n    - alert: SystemdUnitFailed\n      expr: 'systemd_unit_state{state=\"failed\"} == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Systemd unit failed (instance {{ $labels.instance }})\n        description: \"Systemd unit {{ $labels.name }} has entered failed state. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Many units are legitimately inactive. You must adjust the name=~ filter to match your critical services.\n    - alert: SystemdUnitInactive\n      expr: 'systemd_unit_state{state=\"inactive\", type=\"service\", name=~\"your-critical-service.+\"} == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Systemd unit inactive (instance {{ $labels.instance }})\n        description: \"Systemd unit {{ $labels.name }} is inactive. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SystemdServiceCrashLooping\n      expr: 'increase(systemd_service_restart_total[1h]) > 5'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Systemd service crash looping (instance {{ $labels.instance }})\n        description: \"Systemd service {{ $labels.name }} has restarted {{ $value }} times in the last hour. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SystemdUnitTasksNearLimit\n      expr: 'systemd_unit_tasks_current / ignoring(type) systemd_unit_tasks_max > 0.9 and ignoring(type) systemd_unit_tasks_max > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Systemd unit tasks near limit (instance {{ $labels.instance }})\n        description: \"Systemd unit {{ $labels.name }} is using {{ $value | humanizePercentage }} of its task limit. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: SystemdSocketRefusedConnections\n      expr: 'increase(systemd_socket_refused_connections_total[5m]) > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Systemd socket refused connections (instance {{ $labels.instance }})\n        description: \"Systemd socket {{ $labels.name }} is refusing connections. ({{ $value }} refused in last 5m, instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 100 connections is arbitrary. Adjust to your workload.\n    - alert: SystemdSocketHighConnections\n      expr: 'systemd_socket_current_connections > 100'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Systemd socket high connections (instance {{ $labels.instance }})\n        description: \"Systemd socket {{ $labels.name }} has {{ $value }} active connections. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Triggers if timer hasn't fired in 24 hours. Adjust threshold per timer schedule.\n    - alert: SystemdTimerMissedTrigger\n      expr: '(time() - systemd_timer_last_trigger_seconds) / 3600 > 24 and systemd_timer_last_trigger_seconds > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Systemd timer missed trigger (instance {{ $labels.instance }})\n        description: \"Systemd timer {{ $labels.name }} has not triggered for over 24 hours. (instance {{ $labels.instance }})\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/thanos/thanos-bucket-replicate.yml",
    "content": "groups:\n\n- name: ThanosBucketReplicate\n\n  \n  rules:\n\n    - alert: ThanosBucketReplicateErrorRate\n      expr: '(sum by (job) (rate(thanos_replicate_replication_runs_total{result=\"error\", job=~\".*thanos-bucket-replicate.*\"}[5m])) / on (job) group_left sum by (job) (rate(thanos_replicate_replication_runs_total{job=~\".*thanos-bucket-replicate.*\"}[5m]))) * 100 >= 10 and sum by (job) (rate(thanos_replicate_replication_runs_total{job=~\".*thanos-bucket-replicate.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Bucket Replicate Error Rate (instance {{ $labels.instance }})\n        description: \"Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosBucketReplicateRunLatency\n      expr: '(histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~\".*thanos-bucket-replicate.*\"}[5m]))) > 20 and  sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~\".*thanos-bucket-replicate.*\"}[5m])) > 0)'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Bucket Replicate Run Latency (instance {{ $labels.instance }})\n        description: \"Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/thanos/thanos-compactor.yml",
    "content": "groups:\n\n- name: ThanosCompactor\n\n  \n  rules:\n\n    - alert: ThanosCompactorMultipleRunning\n      expr: 'sum by (job) (up{job=~\".*thanos-compact.*\"}) > 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Compactor Multiple Running (instance {{ $labels.instance }})\n        description: \"No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosCompactorHalted\n      expr: 'thanos_compact_halted{job=~\".*thanos-compact.*\"} == 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Compactor Halted (instance {{ $labels.instance }})\n        description: \"Thanos Compact {{$labels.job}} has failed to run and now is halted.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosCompactorHighCompactionFailures\n      expr: '(sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~\".*thanos-compact.*\"}[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total{job=~\".*thanos-compact.*\"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_compact_group_compactions_total{job=~\".*thanos-compact.*\"}[5m])) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Compactor High Compaction Failures (instance {{ $labels.instance }})\n        description: \"Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosCompactBucketHighOperationFailures\n      expr: '(sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~\".*thanos-compact.*\"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~\".*thanos-compact.*\"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~\".*thanos-compact.*\"}[5m])) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Compact Bucket High Operation Failures (instance {{ $labels.instance }})\n        description: \"Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosCompactHasNotRun\n      expr: '(time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~\".*thanos-compact.*\"}[24h]))) / 60 / 60 > 24'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Compact Has Not Run (instance {{ $labels.instance }})\n        description: \"Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/thanos/thanos-component-absent.yml",
    "content": "groups:\n\n- name: ThanosComponentAbsent\n\n  \n  rules:\n\n    - alert: ThanosCompactIsDown\n      expr: 'absent(up{job=~\".*thanos-compact.*\"} == 1)'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Compact Is Down (instance {{ $labels.instance }})\n        description: \"ThanosCompact has disappeared. Prometheus target for the component cannot be discovered.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosQueryIsDown\n      expr: 'absent(up{job=~\".*thanos-query.*\"} == 1)'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Query Is Down (instance {{ $labels.instance }})\n        description: \"ThanosQuery has disappeared. Prometheus target for the component cannot be discovered.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosReceiveIsDown\n      expr: 'absent(up{job=~\".*thanos-receive.*\"} == 1)'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Receive Is Down (instance {{ $labels.instance }})\n        description: \"ThanosReceive has disappeared. Prometheus target for the component cannot be discovered.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleIsDown\n      expr: 'absent(up{job=~\".*thanos-rule.*\"} == 1)'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Rule Is Down (instance {{ $labels.instance }})\n        description: \"ThanosRule has disappeared. Prometheus target for the component cannot be discovered.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosSidecarIsDown\n      expr: 'absent(up{job=~\".*thanos-sidecar.*\"} == 1)'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Sidecar Is Down (instance {{ $labels.instance }})\n        description: \"ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosStoreIsDown\n      expr: 'absent(up{job=~\".*thanos-store.*\"} == 1)'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Store Is Down (instance {{ $labels.instance }})\n        description: \"ThanosStore has disappeared. Prometheus target for the component cannot be discovered.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/thanos/thanos-query.yml",
    "content": "groups:\n\n- name: ThanosQuery\n\n  \n  rules:\n\n    - alert: ThanosQueryHttpRequestQueryErrorRateHigh\n      expr: '(sum by (job) (rate(http_requests_total{code=~\"5..\", job=~\".*thanos-query.*\", handler=\"query\"}[5m]))/  sum by (job) (rate(http_requests_total{job=~\".*thanos-query.*\", handler=\"query\"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~\".*thanos-query.*\", handler=\"query\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Query Http Request Query Error Rate High (instance {{ $labels.instance }})\n        description: \"Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \\\"query\\\" requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh\n      expr: '(sum by (job) (rate(http_requests_total{code=~\"5..\", job=~\".*thanos-query.*\", handler=\"query_range\"}[5m]))/  sum by (job) (rate(http_requests_total{job=~\".*thanos-query.*\", handler=\"query_range\"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~\".*thanos-query.*\", handler=\"query_range\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Query Http Request Query Range Error Rate High (instance {{ $labels.instance }})\n        description: \"Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \\\"query_range\\\" requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosQueryGrpcServerErrorRate\n      expr: '(sum by (job) (rate(grpc_server_handled_total{grpc_code=~\"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded\", job=~\".*thanos-query.*\"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~\".*thanos-query.*\"}[5m])) * 100 > 5) and sum by (job) (rate(grpc_server_started_total{job=~\".*thanos-query.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Query Grpc Server Error Rate (instance {{ $labels.instance }})\n        description: \"Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosQueryGrpcClientErrorRate\n      expr: '(sum by (job) (rate(grpc_client_handled_total{grpc_code!=\"OK\", job=~\".*thanos-query.*\"}[5m])) / sum by (job) (rate(grpc_client_started_total{job=~\".*thanos-query.*\"}[5m]))) * 100 > 5 and sum by (job) (rate(grpc_client_started_total{job=~\".*thanos-query.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Query Grpc Client Error Rate (instance {{ $labels.instance }})\n        description: \"Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosQueryHighDNSFailures\n      expr: '(sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~\".*thanos-query.*\"}[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~\".*thanos-query.*\"}[5m]))) * 100 > 1 and sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~\".*thanos-query.*\"}[5m])) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Query High D N S Failures (instance {{ $labels.instance }})\n        description: \"Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosQueryInstantLatencyHigh\n      expr: '(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\".*thanos-query.*\", handler=\"query\"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_bucket{job=~\".*thanos-query.*\", handler=\"query\"}[5m])) > 0)'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Query Instant Latency High (instance {{ $labels.instance }})\n        description: \"Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosQueryRangeLatencyHigh\n      expr: '(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\".*thanos-query.*\", handler=\"query_range\"}[5m]))) > 90 and sum by (job) (rate(http_request_duration_seconds_count{job=~\".*thanos-query.*\", handler=\"query_range\"}[5m])) > 0)'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Query Range Latency High (instance {{ $labels.instance }})\n        description: \"Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosQueryOverload\n      expr: '(max_over_time(thanos_query_concurrent_gate_queries_max[5m]) - avg_over_time(thanos_query_concurrent_gate_queries_in_flight[5m]) < 1)'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Query Overload (instance {{ $labels.instance }})\n        description: \"Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/thanos/thanos-receiver.yml",
    "content": "groups:\n\n- name: ThanosReceiver\n\n  \n  rules:\n\n    - alert: ThanosReceiveHttpRequestErrorRateHigh\n      expr: '(sum by (job) (rate(http_requests_total{code=~\"5..\", job=~\".*thanos-receive.*\", handler=\"receive\"}[5m]))/  sum by (job) (rate(http_requests_total{job=~\".*thanos-receive.*\", handler=\"receive\"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~\".*thanos-receive.*\", handler=\"receive\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Receive Http Request Error Rate High (instance {{ $labels.instance }})\n        description: \"Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosReceiveHttpRequestLatencyHigh\n      expr: '(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\".*thanos-receive.*\", handler=\"receive\"}[5m]))) > 10 and sum by (job) (rate(http_request_duration_seconds_count{job=~\".*thanos-receive.*\", handler=\"receive\"}[5m])) > 0)'\n      for: 10m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Receive Http Request Latency High (instance {{ $labels.instance }})\n        description: \"Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosReceiveHighReplicationFailures\n      expr: 'thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result=\"error\", job=~\".*thanos-receive.*\"}[5m])) / sum by (job) (rate(thanos_receive_replications_total{job=~\".*thanos-receive.*\"}[5m]))) > (max by (job) (floor((thanos_receive_replication_factor{job=~\".*thanos-receive.*\"}+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes{job=~\".*thanos-receive.*\"}))) * 100'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Receive High Replication Failures (instance {{ $labels.instance }})\n        description: \"Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosReceiveHighForwardRequestFailures\n      expr: '(sum by (job) (rate(thanos_receive_forward_requests_total{result=\"error\", job=~\".*thanos-receive.*\"}[5m]))/  sum by (job) (rate(thanos_receive_forward_requests_total{job=~\".*thanos-receive.*\"}[5m]))) * 100 > 20 and sum by (job) (rate(thanos_receive_forward_requests_total{job=~\".*thanos-receive.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: info\n      annotations:\n        summary: Thanos Receive High Forward Request Failures (instance {{ $labels.instance }})\n        description: \"Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosReceiveHighHashringFileRefreshFailures\n      expr: '(sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~\".*thanos-receive.*\"}[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~\".*thanos-receive.*\"}[5m])) > 0) and sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~\".*thanos-receive.*\"}[5m])) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Receive High Hashring File Refresh Failures (instance {{ $labels.instance }})\n        description: \"Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosReceiveConfigReloadFailure\n      expr: 'avg by (job) (thanos_receive_config_last_reload_successful{job=~\".*thanos-receive.*\"}) != 1'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Receive Config Reload Failure (instance {{ $labels.instance }})\n        description: \"Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosReceiveNoUpload\n      expr: '(up{job=~\".*thanos-receive.*\"} - 1) + on (job, instance) (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~\".*thanos-receive.*\"}[3h])) == 0)'\n      for: 3h\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Receive No Upload (instance {{ $labels.instance }})\n        description: \"Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/thanos/thanos-ruler.yml",
    "content": "groups:\n\n- name: ThanosRuler\n\n  \n  rules:\n\n    - alert: ThanosRuleQueueIsDroppingAlerts\n      expr: 'sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Rule Queue Is Dropping Alerts (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.instance}} is failing to queue alerts ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleSenderIsFailingAlerts\n      expr: 'sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Rule Sender Is Failing Alerts (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleHighRuleEvaluationFailures\n      expr: '(sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~\".*thanos-rule.*\"}[5m])) / sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~\".*thanos-rule.*\"}[5m])) * 100 > 5) and sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Rule High Rule Evaluation Failures (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.instance}} is failing to evaluate {{$value | humanize}}% of rules.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: ThanosRuleHighRuleEvaluationWarnings\n      expr: 'sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~\".*thanos-rule.*\"}[5m])) > 0.05'\n      for: 15m\n      labels:\n        severity: info\n      annotations:\n        summary: Thanos Rule High Rule Evaluation Warnings (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.instance}} has high number of evaluation warnings ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleRuleEvaluationLatencyHigh\n      expr: '(sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~\".*thanos-rule.*\"}) > sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~\".*thanos-rule.*\"}))'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Rule Rule Evaluation Latency High (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleGrpcErrorRate\n      expr: '(sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~\"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded\", job=~\".*thanos-rule.*\"}[5m]))/  sum by (job, instance) (rate(grpc_server_started_total{job=~\".*thanos-rule.*\"}[5m])) * 100 > 5) and sum by (job, instance) (rate(grpc_server_started_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Rule Grpc Error Rate (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleConfigReloadFailure\n      expr: 'avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~\".*thanos-rule.*\"}) != 1'\n      for: 5m\n      labels:\n        severity: info\n      annotations:\n        summary: Thanos Rule Config Reload Failure (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.job}} has not been able to reload its configuration.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleQueryHighDNSFailures\n      expr: '(sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~\".*thanos-rule.*\"}[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~\".*thanos-rule.*\"}[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Rule Query High D N S Failures (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleAlertmanagerHighDNSFailures\n      expr: '(sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~\".*thanos-rule.*\"}[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~\".*thanos-rule.*\"}[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~\".*thanos-rule.*\"}[5m])) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Rule Alertmanager High D N S Failures (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosRuleNoEvaluationFor10Intervals\n      expr: 'time() -  max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~\".*thanos-rule.*\"})>10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~\".*thanos-rule.*\"})'\n      for: 5m\n      labels:\n        severity: info\n      annotations:\n        summary: Thanos Rule No Evaluation For10 Intervals (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosNoRuleEvaluations\n      expr: 'sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~\".*thanos-rule.*\"}[5m])) <= 0  and sum by (job, instance) (thanos_rule_loaded_rules{job=~\".*thanos-rule.*\"}) > 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos No Rule Evaluations (instance {{ $labels.instance }})\n        description: \"Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/thanos/thanos-sidecar.yml",
    "content": "groups:\n\n- name: ThanosSidecar\n\n  \n  rules:\n\n    # Threshold of 0.05/s avoids firing on transient single-event spikes.\n    - alert: ThanosSidecarBucketOperationsFailed\n      expr: 'sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~\".*thanos-sidecar.*\"}[5m])) > 0.05'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Sidecar Bucket Operations Failed (instance {{ $labels.instance }})\n        description: \"Thanos Sidecar {{$labels.instance}} bucket operations are failing ({{ $value | humanize }}/s).\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosSidecarNoConnectionToStartedPrometheus\n      expr: 'thanos_sidecar_prometheus_up{job=~\".*thanos-sidecar.*\"} == 0 and on (namespace, pod)prometheus_tsdb_data_replay_duration_seconds != 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: Thanos Sidecar No Connection To Started Prometheus (instance {{ $labels.instance }})\n        description: \"Thanos Sidecar {{$labels.instance}} is unhealthy.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/thanos/thanos-store.yml",
    "content": "groups:\n\n- name: ThanosStore\n\n  \n  rules:\n\n    - alert: ThanosStoreGrpcErrorRate\n      expr: '(sum by (job) (rate(grpc_server_handled_total{grpc_code=~\"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded\", job=~\".*thanos-store.*\"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~\".*thanos-store.*\"}[5m])) * 100 > 5) and sum by (job) (rate(grpc_server_started_total{job=~\".*thanos-store.*\"}[5m])) > 0'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Store Grpc Error Rate (instance {{ $labels.instance }})\n        description: \"Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosStoreSeriesGateLatencyHigh\n      expr: '(histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~\".*thanos-store.*\"}[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~\".*thanos-store.*\"}[5m])) > 0)'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Store Series Gate Latency High (instance {{ $labels.instance }})\n        description: \"Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosStoreBucketHighOperationFailures\n      expr: '(sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~\".*thanos-store.*\"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~\".*thanos-store.*\"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~\".*thanos-store.*\"}[5m])) > 0'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Store Bucket High Operation Failures (instance {{ $labels.instance }})\n        description: \"Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ThanosStoreObjstoreOperationLatencyHigh\n      expr: '(histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~\".*thanos-store.*\"}[5m]))) > 2 and  sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~\".*thanos-store.*\"}[5m])) > 0)'\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: Thanos Store Objstore Operation Latency High (instance {{ $labels.instance }})\n        description: \"Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/traefik/embedded-exporter-v1.yml",
    "content": "groups:\n\n- name: EmbeddedExporterV1\n\n  \n  rules:\n\n    - alert: TraefikBackendDown\n      expr: 'count(traefik_backend_server_up) by (backend) == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Traefik backend down (instance {{ $labels.instance }})\n        description: \"All Traefik backends are down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TraefikHighHttp4xxErrorRateBackend\n      expr: 'sum(rate(traefik_backend_requests_total{code=~\"4.*\"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 and sum(rate(traefik_backend_requests_total[3m])) by (backend) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }})\n        description: \"Traefik backend 4xx error rate is above 5%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TraefikHighHttp5xxErrorRateBackend\n      expr: 'sum(rate(traefik_backend_requests_total{code=~\"5.*\"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 and sum(rate(traefik_backend_requests_total[3m])) by (backend) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }})\n        description: \"Traefik backend 5xx error rate is above 5%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/traefik/embedded-exporter-v2.yml",
    "content": "groups:\n\n- name: EmbeddedExporterV2\n\n  \n  rules:\n\n    - alert: TraefikServiceDown\n      expr: 'count(traefik_service_server_up) by (service) == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Traefik service down (instance {{ $labels.instance }})\n        description: \"All Traefik services are down\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TraefikHighHttp4xxErrorRateService\n      expr: 'sum(rate(traefik_service_requests_total{code=~\"4.*\"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 and sum(rate(traefik_service_requests_total[3m])) by (service) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }})\n        description: \"Traefik service 4xx error rate is above 5%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: TraefikHighHttp5xxErrorRateService\n      expr: 'sum(rate(traefik_service_requests_total{code=~\"5.*\"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 and sum(rate(traefik_service_requests_total[3m])) by (service) > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }})\n        description: \"Traefik service 5xx error rate is above 5%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/vmware/pryorda-vmware-exporter.yml",
    "content": "groups:\n\n- name: PryordaVmwareExporter\n\n  \n  rules:\n\n    - alert: VirtualMachineMemoryWarning\n      expr: 'vmware_vm_mem_usage_average / 100 >= 80 and vmware_vm_mem_usage_average / 100 < 90'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Virtual Machine Memory Warning (instance {{ $labels.instance }})\n        description: \"High memory usage on {{ $labels.instance }}: {{ $value | printf \\\"%.2f\\\"}}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: VirtualMachineMemoryCritical\n      expr: 'vmware_vm_mem_usage_average / 100 >= 90'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Virtual Machine Memory Critical (instance {{ $labels.instance }})\n        description: \"High memory usage on {{ $labels.instance }}: {{ $value | printf \\\"%.2f\\\"}}%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: HighNumberOfSnapshots\n      expr: 'vmware_vm_snapshots > 3'\n      for: 30m\n      labels:\n        severity: warning\n      annotations:\n        summary: High Number of Snapshots (instance {{ $labels.instance }})\n        description: \"High snapshots number on {{ $labels.instance }}: {{ $value }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: OutdatedSnapshots\n      expr: '(time() - vmware_vm_snapshot_timestamp_seconds) / (60 * 60 * 24) >= 3'\n      for: 5m\n      labels:\n        severity: warning\n      annotations:\n        summary: Outdated Snapshots (instance {{ $labels.instance }})\n        description: \"Outdated snapshots on {{ $labels.instance }}: {{ $value | printf \\\"%.0f\\\"}} days\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/windows-server/windows-exporter.yml",
    "content": "groups:\n\n- name: WindowsExporter\n\n  \n  rules:\n\n    - alert: WindowsServerCollectorError\n      expr: 'windows_exporter_collector_success == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Windows Server collector Error (instance {{ $labels.instance }})\n        description: \"Collector {{ $labels.collector }} was not successful\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: WindowsServerServiceStatus\n      expr: 'windows_service_status{status=\"ok\"} != 1'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Windows Server service Status (instance {{ $labels.instance }})\n        description: \"Windows Service state is not OK\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: WindowsServerCpuUsage\n      expr: '100 - (avg by (instance) (rate(windows_cpu_time_total{mode=\"idle\"}[2m])) * 100) > 80'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: Windows Server CPU Usage (instance {{ $labels.instance }})\n        description: \"CPU Usage is more than 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: WindowsServerMemoryUsage\n      expr: '100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: Windows Server memory Usage (instance {{ $labels.instance }})\n        description: \"Memory usage is more than 90%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: WindowsServerDiskSpaceUsage\n      expr: '100 - 100 * (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes) > 80 and windows_logical_disk_size_bytes > 0'\n      for: 2m\n      labels:\n        severity: critical\n      annotations:\n        summary: Windows Server disk Space Usage (instance {{ $labels.instance }})\n        description: \"Disk usage is more than 80%\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/wireguard/mindflavor-prometheus-wireguard-exporter.yml",
    "content": "groups:\n\n- name: MindflavorPrometheusWireguardExporter\n\n  \n  rules:\n\n    # The threshold of 300 seconds (5 minutes) is a rough default. WireGuard peers that are idle but reachable\n    # typically re-handshake every 2 minutes. Adjust based on your keepalive interval.\n    # The `> 0` guard excludes peers that have never completed a handshake (covered by a separate rule).\n    - alert: WireguardPeerHandshakeTooOld\n      expr: 'time() - wireguard_latest_handshake_seconds > 300 and wireguard_latest_handshake_seconds > 0'\n      for: 2m\n      labels:\n        severity: warning\n      annotations:\n        summary: WireGuard peer handshake too old (instance {{ $labels.instance }})\n        description: \"WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has not had a handshake for over 5 minutes. The tunnel may be down.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This alert will fire for all offline mobile/laptop peers. Consider filtering by expected-online peers.\n    - alert: WireguardPeerHandshakeNeverEstablished\n      expr: 'wireguard_latest_handshake_seconds == 0'\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: WireGuard peer handshake never established (instance {{ $labels.instance }})\n        description: \"WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has never completed a handshake. Check peer configuration and network connectivity.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # This alert fires when a peer has a recent handshake but zero traffic flow.\n    # May indicate routing issues or a misconfigured allowed-ips.\n    # Only useful if you expect continuous traffic on all peers.\n    - alert: WireguardNoTrafficOnPeer\n      expr: '(rate(wireguard_sent_bytes_total[15m]) + rate(wireguard_received_bytes_total[15m])) == 0 and wireguard_latest_handshake_seconds > 0 and (time() - wireguard_latest_handshake_seconds) < 300'\n      for: 15m\n      labels:\n        severity: warning\n      annotations:\n        summary: WireGuard no traffic on peer (instance {{ $labels.instance }})\n        description: \"WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has had no traffic for 15 minutes despite an active handshake.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/zfs/node-exporter.yml",
    "content": "groups:\n\n- name: NodeExporter\n\n  \n  rules:\n\n    - alert: ZfsOfflinePool\n      expr: 'node_zfs_zpool_state{state!=\"online\"} > 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: ZFS offline pool (instance {{ $labels.instance }})\n        description: \"A ZFS zpool is in a unexpected state: {{ $labels.state }}.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/zfs/zfs_exporter.yml",
    "content": "groups:\n\n- name: Zfs_exporter\n\n  \n  rules:\n\n    - alert: ZfsPoolOutOfSpace\n      expr: 'zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0 and zfs_pool_size_bytes > 0'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: ZFS pool out of space (instance {{ $labels.instance }})\n        description: \"Disk is almost full (< 10% left)\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    # 0: ONLINE\n    # 1: DEGRADED\n    # 2: FAULTED\n    # 3: OFFLINE\n    # 4: UNAVAIL\n    # 5: REMOVED\n    # 6: SUSPENDED\n    - alert: ZfsPoolUnhealthy\n      expr: 'zfs_pool_health > 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: ZFS pool unhealthy (instance {{ $labels.instance }})\n        description: \"ZFS pool state is {{ $value }}. See comments for more information.\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ZfsCollectorFailed\n      expr: 'zfs_scrape_collector_success != 1'\n      for: 0m\n      labels:\n        severity: warning\n      annotations:\n        summary: ZFS collector failed (instance {{ $labels.instance }})\n        description: \"ZFS collector for {{ $labels.instance }} has failed to collect information\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/rules/zookeeper/cloudflare-kafka-zookeeper-exporter.yml",
    "content": "groups:\n\n- name: CloudflareKafkaZookeeperExporter\n\n  \n  rules:\n"
  },
  {
    "path": "dist/rules/zookeeper/dabealu-zookeeper-exporter.yml",
    "content": "groups:\n\n- name: DabealuZookeeperExporter\n\n  \n  rules:\n\n    # 1m delay allows a restart without triggering an alert.\n    - alert: ZookeeperDown\n      expr: 'zk_up == 0'\n      for: 1m\n      labels:\n        severity: critical\n      annotations:\n        summary: Zookeeper Down (instance {{ $labels.instance }})\n        description: \"Zookeeper down on instance {{ $labels.instance }}\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ZookeeperMissingLeader\n      expr: 'sum(zk_server_leader) == 0'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Zookeeper missing leader (instance {{ $labels.instance }})\n        description: \"Zookeeper cluster has no node marked as leader\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ZookeeperTooManyLeaders\n      expr: 'sum(zk_server_leader) > 1'\n      for: 0m\n      labels:\n        severity: critical\n      annotations:\n        summary: Zookeeper Too Many Leaders (instance {{ $labels.instance }})\n        description: \"Zookeeper cluster has too many nodes marked as leader\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n\n    - alert: ZookeeperNotOk\n      expr: 'zk_ruok == 0'\n      for: 3m\n      labels:\n        severity: warning\n      annotations:\n        summary: Zookeeper Not Ok (instance {{ $labels.instance }})\n        description: \"Zookeeper instance is not ok\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}\"\n"
  },
  {
    "path": "dist/template.yml",
    "content": "groups:\n{% assign groupName = slug | split: '-' %}{% capture groupNameCamelcase %}{% for word in groupName %}{{ word | capitalize }} {% endfor %}{% endcapture %}\n- name: {{ groupNameCamelcase | remove: ' ' | remove: '-' }}\n\n  {% assign lines = comments | split: \"\n\" %}{% for line in lines %}# {{ line | strip }}\n  {% endfor %}\n  rules:\n{% for rule in rules %}{% assign ruleName = rule.name | split: ' ' %}{% capture ruleNameCamelcase %}{% for word in ruleName %}{{ word | capitalize }} {% endfor %}{% endcapture %}\n    {% assign lines = rule.comments | split: \"\n\" %}{% for line in lines %}# {{ line | strip }}\n    {% endfor %}- alert: {{ ruleNameCamelcase | remove: ' ' }}\n      expr: '{{ rule.query }}'\n      for: {% if rule.for %}{{ rule.for }}{% else %}0m{% endif %}\n      labels:\n        severity: {{ rule.severity }}\n      annotations:\n        summary: {% if rule.summary %}{{ rule.summary }}{% else %}{{ rule.name }} (instance {% raw %}{{ $labels.instance }}{% endraw %}){% endif %}\n        description: \"{{ rule.description | replace: '\"', '\\\"' }}\\n  VALUE = {% raw %}{{ $value }}{% endraw %}\\n  LABELS = {% raw %}{{ $labels }}{% endraw %}\"\n{% endfor %}"
  },
  {
    "path": "docker-compose.yml",
    "content": "version: '3'\n\nservices:\n\n  jekyll:\n    image: jekyll/jekyll:latest\n    command: jekyll serve\n    volumes:\n      - ./:/srv/jekyll\n    ports:\n      - 4000:4000\n"
  },
  {
    "path": "index.md",
    "content": "\n<style>\n.center-image\n{\n    margin: 0 auto;\n    display: block;\n}\n</style>\n\n\n![Prometheus logo](/assets/prometheus-logo.png){: .center-image }\n\n\n<h2>\n  Hello world\n</h2>\n\n<a href=\"/awesome-prometheus-alerts/alertmanager\">\n  AlertManager configuration\n</a>\n\n<a href=\"/awesome-prometheus-alerts/sleep-peacefully\">\n  Alerting time window\n</a>\n\n<h2>\n  Out of the box prometheus alerting rules\n</h2>\n\n<ul>\n  {% for group in site.data.rules.groups %}\n    <li style=\"margin-top: 30px;\">\n      {% assign nbrRules = 0 %}\n      {% for service in group.services %}\n        {% for exporter in service.exporters %}\n          {% for rule in exporter.rules %}\n            {% assign nbrRules = nbrRules | plus: 1 %}\n          {% endfor %}\n        {% endfor %}\n      {% endfor %}\n\n      <h3>{{ group.name }} <small style=\"margin-left: 20px;\">({{ nbrRules }} rules)</small></h3>\n      <ul>\n        {% for service in group.services %}\n        <li>\n          <a href=\"/awesome-prometheus-alerts/rules#{{ service.name | replace: \" \", \"-\" | downcase }}\">\n            {{ service.name }}\n          </a>\n        </li>\n        {% endfor %}\n      </ul>\n    </li>\n  {% endfor %}\n</ul>\n"
  },
  {
    "path": "package.json",
    "content": "{\n\t  \"scripts\": {\n\t\t    \"test\": \"awesome-lint\"\n\t  },\n\t  \"devDependencies\": {\n\t\t    \"awesome-lint\": \"*\"\n\t  }\n}\n"
  },
  {
    "path": "rules.md",
    "content": "<style>\n  ul {\n    list-style: none;\n  }\n</style>\n\n<!-- CAUTIONS -->\n<div style=\"padding: 20px 20px 10px 20px; border: solid grey 1px; border-radius: 10px;\">\n  <h2 style=\"text-align:center;\">⚠️ Caution ⚠️</h2>\n\n  <p style=\"text-align:center;\">\n    Alert thresholds depend on nature of applications.\n    <br>\n    Some queries in this page may have arbitrary tolerance threshold.\n    <br><br>\n    Building an efficient and battle-tested monitoring platform takes time. 😉\n  </p>\n</div>\n\n<br>\n<br>\n\n<h1></h1>\n\n<!-- RULES -->\n<ul>\n  {% for group in site.data.rules.groups %}\n  {% assign groupIndex = forloop.index %}\n    {% for service in group.services %}\n    {% assign serviceIndex = forloop.index %}\n    {% assign nbrExporters = service.exporters | size %}\n      {% for exporter in service.exporters %}\n      {% assign exporterIndex = forloop.index %}\n      {% assign nbrRules = exporter.rules | size %}\n      <li>\n        {% assign serviceId = service.name | replace: \" \", \"-\" | downcase %}\n        <h2 id=\"{{ serviceId }}\">\n          <span id=\"{{ serviceId }}-{{ exporterIndex }}\"></span>\n          <a class=\"anchor\" href=\"#{{ serviceId }}-{{ exporterIndex }}\">#</a>\n          {{ groupIndex }}.{{ serviceIndex }}.{% if nbrExporters > 1 %}{{ exporterIndex }}.{% endif %}\n          {{ service.name }}\n          {% if exporter.name %}:\n          {% if exporter.doc_url %}\n          <a href=\"{{ exporter.doc_url }}\">\n            {{ exporter.name }}\n          </a>\n          {% else %}\n          {{ exporter.name }}\n          {% endif %}\n          {% endif %}\n\n          {% if nbrRules > 0 %}\n            <small style=\"font-size: 60%; vertical-align: middle; margin-left: 10px;\">\n              ({{ nbrRules }} rules)\n            </small>\n            <span class=\"clipboard-multiple\" data-clipboard-target-id=\"group-{{ groupIndex }}-service-{{ serviceIndex }}-exporter-{{ exporterIndex }}\">[copy section]</span>\n          {% endif %}\n        </h2>\n\n        {% if nbrRules == 0 %}\n{% highlight javascript %}\n// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋\n{% endhighlight %}\n        {% else %}\n{{ exporter.comments | strip | newline_to_br }}\n{% highlight bash %}\n$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/{{ service.name | replace: \" \", \"-\" | downcase }}/{{ exporter.slug }}.yml\n{% endhighlight %}\n        {% endif %}\n\n        <ul>\n          {% for rule in exporter.rules %}\n          {% assign ruleIndex = forloop.index %}\n          {% assign comments = rule.comments | strip | newline_to_br | split: '<br />' %}\n          <li>\n            <h4 id=\"rule-{{ serviceId }}-{{ exporterIndex }}-{{ ruleIndex }}\">\n              <span id=\"rule-{{ serviceId }}-{{ ruleIndex }}\"></span><!-- @deprecated -->\n              <a class=\"anchor\" href=\"#rule-{{ serviceId }}-{{ exporterIndex }}-{{ ruleIndex }}\">#</a>\n              {{ groupIndex}}.{{ serviceIndex }}.{% if nbrExporters > 1 %}{{ exporterIndex }}.{% endif %}{{ ruleIndex }}.\n              {{ rule.name }}\n            </h4>\n            <summary>\n              {{ rule.description }}\n              <span class=\"clipboard-single\" data-clipboard-target-id=\"group-{{ groupIndex }}-service-{{ serviceIndex }}-exporter-{{ exporterIndex }}-rule-{{ ruleIndex }}\" onclick=\"event.preventDefault();\">[copy]</span>\n            </summary>\n            <div id=\"group-{{ groupIndex }}-service-{{ serviceIndex }}-exporter-{{ exporterIndex }}-rule-{{ ruleIndex }}\">\n              {% assign ruleName = rule.name | split: ' ' %}\n              {% capture ruleNameCamelcase %}{% for word in ruleName %}{{ word | capitalize }} {% endfor %}{% endcapture %}\n\n  {% highlight yaml %}\n  {% for comment in comments %}# {{ comment | strip }}\n  {% endfor %}- alert: {{ ruleNameCamelcase | remove: ' ' }}\n    expr: {{ rule.query }}\n    for: {% if rule.for %}{{ rule.for }}{% else %}0m{% endif %}\n    labels:\n      severity: {{ rule.severity }}\n    annotations:\n      summary: {{ rule.name }} (instance {% raw %}{{ $labels.instance }}{% endraw %})\n      description: \"{{ rule.description | replace: '\"', '\\\"' }}\\n  VALUE = {% raw %}{{ $value }}{% endraw %}\\n  LABELS = {% raw %}{{ $labels }}{% endraw %}\"\n\n{% endhighlight %}\n\n            </div>\n            <br/>\n          </li>\n          {% endfor %}\n        </ul>\n\n      <hr/>\n      </li>\n    {% endfor %}\n    {% endfor %}\n  {% endfor %}\n</ul>\n\n\n\n<!-- NAVBAR -->\n<div id=\"rules-navbar\" class=\"affix\">\n  <h3>Menu</h3>\n  <ul>\n    {% for group in site.data.rules.groups %}\n      <li>\n        <h4>{{ group.name }}</h4>\n        <ul>\n          {% for service in group.services %}\n            <li>\n              <a href=\"#{{ service.name | replace: \" \", \"-\" | downcase }}\">\n                👉 {{ service.name }}\n              </a>\n            </li>\n          {% endfor %}\n        </ul>\n      </li>\n    {% endfor %}\n  </ul>\n\n  <script>\n    $('#rules-navbar').affix({offset: {top: 750} }).css('display', 'block');\n  </script>\n</div>\n"
  },
  {
    "path": "sleep-peacefully.md",
    "content": "<h1 style=\"text-align: center;\">\n  Sleep Peacefully\n</h1>\n\n## Alerting time window\n\nIn some applications, load and activity can vary over the day/week/year.\n\nIn order to prevent alarm fatigue and busy pager, alerts can be disabled during a period of time (such as night or weekend).\n\nExample:\n\n- Weekday: `node_load5 > 10 and ON() (0 < day_of_week() < 6)`\n- Day time: `node_load5 > 10 and ON() (8 < hour() < 18)`\n- Exclude December: `node_load5 > 10 and ON() (month() != 12)`\n\n## Advanced time windows and timezones\n\n```yml\n# rules.yml\n\ngroups:\n  - name: timezones\n    rules:\n    - record: european_summer_time_offset\n      expr: |\n          (vector(1) and (month() > 3 and month() < 10))\n          or\n          (vector(1) and (month() == 3 and (day_of_month() - day_of_week()) >= 25) and absent((day_of_month() >= 25) and (day_of_week() == 0)))\n          or\n          (vector(1) and (month() == 10 and (day_of_month() - day_of_week()) < 25) and absent((day_of_month() >= 25) and (day_of_week() == 0)))\n          or\n          (vector(1) and ((month() == 10 and hour() < 1) or (month() == 3 and hour() > 0)) and ((day_of_month() >= 25) and (day_of_week() == 0)))\n          or\n          vector(0)\n\n    - record: europe_london_time\n      expr: time() + 3600 * european_summer_time_offset\n    - record: europe_paris_time\n      expr: time() + 3600 * (1 + european_summer_time_offset)\n\n    - record: europe_london_hour\n      expr: hour(europe_london_time)\n    - record: europe_paris_hour\n      expr: hour(europe_paris_time)\n\n    - record: europe_london_weekday\n      expr: 0 < day_of_week(europe_london_time) < 6\n    - record: europe_paris_weekday\n      expr: 0 < day_of_week(europe_paris_time) < 6\n    # opposite\n    - record: not_europe_london_weekday\n      expr: absent(europe_london_weekday)\n    - record: not_europe_paris_weekday\n      expr: absent(europe_paris_weekday)\n\n    - record: europe_london_business_hours\n      expr: 9 <= europe_london_hour < 18\n    - record: europe_paris_business_hours\n      expr: 9 <= europe_paris_hour < 18\n    # opposite\n    - record: not_europe_london_business_hours\n      expr: absent(europe_london_business_hours)\n    - record: not_europe_paris_business_hours\n      expr: absent(europe_paris_business_hours)\n\n    # new year's day / xmas / labor day / all saints' day / ...\n    - record: europe_french_public_holidays\n      expr: |\n          (vector(1) and month(europe_paris_time) == 1 and day_of_month(europe_paris_time) == 1)\n          or\n          (vector(1) and month(europe_paris_time) == 12 and day_of_month(europe_paris_time) == 25)\n          or\n          (vector(1) and month(europe_paris_time) == 5 and day_of_month(europe_paris_time) == 1)\n          or\n          (vector(1) and month(europe_paris_time) == 11 and day_of_month(europe_paris_time) == 1)\n          or\n          vector(0)\n    # opposite\n    - record: not_europe_french_public_holidays\n      expr: absent(europe_french_public_holidays)\n```\n\n```yml\n# alerts.yml\n\ngroups:\n  - name: CPU Load\n    rules:\n      - alert: HighLoadQuietDuringWeekendAndNight\n        expr: node_load5 > 10 and ON() (europe_london_weekday and europe_paris_weekday)\n\n      - alert: HighLoadQuietDuringBackup\n        expr: node_load5 > 10 and ON() absent(hour() == 2)\n\n      - alert: HighLoad\n        expr: |\n            node_load5 > 20 and ON() (europe_london_weekday and europe_paris_weekday)\n            or\n            node_load5 > 10\n```\n\n## Sources\n\n- [https://medium.com/@tom.fawcett/time-of-day-based-notifications-with-prometheus-and-alertmanager-1bf7a23b7695](https://medium.com/@tom.fawcett/time-of-day-based-notifications-with-prometheus-and-alertmanager-1bf7a23b7695)\n- [https://promcon.io/2019-munich/slides/improved-alerting-with-prometheus-and-alertmanager.pdf](https://promcon.io/2019-munich/slides/improved-alerting-with-prometheus-and-alertmanager.pdf)\n"
  }
]