External services

The easiest method to monitor an external service with the Cloudify’s Status Reporter is to add additional scraping targets to existing Prometheus installation (e.g. the one running with Cloudify Manager). It is not the only way. An alternative is to configure an additional external Prometheus instance (along with the appropriate exporter), which handles the service, and have it federated with existing Prometheus. The latter is more complicated, but has its advantages (residency, load-distribution etc.). This guide will focus on the former approach, which is probably the simplest, and, in most cases, the best one.

Prerequisites

In order to add metrics of external services to Cloudify’s Status Reporter, additional scraping targets should be configured. For example Prometheus’s file-based service discovery could be used to do that.

All metrics parsed by Cloudify’s Status Reporter must come with a host label. This is how it distinguishes between different nodes. Label’s value should match the host column of db-nodes or brokers lists, or the private_ip of managers list.

Database (PostgreSQL compliant) monitoring

Requirements

These two services should be accessible from the node on which Prometheus is running (e.g. the node with Cloudify Manager installed):

Prometheus’s postgres_exporter which monitors an external database must be deployed. The exporter listens on port 9187 by default,
Prometheus’s node_exporter to monitor the status of system services (programs running on the node, managed by process supervisor, e.g. supervisor or systemd). The exporter listens on port 9100 by default.

PostgreSQL recording rules

Status Reporter queries Prometheus for the (postgres_healthy) or (postgres_service) metrics. These are configured as Prometheus’s recording rules in /etc/prometheus/alerts/postgres.yml file:

- record: postgres_healthy expr: pg_up{job="postgresql"} == 1 and up{job="postgresql"} == 1

combines two metrics from the postgres_exporter: up monitors status of postgres_exporter and pg_up – that of PostgreSQL server. Both metrics should have values of 1.

- record: postgres_service expr: postgres_service_supervisord - record: postgres_service_supervisord expr: sum by (host, name) (node_supervisord_up{name=~"(etcd|patroni|node_exporter|prometheus|postgres_exporter|postgresql-14|nginx)"}) labels: process_manager: supervisord

returns a list of running services (as reported by the node_exporter), which could be relevant for determining database node’s status. The above rules are meant for a system which is operated with supervisord. In case systemd is the process supervisor of choice, update the rules accordingly, e.g.:

- record: postgres_service
  expr: postgres_service_systemd
- record: postgres_service_systemd
  expr: sum by (host, name) (node_systemd_unit_state{name=~"(etcd|patroni|node_exporter|prometheus|postgres_exporter|postgresql-14|nginx).service", state="active"})
  labels:
    process_manager: systemd

Expected results

The following method might be used to test PostgreSQL metrics (for other methods look here):

curl "http://localhost:9090/monitoring/api/v1/query?query=postgres_healthy%20or%20postgres_service"

For an environment with a single database, the output might be similar to this one. Notice all metrics are of value 1, which suggests that (but is not synonymous with) the database service is healthy.

{
    "data": {
        "result": [
            {
                "metric": {
                    "__name__": "postgres_healthy",
                    "host": "172.20.0.3",
                    "instance": "localhost:9187",
                    "job": "postgresql"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "nginx",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "node_exporter",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "postgres_exporter",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "postgresql-14",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "prometheus",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            }
        ],
        "resultType": "vector"
    },
    "status": "success"
}

Message Queue (RabbitMQ compliant) monitoring

Requirements

These two services should be accessible from the node on which Prometheus is running (e.g. the node with Cloudify Manager installed):

RabbitMQ’s rabbitmq_prometheus plugin must be enabled on message queue nodes. The plugin/exporter listens on port 15692 by default,
Prometheus’s node_exporter to monitor the status of system services (programs running on the node, managed by process supervisor, e.g. supervisor or systemd). The exporter listens on port 9100 by default.

RabbitMQ recording rules

Status Reporter queries Prometheus for the (rabbitmq_healthy) or (rabbitmq_service) metrics. These are configured as Prometheus’s recording rules in /etc/prometheus/alerts/rabbitmq.yml file:

- record: rabbitmq_healthy expr: sum by(host, instance, job, monitor) (up{job="rabbitmq"}) == 1

is a simple metric from the rabbitmq_prometheus plugin which monitors status of RabbitMQ service. Should be of value 1.

- record: rabbitmq_service expr: rabbitmq_service_supervisord - record: rabbitmq_service_supervisord expr: sum by (host, name) (node_supervisord_up{name=~"(node_exporter|prometheus|cloudify-rabbitmq|nginx)"}) labels: process_manager: supervisord

returns a list of running services (as reported by the node_exporter), which could be relevant for determining message queue node’s status. The above rules are meant for a system which is operated with supervisord. In case systemd is the process supervisor of choice, update the rules accordingly, e.g.:

- record: rabbitmq_service
  expr: rabbitmq_service_systemd
- record: rabbitmq_service_systemd
  expr: sum by (host, name) (node_systemd_unit_state{name=~"(node_exporter|prometheus|cloudify-rabbitmq|nginx).service", state="active"})
  labels:
    process_manager: systemd

Expected results

The following method might be used to test RabbitMQ metrics (for other methods look here):

curl "http://localhost:9090/monitoring/api/v1/query?query=rabbitmq_healthy%20or%20rabbitmq_service"

{
    "data": {
        "result": [
            {
                "metric": {
                    "__name__": "rabbitmq_healthy",
                    "host": "172.20.0.3",
                    "instance": "localhost:15692",
                    "job": "rabbitmq"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "rabbitmq_service",
                    "host": "172.20.0.3",
                    "name": "cloudify-rabbitmq",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "rabbitmq_service",
                    "host": "172.20.0.3",
                    "name": "nginx",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "rabbitmq_service",
                    "host": "172.20.0.3",
                    "name": "node_exporter",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "rabbitmq_service",
                    "host": "172.20.0.3",
                    "name": "prometheus",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            }
        ],
        "resultType": "vector"
    },
    "status": "success"
}

Examples

Assume there are:

an external PostgreSQL-compatible database running on host 172.20.0.5, which has a corresponding postgres_exporter monitoring it, running on the same host (172.20.0.5) and listening on its default port (9187), accompanied by node_exporter listening on port 9100,
an external RabbitMQ service with rabbitmq_prometheus plugin enabled on host 172.20.0.6, serving metrics on plugin’s default port (15692) with node_exporter listening on port 9100.

In order to make the metrics available for Status Reporter:

Make sure postgres_exporter has the necessary recording rules configured.
Make sure rabbitmq_prometheus has the necessary recording rules configured.
Consider adding the following contents to your configuration files:

/etc/prometheus/targets/other_postgres.yml

- targets: ["172.20.0.5:9187", "172.20.0.5:9100"]
  labels: {"host": "172.20.0.5"}

/etc/prometheus/targets/other_rabbits.yml

- targets: ["172.20.0.6:15692", "172.20.0.6:9100"]
  labels: {"host": "172.20.0.6"}