External services

The easiest method to monitor an external service with the Cloudify’s Status Reporter is to add additional scraping targets to existing Prometheus installation (e.g. the one running with Cloudify Manager). It is not the only way. An alternative is to configure an additional external Prometheus instance (along with the appropriate exporter), which handles the service, and have it federated with existing Prometheus. The latter is more complicated, but has its advantages (residency, load-distribution etc.). This guide will focus on the former approach, which is probably the simplest, and, in most cases, the best one.

Prerequisites

In order to add metrics of external services to Cloudify’s Status Reporter, additional scraping targets should be configured. For example Prometheus’s file-based service discovery could be used to do that.

Database (PostgreSQL compliant) monitoring

Requirements

These two services should be accessible from the node on which Prometheus is running (e.g. the node with Cloudify Manager installed):

PostgreSQL recording rules

Status Reporter queries Prometheus for the (postgres_healthy) or (postgres_service) metrics. These are configured as Prometheus’s recording rules in /etc/prometheus/alerts/postgres.yml file:

- record: postgres_healthy
  expr: pg_up{job="postgresql"} == 1 and up{job="postgresql"} == 1
combines two metrics from the postgres_exporter: up monitors status of postgres_exporter and pg_up – that of PostgreSQL server. Both metrics should have values of 1.
- record: postgres_service
  expr: postgres_service_supervisord
- record: postgres_service_supervisord
  expr: sum by (host, name) (node_supervisord_up{name=~"(etcd|patroni|node_exporter|prometheus|postgres_exporter|postgresql-14|nginx)"})
  labels:
    process_manager: supervisord
returns a list of running services (as reported by the node_exporter), which could be relevant for determining database node’s status. The above rules are meant for a system which is operated with supervisord. In case systemd is the process supervisor of choice, update the rules accordingly, e.g.:
- record: postgres_service
  expr: postgres_service_systemd
- record: postgres_service_systemd
  expr: sum by (host, name) (node_systemd_unit_state{name=~"(etcd|patroni|node_exporter|prometheus|postgres_exporter|postgresql-14|nginx).service", state="active"})
  labels:
    process_manager: systemd

Expected results

The following method might be used to test PostgreSQL metrics (for other methods look here):

curl "http://localhost:9090/monitoring/api/v1/query?query=postgres_healthy%20or%20postgres_service"

For an environment with a single database, the output might be similar to this one. Notice all metrics are of value 1, which suggests that (but is not synonymous with) the database service is healthy.

{
    "data": {
        "result": [
            {
                "metric": {
                    "__name__": "postgres_healthy",
                    "host": "172.20.0.3",
                    "instance": "localhost:9187",
                    "job": "postgresql"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "nginx",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "node_exporter",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "postgres_exporter",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "postgresql-14",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "postgres_service",
                    "host": "172.20.0.3",
                    "name": "prometheus",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674055762.262,
                    "1"
                ]
            }
        ],
        "resultType": "vector"
    },
    "status": "success"
}

Message Queue (RabbitMQ compliant) monitoring

Requirements

These two services should be accessible from the node on which Prometheus is running (e.g. the node with Cloudify Manager installed):

RabbitMQ recording rules

Status Reporter queries Prometheus for the (rabbitmq_healthy) or (rabbitmq_service) metrics. These are configured as Prometheus’s recording rules in /etc/prometheus/alerts/rabbitmq.yml file:

- record: rabbitmq_healthy
  expr: sum by(host, instance, job, monitor) (up{job="rabbitmq"}) == 1
is a simple metric from the rabbitmq_prometheus plugin which monitors status of RabbitMQ service. Should be of value 1.
- record: rabbitmq_service
  expr: rabbitmq_service_supervisord
- record: rabbitmq_service_supervisord
  expr: sum by (host, name) (node_supervisord_up{name=~"(node_exporter|prometheus|cloudify-rabbitmq|nginx)"})
  labels:
    process_manager: supervisord
returns a list of running services (as reported by the node_exporter), which could be relevant for determining message queue node’s status. The above rules are meant for a system which is operated with supervisord. In case systemd is the process supervisor of choice, update the rules accordingly, e.g.:
- record: rabbitmq_service
  expr: rabbitmq_service_systemd
- record: rabbitmq_service_systemd
  expr: sum by (host, name) (node_systemd_unit_state{name=~"(node_exporter|prometheus|cloudify-rabbitmq|nginx).service", state="active"})
  labels:
    process_manager: systemd

Expected results

The following method might be used to test RabbitMQ metrics (for other methods look here):

curl "http://localhost:9090/monitoring/api/v1/query?query=rabbitmq_healthy%20or%20rabbitmq_service"
{
    "data": {
        "result": [
            {
                "metric": {
                    "__name__": "rabbitmq_healthy",
                    "host": "172.20.0.3",
                    "instance": "localhost:15692",
                    "job": "rabbitmq"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "rabbitmq_service",
                    "host": "172.20.0.3",
                    "name": "cloudify-rabbitmq",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "rabbitmq_service",
                    "host": "172.20.0.3",
                    "name": "nginx",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "rabbitmq_service",
                    "host": "172.20.0.3",
                    "name": "node_exporter",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            },
            {
                "metric": {
                    "__name__": "rabbitmq_service",
                    "host": "172.20.0.3",
                    "name": "prometheus",
                    "process_manager": "supervisord"
                },
                "value": [
                    1674062821.859,
                    "1"
                ]
            }
        ],
        "resultType": "vector"
    },
    "status": "success"
}

Examples

Assume there are:

In order to make the metrics available for Status Reporter:

  1. Make sure postgres_exporter has the necessary recording rules configured.

  2. Make sure rabbitmq_prometheus has the necessary recording rules configured.

  3. Consider adding the following contents to your configuration files:

/etc/prometheus/targets/other_postgres.yml
- targets: ["172.20.0.5:9187", "172.20.0.5:9100"]
  labels: {"host": "172.20.0.5"}
/etc/prometheus/targets/other_rabbits.yml
- targets: ["172.20.0.6:15692", "172.20.0.6:9100"]
  labels: {"host": "172.20.0.6"}