External services
The easiest method to monitor an external service with the Cloudify’s Status Reporter is to add additional scraping targets to existing Prometheus installation (e.g. the one running with Cloudify Manager). It is not the only way. An alternative is to configure an additional external Prometheus instance (along with the appropriate exporter), which handles the service, and have it federated with existing Prometheus. The latter is more complicated, but has its advantages (residency, load-distribution etc.). This guide will focus on the former approach, which is probably the simplest, and, in most cases, the best one.
Prerequisites
In order to add metrics of external services to Cloudify’s Status Reporter, additional scraping targets should be configured. For example Prometheus’s file-based service discovery could be used to do that.
All metrics parsed by Cloudify’s Status Reporter must come with
a host
label. This is how it distinguishes between different nodes. Label’s value should match
the host
column of db-nodes or
brokers lists, or the private_ip
of
managers list.
Database (PostgreSQL compliant) monitoring
Requirements
These two services should be accessible from the node on which Prometheus is running (e.g. the node with Cloudify Manager installed): * Prometheus’s postgres_exporter which monitors an external database must be deployed. The exporter listens on port 9187 by default, * Prometheus’s node_exporter to monitor the status of system services (programs running on the node, managed by process supervisor, e.g. supervisor or systemd). The exporter listens on port 9100 by default.
PostgreSQL recording rules
Status Reporter queries Prometheus for the (postgres_healthy) or (postgres_service)
metrics.
These are configured as Prometheus’s
recording rules in
/etc/prometheus/alerts/postgres.yml
file:
- record: postgres_healthy expr: pg_up{job="postgresql"} == 1 and up{job="postgresql"} == 1
- combines two metrics from the
postgres_exporter:
up
monitors status of postgres_exporter andpg_up
– that of PostgreSQL server. Both metrics should have values of1
. - record: postgres_service expr: postgres_service_supervisord - record: postgres_service_supervisord expr: sum by (host, name) (node_supervisord_up{name=~"(etcd|patroni|node_exporter|prometheus|postgres_exporter|postgresql-14|nginx)"}) labels: process_manager: supervisord
- returns a list of running services (as reported by the
node_exporter),
which could be relevant for determining database node’s status. The above rules are meant for
a system which is operated with supervisord. In case
systemd is the process supervisor of choice, update the rules accordingly,
e.g.:
- record: postgres_service expr: postgres_service_systemd - record: postgres_service_systemd expr: sum by (host, name) (node_systemd_unit_state{name=~"(etcd|patroni|node_exporter|prometheus|postgres_exporter|postgresql-14|nginx).service", state="active"}) labels: process_manager: systemd
Expected results
The following method might be used to test PostgreSQL metrics (for other methods look here):
curl "http://localhost:9090/monitoring/api/v1/query?query=postgres_healthy%20or%20postgres_service"
For an environment with a single database, the output might be similar to this one. Notice all
metrics are of value 1
, which suggests that (but is not synonymous with) the database service
is healthy.
{
"data": {
"result": [
{
"metric": {
"__name__": "postgres_healthy",
"host": "172.20.0.3",
"instance": "localhost:9187",
"job": "postgresql"
},
"value": [
1674055762.262,
"1"
]
},
{
"metric": {
"__name__": "postgres_service",
"host": "172.20.0.3",
"name": "nginx",
"process_manager": "supervisord"
},
"value": [
1674055762.262,
"1"
]
},
{
"metric": {
"__name__": "postgres_service",
"host": "172.20.0.3",
"name": "node_exporter",
"process_manager": "supervisord"
},
"value": [
1674055762.262,
"1"
]
},
{
"metric": {
"__name__": "postgres_service",
"host": "172.20.0.3",
"name": "postgres_exporter",
"process_manager": "supervisord"
},
"value": [
1674055762.262,
"1"
]
},
{
"metric": {
"__name__": "postgres_service",
"host": "172.20.0.3",
"name": "postgresql-14",
"process_manager": "supervisord"
},
"value": [
1674055762.262,
"1"
]
},
{
"metric": {
"__name__": "postgres_service",
"host": "172.20.0.3",
"name": "prometheus",
"process_manager": "supervisord"
},
"value": [
1674055762.262,
"1"
]
}
],
"resultType": "vector"
},
"status": "success"
}
Message Queue (RabbitMQ compliant) monitoring
Requirements
These two services should be accessible from the node on which Prometheus is running (e.g. the node with Cloudify Manager installed): * RabbitMQ’s rabbitmq_prometheus plugin must be enabled on message queue nodes. The plugin/exporter listens on port 15692 by default, * Prometheus’s node_exporter to monitor the status of system services (programs running on the node, managed by process supervisor, e.g. supervisor or systemd). The exporter listens on port 9100 by default.
RabbitMQ recording rules
Status Reporter queries Prometheus for the (rabbitmq_healthy) or (rabbitmq_service)
metrics.
These are configured as Prometheus’s
recording rules in
/etc/prometheus/alerts/rabbitmq.yml
file:
- record: rabbitmq_healthy expr: sum by(host, instance, job, monitor) (up{job="rabbitmq"}) == 1
- is a simple metric from the
rabbitmq_prometheus plugin which
monitors status of RabbitMQ service. Should be of value
1
. - record: rabbitmq_service expr: rabbitmq_service_supervisord - record: rabbitmq_service_supervisord expr: sum by (host, name) (node_supervisord_up{name=~"(node_exporter|prometheus|cloudify-rabbitmq|nginx)"}) labels: process_manager: supervisord
- returns a list of running services (as reported by the
node_exporter),
which could be relevant for determining message queue node’s status. The above rules are meant for
a system which is operated with supervisord. In case
systemd is the process supervisor of choice, update the rules accordingly,
e.g.:
- record: rabbitmq_service expr: rabbitmq_service_systemd - record: rabbitmq_service_systemd expr: sum by (host, name) (node_systemd_unit_state{name=~"(node_exporter|prometheus|cloudify-rabbitmq|nginx).service", state="active"}) labels: process_manager: systemd
Expected results
The following method might be used to test RabbitMQ metrics (for other methods look here):
curl "http://localhost:9090/monitoring/api/v1/query?query=rabbitmq_healthy%20or%20rabbitmq_service"
{
"data": {
"result": [
{
"metric": {
"__name__": "rabbitmq_healthy",
"host": "172.20.0.3",
"instance": "localhost:15692",
"job": "rabbitmq"
},
"value": [
1674062821.859,
"1"
]
},
{
"metric": {
"__name__": "rabbitmq_service",
"host": "172.20.0.3",
"name": "cloudify-rabbitmq",
"process_manager": "supervisord"
},
"value": [
1674062821.859,
"1"
]
},
{
"metric": {
"__name__": "rabbitmq_service",
"host": "172.20.0.3",
"name": "nginx",
"process_manager": "supervisord"
},
"value": [
1674062821.859,
"1"
]
},
{
"metric": {
"__name__": "rabbitmq_service",
"host": "172.20.0.3",
"name": "node_exporter",
"process_manager": "supervisord"
},
"value": [
1674062821.859,
"1"
]
},
{
"metric": {
"__name__": "rabbitmq_service",
"host": "172.20.0.3",
"name": "prometheus",
"process_manager": "supervisord"
},
"value": [
1674062821.859,
"1"
]
}
],
"resultType": "vector"
},
"status": "success"
}
Examples
Assume there are: * an external PostgreSQL-compatible database running on host 172.20.0.5, which has a corresponding postgres_exporter monitoring it, running on the same host (172.20.0.5) and listening on its default port (9187), accompanied by node_exporter listening on port 9100, * an external RabbitMQ service with rabbitmq_prometheus plugin enabled on host 172.20.0.6, serving metrics on plugin’s default port (15692) with node_exporter listening on port 9100.
In order to make the metrics available for Status Reporter:
Make sure postgres_exporter has the necessary recording rules configured.
Make sure rabbitmq_prometheus has the necessary recording rules configured.
Consider adding the following contents to your configuration files:
/etc/prometheus/targets/other_postgres.yml
- targets: ["172.20.0.5:9187", "172.20.0.5:9100"] labels: {"host": "172.20.0.5"}
/etc/prometheus/targets/other_rabbits.yml
- targets: ["172.20.0.6:15692", "172.20.0.6:9100"] labels: {"host": "172.20.0.6"}