Manager Policies

Overview

Policies provide a way of analyzing a stream of events that correspond to a group of nodes (and their instances). The analysis process occurs in real time and enables actions to be triggered, based on its outcome.

The stream of events for each node instance is read from a deployment-specific RabbitMQ queue by the policy engine (Riemann). It can come from any source but is typically generated by a monitoring agent installed on the node’s host. (See: Diamond Plugin).

The analysis is performed using Riemann. Cloudify starts a Riemann core for each deployment with its own Riemann configuration. The Riemann configuration is generated, based on the Blueprint’s groups section that will be described later in this guide.

Part of the API that Cloudify provides on top of Riemann, enables policy triggers to be activated. Usually, that entails executing a workflow in response to a state–for example, if the tomcat CPU usage was above 90% for the last five minutes. The workflow might increase the number of node instances, for example, to handle an increased load.

The logic fo the policies is written in Clojure using Riemann’s API. Cloudify adds a thin API layer on top of these APIs.

The Nodecellar Example is a great example how to apply Cloudify policies.

Using Policies

Policies are configured in the groups section of a Blueprint, as shown in the following example.

groups:
  # arbitrary group name
  my_group:

    # All nodes this group applies to
    # in this case, assume node_vm and mongo_vm
    # were defined in the node_templates section.
    # All events that belong to node instances of
    # these nodes are processed by all policies specified
    # within the group.
    members: [node_vm, mongo_vm]

    # Each group specifies a set of policies
    policies:

      # arbitrary policy name
      my_policy:

        # The policy type to use. Here a built-in 
        # policy type that identifies host failures 
        # based on a keep alive mechanism is used
        type: cloudify.policies.types.host_failure

        # policy specific configuration
        properties:
          # This is not really a host_failure property, it only serves as an example.
          some_prop: some_value

        # This section specifies what will be
        # triggered when the policy is "triggered"
        # (more than one trigger can be specified)
        triggers:
          # arbitrary trigger name
          execute_heal_workflow:

            # Using the built-in 'execute_workflow' trigger
            type: cloudify.policies.triggers.execute_workflow

            # Trigger-specific configuration
            parameters:
              workflow: heal
              workflow_parameters:
                # The failing node instance ID is exposed by the
                # host_failure policy on the event that triggered the
                # execute workflow trigger. Access to properties on
                # that event are as demonstrated below.
                node_instance_id: { get_property: [SELF, node_id] }

Built-in Policies

Cloudify includes a number of built-in policies that are declared in types.yaml, which is usually imported either directly or indirectly via other imports. See Built-in Policies for more information.

Writing a Custom Policy

If you are an advanced user, you might want to create custom policies. For more information, see policies authoring guide.

Auto Healing

Auto healing is the process of automatically identifying when a certain component of your system is failing and fixing the system state without any user intervention. Cloudify includes built-in support for auto healing different components.

Limitation

Currently, only compute nodes can have auto healing configured for them. Otherwise, when a certain compute node instance fails, node instances that are contained in it are also likely to fail, which in turn can cause the heal workflow to be triggered multiple times.

There are several items that must be configured for auto healing to work. The process is described in this topic. The process involves:

Monitoring

You must add monitoring to the compute nodes that require auto healing to be enabled.

This example uses the built-in Diamond support that is bundled with Cloudify. See the Diamond Plugin documentation for more about configuring Diamond-based monitoring.

node_templates:
  some_vm:
    interfaces:
      cloudify.interfaces.monitoring:
        start:
          implementation: diamond.diamond_agent.tasks.add_collectors
          inputs:
            collectors_config:
              ExampleCollector: {}
        stop:
          implementation: diamond.diamond_agent.tasks.del_collectors
          inputs:
            collectors_config:
              ExampleCollector: {}

The ExampleCollector generates a single metric that constantly has the value 42. This collector is used as a “heartbeat” collector, together with the host_failure policy described below.

Workflow Configuration

The heal workflow executes a sequence of tasks that is similar to calling the uninstall workflow, followed by the install workflow. The main difference is that the heal workflow operates on the subset of node instances that are contained within the failing node instance’s compute, and on the relationships these node instances have with other node instances. The workflow reinstalls the entire compute that contains the failing node and handles all appropriate relationships between the node instances inside the compute and the other node instances.

Groups Configuration

After monitoring is configured, groups must be configured. The following example contains a number of inline comments. It is recommended that you read them, to ensure that you have a good understanding of the process.

groups:
  some_vm_group:
    # Adding the some_vm node template that was previously configured
    members: [some_vm]
    policies:
      host_failure_policy:
        # Using the 'host_failure' policy type
        type: cloudify.policies.host_failure

        # Name of the service we want to shortlist (using regular expressions) and
        # watch - every Diamond event has the service field set to some value.
        # In our case, the ExampleCollector sends events with this value set to "example".
        properties:
          service:
            - example

        triggers:
          heal_trigger:
            # Using the 'execute_workflow' policy trigger
            type: cloudify.policies.triggers.execute_workflow
            parameters:
              # Configuring this trigger to execute the heal workflow
              workflow: heal

              # The heal workflow gets its parameters 
              # from the event that triggered its execution              
              workflow_parameters:
                # 'node_id' is the node instance ID
                # of the node that failed. In this example it is
                # something similar to 'some_vm_afd34'
                node_instance_id: { get_property: [SELF, node_id] }

                # Contextual information added by the triggering policy
                diagnose_value: { get_property: [SELF, diagnose] }

In this example, a group was configured that consists of the some_vm node specified earlier. Then, a single host_failure policy was configured for the group. The policy has one execute_workflow policy trigger configured, which is mapped to execute the heal workflow.