Failure Recovery

Failure Recovery

The Whole Cluster is Down or Working Wrong

  1. Save /etc/cloudify/ssl/* files
  2. Teardown managers
  3. Install fresh managers with existing certificates in /etc/cloudify/config.yaml
  4. Create and join cluster
  5. Apply the latest working version snapshot on the active manager

One Manager Cluster Node Down

  1. Remove the manager from the cluster
  2. Destroy manager
  3. Bootstrap fresh manager
  4. Join existing cluster

Effect: Healthy manager cluster

Active Manager Node Down

  1. Another healthy manager from the cluster automatically becomes an active manager.
  2. Investigate error:
  3. Either:
    • Fix problem
    • Destroy manager
      1. Install manager
      2. Join cluster

Effect: Healthy manager cluster

Split Brain

This situation happens when for a while there is no connectivity between managers. Then each of them thinks that other managers are unhealthy and becomes a master. After connectivity is back master becomes the only one in the cluster. It’s chosen based on the newest version of the PostgreSQL database. All data from other managers will be synced with the active one and others will become standbys. All data/ installed deployments/ plugins will get lost.