Failure Recovery

The Whole Cluster is Down or Working Wrong

Save /etc/cloudify/ssl/* files
Teardown managers
Install fresh managers with existing certificates in /etc/cloudify/config.yaml
Create and join cluster
Apply the latest working version snapshot on the active manager

One Manager Cluster Node Down

Remove the manager from the cluster
Destroy manager
Bootstrap fresh manager
Join existing cluster

Effect: Healthy manager cluster

Active Manager Node Down

Another healthy manager from the cluster automatically becomes an active manager.
Investigate error:
Either:
- Fix problem
- Destroy manager
  1. Install manager
  2. Join cluster

Effect: Healthy manager cluster

Split Brain

This situation happens when for a while there is no connectivity between managers. Then each of them thinks that other managers are unhealthy and becomes a master. After connectivity is back master becomes the only one in the cluster. It’s chosen based on the newest version of the PostgreSQL database. All data from other managers will be synced with the active one and others will become standbys. All data/ installed deployments/ plugins will get lost.