Failure Recovery
Failure Recovery
The Whole Cluster is Down or Working Wrong
- Save
/etc/cloudify/ssl/*
files - Teardown managers
- Install fresh managers with existing certificates in
/etc/cloudify/config.yaml
- Create and join cluster
- Apply the latest working version snapshot on the active manager
One Manager Cluster Node Down
- Remove the manager from the cluster
- Destroy manager
- Bootstrap fresh manager
- Join existing cluster
Effect: Healthy manager cluster
Active Manager Node Down
- Another healthy manager from the cluster automatically becomes an active manager.
- Investigate error:
- Either:
- Fix problem
- Destroy manager
- Install manager
- Join cluster
Effect: Healthy manager cluster
Split Brain
This situation happens when for a while there is no connectivity between managers. Then each of them thinks that other managers are unhealthy and becomes a master. After connectivity is back master becomes the only one in the cluster. It’s chosen based on the newest version of the PostgreSQL database. All data from other managers will be synced with the active one and others will become standbys. All data/ installed deployments/ plugins will get lost.