Workflow Error Handling
When an error is raised from the workflow itself, the workflow execution will fail - it will end with
failed status, and should have an error message under its
error field. There is no built-in retry mechanism for the entire workflow.
However, there’s a retry mechanism for task execution within a workflow. Two types of errors can occur during task execution: Recoverable and NonRecoverable. By default, all errors originating from tasks are *Recoverable*. The maximum number of retries for workflow operations is 60, with retries occuring at 15 second intervals for a maximum of 15 minutes.
If a NonRecoverable error occurs, the workflow execution will fail, similarly to the way described for when an error is raised from the workflow itself.
If a Recoverable error occurs, the task execution might be attempted again from its start. This depends on the configuration of the
max_retries parameters, which determines how many retry attempts will be given by default to any failed task execution.
max_retries parameters can be set in one of the following manners:
If the operation
max_retriesparameter has been set for a certain operation, it will be used.
When installing the manager, the
task_retriesparameter is a configuration parameter in the
provider_contextsection of the config.yaml file.
If the parameter is not set, it will default to the value of
-1, which means maximum retries (i.e. 60).
In addition to the
task_retries parameter, there’s also the
retry_interval parameter, which determines the minimum amount of wait time (in seconds) after a task execution fails before it is retried. It can be set in the very same way
max_retries are set. If it isn’t set, it will default to the value of
Lifecycle Retries (Experimental)
In addition to task retries, there is a mechanism that allows retrying a group of operations. This mechanism is used by the built-in
heal workflows. By default it is turned off. To enable it, set the
subgraph_retries parameter in the manager blueprint
manager_configuration node template under the
workflows property to some positive value (or
-1 for infinite subgraph retries). The parameter is named
subgraph_retries because the mechanism is implemented using the subgraphs feature of the workflow framework.
The following example demonstrates how this feature is used by the aforementioned built-in workflows.
Consider the case where some
cloudify.nodes.Compute node template is used in a blueprint to create a VM. The sequence of operations used to create, configure and start the VM will most likely be mapped using the node type’s
start operations, respectively; mapping the operations to some IaaS plugin implementation.
create operation may be implemented in such way, that it makes an API call to the relevant IaaS to create the VM. The
start operation may be implemented in such way, that it waits for the the VM to be in some started state and have a private IP assigned to it. In such implementation, it is possible that the API call to create the VM was successful but the VM itself started in some malformed manner (e.g. no IP was assigned to it).
The task retries mechanism alone, may not be sufficient to fix this problem, as simply retrying the
start operation will not change the VM’s corrupted state. A possible solution in this case, is to run the
delete operations of the
cloudify.interfaces.lifecycle interface and then re-run the
start again in hope that the new VM will be created in a valid state.
This is exactly what the lifecycle retry mechanism does. Once the number of attempts to execute a lifecycle operation (
start in the example above) exceeds
1 + task_retries, the lifecycle retry mechanism kicks in. If
subgraph_retries is set to a positive number (or
-1 for infinity), a lifecycle retry is performed, which in essence means: run “uninstall” on the relevant node instance and then run “install” on it.
Similarly to the
task_retries parameters, the
subgraph_retries parameter affects the number of lifecycle retries attempted before failing the entire workflow.
The lifecycle retry API is marked as experimental. This is mostly because there is no differentiation between NonRecoverable errors that are raise due to some bad configuration, i.e. there is no point of retrying the node lifecycle, to NonRecoverable errors that are raise due to some corrupted state that may be fixed by the lifecycle retry mechanism. In both cases, a lifecycle retry will be attempted.
Due to that, its semantics may change in the future, once an API to make that differentiation is introduced.