Recovery

Errors can happen at different levels hence you want to recover from different points. If the job contains some objects that have an invalid syntax, e.g. one column has mapping but the formula is wrong or missing, the job simply does not start. The first thing the jobserver does is reading all objects required and performing a syntax check on the programs. So if there is a problem you would know right away.
Another kind of error is that a Workflow or Dataflow could not start, e.g. because the source database is not available.
Another kind of error could be that the dataflow started, processed half the data, loaded those into the target and then an exception was raised, e.g. target database said that no more space is available.
The goal of the recovery feature is to start at exactly that point where the problem occured. While it is no problem to start an entire dataflow again, recovering from within a dataflow is almost impossible. If we could guarantee that the source database would return the exact same rows again, in the exact same order and we would save all internal memory structures to disk, yes, then recovery would be possible - and every single dataflow would be unbareably slow. If we would process the data like to old ETL tools, read the source - save intermediate result to a file. Then read this file, perform the first transformation step and save the result to another file. Read this file and ..... and load it into the target. Yes, then a recovery would be possible.
So instead of building something that is either unreliable or slow or both, with our recovery feature it is the endusers responsibility to design the flow so that it deals with half loaded tables. Often, the dataflow will be build that way already anyway, e.g. it truncates the entire target table before loading it. Or the table is loaded via a Table Comparison transform (or similar). In all those cases the dataflow is build "Recovery aware" - something we asked for in the Rapid Mart Development Guidelines (Supports the Recovery Feature).
In order to use the recovery feature, all you have to do is executing the job with the flag "Enable Recovery" turned on.

With that, for each object we store the information if it was started/completed successfully in the repository (table AL_ROLLFORWARD and AL_RF_INFO).

If the job fails somewhere, all you have to do is execute it again. You will see the "recover from last failed execution (point)" checkbox being selected.

When you start it, all objects that have been successfully executed will be skipped, all others started from scratch.

Some not so obvious things
  • Values of variables: What should be the value of the variables in case a job was restarted? As we want to complete a job, the values have to be set to what they were at this point in time. So all variable values will be saved and automatically reloaded when executed in "recover from last failed execution".
  • Recover as a unit: Imagine a scenario where multiple objects form one unit of work that has to be executed together. Example, shall be a first DataFlow that deletes the target table rows that will be loaded by the second dataflow with plain inserts. If the delete succeeds, it would not be started again and the second dataflow would load duplicates. In order to deal with this kind of cases you can place both dataflows in a workflow and mark this workflow as "recover as a unit".
No matter where inside the workflow something happened, all objects will be executed again. So the only way this workflow is not executed in recovery mode is, if all objects inside have been successfully executed.

No comments:

Post a Comment