Guaranteed jBPM Failure Handling

jBPM 3.x allows you to define exception handlers in .jpdl. You can define exceptions you want to catch and either recover from the exception, or allow the execution to fail. These exception handlers are incomplete in functionality though. In my last blog, I talked about how asynchronous continuations give you the ability to have guaranteed transitions. If you dive into the asynchronous code though, you see that exception handlers are executed within the same transaction as the command service triggering the node execution. What does this mean? What are the consequences of this?

Let’s say that your action logic runs into an error condition and it needs to rollback any database changes it has made. Since the thread of execution is marked for rollback, you cannot make any changes to the execution context of the business process, for instance, like transitioning to a failure state. Any failure changes you make to the execution context would just be rolled back by the transaction. Another scenario is the guaranteed transitions I talked about in my last blog. On a rollback situation, what if you want to let your business process decide on whether the node execution should be tried again or not instead of the simple logic of the MDB’s retry count. We talked a bit about these problems on the jBPM forum. Let me summarize some solutions that were discussed.

Execute in a separate transaction

One approach is to solve the problem by invoking non-jbpm code in its own transaction. Your jbpm action could just delegate to an EJB or Spring bean that uses a REQUIRES_NEW semantic. If the invocation succeeds, populate the appropriate bpm variables and move on. If it fails, transition to a failure state or abort and retry the action.

I don’t think this is a good approach because you wouldn’t be able to use frameworks like Seam which make integrating business processes and business logic much easier. Seam uses annotations to biject (push data to and from) jbpm context data into your beans. Freeing you from writing a lot of jbpm specific code.

More importantly, I don’t think this is a solid solution. For instance, what if your non-jbpm code is successful and commits, but the machine crashes before the jbpm context commits? Then your system ends up in an inconsistent state and you are screwed.

Have a Failure State

A better way to solve this is to actually define a transition to a failure state in your process definition

<process-definition>
   <start-state>
     <transition to='business-state-1'/>
   </start-state>
   <state name='business-state-1' asynch='true'>
     ...
     <transition name="failure" to='failure-state-1'/>
   </state>
   <state name='failure-state-1' asynch='true'>
     ...
   </state>

...
</process-definition>

Your failure processing would be encapsulated within actions of the failure state instead of within exception handlers in the original business state. The failure state should be asynchronous because we want to get guaranteed processing of this state. There are two ways to ensure that either a) the business-state-1 state gets processed, or b) the failure state gets processed. I already discussed in a previous blog how to guarantee (a), but for (b) we needs some more thought.

Guarnateed failure handling: Dead Letter Queue Processing

One way to do guaranteed failure handling would be to take advantage of dead letter queues. Most MDB containers allow you to specify how many times you want a message redelivered. After the message redelivery count is reached and there still is failure, the container will reroute the message to a dead letter queue. This queue is a normal queue that you can listen on. Although not a standard JMS feature, I looked into it a little, and at least JBoss, ActiveMQ, and MQSeries have the notion of a DLQ.

So, to do jBPM failure handling, you could just register a new MDB to listen on the dead letter queue. Asynchronous continuations post instances of org.jbpm.job.ExecuteNodeJob. The MDB would check to see if the message contained an org.jbpm.job.ExecuteNodeJob, if it did, check the node for a “failure” transition. If one exists, then signal the process instance to traverse that “failure” transition.

@MessageDriven(activationConfig= {
      ActivationConfigProperty(propertyName="destinationType", propertyValue="javax.jms.Queue"),
      ActivationConfigProperty(propertyName="messageSelector", propertyValue="jobId IS NOT NULL"),
      ActivationConfigProperty(propertyName="destination", propertyValue="queue/DLQ")})
public class DLQFailureHandler implements MessageListener
{
    JbpmConfiguration jbpmConfiguration = null;
    @Resource(name="JbpmCfgResource") String jbpmCfgResource = null;

    @PostConstruct
    public void initJbpm() {
      jbpmConfiguration = JbpmConfiguration.getInstance(jbpmCfgResource);
    }

    public void onMessage(Message message) {
        long jobId = 0;
        try {
            jobId = message.getLongProperty("jobId");
        } catch (JMSException ignored) {
            // message selector confirms existence
        }

        JbpmContext jbpmContext = jbpmConfiguration.createJbpmContext();
        try {
            JobSession jobSession = jbpmContext.getJobSession();
            Job job = jobSession.loadJob(jobId);
            if (!(job instanceof ExecuteNodeJob)) {
                // do some other failure handling
                jbpmContext.close();
                return;
            }
            ExecuteNodeJob executeNodeJob = (ExecuteNodeJob)job;
            if (executeNodeJob.getNode().hasLeavingTransition("failure")) {
                executeNodeJob.getProcessInstance().signal("failure");
                jobSession.deleteJob(job);
            }
            else {
                // push to a different DLQ?
            }
        } finally {
            jbpmContext.close();
        }
    }
}

To make sure we process dead jbpm jobs, we use a message selector configured at line 02. Lines 06-12 lookup and configure the jbpm context. Line 24-25 extracts the Job. If the Job is not an ExecuteNodeJob, you’ll want to add specific handling there or forward the message to another DLQ. Next in line 32 we check to see if the node has a “failure” transition. We transition and delete the job. If there is no “failure” transition, you might want to repost the message to a different DLQ that you create. If there is a failure in this processing, because of message delivery, we are still guaranteed to reprocess the failure logic.

The advantage of this approach is that it is totally seemless to your application: both in code and in your process diagram. You are guaranteed to either be in a successful or failure jbpm state. (At least with using the JBoss JMS inflow adapter but you should probably be ok in other implementations). The disadvantage of this approach is that you have no knowledge of the error condition and can’t report it to the “failure” state. Your failure logic will not know the cause of the failure. This isn’t always an issue because sometimes, specifically in a machine crash, you may not know why there is a failure.

Guaranteed failure handling: Transaction Synchronization

Another idea that came up in the forum discussion was to register a javax.transaction.Synchronization with the Transaction Manager. In afterCompletion(), do the transition to a failure state if there was a rollback. The advantage of such an approach over DLQ is that you can programmatically add the behavior in action code or write an action that performs the TM synchronization. Because of this, you can propagate any exception or failure reason to the failure state. I’m not going to get into details of this approach as I am not sure it will work, specifically:

You must set your redelivery count to 0. The synchronization has no way of knowing whether or not the message had been redelivered or whether or not you should transition the token to the failure state with jBPM code out-of-the-box. Since you don’t want the original JMS message redelivered after you have transitioned to the failure state, you need a retry count of zero.
JMS Providers may not be able to guarantee that a message isn’t redelivered. Since we require a redelivery count of zero, this is not an option.
Your machine could crash before transitioning to the failure state. Since we cannot redeliver the message, the node will never transition to the failure state. You may say, “well, can’t the same thing happen with the DLQ option?”. With the JBoss Inflow adapter, no. The JBoss adapter performs the DLQ check within the same transaction as the delivered method. Since both the sending to the DLQ and acknowledgement of the message are within the same transaction, you are guaranteed of one or the other succeeding.
Probably the most important thing is that Synchronization.afterCompletion() is not guaranteed to be executed on TX recover, after a machine crash.

So, the winner is DLQ! Of course I may have missed something. This approach would be better if we could work out the issues.

Side Effect: Redelivery handling

In my previous blog on Guaranteed Execution in jBPM, I discussed at the end how jBPM actions have know current way of knowing whether or not they have been retried by a new transaction. An indirect consequence of the DLQ approach to failure handling is that we now have some ability to recognize and keep track of redeliveries.

<process-definition>
   <start-state>
     <transition to='guarantee'/>
   </start-state>
   <state name='business-state-1' asynch='true'>
     ...
     <transition name="failure" to='failure-state-1'/>
   </state>
   <decision name='failure-state-1' asynch='true'>
     <transition name="recover" to='recovery-state-1'>
        <condition expression="retryCount >=3"/>
     </transition>
     <transition name="retry" to='business-state-1'>
        <condition expression="retryCount < 3"/>
     </transition>

   </state>

...
</process-definition>

Since failures result in a transition to a failure state we can keep increment and store redelivery counts within the token. We can programmatically decide wheter or not we are allowed to retry the message.

Conclusion

So that’s how you can have guaranteed failure handling. I do have a disclaimer though. I’m too lazy to actually try out these ideas. When I do I’ll get back to you in a follow up blog. Please try it out yourself and provide feedback on the jbpm forum topic. We’re still working out the issues here and your thoughts would be most appreciated.

Guaranteed jBPM Failure Handling « Angry Bill
Aug 07, 2007 @ 21:36:01

Brad Davis
Mar 24, 2009 @ 23:23:42

Hey Bill; was the MDB approach settled to be the proper solution in 3.2.3GA+ environments?

Thanks for the great article too. Do you know if there is good documentation available for the different job types so that they can be handled in the DLQ MDB?

Bill the Plumber