[TAV-709] T2 Enactment Error with pauls workflow Created: 2008-01-23  Updated: 2010-02-01  Resolved: 2008-01-31

Status: Resolved
Project: myGrid
Component/s: None
Affects Version/s: 1.7
Fix Version/s: 1.7.1

Type: Bug Priority: Critical
Reporter: Stuart Owen Assignee: David Withers (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to T3-512 Timer already cancelled from Monitor Resolved

 Description   

There is a workflow of Pauls attached to TAV-706 (phenotype_to_pubmed.xml - takes the input "african trypanosomiasis AND mouse") that fails to run in T2, although its processor types and construction indicates it should. The workflow "sticks" when the queuesize of the nested workflow is 89 - I've found this is consistent when first running the workflow, but not necessarily when clicking Reset and re-running.

Very difficult to determine what the problem may be due to a lack of decent error reporting and monitoring.



 Comments   
Comment by David Withers (Inactive) [ 2008-01-24 ]

This seems to be a problem with the monitor. I'm getting lots of stack traces similar to the one below.

Exception in thread "net.sf.taverna.t2.workflowmodel.processor.dispatch.events.DispatchJobEvent@3f0f86" 
  java.lang.IllegalStateException: Timer already cancelled. 
  at java.util.Timer.sched(Timer.java:354) 
  at java.util.Timer.schedule(Timer.java:170) 
  at net.sf.taverna.t2.monitor.impl.MonitorImpl.deregisterNode(MonitorImpl.java:136) 
  at net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Invoke$1.receiveResult(Invoke.java:196) 
  at net.sf.taverna.t2.activities.wsdl.WSDLActivity$1.run(WSDLActivity.java:142) 
  at java.lang.Thread.run(Thread.java:613) 

If I turn the monitoring off this workflow completes and I get the same results as taverna 1.

Comment by David Withers (Inactive) [ 2008-01-24 ]

The sequence of events that causes this is:

  1. The invoke layer calls MonitorImpl.deregisterNode() which schedules nodeRemovalTimer to call monitorTree.removeNodeFromParent(nodeToRemove).
  2. monitorTree.removeNodeFromParent() throws IllegalArgumentException: node does not have a parent; this kills the timer thread.
  3. The next call to MonitorImpl.deregisterNode() results in nodeRemovalTimer.schedule() throwing IllegalStateException: Timer already cancelled
  4. This exception propagates back to the invoke layer so the activity invocation doesn't happen.

There are several problems here:

  1. nodeToRemove doesn't have a parent. Not too sure why but I think it's a timing issue: the parent gets removed before the child? The parent node always seems to be DataflowActivity.
  2. The scheduled TimerTask shouldn't allow an exception to kill the timer thread.
  3. Calls to MonitorImpl.deregisterNode() shouldn't allow monitoring exceptions to stop the activity invocation.

I think a solution would be to separate the monitoring and invocation code; perhaps by adding a monitor layer before the invoke layer in the dispatch stack.

Comment by David Withers (Inactive) [ 2008-01-31 ]

The root of this bug is child nodes being removed from the monitor tree after their parents have already been removed. I've checked in changes to DataflowActivity and WorkflowInstanceFacadeImpl to fix this.

Generated at Sat Sep 19 21:31:03 BST 2020 using JIRA 6.1.2#6157-sha1:98c729218aad6de1537eb8e98889ee5562c90d96.