Skype meeting 2009-05-06 16:00 GMT
Participants: Wei Tan (Univ. of Chicago/caGrid), Paolo Missier, Ian Dunlop, Stian Soiland-Reyes
Here's the short(er) summary (from taverna-hackers):
It would be great if you could have a partial rerun of a workflow in Taverna. A partial rerun can be seen as a kind of 'stop' and 'continue' mechanism - typically if a service failed yesterday - rerun a workflow from a certain point today to rerun the services from the failure and below - but using the old data from the services 'above'.
Another way to look at it is to do a kind of service caching, ie. if a processor in the previous successful run got inputs A,B and produced C,D,E - if on the rerun we receive A and B again we can simply return the cached C,D,E right away - either without invoking the service - or if an invocation fails. There's thoughts about bundling a workflow with the cached data (a pack/research object?) so that you could run a workflow for the first time - and still
There are issues with doing this full scale, for instance in a workflow there could be stateful services that work in coordination - so you might get trouble if you cache the output of the 'submit job' and 'check status' but re-run the 'get results' service. For these cases you would need to mark more as a section of a workflow as something to cache or not - bringing in thoughts about transactions and checkpoints.
As a first approach it's probably best to go for the simple caching on service-level - just checking some hash of the input data and retrieve the old outputs from the provenance store. This can be achieved by adding a caching layer to the dispatch stack to the processors the users enables caching for - so a simple UI extension would be needed as well.
What we'll do next is that Wei will think of some scenarios with real or invented workflows, and we'll see how he could do this with an initial approach using a simple service-level caching. The myGrid team will help with finding and possibly extending the APIs needed for this. If this looks promising we'll look into getting more time to do a deeper approach.
This Skype meeting was kicked of from the thread rescue workflow in Taverna on the taverna-hackers list during 2009-04. The main idea that was discussed was the ability to partially 'rerun' a workflow - using previous values at some processors, but re-executing other processors - typically either because they failed the first time, or because the workflow designer has changed some parameters.
Wei explains the starting point for this idea: He's building 'complex' workflows that consists of multiple steps, dealing with large data (10-15 MB). Some services might fail - but it is annoying to have to restart the whole workflow as completion can take a while. Also while debugging, one often is changing a parameter and really wants to rerun the workflow using the old data from the 'previous step'. DAGman in Condor already have a similar concept of "rescue workflow" to rerun parts of a workflow.
The participants all agree that the main principles here makes sense, it would be valuable to have such a functionality. About how easy it is to add, Ian gives a rough estimate of perhaps a month for a quick version. One way this can be done using existing libraries in Taverna is to re-use the provenance data that has been captured, and inject some kind of Caching or Rerun layer into the dispatch stack of the processors that are to use the old values. This layer will return the old data fetched from the provenance store, instead of passing down the inputs to invoke the activity/service. There will need to be a UI bit as well for the user to select which processors he want to use the old values from. So the API is there, and some code is there as well - such a dispatch layer would be similar in it's working to the existing extension for the Loop functionality.
Some fundamentals we'll need to get straight. It's difficult to really say "Restart from here" in Taverna - what is here? A Taverna workflow is not a linear thing, there could be parallel datapaths, in addition to iterations and pipelining going on. The meeting participants realised that this was potentially a scientific challenge to get 100% right and automatic - for instance what about stateful services that changes something in the outside world - like a database update or a WSRF service.
It was also mentioned that if a workflow changes - one have to be quite careful to use old result values. If an upstream processor has been changed, for instance a beanshell script has been reconfigured, or a WSDL service points to a new location, the 'old' inputs to a downstream processors are no longer valid. Detecting what has changed or not is a complex task, there could also be changed/moved data-links, etc.
From a provenance point of view there's also an issue in that the provenance data is stored as a particular workflow run (identified with a UUID) of a particular workflow (identified with another UUID). Running the exact workflow again should work, but whenever a workflow is edited in the workbench - the workflow UUID is changed - and the link to retrieve the old data is in a way lost. The intended purpose of this update is that a changed workflow is 'really' a new workflow - from a provenance point of view it will have to build a new graph of all the processors, etc.
In Taverna 1 we had an LSID assigned to a workflow, but typically it was never changed unless the user pushed the 'new LSID' button' - meaning that today we have many workflows in myExperiment that have duplicate LSIDs - simply because they have evolved from a common ancestor. One way to resolve this conflict is to keep a list of all UUIDs - so that the workflow contains all the UUID it has had in the past as well. Ian and Stian discussed this outside the meeting - although the list might grow for every (saved) change, it will probably never grow much longer than say 10000, so it's not too bad to include it in the workflow serialisation - if you consider the complexity of the current serialisation.
The more 'complete' way to do this is to analyse the change to see what has been affected - if it's only 'below' the selected rerun point it's OK - but what if it's 'above'? Issues with changes in topology and configuration - also that the 'rerun point' would be blurry - there could be half an iteration loop finished from above - and there could be other processors that are disjoint that you can't tell if they are 'above' or 'below'.
So what about a more simplified approach, just use service-based caching. If a processor/activity receives a certain input, it has received that input before, and caching is enabled, it simply returns the previously returned data instead of invoking the service.We can let the workflow designer be in charge of deciding if caching makes sense for a processor or not - so she could enable it for a service that does a fixed analysis algorithm, and not for one that does a database lookup (if she wants to retrieve fresh values).
This caching mechanism can also be useful when parameters have changed - it can be set on 'slow' services. Say a database lookup service is not cached, and it returns mainly the same results as before, but in it's list is also a few new results. The following analysis services (working on individual items) would not have to rerun for the old results, but would execute on the new ones as they are not recognized as old inputs by the caching.
The caching can use hashing (like sha1, md5) of the input data to be able to recognise old data - these additional hashes could be stored either in the provenance store or the reference service. This would avoid the problem of changed workflow UUIDs.
Paolo says there are different scenarios to address here. One is a kind of service invocation history (the cache) - on input X the service returned Y in the past, so it can return it again this time.
There's also the idea that the cached value could be used only on service failure - after exhausting retries and failovers the cache would return the old value instead.
Scenario 1: Save time on the second rerun by using cached values
Scenario 2: After a failure, restart workflow. More complex, "transactional" - would another reinvocation be OK? Reminds of check pointing in database transactions. Checkpoints could be done on processor or workflow level - typically a check point is defined as 'Save all the current data - if anything goes wrong - go back and re-execute from this point." - comes back to the problem of what is a "point".
Wei brings up the idea to bundle a workflow with previous data - so that a user who gets the bundle can run it, even if some of the services are down or have stopped working. Paolo says that this bundle could not just be previous data - but hand-crafted data by the designer. However this functionality could be done later - at first it would be interesting enough just to be able to reuse locally cached values - thus avoiding the issue of what this serialized bundle looks like - one probably don't want to munge many of these 15 MB data files into the workflow XML - but they could fit into a future Research Object bundle.
We agree that this is interesting. Are we going for simple caching or the more "complete solution"? Wei has done similar stuff for BPEL, so this is not a new idea, but still it's interesting.
WSRF introduces stateful services and other problems - might need caching on a wider level than just services/processors - or could we use nested workflows for this?
Would Wei participate in this effort? He's looking at combining this with myGrid's research agenda. Perhaps we can make some scenarious and even a prototype (estimate 2-4 weeks) to show to caGrid that this is something we want.
The myGrid team certainly believes this is something that can be done, but from their point it's mainly a lack of time to be able to do the effort themselves. However, the myGrid team can help Wei in doing the work by providing pointers to the APIs, etc.
We can start from the simple cache. Stian and Wei will meet in Boston and Chicago, and wil look at this next week.