Data referencing in action in “Integrating ARC Grid with Taverna”

Daniel Bayer at University of Lübeck has implemented an ARC Grid plugin for Taverna for submitting job to the grid system ARC, in particular for usage with KnowARC. Bayer, together with Steffen Möller and Hajo N. Krabbenhöft also recently got published in Bioinformatics.

The work seems very promising, and in particular because it seems to have much of the same inspiration as t2 with regards to security and referenced data, although their plugin is for Taverna 1.6.2, before even the t2 plugin was included in Taverna.

They have two different solutions to how to run ARC grid jobs from Taverna. The first solution can be compared to SOAPlab, where a command line execution (in this case on the grid) is described in a simple XML description, for example for alignment of Protein Sequences with ClustalW:

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<program name="clustalw_protein"
     description="call clustalw"
     command="clustalw -infile input -outfile output">
  <output name="alignment">
    <fromfile path="output" />
  </output>
  <input name="multiple_sequences_FASTA">
    <file path="input" />
  </input>
  <RE name="APPS/BIO/CLUSTALW-1.8.3" />
</program>

Above I’ve highlighted the commands and the descriptions. These service descriptions, called use cases, are then put onto a repository that can be read by a special ARC scavanger and browsed like services in Taverna. There’s an ARC gateway service that based on these descriptions are able to dynamically generate WSDL-described web services, which Taverna and other clients can use normally, given a grid proxy certificate is used. The gateway reads the annotations from the use case repository, and when a request is submitted to a web service, it will submits the grid jobs and retrieve the results to the client.

This solution should enable any WSDL-capable client to use grid resources without worrying about the command line.

The authors realised that sending potentially large data back and forth over the WSDL interface to Taverna is a big resource waste and limitation, specially when using several ARC services on the same grid in succession. Therefore they developer a second solution, which is a Taverna plugin.

The plugin includes a Use case scavenger for browsing the annotations from Taverna’s Available services panel and adding them to the workflow. There’s a GUI to select the proxy certificate. The services are added as ARC grid processors instead of WSDL processors as in the first solution. This solution does not require the gateway service, as the ARC processor does the grid submission and result retrieval under the hood, similar to the SOAPlab processor already built-in to Taverna.

The crucial difference here is that the plugin allows passing data by reference, if you chain two such processors together, only the reference to the result is passed along (I assume basically a filename), and the big data can stay on the grid. The data is fetched on demand if passed to a non-ARC processors, so the workflow designer is able to combine grid and non-grid resources.

These are two main features we want to achieve in a more general sense with t2. A service should be able to return data by reference, and we fetch the data only when needed, otherwise just pass along the reference. There should be built-in support for secured services, with a GUI to select certificates (no requirement for “magic” files).

We met Möller at ISMB 2007 in Vienna, and we’re currently talking with him and his coders to collaborate on porting their code to t2 and to get more requirements for us, as this is both an interesting and real use case.

Note that it would also be possible to do the pass-by-reference solution with Taverna 1 even without a plugin. All you need to do is to have your service return an xsd:anyURI instead of the result value, and to accept that as input parameters. If you want to support both you can have mirrored methods, say a blast() (using values) and blastRef() (using references).

The URIs themselves can be in the style of:

  • urn:uuid:815BBCE0-ED5A-4ED9-8768-22EED2793EC4
    (internal URI based on an UUID)
  • http://myproject.com/2008/myservice/data/815BBCE0-ED5A-4ED9-8768-22EED2793EC4
    (http-namespaced, but only internally resolvable URI)
  • https://myservice.mygrid.org:8081/data.msf?id=815BBCE0-ED5A-4ED9-8768-22EED2793EC4
    (http-resolvable URL, given valid security credentials)

The disadvantage of this in Taverna 1 is when you want to pass to a non-grid resource you’ll need to introduce a shim that downloads (resolves) the content, for the first two cases this would be a special getData(uri) method on the service (together with a putData(data) method for uploading large data and getting a URI that is valid with the *Ref services), in the last case it could just be the local worker Get web page from URL (for text) or Get image from URL. (OK, that’s another little secret hack, but these two workers should work with not just HTML and JPEGs, the only real difference is if you get text or binary data out.)

What t2 should be able to is to do this under the hood of Taverna and it’s activities (processors), so the only requirement is for the service to say that something it expects or returns is a reference. One of the problems we face is that there’s no way to do so in WSDL currently, but we could try initially with simply saying that xsd:anyURI‘s could be used as references.

2 thoughts on “Data referencing in action in “Integrating ARC Grid with Taverna””

  1. Discovery. This is a very interesting case. It will help me get an insight into the development of a T2 plugin for gLite in order to interact with the EGEE infrastructure.

  2. Hi, I just happened to learn about this blog’s existence and found this entry about our work. Many thanks, Stian. I just wanted to add that our code is available not only to Stian and his colleagues but publicly inspectable on http://svn.nordugrid.org/trac/workarea/browser/T2.6/janitor-taverna-processor

    The work was prepared within the KnowARC EU project that is about helping the grid middleware (Globus pre-ws) behind the NorduGrid (www.nordugrid.org/monitor) to find a larger user base. We are ourselves talking to the group of Luciano Milanesi in Milano to see the development ported to other middlewares. And hey, if you are interested to play around, (as all the grid communities) the NorduGridders are very open for new participants and a site can easily be accessible by multiple middlewares.

Leave a Reply

Your email address will not be published. Required fields are marked *