Skip to end of metadata
Go to start of metadata

Taverna PROV Data Bundle (Taverna 2.x)

Taverna 2.4 with the Taverna-PROV plugin 2.1.5 or later can export Taverna workflow runs as a Data Bundle. The bundle can be saved from within the Workbench results (Save All) or from the command line. The Data Bundle contains the workflow input and output values, intermediate values, a provenance trace and a copy of the executed workflow definition.

Structure of exported provenance

The .bundle.zip file is a RO bundle, which species a structured ZIP file with a manifest (.ro/manifest.json). 

Mime type: 

application/vnd.wf4ever.robundle+zip

File extension:

.bundle.zip

An RO Bundle is effectively a structured ZIP file, with a JSON-LD manifest that follows the Research Object data model, adding provisions for annotations, provenance and annotations of resources. These resources can be embedded within the ZIP file or aggregated from external sources by using URL references.

You can explore the bundle by unzipping it or browse it with a program like 7-Zip.

The Taverna-PROV source code includes an example bundle and unzipped bundle as a folder. This data bundle has been saved after running a simple hello world workflow.

The remaining text of this section describes the content of the RO bundle, as if it was unpacked to a folder. Note that many programming frameworks include support for working with ZIP files, and so complete unpacking might not be necessary for your application. For Java, the Data bundle API gives a programmatic way to inspect and generate data bundles.

Inputs and outputs

The folders inputs/ and outputs/ contain files and folders corresponding to the input and output values of the executed workflow. Ports with multiple values are stored as a folder with numbered outputs, starting from 0. Values representing errors have extension .err, other values have an extension guessed by inspecting the value structure, e.g. .png. External references have the extension .url - these files can often be opened as "Internet shortcut" or similar, depending on your operating system.

Example listing:

c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>ls
inputs intermediates mimetype outputs workflow.wfbundle workflowrun.prov.ttl
c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>ls outputs
greeting.txt
c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>cat outputs/greeting.txt
Hello, John Doe

Workflow run provenance

The file workflowrun.prov.ttl contains the PROV-O export of the workflow run provenance (including nested workflows) in RDF Turtle format.

This log details every intermediate processor invocation in the workflow execution, and relates them to inputs, outputs and intermediate values.

Example listing:

c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>cat workflowrun.prov.ttl | head -n 40 | tail -n 8
<#taverna-prov-export>
rdf:type prov:Activity ;
prov:startedAtTime "2013-11-22T14:01:02.436Z"^^xsd:dateTime ;
prov:qualifiedCommunication _:b1 ;
prov:endedAtTime "2013-11-22T14:01:03.223Z"^^xsd:dateTime ;
rdfs:label "taverna-prov export of workflow run provenance"@en ;
prov:wasInformedBy <http://ns.taverna.org.uk/2011/run/385c794c-ba11-4007-a5b5-502ba8d14263/> ;

See the provenance graph for a complete example. The provenance uses the vocabularies W3C PROV-Owfprov and tavernaprov.

ns.taverna.org.uk URIs

Note that the URIs starting with 

http://ns.taverna.org.uk/2011/run/

http://ns.taverna.org.uk/2011/data/

http://ns.taverna.org.uk/2010/workflowBundle/

are not meant to be clickable (HTTP resolvable) and would currently give 404 Not Found.

The reason for this is that myGrid does not (and will not) store centrally any workflow run information, data values or workflow definitions. It is however still useful that each workflow definition, each workflow run and each produced data value can be uniquely identified, therefore we build these URIs using UUIDs that are generated within Taverna. It is possible that in the future these URIs could redirect to public search results, e.g. on myExperiment.

Intermediate values

Intermediate values are stored in the intermediates/ folder and referenced from workflowrun.prov.ttl

Intermediate value from the example provenance:

Here we see that the bundle file intermediates/d5/d588f6ab-122e-4788-ab12-8b6b66a67354.txt contains the output from the "hello" processor, which was also the input to the "Concatenate_two_strings" processor. Details about processor, ports and parameters can be found in the workflow definition.

Example listing:

c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>ls intermediates/d5
d588f6ab-122e-4788-ab12-8b6b66a67354.txt
c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>cat intermediates/d5/d58*
Hello,

Note that "small" textual values are also included as cnt:chars in the graph, while the referenced intermediate file within the workflow bundle is always present.

<intermediates/d5/d588f6ab-122e-4788-ab12-8b6b66a67354.txt>
rdf:type cnt:ContentAsText ;
cnt:characterEncoding "UTF-8"^^xsd:string ;
cnt:chars "Hello, "^^xsd:string ;
tavernaprov:byteCount "7"^^xsd:long ;
tavernaprov:sha512 "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e"^^xsd:string ;
tavernaprov:sha1 "f52ab57fa51dfa714505294444463ae5a009ae34"^^xsd:string ;
rdf:type tavernaprov:Content .
Workflow definition

The file workflow.wfbundle is a copy of the executed workflow in SCUFL2 workflow bundle format. This is the format which will be used by The file workflow.wfbundle contains the executed workflow in Taverna 3.

You can use the SCUFL2 API to inspect the workflow definition in detail.

The file .ro/annotations/workflow.wfdesc.ttl contains the abstract structure (but not all the implementation details) of the executed workflow, in RDF Turtle according to the wfdesc ontology.

Taverna 3 Data bundle

Taverna 3 uses the same Data Bundle format as Taverna-PROV plugin. Currently there are some differences due to the two different implementations for capturing provenance.

Differences from Taverna 2

Taverna 3 adds the workflow report workflowrun.json (see below).

Taverna 3 does not yet export provenance trace to workflowrun.prov.ttl, (Taverna-PROV issue 3) but as the workflow run report captures the same/similar information, the content of the provenance trace equivalent to the Taverna 2 output can in theory be generated from the workflow report. (T3-829T3-970).

Taverna 3 can also open an existing data bundle and display it in the Results perspective. Opening existing data bundles created with the Taverna PROV plugin is currently not supported in Taverna 3, as the implementation assumes the workflow report is present in the bundle (T3-971).

Workflow report (workflowrun.json)

Taverna 3 introduces a new resource in the data bundle, workflowrun.json which is a more Taverna-centric and it mirrors the actual execution state while running a workflow. This example shows excerpt of a workflow run report (See also the full workflowrun.json):

Subject to change

The structure of the workflowrun.json is still subject to change until the official release of Taverna 3.0.

 

JSON structure, where optional means the property might not be present, and properties marked final should be present after the workflow has finished:

workflow report (top-level JSON Object)

    • subject the URI identifying the executed workflow, as identified within the SCUFL2 workflow.wfbundle
    • state of the last workflow run; one of CREATED,  RUNNINGCOMPLETEDCANCELLEDFAILED
    • createdDate Date/time (in ISO 8601 dateTime format) of creation of the workflow report, e.g. when execution of the top-level workflow was started. 

    • startedDate (final) Date/time this workflow initially executed
    • pausedDate (optional) Date/time this workflow last entered the PAUSED state
    • pausedDates (optional) A chronological JSON list of Date/times of each time a workflow has entered the PAUSED state
    • resumedDate (optional) Date/time this workflow last resumed from the PAUSED state
    • resumedDates (optional) A chronological JSON List of Date/times of each time a workflow has resumed from the PAUSED state.
    • cancelledDate (optional) Date/time this workflow entered the CANCELLED state
    • failedDate (optional) Date/time this workflow entered the FAILED state. Note that a workflow does not normally fail in this way even though some of its outputs could be errors. A FAILED state indicates a workflow execution problem within the Taverna Platform.
    • completedDate (optional) Date/time this workflow entered the COMPLETED state
    • invocations JSON List of workflow invocations. For the top-level workflow, this list always contain only 1 item which mirrors the information above.
      • id An identifier for this workflow invocation, unique within this workflow report. 
      • parent (optional) The identifier (id) of the parent activity invocation. This property is only provided if this invocation was a nested workflow run, in which case it will be the identifier of the corresponding activity invocation within the parent workflow. 
      • name A name for this invocation, unique within this list of invocations. By convention the invocation of the top-level workflow has the same name as the Workflow within the Workflow Bundle, e.g. "Hello_anyone", but this is subject to change. 
      • state of this workflow invocation; one of CREATED,  RUNNINGCOMPLETEDCANCELLEDFAILED
      • startedDate Date/time when this invocation started. 
      • completedDate (final) Date/time when this invocation ended. 
      • inputs A JSON Object of the input port values. The keys are port names, e.g. "name", the values are relative URI references to resources within the Data Bundle, eg. "/inputs/name.txt". 
        Workflow inputs will normally be identified with the same relative URI reference where they are used as processor inputs.
        In some cases the input might be an absolute URI, which should nevertheless be aggregated by the Data Bundle's manifest.
      • outputs A JSON Object of the output port values. The keys are port names, e.g. "greeting", the values are relative URI references to resources within the Data Bundle, eg. "/outputs/greeting.txt" or "intermediates/16/160b64ba-c2b4-435b-8699-465e2d190994".
        Workflow outputs will normally be identified with the same relative URI reference as where they were generated as processor outputs. 
        In some cases the output might be an absolute URI, which should nevertheless be aggregated by the Data Bundle's manifest.
    • processorReports A list of processor reports, one per processor in the current workflow
      • subject the URI identifying the executed processor, as identified within the SCUFL2 workflow.wfbundle
      • as in workflow report (above):state, createdDate, pausedDate, pausedDates, resumedDate, resumedDates, cancelledDate, failedDate, completedDate
      • invocations JSON List of processor invocations. The content of this list corresponds to iterations over this processor (or its containing nested workflow), and so might contain 0, 1 or more invocations depending on the workflow structure and execution.
        • id An identifier for this processor invocation, unique within this workflow report. By convention this identifier is composed by concatination of the parent, "/" and the name (e.g. "Hello_Anyone/Concatenate_two_strings"), but this is subject to change.
        • parent The identifier (id) of the corresponding parent workflow invocation. When this processor is within a nested workflow, this will identify the particular invocation of the nested workflow. 
        • name A name for this invocation, unique within this list of invocations. Note that although this name might in some cases match the actual processor name, this will not be the case when there are iterations over this processor.
        • index (optional) List of JSON integers, indicating the iteration index within the executed workflow invocation (parent), e.g. [0] (first position within a single list) or [3,7] (fourth position within outer list and eight position within inner list)
        • as in processor invocations: statestartedDate, completedDate
        • inputs A JSON Object of the input port values. The keys are port names, e.g. "name", the values are relative URI references to resources within the Data Bundle, eg. "intermediates/16/160b64ba-c2b4-435b-8699-465e2d190994".
          Processor inputs will normally be identified with the same relative URI reference as the corresponding workflow input or processor output where the processor input port is connected from, however in some cases the input has been created by Taverna's iteration system and would first appear at this processor. 
          In some cases the input might be an absolute URI, which should nevertheless be aggregated by the Data Bundle's manifest.
        • outputs A JSON Object of the output port values. The keys are port names, e.g. "greeting", the values are relative URI references to resources within the Data Bundle, eg. "intermediates/16/160b64ba-c2b4-435b-8699-465e2d190994".
          Processor outputs will normally be identified with the same relative URI reference as where they were generated within the corresponding activity invocation. 
          In some cases the output might be an absolute URI, which should nevertheless be aggregated by the Data Bundle's manifest.
      • activityReports JSON List of activity invocations. This list usually contains only 1 item, but might contain several reports if the workflow uses Looping, Retry or Failover.
        • subject the URI identifying the executed activity, as identified within the SCUFL2 workflow.wfbundle
        • as in workflow report (above):state, createdDate, pausedDate, pausedDates, resumedDate, resumedDates, cancelledDate, failedDate, completedDate
        • invocations JSON List of activity invocations. The content of this list corresponds to each invocation of the activity, and so may contain multiple invocations due to nested workflow invocations, processor iterations, looping, retry or failover.
          • id An identifier for this activity invocation
        • nestedWorkflowReport (optional) A nested workflow report, if this activity is a nested workflow. 
          • properties are as in top-level workflow report
          • invocations as in top-level workflow, with their parent matching the corresponding activity invocation id

Working with data bundles

The Data Bundle can be processed using normal ZIP support, such as with the command line Info-ZIP tool unzip, built-in operating system support or third-party programs like 7-zip

Additionally, programming languages will typically have API support or libraries for working with ZIP files, such as the Java 7 zipfs and Apache Commons Compress API, or Ruby's rubyzip gem.

In order to facilitate tighter integration with the Data Bundle formats, we have developed the Java Data Bundle API, which provide higher-level access to reading, creating and modifying data bundles. Example:

The above code will print out the content of outputs/greeting.txt. Regular Java 7 NIO Files operations can also be used with these Paths, for instance for binary content or larger values that won't fit in memory.

The Data Bundle API also ties into the SCUFL2 API to inspect the executed workflow definition:

In addition, you may retrieve the workflow run report as a Jackson JsonNode

Looking up the subject to the corresponding SCUFL2 Processor using URITools and Scufl2Tools.

And printing the intermediate outputs of a particular processor invocation by looking up its bundle Path:

Labels
  • None