Taverna PROV Data Bundle (Taverna 2.x)

Taverna 2.4 with the Taverna-PROV plugin 2.1.5 or later can export Taverna workflow runs as a Data Bundle. The bundle can be saved from within the Workbench results (Save All) or from the command line. The Data Bundle contains the workflow input and output values, intermediate values, a provenance trace and a copy of the executed workflow definition.

Structure of exported provenance

The .bundle.zip file is a RO bundle, which species a structured ZIP file with a manifest (.ro/manifest.json). 

Mime type: 

application/vnd.wf4ever.robundle+zip

File extension:

.bundle.zip

An RO Bundle is effectively a structured ZIP file, with a JSON-LD manifest that follows the Research Object data model, adding provisions for annotations, provenance and annotations of resources. These resources can be embedded within the ZIP file or aggregated from external sources by using URL references.

You can explore the bundle by unzipping it or browse it with a program like 7-Zip.

The Taverna-PROV source code includes an example bundle and unzipped bundle as a folder. This data bundle has been saved after running a simple hello world workflow.

The remaining text of this section describes the content of the RO bundle, as if it was unpacked to a folder. Note that many programming frameworks include support for working with ZIP files, and so complete unpacking might not be necessary for your application. For Java, the Data bundle API gives a programmatic way to inspect and generate data bundles.

Inputs and outputs

The folders inputs/ and outputs/ contain files and folders corresponding to the input and output values of the executed workflow. Ports with multiple values are stored as a folder with numbered outputs, starting from 0. Values representing errors have extension .err, other values have an extension guessed by inspecting the value structure, e.g. .png. External references have the extension .url - these files can often be opened as "Internet shortcut" or similar, depending on your operating system.

Example listing:

c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>ls
inputs intermediates mimetype outputs workflow.wfbundle workflowrun.prov.ttl
c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>ls outputs
greeting.txt
c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>cat outputs/greeting.txt
Hello, John Doe

Workflow run provenance

The file workflowrun.prov.ttl contains the PROV-O export of the workflow run provenance (including nested workflows) in RDF Turtle format.

This log details every intermediate processor invocation in the workflow execution, and relates them to inputs, outputs and intermediate values.

Example listing:

c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>cat workflowrun.prov.ttl | head -n 40 | tail -n 8
<#taverna-prov-export>
rdf:type prov:Activity ;
prov:startedAtTime "2013-11-22T14:01:02.436Z"^^xsd:dateTime ;
prov:qualifiedCommunication _:b1 ;
prov:endedAtTime "2013-11-22T14:01:03.223Z"^^xsd:dateTime ;
rdfs:label "taverna-prov export of workflow run provenance"@en ;
prov:wasInformedBy <http://ns.taverna.org.uk/2011/run/385c794c-ba11-4007-a5b5-502ba8d14263/> ;

See the provenance graph for a complete example. The provenance uses the vocabularies W3C PROV-Owfprov and tavernaprov.

Note that the URIs starting with 

http://ns.taverna.org.uk/2011/run/

http://ns.taverna.org.uk/2011/data/

http://ns.taverna.org.uk/2010/workflowBundle/

are not meant to be clickable (HTTP resolvable) and would currently give 404 Not Found.

The reason for this is that myGrid does not (and will not) store centrally any workflow run information, data values or workflow definitions. It is however still useful that each workflow definition, each workflow run and each produced data value can be uniquely identified, therefore we build these URIs using UUIDs that are generated within Taverna. It is possible that in the future these URIs could redirect to public search results, e.g. on myExperiment.

Intermediate values

Intermediate values are stored in the intermediates/ folder and referenced from workflowrun.prov.ttl

Intermediate value from the example provenance:

<http://ns.taverna.org.uk/2011/data/385c794c-ba11-4007-a5b5-502ba8d14263/ref/d588f6ab-122e-4788-ab12-8b6b66a67354>
        tavernaprov:content          <intermediates/d5/d588f6ab-122e-4788-ab12-8b6b66a67354.txt> ;
        wfprov:describedByParameter  <http://ns.taverna.org.uk/2010/workflowBundle/01348671-5aaa-4cc2-84cc-477329b70b0d/workflow/Hello_Anyone/processor/Concatenate_two_strings/in/string1> ;
        wfprov:describedByParameter  <http://ns.taverna.org.uk/2010/workflowBundle/01348671-5aaa-4cc2-84cc-477329b70b0d/workflow/Hello_Anyone/processor/hello/out/value> ;
        wfprov:wasOutputFrom         <http://ns.taverna.org.uk/2011/run/385c794c-ba11-4007-a5b5-502ba8d14263/process/bbaedc02-896f-491e-88bc-8dd350fcc73b/> .
Here we see that the bundle file intermediates/d5/d588f6ab-122e-4788-ab12-8b6b66a67354.txt contains the output from the "hello" processor, which was also the input to the "Concatenate_two_strings" processor. Details about processor, ports and parameters can be found in the workflow definition.

Example listing:

c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>ls intermediates/d5
d588f6ab-122e-4788-ab12-8b6b66a67354.txt
c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>cat intermediates/d5/d58*
Hello,

Note that "small" textual values are also included as cnt:chars in the graph, while the referenced intermediate file within the workflow bundle is always present.

<intermediates/d5/d588f6ab-122e-4788-ab12-8b6b66a67354.txt>
rdf:type cnt:ContentAsText ;
cnt:characterEncoding "UTF-8"^^xsd:string ;
cnt:chars "Hello, "^^xsd:string ;
tavernaprov:byteCount "7"^^xsd:long ;
tavernaprov:sha512 "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e"^^xsd:string ;
tavernaprov:sha1 "f52ab57fa51dfa714505294444463ae5a009ae34"^^xsd:string ;
rdf:type tavernaprov:Content .
Workflow definition

The file workflow.wfbundle is a copy of the executed workflow in SCUFL2 workflow bundle format. This is the format which will be used by The file workflow.wfbundle contains the executed workflow in Taverna 3.

You can use the SCUFL2 API to inspect the workflow definition in detail.

The file .ro/annotations/workflow.wfdesc.ttl contains the abstract structure (but not all the implementation details) of the executed workflow, in RDF Turtle according to the wfdesc ontology.

c:\Users\stain\workspace\taverna-prov\example\helloanyone.bundle>cat .ro/annotations/workflow.wfdesc.ttl | head -n 20
@base <http://ns.taverna.org.uk/2010/workflowBundle/01348671-5aaa-4cc2-84cc-477329b70b0d/workflow/Hello_Anyone/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix wfdesc: <http://purl.org/wf4ever/wfdesc#> .
@prefix wf4ever: <http://purl.org/wf4ever/wf4ever#> .
@prefix roterms: <http://purl.org/wf4ever/roterms#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix comp: <http://purl.org/DP/components#> .
@prefix dep: <http://scape.keep.pt/vocab/dependencies#> .
@prefix biocat: <http://biocatalogue.org/attribute/> .
@prefix : <#> .
<processor/Concatenate_two_strings/> a wfdesc:Process , wfdesc:Description , owl:Thing , wf4ever:BeanshellScript ;
        rdfs:label "Concatenate_two_strings" ;
        wfdesc:hasInput <processor/Concatenate_two_strings/in/string1> , <processor/Concatenate_two_strings/in/string2> ;
        wfdesc:hasOutput <processor/Concatenate_two_strings/out/output> ;
        wf4ever:script "output = string1 + string2;" .

Taverna 3 Data bundle

Taverna 3 uses the same Data Bundle format as Taverna-PROV plugin. Currently there are some differences due to the two different implementations for capturing provenance.

Differences from Taverna 2

Taverna 3 adds the workflow report workflowrun.json (see below).

Taverna 3 does not yet export provenance trace to workflowrun.prov.ttl, (Taverna-PROV issue 3) but as the workflow run report captures the same/similar information, the content of the provenance trace equivalent to the Taverna 2 output can in theory be generated from the workflow report. (T3-829T3-970).

Taverna 3 can also open an existing data bundle and display it in the Results perspective. Opening existing data bundles created with the Taverna PROV plugin is currently not supported in Taverna 3, as the implementation assumes the workflow report is present in the bundle (T3-971).

Workflow report (workflowrun.json)

Taverna 3 introduces a new resource in the data bundle, workflowrun.json which is a more Taverna-centric and it mirrors the actual execution state while running a workflow. This example shows excerpt of a workflow run report (See also the full workflowrun.json):

{
  "subject" : "http://ns.taverna.org.uk/2010/workflowBundle/01348671-5aaa-4cc2-84cc-477329b70b0d/workflow/Hello_Anyone/",
  "state" : "COMPLETED",
  "createdDate" : "2013-11-27T19:04:02.016+0000",
  "startedDate" : "2013-11-27T19:04:02.023+0000",
  "completedDate" : "2013-11-27T19:04:02.054+0000",
  "invocations" : [ {
    "id" : "Hello_Anyone",
    "state" : "COMPLETED",
    "inputs" : {
      "name" : "/inputs/name"
    },
    "outputs" : {
      "greeting" : "/outputs/greeting"
    }
  } ],
  "processorReports" : [ {
    "subject" : "http://ns.taverna.org.uk/2010/workflowBundle/01348671-5aaa-4cc2-84cc-477329b70b0d/workflow/Hello_Anyone/processor/Concatenate_two_strings/",
    "state" : "COMPLETED",
    "completedDate" : "2013-11-27T19:04:02.054+0000",
    "jobsCompleted" : 1,
    "invocations" : [ {
      "id" : "Hello_Anyone/Concatenate_two_strings",
      "parent" : "Hello_Anyone",
      "name" : "Concatenate_two_strings",
      "state" : "COMPLETED",
      "startedDate" : "2013-11-27T19:04:02.033+0000",
      "inputs" : {
        "string1" : "/intermediates/64/64140288-cf8b-4a47-99ae-b76cb4c531ad",
        "string2" : "/intermediates/3d/3d548b58-ec18-44ab-aeb6-7d9d5999ad21"
      },
      "outputs" : {
        "output" : "/intermediates/92/92721f0a-4fac-4aba-9a09-b2651f303577"
      }
    } ],
    "activityReports" : [ {
      "subject" : "http://ns.taverna.org.uk/2010/workflowBundle/01348671-5aaa-4cc2-84cc-477329b70b0d/profile/taverna-2.4.0/activity/Concatenate_two_strings/",
      "state" : "COMPLETED",
      "completedDate" : "2013-11-27T19:04:02.049+0000",
      "invocations" : [ {
        "id" : "Hello_Anyone/Concatenate_two_strings/invocation80",
        /* .. */
        "inputs" : {
          "string1" : "/intermediates/64/64140288-cf8b-4a47-99ae-b76cb4c531ad",
          "string2" : "/intermediates/3d/3d548b58-ec18-44ab-aeb6-7d9d5999ad21"
        },
        "outputs" : {
          "output" : "/intermediates/92/92721f0a-4fac-4aba-9a09-b2651f303577"
        }
      } ]
    } ]
  }, {
    "subject" : "http://ns.taverna.org.uk/2010/workflowBundle/01348671-5aaa-4cc2-84cc-477329b70b0d/workflow/Hello_Anyone/processor/hello/",
    "state" : "COMPLETED"
    /* .... */
  } 
  ]
} 

The structure of the workflowrun.json is still subject to change until the official release of Taverna 3.0.

 

JSON structure, where optional means the property might not be present, and properties marked final should be present after the workflow has finished:

workflow report (top-level JSON Object)

Working with data bundles

The Data Bundle can be processed using normal ZIP support, such as with the command line Info-ZIP tool unzip, built-in operating system support or third-party programs like 7-zip

Additionally, programming languages will typically have API support or libraries for working with ZIP files, such as the Java 7 zipfs and Apache Commons Compress API, or Ruby's rubyzip gem.

In order to facilitate tighter integration with the Data Bundle formats, we have developed the Java Data Bundle API, which provide higher-level access to reading, creating and modifying data bundles. Example:

try (Bundle dataBundle = DataBundles.openBundle(zip)) {
    Path outputs = DataBundles.getOutputs(dataBundle);
    Path greeting = DataBundles.getPort(outputs, "greeting");
    System.out.println(DataBundles.getStringValue(greeting));
}

The above code will print out the content of outputs/greeting.txt. Regular Java 7 NIO Files operations can also be used with these Paths, for instance for binary content or larger values that won't fit in memory.

The Data Bundle API also ties into the SCUFL2 API to inspect the executed workflow definition:

WorkflowBundle wfBundle = DataBundles.getWorkflowBundle(dataBundle);
for (Processor processor : wfBundle.getMainWorkflow().getProcessors() {
    System.out.println("Processor " + processor);
}

In addition, you may retrieve the workflow run report as a Jackson JsonNode

JsonNode runReport = DataBundles.getWorkflowRunReport(dataBundle);
for (JsonNode procReport : runReport.path("processorReports")) {
    URI subject = URI.create(procReport.path("subject").asText()); 
 for (JsonNode invocation: procReport.path("invocations")) {
 System.out.println("Invocation started": + invocation.path("startedDate").asText());
 }
}

Looking up the subject to the corresponding SCUFL2 Processor using URITools and Scufl2Tools.

URITools uriTools = new URITools();
Processor proc = (Processor)uriTools.resolveBean(wfBundle, subject);    
System.out.println("Execution of " + proc);
 
Scufl2Tools scufl2Tools = new Scufl2Tools();
Configuration activityConfig = scufl2Tools .configurationForActivityBoundToProcessor(proc);
System.out.println("Activity: " + activityConfig.getJsonAsString());

And printing the intermediate outputs of a particular processor invocation by looking up its bundle Path:

for (Port outputPort : proc.getOutputs()) {
    System.out.println("Output " + outputPort);
    String output = invocation.path(outputPort.getName()).asText();
    Path outputPath = dataBundle.getRoot().resolve(output);
    System.out.println("Value: " + DataBundles.getStringValue(outputPath)); 
}