Skip to end of metadata
Go to start of metadata

Scufl2 has moved to Apache (incubator) 

Information in this section is out of date!

A data bundle is an archive of Taverna workflow data. A data bundle can contain workflow inputs, outputs and intermediate values.

Outdated page

This page is outdated requirement page.

See 2013-02 Data bundle requirements instead.

Taverna workflow/processor inputs and outputs are in the form of a map from the port name to the value.

This page describes a way to make a single file containing such a map of data values, which can be used as an input when running a workflow, or produced as an output from a workflow run. This will be used by the Taverna Server and Taverna Command line, and will also provide a archive or exchange interesting data from within the Taverna Workbench.

Taverna data structures

Since Taverna 2, Taverna has 3 different types of data values:

  • Individual data values (String, bytes, URL)
  • Lists
  • Error documents

Lists can contain other lists, errors and values of the expected depth.

Depth

Taverna 2 enforces that lists have a uniform granularity. A list containing individual values is said to be of depth 1, and cannot contain lists. A list of depth 2 can only contain items of depth 1, etc. Individual values always have depth 0, lists are always depth 1 or higher. Errors can be of any non-negative depth, as they replace a value or list that could have been produced, instead including a stack trace, message or a link to another error document.

References

Each data item in Taverna is assigned a globally unique reference when it's produced. Lists contain only references to the other values, meaning that data items can potentially occur in several lists (and even in several workflow runs) at the same time. Generally this is however not the case, as it would require specific support from the implementing service plugin.

Data values can as well be external references, like an URL, a local file name or something specific to a particular service plugin. For instance the API consumer produces references to live Java objects within the VM (and hence can't be exported), while other potential references could be GridFTP URIs or Amazon snapshots. Taverna only requires that the reference itself can be serialised. The most commonly used references are however the inline string and the inline byte array, which simply contains the actual data values.

Former work: Baclava

Taverna 1 had a data format called Baclava, which is a single XML file containing the base64-serialised bytes of the data values. Strings were initially serialised in the character encoding of the operating system, but from Taverna 1.7 (?) this encoding was enforced to be UTF-8. As Taverna 1 does not have error documents, they don't have a proper representation in Baclava.

Taverna 2.x has support for loading and saving Baclava files, they can also be viewed using the standalone DataViewer tool.

Example of legacy Baclava XML format:

This format is unfortunately quite verbose and cryptic to deal with. For instance syntactictype="l(l('application/octet-stream'))"> means it's a list of a list - l(l( - of values of the mime type application/octet-stream - ie binaries. It is not possible to have different mime types across the values - although it is possible to have multiple types under <s:mimeTypes>.

Values are included inline in the XML as Base64, meaning that they'll take about 50% more bytes than needed, and also making it tricky to deal with the XML without running out of memory.

Some legacy leftovers like partialOrder and relationList simply complicate the picture.

Data identifiers can be given as LSIDs - but these days URIs are preferred.

The Baclava format also does not support error documents, references to external data, or any kind of description of where the data came from.

Proposed data bundle format

As for SCUFL2 workflow bundles, the data bundle is a single ZIP archive with a ODF-compatible manifest. Each value is a file in the ZIP archive, and each list is a folder. Exposed by the (RESTful) Taverna Server, the data bundle can be viewed as a directory structure, and by inspecting the manifest the client can determine the data mime types and size of the individual values before choosing to download or give the user a link.

In addition to the manifest, metadata can be provided in the bundle, describing which workflow run produced the values, and what official URIs the data has been assigned. In theory this can be expanded to include the full provenance of individual values and processors in the workflow.

Archive directory structure

Path

Type

Description

Depth

blah-outputs.t2data

ZIP

Taverna data bundle

blah-outputs.t2data/mimetype

Text

Mime type of bundle, ie. application/vnd.taverna.data-bundle

blah-outputs.t2data/META-INF/

Folder

Reserved folder for manifest

blah-outputs.t2data/META-INF/manifest.xml

XML

ODF 1.3-like manifest, listing each file, mime-type and file size

blah-outputs.t2data/META-INF/container.xml

XML

Adobe UCF/OEBPS list of root file

blah-outputs.t2data/outputs.rdf

RDF/XML

Structure and metadata about outputs

blah-outputs.t2data/outputs/

Folder

Outputs from a workflow run

blah-outputs.t2data/outputs/fish/

Folder

List output at port fish

1

blah-outputs.t2data/outputs/fish/0.txt

Text (UTF-8)

Single value at position 0

0

blah-outputs.t2data/outputs/fish/1.uri

URI

Single URI reference at position 1

0

blah-outputs.t2data/outputs/soup/

Folder

List output at port soup

2

blah-outputs.t2data/outputs/soup/0/

Folder

List output (depth 1) at position 0

1

blah-outputs.t2data/outputs/soup/0/0.txt

Text

Single value at position 0

blah-outputs.t2data/outputs/soup/0/1.err

Error document

Error at position 1

0

blah-outputs.t2data/outputs/soup/1/

Folder

Empty list output at position 1

1

blah-outputs.t2data/outputs/soup/2.err

Error document

Error at position 2

1

blah-outputs.t2data/outputs/results

Binary

Single output at port results

0

blah-outputs.t2data/inputs.rdf

RDF/XML

Structure and metadata about workflow inputs

blah-outputs.t2data/inputs/

Folder

Inputs from a workflow run

blah-outputs.t2data/inputs/*

...

Same structure as outputs/

The archive must be a ZIP file, and should have the file extension .t2data. Some situations might require treating the data bundle as an unpacked set of folders. In this case the top folder should still have the file extension .t2data.

According to the Adobe UCF specifications, the mimetype file must be the first file in the folder, and must be stored without compression, encryption or permission attributes, to support detection by mimemagic and similar.

The file META-INF/manifest.xml - if present - must list every non-META-INF file and folder in the archive, including the root folder. It should provide the mime-type - if known - for individual files. The root folder should have the same mime type as in the mimetype file - application/vnd.taverna.data-bundle.

The file META-INF/container.xml - if present - should point to the entry point for the 'main' data of the bundle, one and only one entry which must be of the mime type application/rdf+xml. For a bundle with workflow outputs this should be outputs.rdf, describing outputs/*, while for a bundle representing workflow inputs this should be inputs.rdf, describing inputs. If the implementation does not know if the data is inputs or outputs, data.rdf describing data must be used.

outputs.rdf contains the structural information about the individual lists, values and errors in the folder outputs. Each structural metadata file pairs with a folder containing the data. The names outputs/outputs.rdf is reserved for workflow outputs and inputs/inputs.rdf for workflow inputs.

outputs contains the data lists and values. Each direct sub-folder or file represent a port with the given name, so outputs/fish/ is a list at the port fish, and outputs/results is a single value at the port results.

Which data structure?

If more than one folder except META-INF/ exists in the root of the data bundle, the file META-INF/container.xml must define what is the root data structure (typically outputs.rdf), so that tools can know which data to prefer, say to show in a viewer or use as workflow inputs.

If a workflow execution environment is fed a data bundle for the workflow inputs, it should generally pick the root data structure, allowing workflow outputs to be used as input for a second workflow with matching port names and depths. If the bundle mimetype is different from application/vnd.taverna.data-bundle, an execution tool should instead use the inputs/inputs.rdf as workflow inputs. This could typically be the case where the data-bundle has been provided as example input as part of a SCUFL2 workflow bundle of the mime type application/vnd.taverna.scufl2.workflow-bundle.

Similarly, a data viewing tool should prefer the root data structure, but if the bundle mime-type is not application/vnd.taverna.data-bundle, it should primarily show the outputs/outputs.rdf structure.

mimetype

This file is required, as a guide for mime magic and similar tools that guess the type of the archive. Therefore it must be added as the first file to the archive, uncompressed, so that its content is available in cleartext in the first bytes of the ZIP archive.

The file must be in ASCII and not contain any line feeds. If the archive is a Taverna Data Bundle, the mime type should be application/vnd.taverna.data-bundle. If META-INF/manifest.xml is present, this mime type must match the mime type of "/" in the manifest.

To add the file mimetype as the first uncompressed file, followed by the rest of the bundle (excluding the mimetype file), try using InfoZip:

To verify:

META-INF/manifest.xml

This file, if exists, should follow the OpenDocument container format, and list every file in the bundle (except for the META-INF files). The main functionality provided by the manifest is to give the mime-type of individual data items, which are not required to have extensions. As a minimum the mime-type should distinguish between text/plain (UTF-8 text) and application/octet-stream (binary), but if the workflow definition or mime-magick like tool has guessed a more detailed mime type, it can be provided here.

Additionally the manifest may specify the file sizes, in general this can be useful when inspecting a larger data bundle remotely (exposed as a RESTful folder or similar).

The folder / represents the bundle itself, and must have the same mime type as in the file mimetype, ie. application/vnd.taverna.data-bundle. A different mime type might be used if the primary purpose of the archive is different from being a data bundle, for instance being a SCUFL2 workflow bundle.

Error documents must have the mime type application/vnd.taverna.error.

The other folders are not required to have a mimetype, but if desired these mime types can be used:

  • application/vnd.taverna.port-data for the outputs, inputs and other top-level data structure folders
  • application/vnd.taverna.list} for folders which are lists, like outputs/fish and outputs/soup/0

If there is no manifest in the bundle, all data value files should be treated to be binary application/octet-stream, unless they have one of these file extensions:

  • *.txt is text/plain in UTF-8 character set
  • *.err is application/vnd.taverna.error (RDF/XML in UTF-8)
  • outputs.rdf and similar in the root file is application/rdf+xml

Example manifest:

META-INF/container.xml

This file, if present, should point to the root data structure, typically outputs.rdf. Alternative representation of the same file are permitted, but tools will generally only use application/rdf+xml.

If the container file does not exist, and the bundle is of the mime type application/vnd.taverna.data-bundle, there must be only one folder except META-INF/ in the archive, which would be used together with the corresponding structure file. So if the folder contains fish/, fish.rdf and soup.rdf, then fish/ together with fish.rdf will be used.

The data structure file is optional, so the rootfile can contain an entry for the folder itself, which must be of the mime-type application/vnd.taverna.port-data. It is generally not required to list the folder if a rootfile in the required application/rdf+xml format is already given as a rootfile.

All rootfiles must be equivalent and describe the same data structure, although additional formats can include more or less information than the required format. There should be only one rootfile per media-type, and there must be only one rootfile for the media types application/rdf+xml and application/vnd.taverna.port-data

Example:

Port data folders: outputs/ inputs/ data/ */

The folder outputs/ contains the data for the workflow output ports. If the output port returned a list, a folder with the port's name will be present. Output ports of depth 0 (single values) will on the other hand be represented directly. So in this example outputs/fish is the list at port fish, while outputs/results is a single output for the port results.

This folder structure is required for a data-bundle, even if there is no ports (in which case the folder is empty).

If a mimetype is given for the data folders in the manifest, it must be application/vnd.taverna.port-data.

Several port data folders can be present in the data bundle, but only the root data structure will generally be used, see META-INF/container.xml.

The folder inputs/ is the corresponding folder for workflow inputs. If this is present in the bundle together with the outputs/ folder, it represents the inputs used in a run that produced the given outputs. In this case the bundle must either require details about which workflow was run in outputs.rdf, or must be a SCUFL 2 workflow bundle as well, in which case workflowBundle.rdf should be the workflow bundle that was run.

If the tool creating the data does not consider whether the data is input or outputs, it may use data/ as the root folder.

If the data bundle is also a workflow bundle (indicated by the presence of workflowBundle.rdf), the mime-type of the archive can be a application/vnd.taverna.data-bundle, application/vnd.taverna.scufl2.workflow-bundle or a third party mime-type. The mime-type gives an indication of what is the primary role of the bundle, but tools are not required to support dual-natured bundles and can treat the bundle purely by the given type.

Examples of dual-natured bundles (suggestions):

  • application/vnd.taverna.scufl2.workflow-bundle with inputs/ and outputs/ - a workflow with example inputs and outputs.
  • application/vnd.taverna.scufl2.workflow-bundle with inputs/ - a workflow to be run with given inputs
  • application/vnd.taverna.scufl2.workflow-bundle with data/ - a workflow with associated (reference) data
  • application/vnd.taverna.data-bundle with outputs/, inputs/ and workflowBundle.rdf - workflow outputs produced by running given workflow with given inputs
  • application/vnd.taverna.data-bundle with inputs/ and workflowBundle.rdf - a dataset that can optionally be further processed with the given workflow

If the mime type of the bundle is different from application/vnd.taverna.data-bundle, or a data folder is not the primary folder, the folder must have the mime-type application/vnd.taverna.port-data in the manifest to enable discovery.

outputs/fish

A folder representing a list can only contain files with numeric filenames (ignoring extension).

The lowest filename allowed is 0, representing the first element in the list.

It is not allowed to have several files with the same number, so if the folder contains 2.txt it can't also contain 2.jpg.

Gaps in the sequence is only allowed if the list represents a snapshot of an incomplete run, meaning that if the folder contains 0.txt and 4, one should also be able to find files for 1, 2 and 3.

File extensions are optional for individual values, and are ignored if the manifest declares the mime type, except for error documents which must have the file extension .err.

Nested lists are represented as nested folders without file extensions, like outputs/soup/1/. As Taverna lists must be of uniform cardinality, a folder can't contain both a folder and a file that is not an .err error document. An empty folder represents an empty list.

What about depth of empty lists? Special file name outputs/soup/2.1 or magic file like outputs/soup/2/.depth containing 1? Although depth is in the structural data file, if we make the structural file optional, an alternative way to indicate the depth is needed, in particular for cases where there is no data files that can help determine the top-level depth of the port.

outputs.rdf (and other data structure files)

This structural data file describes the data items in the corresponding folder. So outputs/ would be described by outputs.rdf and equivalent for other data structure folders.

This file is optional, as the pure data structure should be evident in the file structure alone. The main purpose of this file is to give further information about the data, if available, such as data identifiers, how the data was produced, etc. As the data is assigned global identifiers, this also provides the hooks for adding further provenance annotations which can be included as separate files in the data bundle.

This file allows additional metadata like <outputFrom> to indicate how/when/where the data was produced. Detailed provenance information in the bundle could be linked to from this file. Official data identifiers can be provided using owl:sameAs.

This ontology should probably subclass the Ordered list ontology

Indication of list depths and positions must correspond with the file and folder names as described in the previous section, so:

would not be valid, as outputs/fish/2/4 means outputs/fish is of depth 2 or more, and must be in position 4, not 3.

Note - as several entries could share the same identifier via owl:sameAs - when using OWL interference it could be possible that
a value or list is present in several parent lists, ports or even in different data bundles.

The main format is application/rdf+xml, but will come with an XML schema, so that clients can read or generate the file without general RDF knowledge. This means that clients should write the file using XML instead of pure RDF/XML serialisations which output might not comply with the schema. 

Example outputs.rdf (application/rdf+xml):

Alternative, secondary output formats might be included in the bundle, like HTML, JSON, Turtle. They should have a similar filename relating them to the folder, with an extension to indicate the type. If such files are included, the actual type must described their mime-type in the manifest.

Example outputs.ttl (text/turtle):

The explanations below use the RDF/XML format for examples, but Turtle for any inline snippets.

Bundle and bundle identifier

This statement describes the bundle itself. The data bundle should have a global bundle identifier, but only the root data file (in this case outputs.rdf) should assign this using owl:sameAs.

Anyone is allowed to mint a non-information resource at http://ns.taverna.org.uk/2010/data/bundle/$uuid/ as long as they generate a fresh, random v4 UUID. There is no promise that any information will be available at that URI, as the data in the bundle is not generally publicised. It will however be a common anchor-point as the identifier of the data bundle for third-party annotations.

The bundle URI here ends in / - so that one could talk about components of the data bundle, for instance http://ns.taverna.org.uk/2010/data/bundle/1495ca3a-f61a-437b-83ad-c6437c92a3d0/outputs/results would talk about the output results as it is in the data bundle 1495..d0.

The :contains bit says that the bundle contains the folder outputs/ - which is what this file is describing. inputs.rdf would similarly contain <.> :contains <inputs/>, but should not include/repeat the owl:sameAs statement.

Should this use scufl:sameBaseAs instead? Do we need a bundle ontology shared with SCUFL2?

Workflow outputs

This says that outputs/ contains workflow outputs. Other types are workflowInputs, processorInputs, processorOutputs. If you are not sure (like in data.rdf - use workflowData.

Then the three outputs are included. These could be lists, errors or values.

Should IRIs for the folders include the trailing / or not? The reasoning is that values have the file extension, so lists can have the slash as they are folders - but this means there are no annotations about <outputs/soup> - only outputs/soup/.

Lists

This defines outputs/fish/ as a list of depth 1. In this case (directly below outputs/) the list is also a workflowOutput.

Two list entries are included, outputs/fish/0.txt and outputs/fish/1.uri - their listPosition must match the filename (excluding extension, if present) - and only one entry per position is allowed.

An empty list would simply not have any hasListEntry elements - but must still have a depth - see outputs/soup/1/.

The optional owl:sameAs defines a global URI identifying this list. This identifier is typically generated by Taverna when the list is created, and should be on the form http://ns.taverna.org.uk/2010/data/list/$uuid/ using a unique, random UUID v4. The trailing / allows for items (as they are in the list) to be described using zero-based indexes, like http://ns.taverna.org.uk/2010/data/list/$uuid/5.

  • Should the URI include the depth of the list?
  • Should the URI include the namespace of the list?

There should not generally be gaps in the list position, so the first list position should be 0, second 1, etc. Gaps would only occur if
the list is incomplete, say a snapshot of a workflow output port before a workflow has finished. In this case the list should not have an owl:sameAs identifier assigned, as this identifier is assigned once the list is complete and immutable.

The list outputs/soup/ is of depth 2, and defines nested lists outputs/soup/0/ and outputs/soup/1/ which are both included in the same way - but without the workflowOutput specific annotations.

Run provenance for workflowOutput

The optional producedBy indicates which workflow run produced this value. This annotation can be added to any of the workflowOutput elements together with the following outputFrom. The run identifier should be on the form http://ns.taverna.org.uk/2010/run/$uuid/ with a unique, random UUID v4, assigned for each new workflow run (across nested workflows).

The optional outputFrom tells us which port produced this value, identified using SCUFL2 URIs. In this case it is the workflow output port fish in the workflow HelloWorld in the workflow bundle 00626652-55ae-4a9e-80d4-c8e9ac84e2ca. It is important that this reference is to the port as it is in a workflow bundle, not directly in a workflow (http://ns.taverna.org.uk/2010/workflow/00626652-55ae-4a9e-80d4-c8e9ac84e2ca/out/fish) - as the latter could be included in several workflow bundles as a nested workflow.

This level of provenance is typically annotated on just the top-level workflow output ports. Provenance on the individual items inside lists, indicating which processor output of which run of which processor, is out of the scope for this specifications - but such annotations should reuse the owl:sameAs identifiers and the SCUFL2 URIs for describing workflow components.

Information about the workflow run should be included as well, as a minimum linking the run to the executed workflow bundle:

Further run information should be included in a separate resource in the run bundle under run/$uuid.rdf, linked as run:b9455363-5624-4744-901b-3d6c7ec273d7 rdfs:seeAlso <run/b9455363-5624-4744-901b-3d6c7ec273d7.rdf>. The structure of a run bundle is out of the scope of this specifications, but should reuse the same identifiers as in the data and workflow bundles.

Values

There's not much annotations on values (unless they are also workflow outputs - like for outputs/results). The data identifier for values is on the form http://ns.taverna.org.uk/2010/data/value/$uuid using a unique, random UUID v4. Note that several values could have the same content, but different identifiers. (The same output produced by two service calls, for instance).

  • Should the URI include the namespace of the value?
  • Should the mime type and file size from the manifest also be included here?
  • Should we include an optional (sha1 or similar) checksum of the value? (can describe content equality, but also integrity of data bundle)

External references

External references are in a file of extension .uri, mime type text/uri-list. These URIs could also be included in the metadata:

  • Which predicate to use to indicate URI links? xlink:href? (owl:sameAs is probably not appropriate)

Errors

Error documents are produced by Taverna when a service fails, and are returned instead of the expected output type. Therefore errors have depth, as depth 0 replaces values, and depth 1 and above replaces an expected list. This allows for returning a complete list even if just a single service call or output failed, by some of the list's values being error documents.

Errors must have a filename ending in .err in the data bundle.

The internal format of the .err file is not yet finalised - but should include a stack trace and/or a message and could include a 'caused by' link to another error document.

  • Should the URI include the namespace of the list?
  • Should the URI include the depth of the error? (Taverna can occasionally (when forced to iterate over an error) make a derived error with the same UUID, but a different depth)
  • Should the internal details of the error - like 'caused by' and 'message' be included here as well?
Labels
  • None