Scufl2 has moved to Apache (incubator)
Information in this section is out of date!
A data bundle is an archive of Taverna workflow data. A data bundle can contain workflow inputs, outputs and intermediate values.
This page is outdated requirement page.
See 2013-02 Data bundle requirements instead.
Taverna workflow/processor inputs and outputs are in the form of a map from the port name to the value.
This page describes a way to make a single file containing such a map of data values, which can be used as an input when running a workflow, or produced as an output from a workflow run. This will be used by the Taverna Server and Taverna Command line, and will also provide a archive or exchange interesting data from within the Taverna Workbench.
Taverna data structures
Since Taverna 2, Taverna has 3 different types of data values:
- Individual data values (String, bytes, URL)
- Error documents
Lists can contain other lists, errors and values of the expected depth.
Taverna 2 enforces that lists have a uniform granularity. A list containing individual values is said to be of depth 1, and cannot contain lists. A list of depth 2 can only contain items of depth 1, etc. Individual values always have depth 0, lists are always depth 1 or higher. Errors can be of any non-negative depth, as they replace a value or list that could have been produced, instead including a stack trace, message or a link to another error document.
Each data item in Taverna is assigned a globally unique reference when it's produced. Lists contain only references to the other values, meaning that data items can potentially occur in several lists (and even in several workflow runs) at the same time. Generally this is however not the case, as it would require specific support from the implementing service plugin.
Data values can as well be external references, like an URL, a local file name or something specific to a particular service plugin. For instance the API consumer produces references to live Java objects within the VM (and hence can't be exported), while other potential references could be GridFTP URIs or Amazon snapshots. Taverna only requires that the reference itself can be serialised. The most commonly used references are however the inline string and the inline byte array, which simply contains the actual data values.
Former work: Baclava
Taverna 1 had a data format called Baclava, which is a single XML file containing the base64-serialised bytes of the data values. Strings were initially serialised in the character encoding of the operating system, but from Taverna 1.7 (?) this encoding was enforced to be UTF-8. As Taverna 1 does not have error documents, they don't have a proper representation in Baclava.
Taverna 2.x has support for loading and saving Baclava files, they can also be viewed using the standalone DataViewer tool.
Example of legacy Baclava XML format:
This format is unfortunately quite verbose and cryptic to deal with. For instance
syntactictype="l(l('application/octet-stream'))"> means it's a list of a list -
l(l( - of values of the mime type
application/octet-stream - ie binaries. It is not possible to have different mime types across the values - although it is possible to have multiple types under
Values are included inline in the XML as Base64, meaning that they'll take about 50% more bytes than needed, and also making it tricky to deal with the XML without running out of memory.
Some legacy leftovers like
relationList simply complicate the picture.
Data identifiers can be given as LSIDs - but these days URIs are preferred.
The Baclava format also does not support error documents, references to external data, or any kind of description of where the data came from.
Proposed data bundle format
As for SCUFL2 workflow bundles, the data bundle is a single ZIP archive with a ODF-compatible manifest. Each value is a file in the ZIP archive, and each list is a folder. Exposed by the (RESTful) Taverna Server, the data bundle can be viewed as a directory structure, and by inspecting the manifest the client can determine the data mime types and size of the individual values before choosing to download or give the user a link.
In addition to the manifest, metadata can be provided in the bundle, describing which workflow run produced the values, and what official URIs the data has been assigned. In theory this can be expanded to include the full provenance of individual values and processors in the workflow.
Archive directory structure
Taverna data bundle
Mime type of bundle, ie.
Reserved folder for manifest
ODF 1.3-like manifest, listing each file, mime-type and file size
Adobe UCF/OEBPS list of root file
Structure and metadata about outputs
Outputs from a workflow run
List output at port
Single value at position 0
Single URI reference at position 1
List output at port
List output (depth 1) at position 0
Single value at position 0
Error at position 1
Empty list output at position 1
Error at position 2
Single output at port
Structure and metadata about workflow inputs
Inputs from a workflow run
Same structure as outputs/
The archive must be a ZIP file, and should have the file extension
.t2data. Some situations might require treating the data bundle as an unpacked set of folders. In this case the top folder should still have the file extension
According to the Adobe UCF specifications, the
mimetype file must be the first file in the folder, and must be stored without compression, encryption or permission attributes, to support detection by mimemagic and similar.
META-INF/manifest.xml - if present - must list every non-
META-INF file and folder in the archive, including the root folder. It should provide the mime-type - if known - for individual files. The root folder should have the same mime type as in the
mimetype file -
META-INF/container.xml - if present - should point to the entry point for the 'main' data of the bundle, one and only one entry which must be of the mime type
application/rdf+xml. For a bundle with workflow outputs this should be
outputs/*, while for a bundle representing workflow inputs this should be
inputs. If the implementation does not know if the data is inputs or outputs,
data must be used.
outputs.rdf contains the structural information about the individual lists, values and errors in the folder
outputs. Each structural metadata file pairs with a folder containing the data. The names
outputs/outputs.rdf is reserved for workflow outputs and
inputs/inputs.rdf for workflow inputs.
outputs contains the data lists and values. Each direct sub-folder or file represent a port with the given name, so
outputs/fish/ is a list at the port
outputs/results is a single value at the port
Which data structure?
If more than one folder except META-INF/ exists in the root of the data bundle, the file
META-INF/container.xml must define what is the root data structure (typically
outputs.rdf), so that tools can know which data to prefer, say to show in a viewer or use as workflow inputs.
If a workflow execution environment is fed a data bundle for the workflow inputs, it should generally pick the root data structure, allowing workflow outputs to be used as input for a second workflow with matching port names and depths. If the bundle
mimetype is different from
application/vnd.taverna.data-bundle, an execution tool should instead use the
inputs/inputs.rdf as workflow inputs. This could typically be the case where the data-bundle has been provided as example input as part of a SCUFL2 workflow bundle of the mime type
Similarly, a data viewing tool should prefer the root data structure, but if the bundle mime-type is not
application/vnd.taverna.data-bundle, it should primarily show the
This file is required, as a guide for mime magic and similar tools that guess the type of the archive. Therefore it must be added as the first file to the archive, uncompressed, so that its content is available in cleartext in the first bytes of the ZIP archive.
The file must be in ASCII and not contain any line feeds. If the archive is a Taverna Data Bundle, the mime type should be
META-INF/manifest.xml is present, this mime type must match the mime type of
"/" in the manifest.
To add the file
mimetype as the first uncompressed file, followed by the rest of the bundle (excluding the mimetype file), try using InfoZip:
This file, if exists, should follow the OpenDocument container format, and list every file in the bundle (except for the META-INF files). The main functionality provided by the manifest is to give the mime-type of individual data items, which are not required to have extensions. As a minimum the mime-type should distinguish between
text/plain (UTF-8 text) and
application/octet-stream (binary), but if the workflow definition or mime-magick like tool has guessed a more detailed mime type, it can be provided here.
Additionally the manifest may specify the file sizes, in general this can be useful when inspecting a larger data bundle remotely (exposed as a RESTful folder or similar).
/ represents the bundle itself, and must have the same mime type as in the file
application/vnd.taverna.data-bundle. A different mime type might be used if the primary purpose of the archive is different from being a data bundle, for instance being a SCUFL2 workflow bundle.
Error documents must have the mime type
The other folders are not required to have a mimetype, but if desired these mime types can be used:
inputsand other top-level data structure folders
application/vnd.taverna.list} for folders which are lists, like
If there is no manifest in the bundle, all data value files should be treated to be binary
application/octet-stream, unless they have one of these file extensions:
text/plainin UTF-8 character set
application/vnd.taverna.error(RDF/XML in UTF-8)
outputs.rdfand similar in the root file is
This file, if present, should point to the root data structure, typically
outputs.rdf. Alternative representation of the same file are permitted, but tools will generally only use
If the container file does not exist, and the bundle is of the mime type
application/vnd.taverna.data-bundle, there must be only one folder except META-INF/ in the archive, which would be used together with the corresponding structure file. So if the folder contains
fish/ together with
fish.rdf will be used.
The data structure file is optional, so the rootfile can contain an entry for the folder itself, which must be of the mime-type
application/vnd.taverna.port-data. It is generally not required to list the folder if a rootfile in the required
application/rdf+xml format is already given as a rootfile.
All rootfiles must be equivalent and describe the same data structure, although additional formats can include more or less information than the required format. There should be only one rootfile per media-type, and there must be only one rootfile for the media types
Port data folders: outputs/ inputs/ data/ */
outputs/ contains the data for the workflow output ports. If the output port returned a list, a folder with the port's name will be present. Output ports of depth 0 (single values) will on the other hand be represented directly. So in this example
outputs/fish is the list at port
outputs/results is a single output for the port
This folder structure is required for a data-bundle, even if there is no ports (in which case the folder is empty).
If a mimetype is given for the data folders in the manifest, it must be
Several port data folders can be present in the data bundle, but only the root data structure will generally be used, see META-INF/container.xml.
inputs/ is the corresponding folder for workflow inputs. If this is present in the bundle together with the
outputs/ folder, it represents the inputs used in a run that produced the given
outputs. In this case the bundle must either require details about which workflow was run in
outputs.rdf, or must be a SCUFL 2 workflow bundle as well, in which case
workflowBundle.rdf should be the workflow bundle that was run.
If the tool creating the data does not consider whether the data is input or outputs, it may use
data/ as the root folder.
If the data bundle is also a workflow bundle (indicated by the presence of
workflowBundle.rdf), the mime-type of the archive can be a
application/vnd.taverna.scufl2.workflow-bundle or a third party mime-type. The mime-type gives an indication of what is the primary role of the bundle, but tools are not required to support dual-natured bundles and can treat the bundle purely by the given type.
Examples of dual-natured bundles (suggestions):
outputs/- a workflow with example inputs and outputs.
inputs/- a workflow to be run with given inputs
data/- a workflow with associated (reference) data
workflowBundle.rdf- workflow outputs produced by running given workflow with given inputs
workflowBundle.rdf- a dataset that can optionally be further processed with the given workflow
If the mime type of the bundle is different from
application/vnd.taverna.data-bundle, or a data folder is not the primary folder, the folder must have the mime-type
application/vnd.taverna.port-data in the manifest to enable discovery.
A folder representing a list can only contain files with numeric filenames (ignoring extension).
The lowest filename allowed is
0, representing the first element in the list.
It is not allowed to have several files with the same number, so if the folder contains
2.txt it can't also contain
Gaps in the sequence is only allowed if the list represents a snapshot of an incomplete run, meaning that if the folder contains
4, one should also be able to find files for 1, 2 and 3.
File extensions are optional for individual values, and are ignored if the manifest declares the mime type, except for error documents which must have the file extension
Nested lists are represented as nested folders without file extensions, like
outputs/soup/1/. As Taverna lists must be of uniform cardinality, a folder can't contain both a folder and a file that is not an
.err error document. An empty folder represents an empty list.
What about depth of empty lists? Special file name
outputs.rdf (and other data structure files)
This structural data file describes the data items in the corresponding folder. So
outputs/ would be described by
outputs.rdf and equivalent for other data structure folders.
This file is optional, as the pure data structure should be evident in the file structure alone. The main purpose of this file is to give further information about the data, if available, such as data identifiers, how the data was produced, etc. As the data is assigned global identifiers, this also provides the hooks for adding further provenance annotations which can be included as separate files in the data bundle.
This file allows additional metadata like
<outputFrom> to indicate how/when/where the data was produced. Detailed provenance information in the bundle could be linked to from this file. Official data identifiers can be provided using
This ontology should probably subclass the Ordered list ontology
Indication of list depths and positions must correspond with the file and folder names as described in the previous section, so:
would not be valid, as
outputs/fish is of depth 2 or more, and must be in position
Note - as several entries could share the same identifier via
The main format is
application/rdf+xml, but will come with an XML schema, so that clients can read or generate the file without general RDF knowledge. This means that clients should write the file using XML instead of pure RDF/XML serialisations which output might not comply with the schema.
Example outputs.rdf (application/rdf+xml):
Alternative, secondary output formats might be included in the bundle, like HTML, JSON, Turtle. They should have a similar filename relating them to the folder, with an extension to indicate the type. If such files are included, the actual type must described their mime-type in the manifest.
Example outputs.ttl (text/turtle):
The explanations below use the RDF/XML format for examples, but Turtle for any inline snippets.
Bundle and bundle identifier
This statement describes the bundle itself. The data bundle should have a global bundle identifier, but only the root data file (in this case
outputs.rdf) should assign this using
Anyone is allowed to mint a non-information resource at
http://ns.taverna.org.uk/2010/data/bundle/$uuid/ as long as they generate a fresh, random v4 UUID. There is no promise that any information will be available at that URI, as the data in the bundle is not generally publicised. It will however be a common anchor-point as the identifier of the data bundle for third-party annotations.
The bundle URI here ends in
/ - so that one could talk about components of the data bundle, for instance
http://ns.taverna.org.uk/2010/data/bundle/1495ca3a-f61a-437b-83ad-c6437c92a3d0/outputs/results would talk about the output
results as it is in the data bundle
:contains bit says that the bundle contains the folder
outputs/ - which is what this file is describing.
inputs.rdf would similarly contain
<.> :contains <inputs/>, but should not include/repeat the
Should this use
This says that
outputs/ contains workflow outputs. Other types are
processorOutputs. If you are not sure (like in
data.rdf - use
Then the three outputs are included. These could be lists, errors or values.
Should IRIs for the folders include the trailing
outputs/fish/ as a list of depth 1. In this case (directly below
outputs/) the list is also a
Two list entries are included,
outputs/fish/1.uri - their
listPosition must match the filename (excluding extension, if present) - and only one entry per position is allowed.
An empty list would simply not have any
hasListEntry elements - but must still have a depth - see
owl:sameAs defines a global URI identifying this list. This identifier is typically generated by Taverna when the list is created, and should be on the form
http://ns.taverna.org.uk/2010/data/list/$uuid/ using a unique, random UUID v4. The trailing
/ allows for items (as they are in the list) to be described using zero-based indexes, like
There should not generally be gaps in the list position, so the first list position should be 0, second 1, etc. Gaps would only occur if
the list is incomplete, say a snapshot of a workflow output port before a workflow has finished. In this case the list should not have an
owl:sameAs identifier assigned, as this identifier is assigned once the list is complete and immutable.
outputs/soup/ is of depth 2, and defines nested lists
outputs/soup/1/ which are both included in the same way - but without the
workflowOutput specific annotations.
Run provenance for workflowOutput
producedBy indicates which workflow run produced this value. This annotation can be added to any of the
workflowOutput elements together with the following
outputFrom. The run identifier should be on the form
http://ns.taverna.org.uk/2010/run/$uuid/ with a unique, random UUID v4, assigned for each new workflow run (across nested workflows).
outputFrom tells us which port produced this value, identified using SCUFL2 URIs. In this case it is the workflow output port
fish in the workflow
HelloWorld in the workflow bundle
00626652-55ae-4a9e-80d4-c8e9ac84e2ca. It is important that this reference is to the port as it is in a workflow bundle, not directly in a workflow (
http://ns.taverna.org.uk/2010/workflow/00626652-55ae-4a9e-80d4-c8e9ac84e2ca/out/fish) - as the latter could be included in several workflow bundles as a nested workflow.
This level of provenance is typically annotated on just the top-level workflow output ports. Provenance on the individual items inside lists, indicating which processor output of which run of which processor, is out of the scope for this specifications - but such annotations should reuse the
owl:sameAs identifiers and the SCUFL2 URIs for describing workflow components.
Information about the workflow run should be included as well, as a minimum linking the run to the executed workflow bundle:
Further run information should be included in a separate resource in the run bundle under
run/$uuid.rdf, linked as
run:b9455363-5624-4744-901b-3d6c7ec273d7 rdfs:seeAlso <run/b9455363-5624-4744-901b-3d6c7ec273d7.rdf>. The structure of a run bundle is out of the scope of this specifications, but should reuse the same identifiers as in the data and workflow bundles.
There's not much annotations on values (unless they are also workflow outputs - like for
outputs/results). The data identifier for values is on the form
http://ns.taverna.org.uk/2010/data/value/$uuid using a unique, random UUID v4. Note that several values could have the same content, but different identifiers. (The same output produced by two service calls, for instance).
External references are in a file of extension
.uri, mime type
text/uri-list. These URIs could also be included in the metadata:
Error documents are produced by Taverna when a service fails, and are returned instead of the expected output type. Therefore errors have depth, as
depth 0 replaces values, and depth 1 and above replaces an expected list. This allows for returning a complete list even if just a single service call or output failed, by some of the list's values being error documents.
Errors must have a filename ending in
.err in the data bundle.
The internal format of the