Skip to end of metadata
Go to start of metadata

This is an attempt to describe what identifiers to assign to workflow components (such as processors, ports, etc).

Content

Introduction

We should have a unified view on these identifiers, as they are needed for:

  • 2010-05 Semantification of Taverna (statements about workflow components)
  • Provenance, in particular exported as RDF
  • SCUFL2
  • RDF-view of workflow (from SCUFL2)

Assumptions

  • Every workflow has a unique id, an UUID. The Taverna Workbench takes care of assigning a new UUID every time a change has been done to a workflow. (However, two workflows built the exact same way from scratch would not get the same UUID, ie. the identifier is not a hash)
  • Nested workflows have their own UUIDs. If a workflow is included as a nested workflow twice in a mother workflow (or indeed in several workflows), it will still have the same UUID, unless it is later modified. However, the processor they are included in will have different identifiers as they have different names or are in different mother workflows.
  • A processor ("service") in a workflow has a given processor name, similar to a filename, which uniquely identifies the processor, but only within that particular workflow. (Ie. not across nested workflows)
  • An input port has a given input port name, similar to processor names, which uniquely identifies the input port within the processor or workflow.
  • An output port similarly has a unique output port name. Note that an output port and input port could have the same name, even if they are not the same port. (Typically WSDL services have the input "parameter" and the output "parameter")
  • A workflow can also have a workflow name, similar to processor names, but there are no requirements for uniqueness on this name. It is typically the non-informative non-unique "Workflow5" (5th workflow created that day) or a translation of the given name from Taverna 1 workflows.

Approach used by Paolo's provenance RDF export

An approach for workflow element identifiers can be found in Paolo's experimental RDF export of the Taverna 2 provenance.

See the example run1-mmu-chr17-Paul-provenance.rdf, which uses identifiers like:

Although a good approximation, these identifiers don't clearly identify what is a processor, port, etc. A good "Cool URI" should be slightly informative and suggest the type of what we are talking about. The example also does not clearly separate input and output ports, or runs/data from workflows.

This example does however also specify a more open way to identify data references, which are internally in Taverna currently referenced with a home-brew URI syntax like:

  • t2:ref//563ed252-7a0b-4ac2-8a00-7a4f9805dd97?test53 is a reference to a single value
  • t2:error//563ed252-7a0b-4ac2-8a00-7a4f9805dd97?test595/1 is a reference to an error document of depth 1 (taking the place of a list)
  • t2:list//563ed252-7a0b-4ac2-8a00-7a4f9805dd97?test23/false/1 is the list test23 with depth 1 (meaning it contains t2:ref's, with containsErrors=false, meaning that none of the items are errors or contain errors)

As this URI scheme is just an internal detail which are just reflected in internal databases, we should be able to modify this if needed.

SCUFL2/semantification suggestion

This is a suggestion for the URI scheme to be used for semantification of Taverna (specifically to support annotations on specific workflow elements), and will also be used for SCUFL2.

We'll use the unique UUID of the workflow to define our 'namespace' from where we can resolve the other units. This is useful, because once you have found the definition of a given workflow UUID, you will also have the definition of its processors and ports. As the UUID changes with every edit, this will be guaranteed unique, so that your statement about a processor in a particular workflow will not automatically apply to a newer, forked version of the workflow.

(We should in Taverna .t2flow and scufl2 keep the chain of these previous UUIDs, though, so that you can trace the evolution of the workflow, and suggest if a statement might or might not still be true)

Obviously there is no central registry for all known workflows, but we know they are Taverna workflows, so we'll use taverna.org.uk as the base hostname. I suggest http://ns.taverna.org.uk/2010/ as the base URI, but open for suggestion. We don't really want to make it too long either..

Note that there would generally not be anything resolvable at those URIs, the http:// prefix is only used to form a namespace of where workflow component identifiers live. As workflow UUIDs would never practically conflict, anyone can mint such URIs based on a workflow ID, even if they don't have control over http://ns.taverna.org.uk/2010/workflows.

Suggested resolution

We could provide a minimal RDF document that can be resolved, based on the information in the URI. For instance:

assuming that http://www.myexperiment.org/workflows/?id=7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b was a URI that could look of workflows with a given UUID.

So:

  • Workflow input port: http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/inputs/database specifies the workflow input port "database" for the given workflow. Note that if this is the input port in a nested workflow, you will also find a processor port that typically (but not always) will have the same name in the processor for the nested workflow. The difference is that this defines the input port as it is inside the nested workflow (which is linked to processors in the nested workflow), while the outer processor input port specifies how the nested workflow is connected in the parent workflow, like any other processor.
  • Merges are represented indirectly when stating the link. A merge occurs in Taverna when more than one output port is connected to a particular input port, in which case those links are ordered. (The input port will receive a list of items, one from each link). So in this case a third parameter is added to ?datalink: http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/datalink?from=processors/A/outputs/result&to=processors/B/inputs/db&mergePosition=2 - which means this will be in merge position 2 (the third link) connected to processors/B/inputs/db. Merge positions are 0-based, and no gaps are allowed, with first merge always being in position 0. Note that a single link with merge position 0 is still different from a single link without a merge, as this still means the input will be wrapped in a list with a single element.

Not yet identified:

  • Iteration strategies
  • Merges themselves (needed?)
  • T2 references
  • Workflow executions and their elements

Future resolutions

http://ns.taverna.org.uk could do a search in myExperiment (and elsewhere) for matching workflows by their ID, and return some kind of RDF view of the workflow.

URI Considerations

Looking in particular for views from the RESTful community and Scott Marshall on these:

  • What base URL to use? http://purl.taverna.org.uk/ ? http://purl.taverna.org.uk/2010/ ? http://scufl2.taverna.org.uk/v1/ ? http://purl.org/something?
  • Are the hierarchical URIs good, or should we use #hashtags? Hierarchical hashtags quickly ends up messy (#processors/A/inputs/B), but would at least easily be relative.
  • Plural or singular collections? i.e. /processors/A or /processor/A? /processors/ in a RESTful interface would be a list of all the processors, but we generally only talk about a single processor //processor/P..
  • Shortened versions? /wf/ instead of /workflows/, /proc/, /in/, etc.
  • Query parameters for links, or some key-value pair style, like /datalink/from:processors/A/inputs/B;to:processors/B/outputs/B ?
  • Shorter, more cryptic version for datalinks? /datalink/proc:B:b/proc:A:a - we know proc:B:b must be output port, but in /datalink/wf:b/proc:A:a then b is a workflow input port..
Labels
  • None