This is an attempt to describe what identifiers to assign to workflow components (such as processors, ports, etc).
We should have a unified view on these identifiers, as they are needed for:
- 2010-05 Semantification of Taverna (statements about workflow components)
- Provenance, in particular exported as RDF
- RDF-view of workflow (from SCUFL2)
- Every workflow has a unique id, an UUID. The Taverna Workbench takes care of assigning a new UUID every time a change has been done to a workflow. (However, two workflows built the exact same way from scratch would not get the same UUID, ie. the identifier is not a hash)
- Nested workflows have their own UUIDs. If a workflow is included as a nested workflow twice in a mother workflow (or indeed in several workflows), it will still have the same UUID, unless it is later modified. However, the processor they are included in will have different identifiers as they have different names or are in different mother workflows.
- A processor ("service") in a workflow has a given processor name, similar to a filename, which uniquely identifies the processor, but only within that particular workflow. (Ie. not across nested workflows)
- An input port has a given input port name, similar to processor names, which uniquely identifies the input port within the processor or workflow.
- An output port similarly has a unique output port name. Note that an output port and input port could have the same name, even if they are not the same port. (Typically WSDL services have the input "parameter" and the output "parameter")
- A workflow can also have a workflow name, similar to processor names, but there are no requirements for uniqueness on this name. It is typically the non-informative non-unique "Workflow5" (5th workflow created that day) or a translation of the given name from Taverna 1 workflows.
Approach used by Paolo's provenance RDF export
An approach for workflow element identifiers can be found in Paolo's experimental RDF export of the Taverna 2 provenance.
See the example run1-mmu-chr17-Paul-provenance.rdf, which uses identifiers like:
http://purl.org/net/taverna/janus/e589d90b-01f2-4de6-87c6-684e8d6e9781/merge_genes_and_pathways_2/concatenated- the input (or output) port
"concatenated"on the processor
"merge_genes_and_pathways_2"in the workflow
- http://purl.org/net/taverna/janus/f1cc0d5d-b911-4eea-ac93-b7aba0d4952f?test549 the data value
test549in the workflow run
- http://purl.org/net/taverna/janus/f1cc0d5d-b911-4eea-ac93-b7aba0d4952f?test1527/false/1 the list
test1527(depth 1, no errors) in the run
- http://purl.org/net/taverna/janus/f1cc0d5d-b911-4eea-ac93-b7aba0d4952f/getcurrentdatabase identifies the execution(s) of the processor
getcurrentdatabasein workflow run
f1cc0d5d-b911-4eea-ac93-b7aba0d4952f, but not which iteration or in which nested workflow.
Although a good approximation, these identifiers don't clearly identify what is a processor, port, etc. A good "Cool URI" should be slightly informative and suggest the type of what we are talking about. The example also does not clearly separate input and output ports, or runs/data from workflows.
This example does however also specify a more open way to identify data references, which are internally in Taverna currently referenced with a home-brew URI syntax like:
- t2:ref//563ed252-7a0b-4ac2-8a00-7a4f9805dd97?test53 is a reference to a single value
- t2:error//563ed252-7a0b-4ac2-8a00-7a4f9805dd97?test595/1 is a reference to an error document of depth 1 (taking the place of a list)
- t2:list//563ed252-7a0b-4ac2-8a00-7a4f9805dd97?test23/false/1 is the list
test23with depth 1 (meaning it contains t2:ref's, with containsErrors=false, meaning that none of the items are errors or contain errors)
As this URI scheme is just an internal detail which are just reflected in internal databases, we should be able to modify this if needed.
This is a suggestion for the URI scheme to be used for semantification of Taverna (specifically to support annotations on specific workflow elements), and will also be used for SCUFL2.
We'll use the unique UUID of the workflow to define our 'namespace' from where we can resolve the other units. This is useful, because once you have found the definition of a given workflow UUID, you will also have the definition of its processors and ports. As the UUID changes with every edit, this will be guaranteed unique, so that your statement about a processor in a particular workflow will not automatically apply to a newer, forked version of the workflow.
(We should in Taverna .t2flow and scufl2 keep the chain of these previous UUIDs, though, so that you can trace the evolution of the workflow, and suggest if a statement might or might not still be true)
Obviously there is no central registry for all known workflows, but we know they are Taverna workflows, so we'll use
taverna.org.uk as the base hostname. I suggest
http://ns.taverna.org.uk/2010/ as the base URI, but open for suggestion. We don't really want to make it too long either..
Note that there would generally not be anything resolvable at those URIs, the
http:// prefix is only used to form a namespace of where workflow component identifiers live. As workflow UUIDs would never practically conflict, anyone can mint such URIs based on a workflow ID, even if they don't have control over
We could provide a minimal RDF document that can be resolved, based on the information in the URI. For instance:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/identifies a workflow which has the identifier
7cbda4a8-21ca-4d22-83d5-9d0959ab1e5bin its definition. Nested workflows have their own identifiers which are not directly linked to the parent, for instance
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/Get_page_from_URL/identifies the processor called
"Get_page_from_URL"in the workflow above. Normal URI escaping applies to funny characters, so
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/50%25%20must%20go%21/is the processor called
"50% must go!"(note that Taverna normally prevents such funny characters in these names)
- Processor input port:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/Get_page_from_URL/inputs/urlidentifies the input port
"url"for the processor
"Get_page_from_URL". Same quoting and naming rules apply.
- Processor output port:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/Get_page_from_URL/outputs/resultidentifies the output port
"result"in the processor
"Get_page_from_URL", vice versa the input port above.
- Workflow input port:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/inputs/databasespecifies the workflow input port
"database"for the given workflow. Note that if this is the input port in a nested workflow, you will also find a processor port that typically (but not always) will have the same name in the processor for the nested workflow. The difference is that this defines the input port as it is inside the nested workflow (which is linked to processors in the nested workflow), while the outer processor input port specifies how the nested workflow is connected in the parent workflow, like any other processor.
- Workflow output port:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/outputs/resultsis similarly the workflow output port called
results. As with processor ports, there could also be a workflow input port in this workflow with the same name.
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/datalink?from=processors/A/outputs/result&to=processors/B/inputs/dbidentifies a datalink from the output port
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/A/outputs/resultto the input port
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/B/inputs/db. from is always a processor output port or a workflow input port, and to is always a processor input port or a workflow output port. A workflow ports example:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/datalink?from=inputs/database&to=outputs/resultsis a boring link directly from workflow input port
databaseto the workflow output port
results. Note that you can't make links to items outside your workflow, so the
toparameters are always resolved relative to the workflow. (Hence there's no
/in the end of
- Merges are represented indirectly when stating the link. A merge occurs in Taverna when more than one output port is connected to a particular input port, in which case those links are ordered. (The input port will receive a list of items, one from each link). So in this case a third parameter is added to
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/datalink?from=processors/A/outputs/result&to=processors/B/inputs/db&mergePosition=2- which means this will be in merge position 2 (the third link) connected to
processors/B/inputs/db. Merge positions are 0-based, and no gaps are allowed, with first merge always being in position 0. Note that a single link with merge position 0 is still different from a single link without a merge, as this still means the input will be wrapped in a list with a single element.
- Conditional links are represented similar to the datalinks, but without the ports:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/condition?start=processors/B&after=processors/Ameans there is a conditional link saying that
processors/Ahas finished. (No other keywords currently supported, so you can't say
- Services: Services are normally declared in the t2flow, but might also be dynamically resolved by dispatch stack layers. This identifies declared activities:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/Get_page_from_URL/activities/0/is the first activity of
"Get_page_from_URL". Activities don't (currently) have names or their own UUIDs. SCUFL2 talks about service bindings, so this identifier might be revised.
- Dispatch stack:
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/Get_page_from_URL/dispatch/0/identifies the first (and for the moment only) dispatch stack.
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/Get_page_from_URL/dispatch/0/0/defines the top-most layer in the dispatch stack, typically the 'Top layer',
http://ns.taverna.org.uk/2010/workflows/7cbda4a8-21ca-4d22-83d5-9d0959ab1e5b/processors/Get_page_from_URL/dispatch/0/1/the next layer, etc.
Not yet identified:
- Iteration strategies
- Merges themselves (needed?)
- T2 references
- Workflow executions and their elements
http://ns.taverna.org.uk could do a search in myExperiment (and elsewhere) for matching workflows by their ID, and return some kind of RDF view of the workflow.
Looking in particular for views from the RESTful community and Scott Marshall on these:
- What base URL to use? http://purl.taverna.org.uk/ ? http://purl.taverna.org.uk/2010/ ? http://scufl2.taverna.org.uk/v1/ ? http://purl.org/something?
- Are the hierarchical URIs good, or should we use #hashtags? Hierarchical hashtags quickly ends up messy (#processors/A/inputs/B), but would at least easily be relative.
- Plural or singular collections? i.e.
/processors/in a RESTful interface would be a list of all the processors, but we generally only talk about a single processor
- Shortened versions?
- Query parameters for links, or some key-value pair style, like
- Shorter, more cryptic version for datalinks?
/datalink/proc:B:b/proc:A:a- we know
proc:B:bmust be output port, but in
bis a workflow input port..