Skip to end of metadata
Go to start of metadata

Current provenance schema (Taverna 2.1.2)

The current provenance is stored in either a Derby or MySQL database. Their schema definitions are very similar. This is based on the schema from MySQLProvenanceConnector

From this we've reconstructed the schema as UML (also available in XMI format):

Table/column explanations

When Provenance first encounters a workflow definition, differentiated by their workflow identifier (uuid), Provenance will perform a static analysis of the workflow structure (recursively including nested workflows), and populate these tables:

  • Workflow - the workflow definition
    • wfNameworkflow identifier from the t2flow (not the defined workflow name)
    • parentWfName - NULL or a reference to the last known workflow where this workflow appeared as a nested workflow
    • externalName - either the workflow name from the t2flow definition, or the last known processor name where this workflow was used as a nested workflow
    • dataflow -  serialised XML of the workflow (t2flow)
  • Processor - processors in workflow
    • pName - name of the processor in the workflow, or if isTopLevel is 1, the workflow name from the t2flow definition.
    • wfInstanceRef -  Workflow where this processor occurs, wfName from above (not WFInstance)
    • type is the Java classname for the first activity defined in this processor, or if isTopLevel is 1, always net.sf.taverna.t2.activities.dataflow.DataflowActivity (aliasing as a nested workflow)
    • isTopLevel - 0 by default, or 1 if this is a virtual processor representing the workflow itself.
  • Var - ports in processors/workflows
    • varName - port name, describing either a workflow or processor port
    • type - unknown, seems to always be NULL
    • inputOrOutput - 1 if this is an input port, 0 if it is an output port
    • pNameRef - references the Processor pName from above. (workflow ports in the top-level workflow are represented with pNameRef equal to the workflow name from the t2flow definition, )
    • wfInstanceRef - references the Workflow wfName (not the WFInstance). For processors inside nested workflows this will reference the ID of that nested workflow, for workflow ports of nested workflows there will be one entry for the nested workflow's ID and one for the parent's ID. (In both cases PNameRef will refer to the name in the parent)
    • nestingLevel - indicates the calculated depth of the port
    • actualNestingLevel - always 0 or null - even when the actual depth is 1 (like for "echo list")
    • anlSet - (unknown purpose) - values NULL, 0 or 1
    • reorder (or the reserved word order in mySQL) - (unknown purpose, but probably related to back patching) - values NULL, 0, 1
  • Arc - links between ports
    • sourceVarNameRef - reference to Var varName for the source of the link
    • sinkVarNameRef - reference to Var varName for the destination of the link
    • sourcePNameRef - reference to Var pNameRef (and Processor{{ pName}}) for the source of the link
    • sinkPNameRef - reference to Var pNameRef (and Processor pName) for the destination of the link
    • wfInstanceRef - references the Workflow wfName (not the WFInstance). For ports of nested workflows there will be two entries, one for the nested workflow and one for the parent workflow - in both cases the workflow port will have a PName referencing the nested workflow's processor's name in the parent

For each workflow run, which are identified by their workflow instance UUID (assigned by the facade, also used as the t2ref namespace), provenance will be populating these tables:

  • WfInstance - a particular run
    • instanceId - the UUID given by the WorkflowInstanceFacade when imitating the run. (This is also the namespace used in generated t2references). Nested workflow runs are part of the instanceId of the running of the parent.
    • wfNameRef - the id of a Workflow that is involved in this run. All nested workflows are also included here, linked to the same instanceId.
    • timestamp - an automatically generated field, indicating when this workflow run started
  • ProcBinding - activity used for a particular iteration in a particular run
    • pNameRef - a reference to the Processor this binding is for
    • wfNameRef - the ID of the (possibly nested) Workflow this processor is in
    • execIDRef - a reference to the WfInstance instanceId this binding was used for
    • iteration - the index of the job this binding was used for, typically a string like [] (no implicit iterations) or [0] (first of depth-1 implicit iteration)
    • actName - an UUID identifying the activity. (Can't find any other reference to these IDs, which are always unique)
  • VarBinding - data seen at a port for a workflow run
    • varNameRef - reference to the Var (the port) the data was seen at
    • pNameRef - reference to the Processor the data was seen at
    • wfNameRef - reference to the Workflow (potentially nested) this processor is in
    • wfInstanceRef - reference to the WfInstance of the run
    • iteration - the 0-based index this data was seen at. For services expecting or returning lists, this index will include the position(s) in the list(s), so iteration for a job with index [2,5] for a service taking inputs at depth 2 would have one varbinding per value in that list, with indexes like [2,5,3,7]
    • value - t2reference identifier of the data. Only depth 0 data are shown, so if a service accepts or returns a list, these will be the individual values of the list
    • collIdRef - t2reference identifier for the first list this value was seen in. NULL if there is no such parent list, or if the list is unknown (not yet backpatched)
    • positionInColl - 1-based position in the list referenced in collIdRef, or 1 the collection is unknown
    • valueType - type of value, typically referenceSet or error (as lists themselves are stored in Collection). Sometimes null, unknown for what reason
    • ref - unknown purpose, seems to be NULL, TEXT or UNKNOWN.
  • Collection - lists observed at a port
    • collId - the t2reference identifier of the list
    • parentCollIDRef - parent list for nested lists, otherwise TOP, which can indicate both that this list is not in a parent collection (if iteration is []), or that it is unknown what parent collection it is in (if iteration contains an index).
    • wfInstanceRef - the WfInstance of the run this list was observed in
    • varNameRef - the Var (port) this list was observed at
    • pNameRef - the Processor pName of the Var this list was observed at (could be in a nested workflow, but there is no wfNameRef to say which)
    • iteration the 0-based index (including list index) this list was observed at, [] if this is the top-level list at that port (in which case parentCollIdRef is TOP)

This table is no longer used, as data is stored using the t2 reference service, which data should be in the same database. This was originally meant for preserving the data beyond the t2 reference service:

  • Data
    • wfInstanceId - the wfInstance this data was seen at
    • dataReference - the (t2reference) identifier for this data
    • data - a Blob for containing the data value itself.

Issues with current schema

  • Unnecessary/unused tables and columns
    • Data (always empty)
    • ProcBinding (actName references nothing)
    • VarBinding.ref, Var.type
  • Terminology does not match Taverna
    • Var -> Port
    • Arc -> Datalink
    • Collection -> List
    • nestingLevel -> depth
    • Workflow.externalName -> processorName or workflowName
    • Processor.type -> Processor.activityClassName (but should be removed)
  • Misleading names
    • wfName -> workflowId
    • wfInstanceRef that points to wfName -> wfNameRef -> workflowIdRef
  • Large composite primary keys
    • Arc, VarBinding, Collection, WfInstance, Workflow, Var, Processor, ProcBinding
  • Foreign keys without using complete candidate key
    • Generally: Var.varName referenced without including inputOrOutput or wfName (but typically do include pName)
    • Processor.pName referenced some places without wfName (not indicating in which nested workflow)
  • Structure does not match Taverna behaviour
    • wfName could be in conflict with processor names in the workflow, and should not be used to identifiy workflow ports
    • Assumes there is no output port with same name as input port on a processor, although Taverna allows this. (In particular, WSDL services often have input:parameter and output:parameter)
    • Nested workflows can be included as nested workflows several places (and even in several processors in the same workflow), meaning that Workflow.parentWfName and externalName are fragile, and seem to be overwritten for every run of the workflow.
    • Processors can have more than one activity (alternates), but only one of them are used for a particular execution (ProcBinding.actName and Processor.type)
    • VarBinding records individual items, even when a service returned a list. Empty lists are not currently recorded. (But could probably be recorded in Collection)
    • Workflows recorded as processors with magic pName
  • Intermediate state kept in workflow
    • Some intermediate data for 'backpatching' seems to be kept in the workflow, for instance Var.reorder. It is unclear what the values in many fields are before they have been determined.
  • Database design guidelines
    • Foreign keys are not stated in SQL schema
    • Column names should be consistent (ie. execIDRef vs. wfInstanceRef)
    • Unused tables/columns should be removed
    • Should not use default values like positionInColl=1 and Collection.parentCollIDRef="TOP" instead of NULL when the value is unknown
    • Should introduce identifier columns for often referenced tables with composite primary keys, like Var.
    • Some fields are 1-based (positionInColl and other 0-based iteration, should all be 0-based.
  • Peculiarities
    • Most of the tables use utf8 but T2provenance.data is utf8
    • VarBinding.iteration sometimes include positionInColl in the beginning of the iteration string [$position,2], and sometimes in the end [2,1,$position].
Labels
  • None
  1. 2010-04-22

    There's also nothing there to record additional information about what a processor did for a particular iteration. For example, if there's a choice of several addresses which provide a service, it would be really helpful if it was recorded which one was used at each step.

    1. 2010-04-22

      Yes, those are the details about the execution which we'll have to add to record execution times, retries, etc. I'll try to come up with a 'suggested schema' on a separate page. First thing is to get those column names right.. we won't have time to do a proper refactoring for 2.2, unfortunately.