Taverna 2.0 state
Note: This page is not about detailed problems with the Taverna 2.0 data storage implementation.
In Taverna 2.0, the workbench (and the enactment engine) registers and retrieves data from a ReferenceService. When data is registered with a ReferenceService, a T2Reference is returned. The T2Reference can be mapped to a (supposed) URI. The actual data storage and retrieval is handled for the ReferenceService by a DAO (data access object?). There are currently two sets of DAOs (in memory and hibernate), with each set having three separate DAOs to handle error documents, lists and ordinary data.
If provenance capture is turned on, then information about a run of a workflow is captured within a provenance database, including information about the data passed to enactments of processors (sort of equivalent to calls of services). The data is not stored within the provenance database; instead it stores the URI representation of the T2Reference. The provenance information for a particular workflow run is organized so that it "knows" which run and which workflow the provenance information is for. It also holds additional information about the type of data that is referenced.
The provenance database is not the same database as that which may be used to store data values.
- The T2Reference (when represented as a URI) is only unique within a particular running Taverna. When a Taverna is restarted, an equivalent T2Reference can be used to refer to different data. Two different running Tavernas may also use equivalent T2References.
- The URIs for the T2Reference do not comply with IANA formatting norms
- If in memory storage is used, then when the current Taverna is closed, the data is also deleted. No warning is given to the user that the data will be lost.
- In the absence of the currently running Taverna, there is no way to easily access data, even if it has been stored via hibernate.
- It is impossible to load a previous run and examine it.
- In order to retrieve data from the ReferenceService it is necessary to know a data type to ask for. (This is hacked for workflow outputs in Taverna 2.0 Workbench. The provenance keeps the information.) This requirement could prevent Taverna-independent browsing of data and/or provenance.
Prerequisites for future work:
- The string representation of a T2Reference (currently a supposed URI) must be unique
- The URIs (if they are kept) for the T2Reference must comply with IANA norms
- Data must not be lost when Taverna is closed (except by explicit user choice). If the data for a run is not saved, then the data for the provenance (if generated) of that run can, and probably should, be deleted. Data saving could be the default or by the user explicitly archiving the data.
- The string representation of a T2Reference (currently a URI) should preferably be of a form so that the workflow and the run can be readily determined. (This does not necessarily imply they are within the representation.)
- The data for a run must be accessible in the absence of a running Taverna.
- the ReferenceService could be available stand-alone outside Taverna
- the data could be kept (or archived) in a way that is not dependent upon ReferenceService (LSID?)
- What about loading/viewing a previous run?
- The preferred/original data type must be determinable. Possibly from the T2Reference.
- It could be stored with the data
- It could be part of the URI
Note that 4-7 are not immediate killers for Taverna 2.1 but need to be sorted out preferably as soon as possible.
- Describe some use cases for provenance and data use, both directly relating to Taverna and also in a wider context - needs non-myGrid/Manchester input
- Guided by (1), decide how T2References should be represented ( e.g. Taverna-specific URI, just string, LSID, HTTP address, something else) - needs non-myGrid/Manchester input
- Decide exact syntax of T2Reference representation - needs non-myGrid/Manchester input
- Alter Taverna (including provenance) code for new T2Reference representation
- Decide way of saving run data
- Implement way of saving run data
1-4 can possibly be done at the same time as 5-6.