Skip to end of metadata
Go to start of metadata

This presentation about Taverna's provenance support shows its architecture and outputs.

These pages are outdated.

Users are recommended to install and use the updated Taverna-PROV plugin which produces PROV-O traces and includes the data values. The Taverna-PROV traces are more complete, more are "correct" and addresses many of the known issues in OPM/Janus.

What is the provenance of a workflow data product?

Taverna workflows essentially specify data generation, retrieval, and transformation pipelines. The provenance trace associated to one execution of any such workflow is a detailed account of all the data transformations that have occurred during the execution, from the initial inputs to the final outputs. Consider for instance the following bioinformatics workflow, that maps genomic QTL regions to mouse genes and pathways.

Workflow execution results in a provenance trace like the one sketched below.

In practice, a trace contains all dependencies amongst all intermediate data products that have been generated by the workflow execution. These dependencies are structured as a causal graph, where the nodes represent data items, and edges denote dependencies from one data item to another, for instance:
path:mmu04010->derives_from->mmu:26416
Graphs for realistic workflow executions may contain hundreds or thousand of nodes. In Taverna, provenance traces are stored persistently so they can be inspected after the workflow execution has completed. Traces for different executions accumulate in a provenance database, where they can be queried either in isolation or collectively (see below for details on the Taverna provenance query model).

What is provenance useful for?

Interest in provenance has been growing over the past few years across the scientific data management community, following the realisation that provenance traces are a form of metadata that can be captured inexpensively (no human intervention is required) and may potentially yield important insight into the workflow data products.

Janus conceptual provenance model

We have defined a conceptual model for workflow-based provenance, code-named Janus. The model has both a relational and a RDF(S) realisation. The RDF(S) ontology is publicly available here

Provenance management architecture

We have designed and implemented a first version of the provenance management architecture based on the Janus model. It consists of a relational DB (mySQL) and Java code for capturing provenance traces and for query. The implementation provides a Java API for performing provenance query and analysis.

Using the API, provenance traces can be exported:

  • as Janus-compliant RDF graphs, and
  • as OPM graphs

Below is a sketch of the overall architecture.

A number of clients to the provenance manager have been developed. These are documented in the technical section.

Extensions to semantic provenance

The initial design has been extended to include Semantic provenance, in two separate research threads:

  1. Providing and exploiting semantic annotations to provenance graphs. In collaboration with Indiana University (Prof. Beth Plale) and Ely Lilly, IN, USA:
    1. Cao, B., Plale, B., Subramanian, G., Missier, P., Goble, C., & Simmhan, Y. (2009). Semantically Annotated Provenance in the Life Science Grid. In J. Freire, Paolo Missier, & S. S. Sahoo (Eds.), 1st International Workshop on the Role of Semantic Web in Provenance Management. CEUR Proceedings
  2. Exploring the query capabilities afforded by semantically annotated provenance, as well as interoperability with Linked Data clouds. In collaboration with Knoesis Center at Wright State University (Satya Sahoo, Amit Sheth) and Jun Zhao, University of Oxford:
    1. Missier, P., Sahoo, S. S., Zhao, J., Sheth, A., & Goble, C. (2010). Janus: from Workflows to Semantic Provenance and Linked Open Data. Procs. IPAW 2010. Troy, NY (not yet online)
    2. Sahoo, S. S., Zhao, J., & Missier, P. (2011). Extending Semantic provenance into the Web of Data. Internet Computing, special issue on Provenance in Web Applications, to appear

Provenance interoperability

We have developed a prototype to show interoperability amongst provenance traces generated by different workflow systems. Work done in the context of the DataONE Summer of Code initiative is funded by the DataONE NSF project. In collaboration with the Kepler group at UC Davis and UC SDSC:

  • Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).

What is left to be done?

Plans beyond the current state include:

  • Graphical front-end for interactive provenance query formulation, and for presentation of provenance query results to users. This may include experimenting with visualisation of limited-size provenance graphs.
  • Provenance mining (more to come on this)

What else have we done - workshops, community involvement

Standardisation of provenance - we have been actively involved from the beginning in the specification of the Open Provenance Model:

We are also involeved in the W3C Incubator Group on Provenance on the Web (2009-2010), which is currently in the process of evolving into a W3C standard group.

We have been involved in the organisation of the first and second Workshops on "Role of Semantics in Provenance Management":

We have also been involved in the Third provenance challenge.

Labels
  • None