Some performance overhead figures for provenance collection in T2.
Taverna 2 supports the collection and querying of provenance metadata, i.e., of data about the dependencies of outputs from inputs, for each processor in the workflow and for each data product (either final or intermediate).
Provenance metadata collection involves (1) generating provenance events that describe elementary data transformations through workflow processors, and (2) storing those events in a dedicated provenance database (this is different from the workflow data database managed by the Taverna Data Manager).
As both of these actions take place during workflow execution, it is reasonable to expect an overhead in the execution time. This page reports on preliminary experiments to measure such overhead.
The overhead is measured by repeatedly executing a number of test Taverna workflows, selectively enabling and disabling provenance collection.
The workflows are all generated programmatically, using the template shown in the following figure:
in this template the input data consists of a list whose size d is set by one the inputs, and each processor (a beanshell) in a linear chain of length l simply propagates the input to the output, essentially unchanged. As each beanshell is designed to accept a single string, each will iterate over the input list (a number of times equal to the size of the list). The final processor involves a cross product between the two lists at the end of the linear chain, which produces a 2-deep list with l^2 elements as the final output.
Please note that the figure does not show additional control links that have been used to serialize the execution of each linear chain. This is done to reduce thread concurrency and thus make the results more uniform over multiple runs.
We generate a family of test workflows with this shape, by varying l and d.
(Note: in the results, d is called LS (List Size) and l is called PL (Path Length) )
This allows us to test provenance overhead (as well as other performance properties of the Taverna engine) for complex workflows that sit at the boundaries of common e-science practice. In fact, in these workflows the speed of provenance collection becomes a dominant factor, since the beanshells themselves do not really perform any computation. In order to simulate a family of fast processors, a random delay of up to 1 sec is introduced into each beanshell.
Here we provide test results for 4 data points, representing moderate-to-long linear chains, and moderate-to-large input lists (and therefore, number of iterations):
1- PL = 50, LS = 10
2- PL = 50, LS = 50
3- PL = 100, LS = 10
4- PL = 100, LS = 50
In the figures below, each result is measured by taking the average over 3 (or 5) runs with and without provenance collection, respectively. Results for PL=50 are expressed in seconds, while results for PL=100 are expressed in minutes.
The overall result is that the overhead is below 20% in all cases, with a maximum of 40 secs difference, in absolute terms, over a 5 minutes run.
This should give confidence to users that enabling provenance collection will not add substantially to their execution times.