This is only the major server features; there's also the server issues to consider.
This is the primary target for even a first alpha version of 3.0. It depends on having support libraries for working with scufl2; the current server has some restricted reading of t2flow documents, in order to be able to describe what the expected inputs and outputs are (to a limited extent), in particular their names, depth, and datatypes (i.e., MIME type, though currently this is handled solely through reading the actual produced data). It would be super-valuable to clients to provide this info, as it would enable clients to be able to work with the server without understanding workflow documents at all; they would be able to just treat them as a bunch of bytes to ship around.
Question: should we support t2flow as well? (Can distinguish by MIME type, which we can insist on the client providing correctly.)
Answer: yes. And done too. It turns out that it simplifies the introspection code too, which is a big bonus.
We want to be able to provide provenance/intermediate values, but from the perspective of the server, these could be in any format and according to any ontology.
Level 1: Dumped Provenance
With this, we just turn on whatever provenance dumping is supported by the command-line tool (or its successor) and make sure that what comes out can be downloaded. Clients are responsible for figuring out what to do with it.
Question: would we want to be able to set up an auto-push of the provenance data elsewhere?
Level 2: Queryable Provenance
Provide mechanism for actual querying against the provenance (SPARQL endpoints, etc.)
NB: Not sure if the server should do this or not.
I've become reluctant to significantly change the execution model. Yes, it is somewhat slower than it might otherwise be, but it also means that a problem with a workflow run won't hugely impact others (other than in terms of execution time); memory hogs won't cause wider destruction. It's also important to continue to maintain user separation, but that does increase impact on system resources and execution time.
NB: current server allows mapping of multiple server users to one system user, which gives some resource sharing and speed at a cost of reducing the enforced separation between users, i.e., workflows can poke around the filesystem to find other runs.
Update/clarification: basic execution model will not change at all. Do intend to provide a more efficient messaging channel though (as plugged-in module?).
We need to sort out IDs so that we can have a uniform view of messaging in the server; the IDs in the engine need to match up with those on the outside so that it is possible to make everything know what its talking about. (The current plethora of IDs is wholly untenable.) The server needs to have an ID for the run before the engine starts, because it needs that ID during the process of dealing with input data uploads.
With that sorted out, it should be possible to get events out of the monitoring system (Q: what events are “interesting”?) and push those into the server's publication mechanism. (NB: current implementation assumes that all events are termination events, so this is quite a bit of work.)
Usability as Executor
Want the workbench to be able to execute workflows “transparently” within a server instance. Open questions:
- How does the workbench discover the servers?
- How does the workbench select between the servers it finds?
- What does the workbench need apart from execution and monitoring (i.e., what it gets out of messaging and provenance)?