The Biomart system (http://www.biomart.org/) is a flexible data warehouse aimed at complex interlinked biological data sets. It enables the retrieval of large amounts of genomic data e.g. from Ensembl and Sanger, as well as Uniprot, MSD datasets and many more.
The Biomart service is one of the default Taverna service and can be found in the service panel.
Creating a new Biomart query
The query is created by dragging a dataset (shown in the figure above) into the workflow diagram panel as with any other service type. A configuration panel is displayed. All Biomart services require configuration before they are of any use within the workflow.
The configuration panel is also accessible from the ' Configure biomart query...' option in the right click context menu for any Biomart service in the workflow diagram panel.
As example, let's create a Biomart query for human genes (Homo sapiens genes) from ENSEMBL
Biomart initial configuration screen for Homo sapiens genomic data in Ensembl
Biomart services have two sets of configuration, filters and attributes. Filters define restrictions on the query and are particularly important if the users wishes the query to return anything other than entire genome's worth of data. Attributes on the other hand define the values which the user is interested in. Conceptually filters are inputs (although not all filters appear as input ports) and attributes are outputs.
Filters are critical to almost all queries. If no filter is defined the query will return all records within the selected data set. As data sets generally correspond to entire genomes or databases these queries are therefore substantial. Filters are configured by selecting the 'Filters' button on the summary panel (on the left of the Biomart configuration panel).
They are shown in groups (REGION, GENE, GENE ONTOLOGY, EXPRESSION, etc); clicking the '+' next to the filter group name will expand the group. Clicking on the '-' will collapse the filter group.
Filters are added by selecting the check box on the left. Note that the summary box on the left shows the filters that have been selected.
The image below shows two distinct kinds of filters: The drop down lists (e.g Chromosome number) represent filters over controlled vocabularies whereas the text entry boxes (e.g. Gene Start(bp)); represent arbitrary textual inputs. Some filter values change when other filters are configured.
The image below shows two more filter types, both based on boolean expressions. The pair of filters at the end of the page are simple boolean filters, they allow the user to specify whether a particular constraint must be satisfied, must not be satisfied or is ignored (filter not selected, the default). The filters at the top of the page are similar but the condition is configurable, the entire filter is constructed from a combination of the drop down subject with the predicate and object specified by the boolean selection:
The image below shows an ID List based filter. These are used to constrain the query to only those results matching an explicitly stated list of values. The drop down list at the top of the filter selects the type of ID to filter on and the text entry area accepts IDs, one per line, to be used as values in the filter. Selecting the 'Browse' button allows the ID values to be read from a file - the file must have one ID per line.
Some types of filters may manifest as inputs to the query processor. These are always optional if no upstream processor is connected to the input the query will proceed exactly as configured. If, however, a data link is connected the data will override the value parameter for the query. For example, if the user wishes to construct a workflow where Ensembl Gene IDs are used to fetch the corresponding sequences he or she specifies a filter based on Ensembl Gene ID (as with the ID list filter above) and overrides the specified values by connecting a string list to the appropriate input. When the query is run by the enactor the ID list configuration will be modified by the input data
Attributes are configured by clicking the 'Attributes' button on the summary panel. Attributes are themselves divided into pages. The required page is selected from the available pages, in this case 'Features', 'Structures', 'Variations', 'Homologs' and 'Sequences'.
Attributes can only be selected from one page. Any attributes selected from an attribute page will be removed from the query when another page is selected.
Selecting an attribute in one of the subpages states that the attribute should be returned for each record passing all defined filters.
There are two modes for returning values from a Biomart query: Multiple outputs and a formatted single output (as shown above). The formats available for the single output depends on which attribute page is chosen.
- Multiple Outputs
Each attribute maps directly to an output on the processor - where possible sensible names are chosen for the processor outputs such that it is reasonably obvious which corresponds to which.
<The image below show the corresponding processor in the advanced model explorer for the above attributes>
- Single formatted output
When a single output is chosen there is a single output port on the processor. A single value will be returned in the format chosen.
<<The image below show the corresponding processor in the advanced model explorer when single output mode is chosen:>>
Switching between modes will cause output ports (and any links connected to them) to be removed from the workflow.