Defining Workflows¶
Once you have steps defined, you are ready to build workflows out of them. Workflows build up graphs of individual data processing steps. For that a workflow file workflow.yml
defines a series of "stages" that add to the graph.
stages:
- name: stage1
... rest of stage spec ...
- name: stage2
... rest of stage spec ...
...
Example¶
Before going into details, here's an example of a workflow
stages:
- name: eventselection
dependencies: [init]
scheduler:
scheduler_type: singlestep-stage
parameters:
name: recast_sample
did: {stages: init, output: did}
xsec_in_pb: {step: init, output: xsec_in_pb}
dxaod_file: {step: init, output: dxaod_file}
lumi_in_ifb: 30.0
submitDir: '{workdir}/submitDir'
step: {$ref: 'steps.yml#/eventselection'}
- name: statanalysis
dependencies: [eventselection]
scheduler:
scheduler_type: singlestep-stage
parameters:
data_file: /code/data/data.root
signal_file: {step: eventselection, output: histfile}
background_file: /code/data/background.root
resultdir: '{workdir}/fitresults'
step: {$ref: 'steps.yml#/statanalysis'}
Stage Structure¶
The basic structure of a stage is the following
- name: <name>
dependencies: [<dep1>, <dep2>, ...]
scheduler:
scheduler_type: singlestep-stage
parameters:
<par name>: <par value or reference>
<par name>: <par value or reference>
...
step: {$ref: 'steps.yml#/<step name>'}
That is, it has a name
, a list of dependencies
, as well as a recipe on how new nodes should be added to the graph (the scheduler
). The dependencies
control at which point in the workflow execution the scheduler
is applied.
Scheduler Structure¶
While there are more complicated schedulers, the most common one is the scheduler_type: singlestep-stage
. This adds a single new task to the graph. The task is identified with the step reference syntax: 'steps.yml#/myname'
Parameter References¶
The main body of the scheduler deals with setting the parameters of the task. While some can be constant (typical example: outputfile: '{workdir}/output.root'
). some parameter values are only defined by a previous step (typical example: input files).
The syntax for parameter references is
<par name>: {step: <stage name>, output: <published output name>}
where the <par name>
is the parameter name for which the value should be set, <stage name>
is the stage from which you want to take an output and output
is the name under which the output was published (as defined in the step definition publisher
section)
Stage Dependencies¶
When using parameter references it is important to add each upstream stage that is used in a parameter reference to the list of dependencies of the stage (collected under the dependencies
field). As seen in this example:
- name: statanalysis
dependencies: [eventselection]
scheduler:
scheduler_type: singlestep-stage
parameters:
data_file: /code/data/data.root
signal_file: {step: eventselection, output: histfile}
background_file: /code/data/background.root
resultdir: '{workdir}/fitresults'
step: {$ref: 'steps.yml#/statanalysis'}
Here, the new signal to enter the fit is produced by an upstream stage and is bound to the signal_file
parameter using the parameter reference {step: eventselection, output: histfile}
. Now the eventselection
stage is a dependency and consequently added in the dependencies
field of the stage definition
The init
stage¶
The workflow as a whole is also parametrized. That is it can be called with different sets of input parameters (such as new RECAST signals). These "initial" parameters are accessible from an implicitly defined init
step. We can see this in action in the eventselection
step above:
stages:
- name: eventselection
dependencies: [init]
scheduler:
scheduler_type: singlestep-stage
parameters:
name: recast_sample
did: {stages: init, output: did}
xsec_in_pb: {step: init, output: xsec_in_pb}
dxaod_file: {step: init, output: dxaod_file}
lumi_in_ifb: 30.0
submitDir: '{workdir}/submitDir'
step: {$ref: 'steps.yml#/eventselection'}
Here, the overall workflow is called with did
, xsec_in_pb
, and dxaod_file
parameters and the eventselection
stage uses them in a few places using the same parameter reference syntax as explained above. Other parameters of the eventselection
stage are set to constant values. Also, as expected, the init
stage is added to the dependencies, since it is used in the parameter references.