Skip to content

Defining Workflows

Once you have steps defined, you are ready to build workflows out of them. Workflows build up graphs of individual data processing steps. For that a workflow file workflow.yml defines a series of "stages" that add to the graph.

stages:
- name: stage1
  ... rest of stage spec ...
- name: stage2
  ... rest of stage spec ...
...

Example

Before going into details, here's an example of a workflow

stages:
  - name: eventselection
    dependencies: [init]
    scheduler:
      scheduler_type: singlestep-stage
      parameters:
        name: recast_sample
        did: {stages: init, output: did}
        xsec_in_pb: {step: init, output: xsec_in_pb}
        dxaod_file: {step: init, output: dxaod_file}
        lumi_in_ifb: 30.0
        submitDir: '{workdir}/submitDir'
      step: {$ref: 'steps.yml#/eventselection'}
  - name: statanalysis
    dependencies: [eventselection]
    scheduler:
      scheduler_type: singlestep-stage
      parameters:
        data_file: /code/data/data.root
        signal_file: {step: eventselection, output: histfile}
        background_file: /code/data/background.root
        resultdir: '{workdir}/fitresults'
      step: {$ref: 'steps.yml#/statanalysis'}

Stage Structure

The basic structure of a stage is the following

  - name: <name>
    dependencies: [<dep1>, <dep2>, ...]
    scheduler:
      scheduler_type: singlestep-stage
      parameters:
        <par name>: <par value or reference>
        <par name>: <par value or reference>
        ...
      step: {$ref: 'steps.yml#/<step name>'}

That is, it has a name, a list of dependencies, as well as a recipe on how new nodes should be added to the graph (the scheduler). The dependencies control at which point in the workflow execution the scheduler is applied.

Scheduler Structure

While there are more complicated schedulers, the most common one is the scheduler_type: singlestep-stage. This adds a single new task to the graph. The task is identified with the step reference syntax: 'steps.yml#/myname'

Parameter References

The main body of the scheduler deals with setting the parameters of the task. While some can be constant (typical example: outputfile: '{workdir}/output.root'). some parameter values are only defined by a previous step (typical example: input files).

The syntax for parameter references is

<par name>: {step: <stage name>, output: <published output name>}

where the <par name> is the parameter name for which the value should be set, <stage name> is the stage from which you want to take an output and output is the name under which the output was published (as defined in the step definition publisher section)

Stage Dependencies

When using parameter references it is important to add each upstream stage that is used in a parameter reference to the list of dependencies of the stage (collected under the dependencies field). As seen in this example:

  - name: statanalysis
    dependencies: [eventselection]
    scheduler:
      scheduler_type: singlestep-stage
      parameters:
        data_file: /code/data/data.root
        signal_file: {step: eventselection, output: histfile}
        background_file: /code/data/background.root
        resultdir: '{workdir}/fitresults'
      step: {$ref: 'steps.yml#/statanalysis'}

Here, the new signal to enter the fit is produced by an upstream stage and is bound to the signal_file parameter using the parameter reference {step: eventselection, output: histfile}. Now the eventselection stage is a dependency and consequently added in the dependencies field of the stage definition

The init stage

The workflow as a whole is also parametrized. That is it can be called with different sets of input parameters (such as new RECAST signals). These "initial" parameters are accessible from an implicitly defined init step. We can see this in action in the eventselection step above:

stages:
  - name: eventselection
    dependencies: [init]
    scheduler:
      scheduler_type: singlestep-stage
      parameters:
        name: recast_sample
        did: {stages: init, output: did}
        xsec_in_pb: {step: init, output: xsec_in_pb}
        dxaod_file: {step: init, output: dxaod_file}
        lumi_in_ifb: 30.0
        submitDir: '{workdir}/submitDir'
      step: {$ref: 'steps.yml#/eventselection'}

Here, the overall workflow is called with did, xsec_in_pb, and dxaod_file parameters and the eventselection stage uses them in a few places using the same parameter reference syntax as explained above. Other parameters of the eventselection stage are set to constant values. Also, as expected, the init stage is added to the dependencies, since it is used in the parameter references.


Last update: October 5, 2021