Skip to content

Defining Workflow Steps

Before defining the full workflow, we first need to capture individual data processing steps in the steps.yml file.

Overview

The steps.yml file consists of a dictionary of named steps, each of which has three sections (process, publisher, and environment):

stepname:
  process:
    ...
  publisher:
    ...
  environment:
    ...

The step definition acts as a parametrized job template:

  1. process -- describes the script template of the job to run.

  2. publisher -- identifies the relevant items produced by the process such as output files.

  3. environment -- specifies the computational environment. Essentially the Docker image.

Parameters are declared in a way that closely follows Python's formatting syntax. That is, you can use a parameter by writing {myparameter}. To escape braces, use double braces: {{ or }}.

Example

To give you a feel, this is what a typical process definition looks like:

process:
  process_type: interpolated-script-cmd
  script: |
    /recast_auth/getkrb.sh
    source /home/atlas/release_setup.sh
    source x86*/setup.sh
    MBJ_run.py \
    --files {inputdata_xrootd} \
    --inputSource xrootd \
    --submitDir {outputdir} \
    --dataSource 2 \
    --triggerLists trigger_menu_2015,trigger_menu_2016,trigger_menu_2017 \
    --doSyst 1 \
    --config MultibjetsAnalysis/SUSYTools_rel21.conf #
publisher:
  publisher_type: interpolated-pub
  publish:
    selected_signal: ['{outputdir}/data-output_histfitter/*{mc16a_pattern}*.root','{outputdir}/data-output_histfitter/*{mc16d_pattern}*.root']
  glob: true
environment:
  environment_type: docker-encapsulated
  image: gitlab-registry.cern.ch/recast-atlas/susy/atlas-conf-2018-041/mbj_analysis
  imagetag: ATLAS-CONF-2018-041
  resources:
    - kerberos: true

Process

The process section defines the script for this step. Typically, you will first set up the shell environment for your analysis (release, etc) and then execute some processing such as the event selection of the statistical analysis. While different process_types are possible, you will probably only need the interpolated-script-cmd.

process_type: interpolated-script-cmd
script: |
    <your script here>

Publisher

The publisher section identifies the outputs of the script. This identification will be used later when you connect the outputs and inputs of different steps in the workflow definition.

Environment

The environment section defines the computational environment in which the script from process will be run. In practice, this is done by referencing an appropriate Docker image.

environment:
  environment_type: docker-encapsulated
  image: <your image here>
  imagetag: <your tag here>

Environment Resources

In some cases, the script needs additional resources in addition to the docker image. This can be specified using the resources field in the environment definition.

environment:
  environment_type: docker-encapsulated
  image: <your image here>
  imagetag: <your tag here>
  resources:
  - <resource name>
  - <resource name>

The two main use-cases are authentication and CVMFS access.

Authentication

For some jobs you need to be able to authenticate as an ATLAS member, e.g., if you need to access ATLAS data from EOS. Authentication information should never be stored within the Docker image. Rather, you can request authentication data to be mounted into the image.

resources:
- kerberos: true

This will result in a directory /recast_auth being mounted into your container. You can use getkrb.sh in this directory to acquire Kerberos tokens:

process:
  process_type: interpolated-script-cmd
  script: |
    /recast_auth/getkrb.sh
    .. <remaining script here> ...
publisher:
  ...
environment:
  environment_type: docker-encapsulated
  ...
  resources:
  - kerberos: true

Note

For backwards compatibility reasons the historical term GRIDProxy can be used in place of kerberos: true. This is not recommended though, and for consistency kerberos: true should be used. (KRB5Auth is also valid, but mixed casing can lead to typos, so use at your own risk.)

CVMFS Access

For older analyses (i.e. release 20.7), it might be necessary to access CVMFS. (Note: for release 21 and onwards, this should not be necessary--please get in touch if you have a use-case). Adding - CVMFS to the resources section makes /cvmfs/atlas.cern.ch, /cvmfs/atlas-condb.cern.ch and `/cvmfs/sft.cern.ch available.

process:
  process_type: interpolated-script-cmd
  script: |
    ls -lrt /cvmfs/atlas.cern.ch
publisher:
  ...
environment:
  environment_type: docker-encapsulated
  ...
  resources:
  - CVMFS

Testing

Once defined, it is time to test the steps with some test values for the parameters defined in the job script.

Adding Tests

Tests can be added to the "catalogue entry" file of the analysis.

Let's take the example of this step definition:

eventselection:
  process:
    process_type: interpolated-script-cmd
    interpreter: bash
    script: |
      source /home/atlas/release_setup.sh
      source /analysis/build/x86*/setup.sh
      cat << 'EOF' > recast_xsecs.txt
      id/I:name/C:xsec/F:kfac/F:eff/F:relunc/F
      {did} {name} {xsec_in_pb} 1.0 1.0 1.0
      EOF
      echo {dxaod_file} > recast_inputs.txt
      myEventSelection {submitDir} recast_inputs.txt recast_xsecs.txt {lumi_in_ifb}
  publisher:
    publisher_type: interpolated-pub
    publish:
      histfile: '{submitDir}/hist-sample.root'
  environment:
    environment_type: 'docker-encapsulated'
    image: reanahub/reana-demo-atlas-recast-eventselection

The step has six parameters:

  1. {did} the dataset id
  2. {name} the sample name
  3. {xsec_in_pb} the sample cross-section
  4. {lumi_in_ifb} the overall luminosity
  5. {submitDir} the work directory for EventLoop
  6. {dxaod_file} the input DxAOD file

To test this step, we will add the following test case (named test_eventselection) into recast.yml

tests:
- name: test_eventselection
  spec: steps.yml#/eventselection
  parameters:
    did: 404958
    name: recast
    submitDir: '{workdir}/submitDir'
    dxaod_file: https://recastwww.web.cern.ch/recastwww/data/reana-recast-demo/mc15_13TeV.123456.cap_recast_demo_signal_one.root
    xsec_in_pb: 0.00122
    lumi_in_ifb: 30

Using Test Data

Some steps require data to be present (possibly the data that would normally be produced by an upstream step within a workflow, e.g. the input to a statistical analysis). In this case you can provide test inputs in the test definition using the data: field like this:

- name: test_statanalysis
  spec: steps.yml#/statanalysis
  parameters:
    signal_file: '{readdir0}/hist-sample.root'
    resultdir: '{workdir}/fitresults'
    data_file: /code/data/data.root
    background_file: /code/data/background.root
  data:
  - testdata

Here:

  • testdata refers to a relative directory testdata. Alternatively this may also be an absolute path. If using the latter, note that other people checking out your repository may not have access to the test data.

  • {readdir0} is an internal variable which evaluates to testdata (or whatever the directory is specified in the data: field).

Note that you can specify multiple data directories, eg.

data:
- testdata1
- testdata2

In this case, {readdir0} will evaluate to testdata1 and {readdir1} will evaluate to testdata2, and so on for additional data directories.

Running tests

Prerequisites (Authentication, etc):

Make sure you followed the steps in the Introduction. Most likely your steps will require running over both private images (e.g. built form your private analysis repos) and private data (e.g. DxAOD files), so make sure you have authentication set up before running the recast tests commands:

eval "$(recast auth setup -a $RECAST_USER -a $RECAST_PASS -a $RECAST_TOKEN -a default)"
eval "$(recast auth write --basedir authdir)"

Local non-interactive test of Scripts

$> recast tests run examples/testdemo test_eventselection --backend docker
2019-06-03 03:36:33,743 |  recastatlas.testing |   INFO | running test recast-test-050bc7f8
<TypedLeafs: {u'histfile': u'/Users/lukasheinrich/Code/recast/recast-romerepo/recast-test-050bc7f8/submitDir/hist-sample.root'}> (prepublished)
2019-06-03 01:36:35,254 | pack.packtivity.step |   INFO | starting file logging for topic: step
<TypedLeafs: {u'histfile': u'/Users/lukasheinrich/Code/recast/recast-romerepo/recast-test-050bc7f8/submitDir/hist-sample.root'}> (post-run)
test test_eventselection succeeded

This creates a directory recast-test-[tag] (eg. recast-test-050bc7f8 in the above example) which contains the testing output. Useful debugging information can be found in the log files located under recast-test-[tag]/_packtivity.

Local interactive tests in the step runtime environment

For troubleshooting it is also possible to interactively test the runtime environment. This provides you exactly the shell of the non-interactive script and you can test things until you are comfortable putting this into the script template.

$(recast tests shell examples/testdemo test_eventselection --backend docker)
2019-06-05 15:58:06,468 |  recastatlas.testing |   INFO | running test recast-testshell-f199802e
2019-06-05 13:58:07,567 |       pack.test.step |   INFO | starting file logging for topic: step
sh-4.1# source /home/atlas/release_setup.sh
Configured GCC from: /opt/lcg/gcc/6.2.0binutils/x86_64-slc6
Configured AnalysisBase from: /usr/AnalysisBase/21.2.32/InstallArea/x86_64-slc6-gcc62-opt
[bash][root AnalysisBase-21.2.32]:build >

Non-interactive Within CI

The recommended recast/recastatlas image has all the necessary software installed to work with the local backend. Adding something like this to your .gitlab-ci.yml should run the test.

testing:
  tags:
  - docker-privileged
  services:
  - docker:stable-dind
  stage: build
  image: "recast/recastatlas:v0.3.0"
  script:
  - eval "$(recast auth setup -a $RECAST_USER -a $RECAST_PASS -a $RECAST_TOKEN -a default)"
  - eval "$(recast auth write --basedir authdir)"

  # add my workflow
  - $(recast catalogue add $PWD)
  - recast tests run examples/testdemo test_eventselection --tag firsttest

Last update: July 5, 2023