Defining Workflow Steps¶
Before defining the full workflow, we first need to capture individual data processing steps in the steps.yml
file.
Overview¶
The steps.yml
file consists of a dictionary of named steps, each of which has three sections (process
, publisher
, and environment
):
stepname:
process:
...
publisher:
...
environment:
...
The step definition acts as a parametrized job template:
-
process
-- describes the script template of the job to run. -
publisher
-- identifies the relevant items produced by theprocess
such as output files. -
environment
-- specifies the computational environment. Essentially the Docker image.
Parameters are declared in a way that closely follows Python's formatting syntax. That is, you can use a parameter by writing {myparameter}
. To escape braces, use double braces: {{
or }}
.
Example¶
To give you a feel, this is what a typical process definition looks like:
process:
process_type: interpolated-script-cmd
script: |
/recast_auth/getkrb.sh
source /home/atlas/release_setup.sh
source x86*/setup.sh
MBJ_run.py \
--files {inputdata_xrootd} \
--inputSource xrootd \
--submitDir {outputdir} \
--dataSource 2 \
--triggerLists trigger_menu_2015,trigger_menu_2016,trigger_menu_2017 \
--doSyst 1 \
--config MultibjetsAnalysis/SUSYTools_rel21.conf #
publisher:
publisher_type: interpolated-pub
publish:
selected_signal: ['{outputdir}/data-output_histfitter/*{mc16a_pattern}*.root','{outputdir}/data-output_histfitter/*{mc16d_pattern}*.root']
glob: true
environment:
environment_type: docker-encapsulated
image: gitlab-registry.cern.ch/recast-atlas/susy/atlas-conf-2018-041/mbj_analysis
imagetag: ATLAS-CONF-2018-041
resources:
- kerberos: true
Process¶
The process
section defines the script for this step. Typically, you will first set up the shell environment for your analysis (release, etc) and then execute some processing such as the event selection of the statistical analysis.
While different process_type
s are possible,
you will probably only need the interpolated-script-cmd
.
process_type: interpolated-script-cmd
script: |
<your script here>
Publisher¶
The publisher
section identifies the outputs of the script. This identification will be used later when you connect the outputs and inputs of different steps in the workflow definition.
Environment¶
The environment
section defines the computational environment in which the script from process
will be run. In practice, this is done by referencing an appropriate Docker image.
environment:
environment_type: docker-encapsulated
image: <your image here>
imagetag: <your tag here>
Environment Resources¶
In some cases, the script needs additional resources in addition to the docker image. This can be specified using the resources
field in the environment definition.
environment:
environment_type: docker-encapsulated
image: <your image here>
imagetag: <your tag here>
resources:
- <resource name>
- <resource name>
The two main use-cases are authentication and CVMFS access.
Authentication¶
For some jobs you need to be able to authenticate as an ATLAS member, e.g., if you need to access ATLAS data from EOS. Authentication information should never be stored within the Docker image. Rather, you can request authentication data to be mounted into the image.
resources:
- kerberos: true
This will result in a directory /recast_auth
being mounted into your container. You can use getkrb.sh
in this directory to acquire Kerberos tokens:
process:
process_type: interpolated-script-cmd
script: |
/recast_auth/getkrb.sh
.. <remaining script here> ...
publisher:
...
environment:
environment_type: docker-encapsulated
...
resources:
- kerberos: true
Note
For backwards compatibility reasons the historical term GRIDProxy
can be used in place of kerberos: true
.
This is not recommended though, and for consistency kerberos: true
should be used.
(KRB5Auth
is also valid, but mixed casing can lead to typos, so use at your own risk.)
CVMFS Access¶
For older analyses (i.e. release 20.7), it might be necessary to access CVMFS. (Note: for release 21 and onwards, this should not be necessary--please get in touch if you have a use-case). Adding - CVMFS
to the resources
section makes /cvmfs/atlas.cern.ch
, /cvmfs/atlas-condb.cern.ch
and `/cvmfs/sft.cern.ch
available.
process:
process_type: interpolated-script-cmd
script: |
ls -lrt /cvmfs/atlas.cern.ch
publisher:
...
environment:
environment_type: docker-encapsulated
...
resources:
- CVMFS
Testing¶
Once defined, it is time to test the steps with some test values for the parameters defined in the job script.
Adding Tests¶
Tests can be added to the "catalogue entry" file of the analysis.
Let's take the example of this step definition:
eventselection:
process:
process_type: interpolated-script-cmd
interpreter: bash
script: |
source /home/atlas/release_setup.sh
source /analysis/build/x86*/setup.sh
cat << 'EOF' > recast_xsecs.txt
id/I:name/C:xsec/F:kfac/F:eff/F:relunc/F
{did} {name} {xsec_in_pb} 1.0 1.0 1.0
EOF
echo {dxaod_file} > recast_inputs.txt
myEventSelection {submitDir} recast_inputs.txt recast_xsecs.txt {lumi_in_ifb}
publisher:
publisher_type: interpolated-pub
publish:
histfile: '{submitDir}/hist-sample.root'
environment:
environment_type: 'docker-encapsulated'
image: reanahub/reana-demo-atlas-recast-eventselection
The step has six parameters:
{did}
the dataset id{name}
the sample name{xsec_in_pb}
the sample cross-section{lumi_in_ifb}
the overall luminosity{submitDir}
the work directory for EventLoop{dxaod_file}
the input DxAOD file
To test this step, we will add the following test case (named test_eventselection
) into recast.yml
tests:
- name: test_eventselection
spec: steps.yml#/eventselection
parameters:
did: 404958
name: recast
submitDir: '{workdir}/submitDir'
dxaod_file: https://recastwww.web.cern.ch/recastwww/data/reana-recast-demo/mc15_13TeV.123456.cap_recast_demo_signal_one.root
xsec_in_pb: 0.00122
lumi_in_ifb: 30
Using Test Data¶
Some steps require data to be present (possibly the data that would normally be produced by an upstream step within a workflow, e.g. the input to a statistical analysis). In this case you can provide test inputs in the test definition using the data:
field like this:
- name: test_statanalysis
spec: steps.yml#/statanalysis
parameters:
signal_file: '{readdir0}/hist-sample.root'
resultdir: '{workdir}/fitresults'
data_file: /code/data/data.root
background_file: /code/data/background.root
data:
- testdata
Here:
-
testdata
refers to a relative directorytestdata
. Alternatively this may also be an absolute path. If using the latter, note that other people checking out your repository may not have access to the test data. -
{readdir0}
is an internal variable which evaluates totestdata
(or whatever the directory is specified in thedata:
field).
Note that you can specify multiple data directories, eg.
data:
- testdata1
- testdata2
In this case, {readdir0}
will evaluate to testdata1
and {readdir1}
will evaluate to testdata2
, and so on for additional data directories.
Running tests¶
Prerequisites (Authentication, etc):¶
Make sure you followed the steps in the Introduction. Most likely your steps will require running over both private images (e.g. built form your private analysis repos) and private data (e.g. DxAOD files), so make sure you have authentication set up before running the recast tests
commands:
eval "$(recast auth setup -a $RECAST_USER -a $RECAST_PASS -a $RECAST_TOKEN -a default)"
eval "$(recast auth write --basedir authdir)"
Local non-interactive test of Scripts¶
$> recast tests run examples/testdemo test_eventselection --backend docker
2019-06-03 03:36:33,743 | recastatlas.testing | INFO | running test recast-test-050bc7f8
<TypedLeafs: {u'histfile': u'/Users/lukasheinrich/Code/recast/recast-romerepo/recast-test-050bc7f8/submitDir/hist-sample.root'}> (prepublished)
2019-06-03 01:36:35,254 | pack.packtivity.step | INFO | starting file logging for topic: step
<TypedLeafs: {u'histfile': u'/Users/lukasheinrich/Code/recast/recast-romerepo/recast-test-050bc7f8/submitDir/hist-sample.root'}> (post-run)
test test_eventselection succeeded
This creates a directory recast-test-[tag]
(eg. recast-test-050bc7f8
in the above example) which contains the testing output. Useful debugging information can be found in the log files located under recast-test-[tag]/_packtivity
.
Local interactive tests in the step runtime environment¶
For troubleshooting it is also possible to interactively test the runtime environment. This provides you exactly the shell of the non-interactive script and you can test things until you are comfortable putting this into the script template.
$(recast tests shell examples/testdemo test_eventselection --backend docker)
2019-06-05 15:58:06,468 | recastatlas.testing | INFO | running test recast-testshell-f199802e
2019-06-05 13:58:07,567 | pack.test.step | INFO | starting file logging for topic: step
sh-4.1# source /home/atlas/release_setup.sh
Configured GCC from: /opt/lcg/gcc/6.2.0binutils/x86_64-slc6
Configured AnalysisBase from: /usr/AnalysisBase/21.2.32/InstallArea/x86_64-slc6-gcc62-opt
[bash][root AnalysisBase-21.2.32]:build >
Non-interactive Within CI¶
The recommended recast/recastatlas
image has all the necessary software installed to work with the local backend. Adding something like this to your .gitlab-ci.yml
should run the test.
testing:
tags:
- docker-privileged
services:
- docker:stable-dind
stage: build
image: "recast/recastatlas:v0.3.0"
script:
- eval "$(recast auth setup -a $RECAST_USER -a $RECAST_PASS -a $RECAST_TOKEN -a default)"
- eval "$(recast auth write --basedir authdir)"
# add my workflow
- $(recast catalogue add $PWD)
- recast tests run examples/testdemo test_eventselection --tag firsttest