1) Message boards : ATLAS application : ATLAS native_mt fail (Message 35471)
Posted 9 Jun 2018 by PoppaGeek
Post:
PyJobTransforms.trfExe.validate 2018-06-09 14:23:35,700 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (64) (Error code 65)




***********************pilot_error_report.json*********************
{
    "3957346728": {
        "2": [
            {
                "pilotErrorCode": 0,
                "pilotErrorDiag": "Job failed: Non-zero failed job return code: 65"
            }
        ]
    }
}
*****************The last 100 lines of the pilot log******************


6 work units all completed and validated with runtime less than 500 seconds.
2) Message boards : ATLAS application : ATLAS native_mt fail (Message 35469)
Posted 9 Jun 2018 by PoppaGeek
Post:
Found this:

This is the error that we have seen before: "No events to process: 4050 (skipEvents) >= 2000 (inputEvents of EVNT)"

It happens when the WU tries to process events which do not exist in the input file and is a bug in our ATLAS systems. I have changed the validation logic to pass these results so that the real error gets propagated upstream and so the WU does not get retried, since it will never succeed.


https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4179&postid=33433

So case closed?
3) Message boards : ATLAS application : ATLAS native_mt fail (Message 35468)
Posted 9 Jun 2018 by PoppaGeek
Post:
I cannot for the life of me find where to Show Computers I do not know if you can see them. :-/

Why are the tasks failing on this setup?

Thanks!

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
14:18:54 (20335): wrapper (7.7.26015): starting
14:18:54 (20335): wrapper: running run_atlas (--nthreads 2)
singularity image is /cvmfs/atlas.cern.ch/repo/images/singularity/x86_64-slc6.img
sys.argv = ['run_atlas', '--nthreads', '2']
THREADS=2
Checking for CVMFS
CVMFS is installed
OS:cat: /etc/redhat-release: No such file or directory

This is not SLC6, need to run with Singularity....
Checking Singularity...
Singularity is installed
copy /var/lib/boinc-client/slots/2/shared/start_atlas.sh
copy /var/lib/boinc-client/slots/2/shared/RTE.tar.gz
copy /var/lib/boinc-client/slots/2/shared/input.tar.gz
copy /var/lib/boinc-client/slots/2/shared/ATLAS.root_0
export ATHENA_PROC_NUMBER=2;start atlas job with PandaID=3957346728
Testing the function of Singularity...
check singularity with cmd:singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/images/singularity/x86_64-slc6.img hostname
Singularity Works...
cmd = singularity exec --pwd /var/lib/boinc-client/slots/2 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/images/singularity/x86_64-slc6.img sh start_atlas.sh > runtime_log 2> runtime_log.err
running cmd return value is 0

***********************log_extracts.txt*************************
- Last 10 lines from /var/lib/boinc-client/slots/2/Panda_Pilot_20784_1528571937/PandaJob/athena_stdout.txt -
PyJobTransforms.trfExe.preExecute 2018-06-09 14:19:36,673 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
PyJobTransforms.trfExe.preExecute 2018-06-09 14:19:36,675 INFO Now writing wrapper for substep executor EVNTtoHITS
PyJobTransforms.trfExe._writeAthenaWrapper 2018-06-09 14:19:36,676 INFO Valgrind not engaged
PyJobTransforms.trfExe.preExecute 2018-06-09 14:19:36,676 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
PyJobTransforms.trfExe.execute 2018-06-09 14:19:36,676 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
PyJobTransforms.trfExe.execute 2018-06-09 14:23:34,791 INFO EVNTtoHITS executor returns 64
PyJobTransforms.trfExe.validate 2018-06-09 14:23:35,700 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (64) (Error code 65)
PyJobTransforms.trfExe.validate 2018-06-09 14:23:35,732 INFO Scanning logfile log.EVNTtoHITS for errors
PyJobTransforms.transform.execute 2018-06-09 14:23:36,121 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (64)
PyJobTransforms.transform.execute 2018-06-09 14:23:39,295 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (64))

- Walltime -
JobRetrival=3, StageIn=10, Execution=273, StageOut=0, CleanUp=14

***********************pilot_error_report.json*********************
{
    "3957346728": {
        "2": [
            {
                "pilotErrorCode": 0,
                "pilotErrorDiag": "Job failed: Non-zero failed job return code: 65"
            }
        ]
    }
}
*****************The last 100 lines of the pilot log******************
    "seopt": "token:ATLASDATADISK:srm://srm.ndgf.org:8443/srm/managerv2?SFN=", 
    "sepath": "/atlas/disk/atlasdatadisk/rucio", 
    "seprodpath": "/atlas/disk/atlasdatadisk/rucio", 
    "setokens": "ATLASDATADISK", 
    "site": "BOINC", 
    "siteid": "BOINC_MCORE", 
    "sitershare": null, 
    "space": 0, 
    "special_par": null, 
    "stageinretry": 2, 
    "stageoutretry": 2, 
    "status": "brokeroff", 
    "statusoverride": "offline", 
    "sysconfig": "manual", 
    "system": "arc", 
    "tags": "arc", 
    "tier": "T3", 
    "timefloor": 0, 
    "tmpdir": null, 
    "transferringlimit": 20000, 
    "tspace": "2070-01-01T00:00:00", 
    "use_newmover": "True", 
    "validatedreleases": "True", 
    "version": null, 
    "wansinklimit": null, 
    "wansourcelimit": null, 
    "wnconnectivity": "full", 
    "wntmpdir": null
}

2018-06-09 19:18:57|20784|SiteInformat| Queuedata was successfully downloaded by pilot wrapper script
2018-06-09 19:18:57|20784|ATLASSiteInf| curl command returned valid queuedata
2018-06-09 19:18:57|20784|ATLASSiteInf| Site BOINC_MCORE is currently in brokeroff mode
2018-06-09 19:18:57|20784|ATLASSiteInf| Job recovery turned off
2018-06-09 19:18:57|20784|ATLASSiteInf| Confirmed correctly formatted rucio sepath
2018-06-09 19:18:57|20784|ATLASSiteInf| Confirmed correctly formatted rucio seprodpath
2018-06-09 19:18:57|20784|SiteInformat| Evaluating queuedata
2018-06-09 19:18:57|20784|SiteInformat| Setting unset pilot variables using queuedata
2018-06-09 19:18:57|20784|SiteInformat| appdir: 
2018-06-09 19:18:57|20784|pUtil.py    | File registration will be done by server
2018-06-09 19:18:57|20784|pUtil.py    | Updated stage-in retry number to 2
2018-06-09 19:18:57|20784|pUtil.py    | Updated stage-out retry number to 2
2018-06-09 19:18:57|20784|pUtil.py    | Detected unset (NULL) release/homepackage string
2018-06-09 19:18:57|20784|ATLASExperim| Application dir confirmed: /var/lib/boinc-client/slots/2/
2018-06-09 19:18:57|20784|pilot.py    | Pilot will serve experiment: Nordugrid-ATLAS
2018-06-09 19:18:57|20784|ATLASExperim| Architecture information:
2018-06-09 19:18:57|20784|ATLASExperim| Excuting command: lsb_release -a
2018-06-09 19:18:57|20784|ATLASExperim| 
sh: lsb_release: command not found
2018-06-09 19:18:57|20784|pUtil.py    | getSiteInformation: got experiment=ATLAS
2018-06-09 19:18:57|20784|ATLASExperim| appdirs = ['/cvmfs/atlas.cern.ch/repo/sw']
2018-06-09 19:18:57|20784|ATLASExperim| head of /cvmfs/atlas.cern.ch/repo/sw/ChangeLog: 
--------------------------------------------------------------------------------
2018-06-09 21:00:23 Alessandro De Salvo
	* + AGISData 20180609210023

2018-06-09 20:01:16 Alessandro De Salvo
  * + GroupData 201806092001

2018-06-09 20:00:27 Alessandro De Salvo
	* + AGISData 20180609200027

2018-06-09 19:00:17 Alessandro De Salvo
--------------------------------------------------------------------------------
2018-06-09 19:18:57|20784|ATLASExperim| ATLAS_PYTHON_PILOT set to /usr/bin/python
2018-06-09 19:18:57|20784|pUtil.py    | getSiteInformation: got experiment=ATLAS
2018-06-09 19:18:57|20784|ATLASExperim| Executing command: export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase;$ATLAS_LOCAL_ROOT_BASE/utilities/checkValidity.sh (time-out: 300)
2018-06-09 19:18:57|20784|pUtil.py    | Executing command: export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase;$ATLAS_LOCAL_ROOT_BASE/utilities/checkValidity.sh (protected by timed_command, timeout: 300 s)
2018-06-09 19:18:58|20784|pUtil.py    | Elapsed time: 0
2018-06-09 19:18:58|20784|ATLASExperim| Diagnostics tool has verified CVMFS
2018-06-09 19:18:58|20784|Node.py     | Collecting machine features
2018-06-09 19:18:58|20784|Node.py     | $MACHINEFEATURES not defined locally
2018-06-09 19:18:58|20784|Node.py     | $JOBFEATURES not defined locally
2018-06-09 19:18:58|20784|Node.py     | Executing command: hostname -i
2018-06-09 19:18:58|20784|Node.py     | IP number of worker node: 127.0.1.1
2018-06-09 19:18:58|20784|pUtil.py    | getSiteInformation: got experiment=Nordugrid-ATLAS
2018-06-09 19:18:58|20784|pilot.py    | Using site information for experiment: Nordugrid-ATLAS
2018-06-09 19:18:58|20784|pilot.py    | Will attempt to create workdir: /var/lib/boinc-client/slots/2/Panda_Pilot_20784_1528571937
2018-06-09 19:18:58|20784|pilot.py    | Creating file: /var/lib/boinc-client/slots/2/CURRENT_SITEWORKDIR
2018-06-09 19:18:58|20784|pUtil.py    | Wrote string "/var/lib/boinc-client/slots/2/Panda_Pilot_20784_1528571937" to file: /var/lib/boinc-client/slots/2/CURRENT_SITEWORKDIR
2018-06-09 19:18:58|20784|ATLASExperim| ATLAS_POOLCOND_PATH not set by wrapper
2018-06-09 19:18:58|20784|pilot.py    | Preparing to execute Cleaner
2018-06-09 19:18:58|20784|pilot.py    | Cleaning /var/lib/boinc-client/slots/2
2018-06-09 19:18:58|20784|Cleaner.py  | Cleaner initialized with clean-up limit: 2 hours
2018-06-09 19:18:58|20784|Cleaner.py  | Cleaner will scan for lost directories in verified path: /var/lib/boinc-client/slots/2
2018-06-09 19:18:58|20784|Cleaner.py  | Executing empty dirs clean-up, stage 1/5
2018-06-09 19:18:58|20784|Cleaner.py  | Purged 0 empty directories
2018-06-09 19:18:58|20784|Cleaner.py  | Executing work dir clean-up, stage 2/5
2018-06-09 19:18:58|20784|Cleaner.py  | Purged 0 single workDirs directories
2018-06-09 19:18:58|20784|Cleaner.py  | Executing maxed-out dirs clean-up, stage 3/5
2018-06-09 19:18:58|20784|Cleaner.py  | Purged 0 empty directories
2018-06-09 19:18:58|20784|Cleaner.py  | Executing AthenaMP clean-up, stage 4/5 <SKIPPED>
2018-06-09 19:18:58|20784|Cleaner.py  | Executing PanDA Pilot dir clean-up, stage 5/5
2018-06-09 19:18:58|20784|Cleaner.py  | Number of found job state files: 0
2018-06-09 19:18:58|20784|Cleaner.py  | No job state files were found, aborting clean-up
2018-06-09 19:18:58|20784|pilot.py    | Update frequencies:
2018-06-09 19:18:58|20784|pilot.py    | ...Processes: 300 s
2018-06-09 19:18:58|20784|pilot.py    | .......Space: 600 s
2018-06-09 19:18:58|20784|pilot.py    | ......Server: 1800 s
2018-06-09 19:18:58|20784|pUtil.py    | Timefloor set to zero in queuedata (multi-jobs disabled)
***************diag file************
runtimeenvironments=APPS/HEP/ATLAS-SITE;
Processors=1
WallTime=411.32s
KernelTime=18.39s
UserTime=252.80s
CPUUsage=65%
MaxResidentMemory=1807372kB
AverageResidentMemory=0kB
AverageTotalMemory=0kB
AverageUnsharedMemory=0kB
AverageUnsharedStack=0kB
AverageSharedMemory=0kB
PageSize=4096B
MajorPageFaults=6937
MinorPageFaults=2270894
Swaps=0
ForcedSwitches=24219
WaitSwitches=487507
Inputs=2706816
Outputs=65056
SocketReceived=0
SocketSent=0
Signals=0

nodename=PoppaGeek@Dev9400
exitcode=0
******************************WorkDir***********************
total 263632
drwxrwx--x 6 boinc boinc      4096 Jun  9 14:25 .
drwxrwx--x 5 boinc boinc      4096 Jun  9 13:30 ..
-rw------- 1 boinc boinc   6739364 Jun  9 14:19 agis_ddmendpoints.cvmfs.json
-rw------- 1 boinc boinc   5359206 Jun  9 14:19 agis_schedconf.cvmfs.json
drwx------ 2 boinc boinc      4096 Jun  9 14:19 .alrb
drwxr-xr-x 3 boinc boinc      4096 Jun  9 14:18 APPS
-rwx------ 1 boinc boinc      2435 Jun  9 10:31 ARCpilot
-rw------- 1 boinc boinc       549 Jun  9 14:19 .asetup
-rw------- 1 boinc boinc     10994 Jun  9 14:19 .asetup.save
-rw-r--r-- 1 boinc boinc         0 Jun  9 14:18 boinc_lockfile
-rw-r--r-- 1 boinc boinc      8192 Jun  9 14:25 boinc_mmap_file
-rw-r--r-- 1 boinc boinc       526 Jun  9 14:23 boinc_task_state.xml
-rw------- 1 boinc boinc        58 Jun  9 14:18 CURRENT_SITEWORKDIR
-rw-r--r-- 1 boinc boinc 256192482 Jun  9 14:18 EVNT.13837267._001172.pool.root.1
-rw-r--r-- 1 boinc boinc      5744 Jun  9 14:18 init_data.xml
-rw-r--r-- 1 boinc boinc   1091389 Jun  9 14:18 input.tar.gz
-rw------- 1 boinc boinc       488 Jun  9 14:25 IUWLDmW1ulsnlyackoJh5iwnABFKDmABFKDmqz7XDmABFKDmOvp3Fm.diag
-rw------- 1 boinc boinc      3467 Jun  9 14:25 jobSmallFiles.tgz
-rw-r--r-- 1 boinc boinc       105 Jun  9 14:18 job.xml
-rw------- 1 boinc boinc    170277 Jun  9 14:25 log.14322886._074314.job.log.1
-rw------- 1 boinc boinc    152071 Jun  9 14:24 log.14322886._074314.job.log.tgz.1
-rw------- 1 boinc boinc      1490 Jun  9 14:24 log_extracts.txt
-rw------- 1 boinc boinc       306 Jun  9 14:23 memory_monitor_summary.json
-rw------- 1 boinc boinc       599 Jun  9 14:25 metadata-surl.xml
-rw------- 1 boinc boinc       241 Jun  9 14:24 output.list
-rw------- 1 boinc boinc        11 Jun  9 14:19 pandaIDs.out
-rw------- 1 boinc boinc      2951 Jun  9 14:19 pandaJobData_1.out
-rw------- 1 boinc boinc      2951 Jun  9 14:18 pandaJobData.out
-rw------- 1 boinc boinc      8158 Jun  9 14:24 panda_node_struct.pickle
-rw------- 1 boinc boinc       203 Jun  9 14:24 pilot_error_report.json
-rw------- 1 boinc boinc        29 Jun  9 14:18 PILOT_INITDIR
-rw------- 1 boinc boinc       139 Jun  9 14:25 pilotlog-last.txt
-rw------- 1 boinc boinc     11387 Jun  9 14:18 pilotlog.txt
drwx------ 3 boinc boinc      4096 Jun  9 14:19 .pki
-rw------- 1 boinc boinc      3751 Jun  9 14:19 queuedata.json
-rw-r--r-- 1 boinc boinc      4376 Jun  9 10:32 queuedata.pilot.json
-rw-r--r-- 1 boinc boinc       606 Jun  9 14:18 RTE.tar.gz
-rwxr-xr-x 1 boinc boinc      8356 Jun  9 14:18 run_atlas
-rw-r--r-- 1 boinc boinc       604 Jun  9 14:25 runtime_log
-rw-r--r-- 1 boinc boinc     10385 Jun  9 14:25 runtime_log.err
drwxrwx--x 2 boinc boinc      4096 Jun  9 14:25 shared
-rw-r--r-- 1 boinc boinc     14425 Jun  9 14:18 start_atlas.sh
-rw------- 1 boinc boinc        19 Jun  9 14:19 START_TIME_3957346728
-rw------- 1 boinc boinc         1 Jun  9 14:18 STATUSCODE
-rw-r--r-- 1 boinc boinc      9737 Jun  9 14:25 stderr.txt
-rw------- 1 boinc boinc        47 Jun  9 14:24 workdir_size-3957346728.json
-rw-r--r-- 1 boinc boinc       100 Jun  9 14:18 wrapper_26015_x86_64-pc-linux-gnu
-rw-r--r-- 1 boinc boinc        24 Jun  9 14:25 wrapper_checkpoint.txt
running start_atlas return value is 0
Parent exit 0
child process exit 0
14:25:47 (20335): run_atlas exited; CPU time 253.180000
14:25:47 (20335): called boinc_finish(0)

</stderr_txt>
]]>



©2024 CERN