Thread 'CMS jobs failing at the LogArchive stage'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1153 Credit: 11,734,920 RAC: 269	Message 45115 - Posted: 11 Jul 2021, 10:10:31 UTC Last modified: 11 Jul 2021, 10:15:01 UTC Some of you may have noticed that since Friday night the CMS@Home job graphs have been showing a 100% failure rate. This seems to be a failure in sending the job logs to storage at the CERN DataBridge: <@========== WMException Start ==========@> Exception Class: StageOutError Message: Command exited non-zero, ExitCode:112 Output: stdout: Sat Jul 10 13:41:48 UTC 2021 Copying 327892 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2021/7/10/ireid_TC_SLC7_IDR_CMS_Home_210708_103444_8611/SinglePiE50HCAL_pythia8_2018_GenSimFull/0002/2/14aed0d0-6a1b-4993-a85a-cac211d51285-112-2-logArchive.tar.gz gfal-copy exit status: 112 ERROR: gfal-copy exited with 112 I've been unable to contact my colleagues at CERN (it's high holiday season in that part of Europe...), and I cannot connect to the DataBridge with my browser, so I've opened a problem ticket with CERN IT support. ID: 45115 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 802 Credit: 65,119,812 RAC: 26,242	Message 45119 - Posted: 11 Jul 2021, 15:50:12 UTC Last modified: 11 Jul 2021, 15:51:35 UTC I have noticed that there are long pauses in the CPU activity during crunching. This happens when one set of jobs have finished and it should download a new set. This results in a 2...3 hours of difference in runtime and CPU time. This gives the CPU a breather while the summer temperatures are higher so not all in all a bad thing. All my CMS tasks are getting the usual credits though, so no errors in Boinc. ID: 45119 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1153 Credit: 11,734,920 RAC: 269	Message 45120 - Posted: 11 Jul 2021, 17:37:32 UTC - in response to Message 45119. Yes, we don't pass through all HTCondor exit codes to BOINC, so even if condor/WMAgent thinks a job has failed BOINC will still give you credit. The current failures are when archiving the .log.gz files. I'm not sure if the .root result files are getting stored -- I cannot log-into DataBridge to check. Still no response from CERN IT. ID: 45120 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1153 Credit: 11,734,920 RAC: 269	Message 45121 - Posted: 12 Jul 2021, 13:36:24 UTC - in response to Message 45120. OK, the problem appears to have been resolved. I'll keep an eye on it for a day or two to be sure. ID: 45121 · Reply Quote