Message boards : CMS Application : CMS jobs failing at the LogArchive stage
Message board moderation

To post messages, you must log in.

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 45115 - Posted: 11 Jul 2021, 10:10:31 UTC
Last modified: 11 Jul 2021, 10:15:01 UTC

Some of you may have noticed that since Friday night the CMS@Home job graphs have been showing a 100% failure rate. This seems to be a failure in sending the job logs to storage at the CERN DataBridge:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Command exited non-zero, ExitCode:112
Output: stdout: Sat Jul 10 13:41:48 UTC 2021
Copying 327892 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2021/7/10/ireid_TC_SLC7_IDR_CMS_Home_210708_103444_8611/SinglePiE50HCAL_pythia8_2018_GenSimFull/0002/2/14aed0d0-6a1b-4993-a85a-cac211d51285-112-2-logArchive.tar.gz
gfal-copy exit status: 112
ERROR: gfal-copy exited with 112

I've been unable to contact my colleagues at CERN (it's high holiday season in that part of Europe...), and I cannot connect to the DataBridge with my browser, so I've opened a problem ticket with CERN IT support.
ID: 45115 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,161,776
RAC: 15,924
Message 45119 - Posted: 11 Jul 2021, 15:50:12 UTC
Last modified: 11 Jul 2021, 15:51:35 UTC

I have noticed that there are long pauses in the CPU activity during crunching. This happens when one set of jobs have finished and it should download a new set. This results in a 2...3 hours of difference in runtime and CPU time. This gives the CPU a breather while the summer temperatures are higher so not all in all a bad thing.

All my CMS tasks are getting the usual credits though, so no errors in Boinc.
ID: 45119 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 45120 - Posted: 11 Jul 2021, 17:37:32 UTC - in response to Message 45119.  

Yes, we don't pass through all HTCondor exit codes to BOINC, so even if condor/WMAgent thinks a job has failed BOINC will still give you credit. The current failures are when archiving the .log.gz files. I'm not sure if the .root result files are getting stored -- I cannot log-into DataBridge to check.
Still no response from CERN IT.
ID: 45120 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 45121 - Posted: 12 Jul 2021, 13:36:24 UTC - in response to Message 45120.  

OK, the problem appears to have been resolved. I'll keep an eye on it for a day or two to be sure.
ID: 45121 · Report as offensive     Reply Quote

Message boards : CMS Application : CMS jobs failing at the LogArchive stage


©2024 CERN