Message boards : News : Problem writing CMS job results; please avoid CMS tasks until we find the reason
Message board moderation

To post messages, you must log in.

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38576 - Posted: 18 Apr 2019, 15:44:45 UTC

Since some time last night CMS jobs appear to have problems writing results to CERN storage (DataBridge). It's not affecting BOINC tasks as far as I can see, they keep running and credit is given. However, Dashboard does see the jobs as failing, hence the large red areas on the job plots.
Until we find out where the problem lies, it's best to set No New Tasks or otherwise avoid CMS jobs. I'll let you know when things are back to normal again.
ID: 38576 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38577 - Posted: 18 Apr 2019, 16:18:08 UTC - in response to Message 38576.  

I'm starting to find some "Permission denied" messages; perhaps a certificate has expired somewhere.
ID: 38577 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 38596 - Posted: 21 Apr 2019, 12:38:24 UTC - in response to Message 38576.  

So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet?
ID: 38596 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2375
Credit: 221,671,037
RAC: 143,151
Message 38597 - Posted: 21 Apr 2019, 13:28:09 UTC - in response to Message 38596.  

So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet?

CMS tasks that are downloaded but not yet started are like an empty box.
From the project's perspective it doesn't matter whether to suspend or to cancel them.

From the BOINC client's perspective it may be better to cancel them to free some resources.
ID: 38597 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38599 - Posted: 21 Apr 2019, 17:25:29 UTC - in response to Message 38597.  

So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet?

CMS tasks that are downloaded but not yet started are like an empty box.
From the project's perspective it doesn't matter whether to suspend or to cancel them.

From the BOINC client's perspective it may be better to cancel them to free some resources.

Indeed. As far as I can tell, the job failures are not counting as BOINC task failures, so you are getting BOINC credits any-road-up. But, for the project the results are lost. Please re-direct your computing resources to another project where they might count while we are trying to identify the problem,
ID: 38599 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 38601 - Posted: 22 Apr 2019, 6:34:11 UTC

If the problem is related to the intake of data that was uploaded into the upload server into whatever is processing that data, could you just disable the CMS assimilator so that the uploaded results just become a backlog to process once the problem is fixed instead of getting lost?
ID: 38601 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38604 - Posted: 22 Apr 2019, 10:29:14 UTC - in response to Message 38601.  

If the problem is related to the intake of data that was uploaded into the upload server into whatever is processing that data, could you just disable the CMS assimilator so that the uploaded results just become a backlog to process once the problem is fixed instead of getting lost?

No, I don't think so. There's a separation of duties here. What BOINC sees, and what we call a task, is an instantiation of a virtual machine (VM) that runs under VirtualBox. All that BOINC and its associated machinery sees is these VMs. Within the VM we run an HTCondor instance, that looks for CMS jobs on a Condor server; if it finds one it runs it. It's the staging out of the results and logs from those jobs which is failing, BOINC knows nothing about them. When the VM has run enough jobs, or there are no new jobs to run, or there is a severe error, it terminates and thus BOINC becomes aware that the task has ended.
What I'm not sure about at the moment is why the stage-out errors are not causing the VM to stop with an error, I'm sure they used to.
ID: 38604 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38629 - Posted: 25 Apr 2019, 6:37:52 UTC

We seem to be running jobs successfully again. Unfortunately I'm at a conference today so can't verify all details until tonight. Resume tasks with care.
ID: 38629 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38632 - Posted: 25 Apr 2019, 7:39:47 UTC - in response to Message 38629.  

Received confirmation that it was a quota problem which should now be fixed.
ID: 38632 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2375
Credit: 221,671,037
RAC: 143,151
Message 38633 - Posted: 25 Apr 2019, 7:47:33 UTC

Result upload succeeded (status: 200; size: 50 MB):
[25/Apr/2019:09:36:09 +0200] "PUT http://vc-cms-output.s3.cern.ch/store/unmerged/CMSSW_10_4_0/RelValJpsiMuMu_Pt-8/GEN-SIM/JpsiMuMu_Pt_8_forSTEAM_13TeV_TuneCUETP8M1_2018_GenSimFull_TC_OneTask_CMS_Home_IDRv4i-v11/00006/279EE467-80EC-C947-A451-D7DBADCEF1D3.root?AWSAccessKeyId=<snip>&Signature=<snip>&Expires=1556184955 HTTP/1.1" 200 50283515 "-" "gfal2-util/1.5.1 gfal2/2.16.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
ID: 38633 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38636 - Posted: 25 Apr 2019, 9:47:56 UTC - in response to Message 38633.  

Result upload succeeded (status: 200; size: 50 MB):
[25/Apr/2019:09:36:09 +0200] "PUT http://vc-cms-output.s3.cern.ch/store/unmerged/CMSSW_10_4_0/RelValJpsiMuMu_Pt-8/GEN-SIM/JpsiMuMu_Pt_8_forSTEAM_13TeV_TuneCUETP8M1_2018_GenSimFull_TC_OneTask_CMS_Home_IDRv4i-v11/00006/279EE467-80EC-C947-A451-D7DBADCEF1D3.root?AWSAccessKeyId=<snip>&Signature=<snip>&Expires=1556184955 HTTP/1.1" 200 50283515 "-" "gfal2-util/1.5.1 gfal2/2.16.1 neon/0.0.29" TCP_MISS:HIER_DIRECT

Good, thanks. The job charts are starting to look better.
ID: 38636 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 38662 - Posted: 29 Apr 2019, 10:39:04 UTC
Last modified: 29 Apr 2019, 10:39:24 UTC

VirtualBox downloads are not possible since the certficate is not secure. This both in Windows and Linux.
Tullio
ID: 38662 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38666 - Posted: 29 Apr 2019, 18:40:01 UTC - in response to Message 38662.  

VirtualBox downloads are not possible since the certficate is not secure. This both in Windows and Linux.
Tullio

Which certificate exactly? I was running jobs up until a short time ago -- I see a few still running -- but I'm also seeing (in the Alt-F5 window) errors from trying to write log files to EOS at CERN. It's not supposed to do that, it's supposed to write to DataBridge and then Laurence's cluster takes care of the log collection to EOS. There are authorisation failures when "we" try to write to EOS. However, I'm seeing log files on EOS time-stamped within the last few minutes. There are also merged result files on EOS from within the last ten minutes.
I'll some more digging around.
ID: 38666 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 38667 - Posted: 29 Apr 2019, 19:00:35 UTC - in response to Message 38666.  

I managed to catch one job in the middle of stage-out. It looks like it does successfully write the log to DataBridge but then tries a fallback to EOS. This is probably a red herring.
ID: 38667 · Report as offensive     Reply Quote

Message boards : News : Problem writing CMS job results; please avoid CMS tasks until we find the reason


©2024 CERN