Message boards :
News :
Problem writing CMS job results; please avoid CMS tasks until we find the reason
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
Since some time last night CMS jobs appear to have problems writing results to CERN storage (DataBridge). It's not affecting BOINC tasks as far as I can see, they keep running and credit is given. However, Dashboard does see the jobs as failing, hence the large red areas on the job plots. Until we find out where the problem lies, it's best to set No New Tasks or otherwise avoid CMS jobs. I'll let you know when things are back to normal again. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
|
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,322,742 RAC: 6,496 ![]() ![]() ![]() |
So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet? |
![]() Send message Joined: 15 Jun 08 Posts: 2147 Credit: 175,745,373 RAC: 110,055 ![]() ![]() ![]() |
So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet? CMS tasks that are downloaded but not yet started are like an empty box. From the project's perspective it doesn't matter whether to suspend or to cancel them. From the BOINC client's perspective it may be better to cancel them to free some resources. |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet? Indeed. As far as I can tell, the job failures are not counting as BOINC task failures, so you are getting BOINC credits any-road-up. But, for the project the results are lost. Please re-direct your computing resources to another project where they might count while we are trying to identify the problem, ![]() |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,322,742 RAC: 6,496 ![]() ![]() ![]() |
If the problem is related to the intake of data that was uploaded into the upload server into whatever is processing that data, could you just disable the CMS assimilator so that the uploaded results just become a backlog to process once the problem is fixed instead of getting lost? |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
If the problem is related to the intake of data that was uploaded into the upload server into whatever is processing that data, could you just disable the CMS assimilator so that the uploaded results just become a backlog to process once the problem is fixed instead of getting lost? No, I don't think so. There's a separation of duties here. What BOINC sees, and what we call a task, is an instantiation of a virtual machine (VM) that runs under VirtualBox. All that BOINC and its associated machinery sees is these VMs. Within the VM we run an HTCondor instance, that looks for CMS jobs on a Condor server; if it finds one it runs it. It's the staging out of the results and logs from those jobs which is failing, BOINC knows nothing about them. When the VM has run enough jobs, or there are no new jobs to run, or there is a severe error, it terminates and thus BOINC becomes aware that the task has ended. What I'm not sure about at the moment is why the stage-out errors are not causing the VM to stop with an error, I'm sure they used to. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
|
![]() Send message Joined: 15 Jun 08 Posts: 2147 Credit: 175,745,373 RAC: 110,055 ![]() ![]() ![]() |
Result upload succeeded (status: 200; size: 50 MB): [25/Apr/2019:09:36:09 +0200] "PUT http://vc-cms-output.s3.cern.ch/store/unmerged/CMSSW_10_4_0/RelValJpsiMuMu_Pt-8/GEN-SIM/JpsiMuMu_Pt_8_forSTEAM_13TeV_TuneCUETP8M1_2018_GenSimFull_TC_OneTask_CMS_Home_IDRv4i-v11/00006/279EE467-80EC-C947-A451-D7DBADCEF1D3.root?AWSAccessKeyId=<snip>&Signature=<snip>&Expires=1556184955 HTTP/1.1" 200 50283515 "-" "gfal2-util/1.5.1 gfal2/2.16.1 neon/0.0.29" TCP_MISS:HIER_DIRECT |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
Result upload succeeded (status: 200; size: 50 MB): Good, thanks. The job charts are starting to look better. ![]() |
Send message Joined: 19 Feb 08 Posts: 707 Credit: 4,335,771 RAC: 11 ![]() ![]() |
VirtualBox downloads are not possible since the certficate is not secure. This both in Windows and Linux. Tullio |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
VirtualBox downloads are not possible since the certficate is not secure. This both in Windows and Linux. Which certificate exactly? I was running jobs up until a short time ago -- I see a few still running -- but I'm also seeing (in the Alt-F5 window) errors from trying to write log files to EOS at CERN. It's not supposed to do that, it's supposed to write to DataBridge and then Laurence's cluster takes care of the log collection to EOS. There are authorisation failures when "we" try to write to EOS. However, I'm seeing log files on EOS time-stamped within the last few minutes. There are also merged result files on EOS from within the last ten minutes. I'll some more digging around. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,104,659 RAC: 976 ![]() |
|
©2023 CERN