Thread 'Problem writing CMS job results; please avoid CMS tasks until we find the reason'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38576 - Posted: 18 Apr 2019, 15:44:45 UTC Since some time last night CMS jobs appear to have problems writing results to CERN storage (DataBridge). It's not affecting BOINC tasks as far as I can see, they keep running and credit is given. However, Dashboard does see the jobs as failing, hence the large red areas on the job plots. Until we find out where the problem lies, it's best to set No New Tasks or otherwise avoid CMS jobs. I'll let you know when things are back to normal again. ID: 38576 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38577 - Posted: 18 Apr 2019, 16:18:08 UTC - in response to Message 38576. I'm starting to find some "Permission denied" messages; perhaps a certificate has expired somewhere. ID: 38577 · Reply Quote

Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 38596 - Posted: 21 Apr 2019, 12:38:24 UTC - in response to Message 38576. So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet? ID: 38596 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,761,827 RAC: 131,622	Message 38597 - Posted: 21 Apr 2019, 13:28:09 UTC - in response to Message 38596. So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet? CMS tasks that are downloaded but not yet started are like an empty box. From the project's perspective it doesn't matter whether to suspend or to cancel them. From the BOINC client's perspective it may be better to cancel them to free some resources. ID: 38597 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38599 - Posted: 21 Apr 2019, 17:25:29 UTC - in response to Message 38597. So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet? CMS tasks that are downloaded but not yet started are like an empty box. From the project's perspective it doesn't matter whether to suspend or to cancel them. From the BOINC client's perspective it may be better to cancel them to free some resources. Indeed. As far as I can tell, the job failures are not counting as BOINC task failures, so you are getting BOINC credits any-road-up. But, for the project the results are lost. Please re-direct your computing resources to another project where they might count while we are trying to identify the problem, ID: 38599 · Reply Quote

Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 38601 - Posted: 22 Apr 2019, 6:34:11 UTC If the problem is related to the intake of data that was uploaded into the upload server into whatever is processing that data, could you just disable the CMS assimilator so that the uploaded results just become a backlog to process once the problem is fixed instead of getting lost? ID: 38601 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38604 - Posted: 22 Apr 2019, 10:29:14 UTC - in response to Message 38601. If the problem is related to the intake of data that was uploaded into the upload server into whatever is processing that data, could you just disable the CMS assimilator so that the uploaded results just become a backlog to process once the problem is fixed instead of getting lost? No, I don't think so. There's a separation of duties here. What BOINC sees, and what we call a task, is an instantiation of a virtual machine (VM) that runs under VirtualBox. All that BOINC and its associated machinery sees is these VMs. Within the VM we run an HTCondor instance, that looks for CMS jobs on a Condor server; if it finds one it runs it. It's the staging out of the results and logs from those jobs which is failing, BOINC knows nothing about them. When the VM has run enough jobs, or there are no new jobs to run, or there is a severe error, it terminates and thus BOINC becomes aware that the task has ended. What I'm not sure about at the moment is why the stage-out errors are not causing the VM to stop with an error, I'm sure they used to. ID: 38604 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38629 - Posted: 25 Apr 2019, 6:37:52 UTC We seem to be running jobs successfully again. Unfortunately I'm at a conference today so can't verify all details until tonight. Resume tasks with care. ID: 38629 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38632 - Posted: 25 Apr 2019, 7:39:47 UTC - in response to Message 38629. Received confirmation that it was a quota problem which should now be fixed. ID: 38632 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,761,827 RAC: 131,622	Message 38633 - Posted: 25 Apr 2019, 7:47:33 UTC upload succeeded (status: 200; size: 50 MB): [pre][25/Apr/2019:09:36:09 +0200] "PUT http://vc-cms-output.s3.cern.ch/store/unmerged/CMSSW_10_4_0/RelValJpsiMuMu_Pt-8/GEN-SIM/JpsiMuMu_Pt_8_forSTEAM_13TeV_TuneCUETP8M1_2018_GenSimFull_TC_OneTask_CMS_Home_IDRv4i-v11/00006/279EE467-80EC-C947-A451-D7DBADCEF1D3.root?AWSAccessKeyId=<snip>&Signature=<snip>&Expires=1556184955 HTTP/1.1" 200 50283515 "-" "gfal2-util/1.5.1 gfal2/2.16.1 neon/0.0.29" TCP_MISS:HIER_DIRECT[/pre] ID: 38633 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38636 - Posted: 25 Apr 2019, 9:47:56 UTC - in response to Message 38633. ]Result upload succeeded (status: 200; size: 50 MB): [pre][25/Apr/2019:09:36:09 +0200] "PUT http://vc-cms-output.s3.cern.ch/store/unmerged/CMSSW_10_4_0/RelValJpsiMuMu_Pt-8/GEN-SIM/JpsiMuMu_Pt_8_forSTEAM_13TeV_TuneCUETP8M1_2018_GenSimFull_TC_OneTask_CMS_Home_IDRv4i-v11/00006/279EE467-80EC-C947-A451-D7DBADCEF1D3.root?AWSAccessKeyId=&Signature=&Expires=1556184955 HTTP/1.1" 200 50283515 "-" "gfal2-util/1.5.1 gfal2/2.16.1 neon/0.0.29" TCP_MISS:HIER_DIRECT[/pre][/quote] Good, thanks. The job charts are starting to look better. ID: 38636 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 38662 - Posted: 29 Apr 2019, 10:39:04 UTC Last modified: 29 Apr 2019, 10:39:24 UTC VirtualBox downloads are not possible since the certficate is not secure. This both in Windows and Linux. Tullio ID: 38662 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38666 - Posted: 29 Apr 2019, 18:40:01 UTC - in response to Message 38662. VirtualBox downloads are not possible since the certficate is not secure. This both in Windows and Linux. Tullio Which certificate exactly? I was running jobs up until a short time ago -- I see a few still running -- but I'm also seeing (in the Alt-F5 window) errors from trying to write log files to EOS at CERN. It's not supposed to do that, it's supposed to write to DataBridge and then Laurence's cluster takes care of the log collection to EOS. There are authorisation failures when "we" try to write to EOS. However, I'm seeing log files on EOS time-stamped within the last few minutes. There are also merged result files on EOS from within the last ten minutes. I'll some more digging around. ID: 38666 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,966,376 RAC: 8,434	Message 38667 - Posted: 29 Apr 2019, 19:00:35 UTC - in response to Message 38666. I managed to catch one job in the middle of stage-out. It looks like it does successfully write the log to DataBridge but then tries a fallback to EOS. This is probably a red herring. ID: 38667 · Reply Quote