Thread 'CMS Tasks Failing'

Author	Message
maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,722,395 RAC: 27,537	Message 42436 - Posted: 12 May 2020, 17:59:29 UTC - in response to Message 42434. Thanks Ivan for your explanations. Welcome back for CMS. Is this a correct interpretation for the way from WM-Agent to Boinc? Batch with Jobs created -> WM-Agent -> HTCondor -> Boinc Boinc-Server watching HTCondor-Server if Jobs on HTCondor -> Boinc creates Tasks for CMS@Home (Volunteer) if NO Jobs on HTCondor -> Boinc Task queue is running drain Take with your team all the time you need to find a solution for CMS in this not easy time. ID: 42436 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,829,056 RAC: 6,376	Message 42485 - Posted: 14 May 2020, 15:51:57 UTC - in response to Message 42436. Last modified: 14 May 2020, 15:53:17 UTC Thanks Ivan for your explanations. Welcome back for CMS. Is this a correct interpretation for the way from WM-Agent to Boinc? Batch with Jobs created -> WM-Agent -> HTCondor -> Boinc Boinc-Server watching HTCondor-Server if Jobs on HTCondor -> Boinc creates Tasks for CMS@Home (Volunteer) if NO Jobs on HTCondor -> Boinc Task queue is running drain Take with your team all the time you need to find a solution for CMS in this not easy time. Yes, that's pretty much it. Just remember the difference between BOINC tasks and CMS jobs. CMS jobs are created by WMAgent which maintains a queue (up to 2,000 IIRC) of jobs on the HTCondor server. The BOINC tasks query the condor server for CMS jobs when they need one, and report status back there. You may have noticed jobs flowing again. We waited several days to see if the 200 pending jobs would start up again, and agreed at a meeting yesterday that we would give it a bit more time and then submit another batch. They say great minds think alike -- both my Italian colleague and I submitted batches of 400, resp. 500, jobs within 2-1/2 minutes of each other this morning! It takes time for them to show up on the monitor, so I didn't see that she had already made a batch. Hers are half the size of mine, so they'll run correspondingly faster. As of now, she has 134 running, 248 successful, and 22 pending (that includes post-production jobs that run on the CERN T3_CH_CMSAtHome VM cluster). All of my 500 are still pending. ID: 42485 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,829,056 RAC: 6,376	Message 42489 - Posted: 14 May 2020, 21:15:41 UTC - in response to Message 42485. Last modified: 14 May 2020, 21:16:42 UTC ...and wouldn't you know it, suddenly the 200 pending jobs from the old batch started flowing again. I don't know if my Italian colleague was able to catch them in flagrant to see if (or how) they'd lost the "don't send to volunteers" requirement (she has access to our HTCondor server, I don't). Maybe we just didn't wait long enough for our hypothetical time-out on these jobs before submitting our new batches. Currently her batch of 400 shows 8 running, 398 successful, and 3 pending. My old batch of 500 is now 105 running, 411 successful, and 2 pending; the new batch has 45 running, 455 pending, and nothing completed yet. ID: 42489 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,702,179 RAC: 107,915	Message 42490 - Posted: 15 May 2020, 7:53:40 UTC @ Ivan, Federica Is it intended to upload the results/logs to different directory hierarchy levels? Are your upstream/downstream processes aware of that? (Links are shortened) Yesterday: PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_FFv14052020_test-v11/00000/33A6F9B1-186B-7748-9115-0DFF3CEF295F.root PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/f......_TC_SLC7_FF_CMS_Home_200514_111748_5368/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/0/0d76c908-b7a1-4c4c-98b8-c5f0223e4263-293-0-logArchive.tar.gz PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv5d-v11/00000/660548A7-C117-9F49-A4B1-65D23EDF417C.root PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/i...._TC_SLC7_IDR_CMS_Home_200509_140603_2627/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/1/588c22ce-2f55-40d8-b3f6-352e30250b56-14-1-logArchive.tar.gz This morning: PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2333175_1589511258.251129_0.tgz PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2018665_1589462038.267595_0.tgz ID: 42490 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,702,179 RAC: 107,915	Message 42491 - Posted: 15 May 2020, 10:46:39 UTC ... and all CMS tasks from today finished with an error. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=273333559 ID: 42491 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 65,477,205 RAC: 13,590	Message 42492 - Posted: 15 May 2020, 13:05:08 UTC - in response to Message 42491. ... and all CMS tasks from today finished with an error. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=273333559 [url]https://lhcathome.cern.ch/lhcathome/result.php?resultid=273134385 [/url] Everything was working fine for me until about 4 am then all the CMS tasks started failing. This one seems to have failed because there were no jobs available. https://lhcathome.cern.ch/lhcathome/result.php?resultid=273225488 ID: 42492 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,702,179 RAC: 107,915	Message 42494 - Posted: 15 May 2020, 14:02:40 UTC - in response to Message 42492. ttps://lhcathome.cern.ch/lhcathome/result.php?resultid=273134385[/url] This has been written by your BOINC client. [pre]Process still present 5 min after writing finish file; aborting[/pre] Regular shutdown might have been interrupted by a suspend command. At the next restart the VM was too slow to finish the restart procedure and BOINC killed it. Things like that are likely to happen if vbox tasks are suspended too often, especially on heavily used machines (disk I/O). See how often this VM had been restarted: [pre]2020-05-14 14:37:54 (9132): VM state change detected. (old = 'PoweredOff', new = 'Running') # first start 2020-05-14 14:45:45 (4420): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-14 15:34:11 (15348): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-14 17:34:15 (8272): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-14 19:39:57 (8288): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-14 21:00:01 (15852): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-15 00:51:05 (16032): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-15 01:30:42 (1780): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-15 03:05:57 (7940): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-15 04:50:02 (3084): VM state change detected. (old = 'PoweredOff', new = 'Running')[/pre] ID: 42494 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,829,056 RAC: 6,376	Message 42501 - Posted: 15 May 2020, 17:34:42 UTC - in response to Message 42492. ... and all CMS tasks from today finished with an error. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=273333559 [url]https://lhcathome.cern.ch/lhcathome/result.php?resultid=273134385 [/url] Everything was working fine for me until about 4 am then all the CMS tasks started failing. This one seems to have failed because there were no jobs available. https://lhcathome.cern.ch/lhcathome/result.php?resultid=273225488 Yes, we ran out of available jobs somewhere around then, so tasks will fail because of that. Currently there are about 10 jobs in the "pending" queue (they probably have what we call the "strange requirement" that spuriously says they can't run on volunteer machines), and about 13 still "running" which are probably machines that have been switched of and aren't reporting back to the condor server. ID: 42501 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,829,056 RAC: 6,376	Message 42502 - Posted: 15 May 2020, 18:06:09 UTC - in response to Message 42490. @ Ivan, Federica Is it intended to upload the results/logs to different directory hierarchy levels? Are your upstream/downstream processes aware of that? (Links are shortened) Yesterday: PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_FFv14052020_test-v11/00000/33A6F9B1-186B-7748-9115-0DFF3CEF295F.root PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/f......_TC_SLC7_FF_CMS_Home_200514_111748_5368/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/0/0d76c908-b7a1-4c4c-98b8-c5f0223e4263-293-0-logArchive.tar.gz PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv5d-v11/00000/660548A7-C117-9F49-A4B1-65D23EDF417C.root PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/i...._TC_SLC7_IDR_CMS_Home_200509_140603_2627/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/1/588c22ce-2f55-40d8-b3f6-352e30250b56-14-1-logArchive.tar.gz This morning: PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2333175_1589511258.251129_0.tgz PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2018665_1589462038.267595_0.tgz A couple of interesting things there. I'd be expecting volunteer machines to write to https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/ (and they have been). I'm not sure where vc-cms-output.s3.cern.ch fits into things unless it's an alias (but nslookup gives them very different IP addresses). The IP addresses for the latter resolve to what seems to be ceph instances at CERN; I know DataBridge is a ceph instance so I guess there is some aliasing going on. It almost looks like you've picked up something that should have run as post-production on T3_CH_CMSAtHome. I'll make sure Laurence and Federica know about it. ID: 42502 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,829,056 RAC: 6,376	Message 42521 - Posted: 16 May 2020, 16:11:52 UTC - in response to Message 42502. OK, there is a sort of aliasing going on. "The databridge does redirection to the S3 bucket. vc-cms-output.s3.cern.ch is the S3 bucket we use. You can't access that URL directly as you need to have the authentication key. The databridge does something called redirect and sign to give you temporary access after it authenticates you with your BOINC credentials." ID: 42521 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 42557 - Posted: 21 May 2020, 17:53:11 UTC Last modified: 22 May 2020, 11:14:05 UTC I got finally some of your jobs, Ivan. ireid_TC_SLC7_IDR_CMS_Home_200520_164749_5061 ireid_TC_SLC7_IDR_CMS_Home_200521_130530_953 and today ireid_TC_SLC7_IDR_CMS_Home_200521_232529_4572 ID: 42557 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,705,387 RAC: 76,557	Message 42632 - Posted: 27 May 2020, 18:13:47 UTC Ivan, should CMS work now, or should we better not download tasks? ID: 42632 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 42640 - Posted: 28 May 2020, 0:58:20 UTC - in response to Message 42632. Erich, I just did one. It failed as usual. No need to waste your time. ID: 42640 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,705,387 RAC: 76,557	Message 42646 - Posted: 28 May 2020, 6:15:15 UTC - in response to Message 42640. I've got two successful ones now. I guess the failures from yesterday evening were due to these server problems described by Nils in the News Thread. ID: 42646 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,705,387 RAC: 76,557	Message 42650 - Posted: 28 May 2020, 9:12:28 UTC - in response to Message 42646. well, also this morning, I had a few tasks that failed after about 20 Minuten, but also a few ones which have been running four hours now. So, all in all, CMS still does not seem to function the way it's supposed to. Really too bad. After it has not been working well for such long time now, I think it would make sense to either repair it as soon as possible, or to delete it from the subprojects list. ID: 42650 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,705,387 RAC: 76,557	Message 42655 - Posted: 28 May 2020, 18:24:08 UTC so, I just got a task finished after almost 10 hours with "unknown error code" - how nice: https://lhcathome.cern.ch/lhcathome/result.php?resultid=275487344 I really don't understand why CMS tasks are offered for download, if in reality they are junk :-( ID: 42655 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 42656 - Posted: 29 May 2020, 5:17:56 UTC - in response to Message 42655. Same here. Proces cmsRun ended normally after 10,000 events and the 12 hours elapsed were over, so I expected a normal shutdown and a valid result. https://lhcathome.cern.ch/lhcathome/result.php?resultid=275589481 Instead of that: Exit status 1 (0x00000001) Unknown error code Last lines of stderr output: 2020-05-28 22:20:15 (9344): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 600 seconds)) 2020-05-28 22:40:35 (9344): Guest Log: [ERROR] Condor ended after 44281 seconds. 2020-05-28 22:40:35 (9344): Guest Log: [INFO] Shutting Down. 2020-05-28 22:40:35 (9344): VM Completion File Detected. 2020-05-28 22:40:35 (9344): VM Completion Message: Condor ended after 44281 seconds. . 2020-05-28 22:40:35 (9344): Powering off VM. 2020-05-28 22:40:36 (9344): Successfully stopped VM. 2020-05-28 22:40:36 (9344): Deregistering VM. (boinc_084c9fd56a0a55ba, slot#0) 2020-05-28 22:40:36 (9344): Removing network bandwidth throttle group from VM. 2020-05-28 22:40:37 (9344): Removing storage controller(s) from VM. 2020-05-28 22:40:37 (9344): Removing VM from VirtualBox. 2020-05-28 22:40:37 (9344): Removing virtual disk drive from VirtualBox. 22:40:43 (9344): called boinc_finish(1) ID: 42656 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,705,387 RAC: 76,557	Message 42657 - Posted: 29 May 2020, 5:28:26 UTC - in response to Message 42640. Well, I think Jim1348 was perfectly right when saying: Erich, I just did one. It failed as usual. No need to waste your time. So I abandon CMS for now; it's a pitty, but what can we do if they are not able to repair it after so many months :-( ID: 42657 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,829,056 RAC: 6,376	Message 42674 - Posted: 30 May 2020, 13:05:23 UTC Hi. As you've noticed, we are still having difficulties... One small step forward was to understand why jobs that were being held pending, because they somehow acquire a requirement not to run on volunteer machines, start running after several days. It turns out that WMAgent times-out jobs that haven't run for five days and resubmits them -- without that restriction. However, we still don't know what it is about the handshaking back and forth between WMAgent and HTCondor that makes failed jobs get resubmitted with the "strange requirement". On the gripping hand, some other problems have arisen. Since last week, WMAgent has not been keeping the pending queue on the condor server topped up. This means fewer jobs are available, leading to inevitable time-outs and task failures even though the agent has plenty of jobs in its queue. Yet another problem arose last night, when an upgrade was made to the testbed server we use. Now I can no longer check the status of our jobs -- the WMStats page returns all nulls. The responsibles are aware of this but can't make a fix until Monday or Tuesday. So, i'm afraid it's time to once again take Little River Band's advice, and shut down for the time being. Sorry. ID: 42674 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,705,387 RAC: 76,557	Message 42707 - Posted: 30 May 2020, 18:07:40 UTC - in response to Message 42674. hello Ivan, once more many thanks for your thorough reply :-) So all we can do is to keep our fingers crossed that one day CMS may work again! ID: 42707 · Reply Quote