Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 2 May 07 Posts: 2246 Credit: 174,083,439 RAC: 8,284 |
Thanks Ivan for your explanations. Welcome back for CMS. Is this a correct interpretation for the way from WM-Agent to Boinc? Batch with Jobs created -> WM-Agent -> HTCondor -> Boinc Boinc-Server watching HTCondor-Server if Jobs on HTCondor -> Boinc creates Tasks for CMS@Home (Volunteer) if NO Jobs on HTCondor -> Boinc Task queue is running drain Take with your team all the time you need to find a solution for CMS in this not easy time. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,081,833 RAC: 16,543 |
Thanks Ivan for your explanations. Welcome back for CMS. Yes, that's pretty much it. Just remember the difference between BOINC tasks and CMS jobs. CMS jobs are created by WMAgent which maintains a queue (up to 2,000 IIRC) of jobs on the HTCondor server. The BOINC tasks query the condor server for CMS jobs when they need one, and report status back there. You may have noticed jobs flowing again. We waited several days to see if the 200 pending jobs would start up again, and agreed at a meeting yesterday that we would give it a bit more time and then submit another batch. They say great minds think alike -- both my Italian colleague and I submitted batches of 400, resp. 500, jobs within 2-1/2 minutes of each other this morning! It takes time for them to show up on the monitor, so I didn't see that she had already made a batch. Hers are half the size of mine, so they'll run correspondingly faster. As of now, she has 134 running, 248 successful, and 22 pending (that includes post-production jobs that run on the CERN T3_CH_CMSAtHome VM cluster). All of my 500 are still pending. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,081,833 RAC: 16,543 |
...and wouldn't you know it, suddenly the 200 pending jobs from the old batch started flowing again. I don't know if my Italian colleague was able to catch them in flagrant to see if (or how) they'd lost the "don't send to volunteers" requirement (she has access to our HTCondor server, I don't). Maybe we just didn't wait long enough for our hypothetical time-out on these jobs before submitting our new batches. Currently her batch of 400 shows 8 running, 398 successful, and 3 pending. My old batch of 500 is now 105 running, 411 successful, and 2 pending; the new batch has 45 running, 455 pending, and nothing completed yet. |
Send message Joined: 15 Jun 08 Posts: 2560 Credit: 256,330,545 RAC: 94,675 |
@ Ivan, Federica Is it intended to upload the results/logs to different directory hierarchy levels? Are your upstream/downstream processes aware of that? (Links are shortened) Yesterday: PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_FFv14052020_test-v11/00000/33A6F9B1-186B-7748-9115-0DFF3CEF295F.root PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/f......_TC_SLC7_FF_CMS_Home_200514_111748_5368/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/0/0d76c908-b7a1-4c4c-98b8-c5f0223e4263-293-0-logArchive.tar.gz PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv5d-v11/00000/660548A7-C117-9F49-A4B1-65D23EDF417C.root PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/i...._TC_SLC7_IDR_CMS_Home_200509_140603_2627/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/1/588c22ce-2f55-40d8-b3f6-352e30250b56-14-1-logArchive.tar.gz This morning: PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2333175_1589511258.251129_0.tgz PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2018665_1589462038.267595_0.tgz |
Send message Joined: 15 Jun 08 Posts: 2560 Credit: 256,330,545 RAC: 94,675 |
... and all CMS tasks from today finished with an error. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=273333559 |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,307,799 RAC: 7,931 |
... and all CMS tasks from today finished with an error. [url]https://lhcathome.cern.ch/lhcathome/result.php?resultid=273134385 [/url] Everything was working fine for me until about 4 am then all the CMS tasks started failing. This one seems to have failed because there were no jobs available. https://lhcathome.cern.ch/lhcathome/result.php?resultid=273225488 |
Send message Joined: 15 Jun 08 Posts: 2560 Credit: 256,330,545 RAC: 94,675 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=273134385 This has been written by your BOINC client. Process still present 5 min after writing finish file; aborting Regular shutdown might have been interrupted by a suspend command. At the next restart the VM was too slow to finish the restart procedure and BOINC killed it. Things like that are likely to happen if vbox tasks are suspended too often, especially on heavily used machines (disk I/O). See how often this VM had been restarted: 2020-05-14 14:37:54 (9132): VM state change detected. (old = 'PoweredOff', new = 'Running') # first start 2020-05-14 14:45:45 (4420): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-14 15:34:11 (15348): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-14 17:34:15 (8272): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-14 19:39:57 (8288): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-14 21:00:01 (15852): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-15 00:51:05 (16032): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-15 01:30:42 (1780): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-15 03:05:57 (7940): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-15 04:50:02 (3084): VM state change detected. (old = 'PoweredOff', new = 'Running') |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,081,833 RAC: 16,543 |
... and all CMS tasks from today finished with an error. Yes, we ran out of available jobs somewhere around then, so tasks will fail because of that. Currently there are about 10 jobs in the "pending" queue (they probably have what we call the "strange requirement" that spuriously says they can't run on volunteer machines), and about 13 still "running" which are probably machines that have been switched of and aren't reporting back to the condor server. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,081,833 RAC: 16,543 |
@ Ivan, Federica A couple of interesting things there. I'd be expecting volunteer machines to write to https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/ (and they have been). I'm not sure where vc-cms-output.s3.cern.ch fits into things unless it's an alias (but nslookup gives them very different IP addresses). The IP addresses for the latter resolve to what seems to be ceph instances at CERN; I know DataBridge is a ceph instance so I guess there is some aliasing going on. It almost looks like you've picked up something that should have run as post-production on T3_CH_CMSAtHome. I'll make sure Laurence and Federica know about it. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,081,833 RAC: 16,543 |
OK, there is a sort of aliasing going on. "The databridge does redirection to the S3 bucket. vc-cms-output.s3.cern.ch is the S3 bucket we use. You can't access that URL directly as you need to have the authentication key. The databridge does something called redirect and sign to give you temporary access after it authenticates you with your BOINC credentials." |
Send message Joined: 14 Jan 10 Posts: 1432 Credit: 9,594,942 RAC: 6,465 |
I got finally some of your jobs, Ivan. ireid_TC_SLC7_IDR_CMS_Home_200520_164749_5061 ireid_TC_SLC7_IDR_CMS_Home_200521_130530_953 and today ireid_TC_SLC7_IDR_CMS_Home_200521_232529_4572 |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,276,590 RAC: 67,404 |
Ivan, should CMS work now, or should we better not download tasks? |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Erich, I just did one. It failed as usual. No need to waste your time. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,276,590 RAC: 67,404 |
I've got two successful ones now. I guess the failures from yesterday evening were due to these server problems described by Nils in the News Thread. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,276,590 RAC: 67,404 |
well, also this morning, I had a few tasks that failed after about 20 Minuten, but also a few ones which have been running four hours now. So, all in all, CMS still does not seem to function the way it's supposed to. Really too bad. After it has not been working well for such long time now, I think it would make sense to either repair it as soon as possible, or to delete it from the subprojects list. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,276,590 RAC: 67,404 |
so, I just got a task finished after almost 10 hours with "unknown error code" - how nice: https://lhcathome.cern.ch/lhcathome/result.php?resultid=275487344 I really don't understand why CMS tasks are offered for download, if in reality they are junk :-( |
Send message Joined: 14 Jan 10 Posts: 1432 Credit: 9,594,942 RAC: 6,465 |
Same here. Proces cmsRun ended normally after 10,000 events and the 12 hours elapsed were over, so I expected a normal shutdown and a valid result. https://lhcathome.cern.ch/lhcathome/result.php?resultid=275589481 Instead of that: Exit status 1 (0x00000001) Unknown error code Last lines of stderr output: 2020-05-28 22:20:15 (9344): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 600 seconds)) 2020-05-28 22:40:35 (9344): Guest Log: [ERROR] Condor ended after 44281 seconds. 2020-05-28 22:40:35 (9344): Guest Log: [INFO] Shutting Down. 2020-05-28 22:40:35 (9344): VM Completion File Detected. 2020-05-28 22:40:35 (9344): VM Completion Message: Condor ended after 44281 seconds. . 2020-05-28 22:40:35 (9344): Powering off VM. 2020-05-28 22:40:36 (9344): Successfully stopped VM. 2020-05-28 22:40:36 (9344): Deregistering VM. (boinc_084c9fd56a0a55ba, slot#0) 2020-05-28 22:40:36 (9344): Removing network bandwidth throttle group from VM. 2020-05-28 22:40:37 (9344): Removing storage controller(s) from VM. 2020-05-28 22:40:37 (9344): Removing VM from VirtualBox. 2020-05-28 22:40:37 (9344): Removing virtual disk drive from VirtualBox. 22:40:43 (9344): called boinc_finish(1) |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,276,590 RAC: 67,404 |
Well, I think Jim1348 was perfectly right when saying: Erich,So I abandon CMS for now; it's a pitty, but what can we do if they are not able to repair it after so many months :-( |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,081,833 RAC: 16,543 |
Hi. As you've noticed, we are still having difficulties... One small step forward was to understand why jobs that were being held pending, because they somehow acquire a requirement not to run on volunteer machines, start running after several days. It turns out that WMAgent times-out jobs that haven't run for five days and resubmits them -- without that restriction. However, we still don't know what it is about the handshaking back and forth between WMAgent and HTCondor that makes failed jobs get resubmitted with the "strange requirement". On the gripping hand, some other problems have arisen. Since last week, WMAgent has not been keeping the pending queue on the condor server topped up. This means fewer jobs are available, leading to inevitable time-outs and task failures even though the agent has plenty of jobs in its queue. Yet another problem arose last night, when an upgrade was made to the testbed server we use. Now I can no longer check the status of our jobs -- the WMStats page returns all nulls. The responsibles are aware of this but can't make a fix until Monday or Tuesday. So, i'm afraid it's time to once again take Little River Band's advice, and shut down for the time being. Sorry. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,276,590 RAC: 67,404 |
hello Ivan, once more many thanks for your thorough reply :-) So all we can do is to keep our fingers crossed that one day CMS may work again! |
©2025 CERN