Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 22 · Next

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,088,272
RAC: 104,020
Message 42436 - Posted: 12 May 2020, 17:59:29 UTC - in response to Message 42434.  

Thanks Ivan for your explanations. Welcome back for CMS.
Is this a correct interpretation for the way from WM-Agent to Boinc?

Batch with Jobs created -> WM-Agent -> HTCondor -> Boinc
Boinc-Server watching HTCondor-Server
if Jobs on HTCondor -> Boinc creates Tasks for CMS@Home (Volunteer)
if NO Jobs on HTCondor -> Boinc Task queue is running drain

Take with your team all the time you need to find a solution for CMS in this not easy time.
ID: 42436 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42485 - Posted: 14 May 2020, 15:51:57 UTC - in response to Message 42436.  
Last modified: 14 May 2020, 15:53:17 UTC

Thanks Ivan for your explanations. Welcome back for CMS.
Is this a correct interpretation for the way from WM-Agent to Boinc?

Batch with Jobs created -> WM-Agent -> HTCondor -> Boinc
Boinc-Server watching HTCondor-Server
if Jobs on HTCondor -> Boinc creates Tasks for CMS@Home (Volunteer)
if NO Jobs on HTCondor -> Boinc Task queue is running drain

Take with your team all the time you need to find a solution for CMS in this not easy time.

Yes, that's pretty much it. Just remember the difference between BOINC tasks and CMS jobs. CMS jobs are created by WMAgent which maintains a queue (up to 2,000 IIRC) of jobs on the HTCondor server. The BOINC tasks query the condor server for CMS jobs when they need one, and report status back there.

You may have noticed jobs flowing again. We waited several days to see if the 200 pending jobs would start up again, and agreed at a meeting yesterday that we would give it a bit more time and then submit another batch. They say great minds think alike -- both my Italian colleague and I submitted batches of 400, resp. 500, jobs within 2-1/2 minutes of each other this morning! It takes time for them to show up on the monitor, so I didn't see that she had already made a batch. Hers are half the size of mine, so they'll run correspondingly faster. As of now, she has 134 running, 248 successful, and 22 pending (that includes post-production jobs that run on the CERN T3_CH_CMSAtHome VM cluster). All of my 500 are still pending.
ID: 42485 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42489 - Posted: 14 May 2020, 21:15:41 UTC - in response to Message 42485.  
Last modified: 14 May 2020, 21:16:42 UTC

...and wouldn't you know it, suddenly the 200 pending jobs from the old batch started flowing again. I don't know if my Italian colleague was able to catch them in flagrant to see if (or how) they'd lost the "don't send to volunteers" requirement (she has access to our HTCondor server, I don't). Maybe we just didn't wait long enough for our hypothetical time-out on these jobs before submitting our new batches. Currently her batch of 400 shows 8 running, 398 successful, and 3 pending. My old batch of 500 is now 105 running, 411 successful, and 2 pending; the new batch has 45 running, 455 pending, and nothing completed yet.
ID: 42489 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,911,291
RAC: 138,132
Message 42490 - Posted: 15 May 2020, 7:53:40 UTC

@ Ivan, Federica

Is it intended to upload the results/logs to different directory hierarchy levels?
Are your upstream/downstream processes aware of that?
(Links are shortened)

Yesterday:
PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_FFv14052020_test-v11/00000/33A6F9B1-186B-7748-9115-0DFF3CEF295F.root
PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/f......_TC_SLC7_FF_CMS_Home_200514_111748_5368/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/0/0d76c908-b7a1-4c4c-98b8-c5f0223e4263-293-0-logArchive.tar.gz

PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv5d-v11/00000/660548A7-C117-9F49-A4B1-65D23EDF417C.root
PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/i...._TC_SLC7_IDR_CMS_Home_200509_140603_2627/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/1/588c22ce-2f55-40d8-b3f6-352e30250b56-14-1-logArchive.tar.gz



This morning:
PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2333175_1589511258.251129_0.tgz
PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2018665_1589462038.267595_0.tgz
ID: 42490 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,911,291
RAC: 138,132
Message 42491 - Posted: 15 May 2020, 10:46:39 UTC

... and all CMS tasks from today finished with an error.

Example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=273333559
ID: 42491 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,495,691
RAC: 22,225
Message 42492 - Posted: 15 May 2020, 13:05:08 UTC - in response to Message 42491.  

... and all CMS tasks from today finished with an error.

Example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=273333559



[url]https://lhcathome.cern.ch/lhcathome/result.php?resultid=273134385
[/url]
Everything was working fine for me until about 4 am then all the CMS tasks started failing.

This one seems to have failed because there were no jobs available.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=273225488
ID: 42492 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,911,291
RAC: 138,132
Message 42494 - Posted: 15 May 2020, 14:02:40 UTC - in response to Message 42492.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=273134385
This has been written by your BOINC client.
Process still present 5 min after writing finish file; aborting


Regular shutdown might have been interrupted by a suspend command.
At the next restart the VM was too slow to finish the restart procedure and BOINC killed it.
Things like that are likely to happen if vbox tasks are suspended too often, especially on heavily used machines (disk I/O).

See how often this VM had been restarted:
2020-05-14 14:37:54 (9132): VM state change detected. (old = 'PoweredOff', new = 'Running') # first start
2020-05-14 14:45:45 (4420): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-14 15:34:11 (15348): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-14 17:34:15 (8272): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-14 19:39:57 (8288): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-14 21:00:01 (15852): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-15 00:51:05 (16032): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-15 01:30:42 (1780): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-15 03:05:57 (7940): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-15 04:50:02 (3084): VM state change detected. (old = 'PoweredOff', new = 'Running')
ID: 42494 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42501 - Posted: 15 May 2020, 17:34:42 UTC - in response to Message 42492.  

... and all CMS tasks from today finished with an error.

Example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=273333559



[url]https://lhcathome.cern.ch/lhcathome/result.php?resultid=273134385
[/url]
Everything was working fine for me until about 4 am then all the CMS tasks started failing.

This one seems to have failed because there were no jobs available.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=273225488

Yes, we ran out of available jobs somewhere around then, so tasks will fail because of that. Currently there are about 10 jobs in the "pending" queue (they probably have what we call the "strange requirement" that spuriously says they can't run on volunteer machines), and about 13 still "running" which are probably machines that have been switched of and aren't reporting back to the condor server.
ID: 42501 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42502 - Posted: 15 May 2020, 18:06:09 UTC - in response to Message 42490.  

@ Ivan, Federica

Is it intended to upload the results/logs to different directory hierarchy levels?
Are your upstream/downstream processes aware of that?
(Links are shortened)

Yesterday:
PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_FFv14052020_test-v11/00000/33A6F9B1-186B-7748-9115-0DFF3CEF295F.root
PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/f......_TC_SLC7_FF_CMS_Home_200514_111748_5368/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/0/0d76c908-b7a1-4c4c-98b8-c5f0223e4263-293-0-logArchive.tar.gz

PUT http://vc-cms-output.s3.cern.ch/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv5d-v11/00000/660548A7-C117-9F49-A4B1-65D23EDF417C.root
PUT http://vc-cms-output.s3.cern.ch/store/unmerged/logs/prod/2020/5/14/i...._TC_SLC7_IDR_CMS_Home_200509_140603_2627/SinglePiE50HCAL_pythia8_2018_GenSimFull/0000/1/588c22ce-2f55-40d8-b3f6-352e30250b56-14-1-logArchive.tar.gz



This morning:
PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2333175_1589511258.251129_0.tgz
PUT http://vc-cms-output.s3.cern.ch/logs/CMS_2018665_1589462038.267595_0.tgz

A couple of interesting things there. I'd be expecting volunteer machines to write to https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/ (and they have been). I'm not sure where vc-cms-output.s3.cern.ch fits into things unless it's an alias (but nslookup gives them very different IP addresses). The IP addresses for the latter resolve to what seems to be ceph instances at CERN; I know DataBridge is a ceph instance so I guess there is some aliasing going on.
It almost looks like you've picked up something that should have run as post-production on T3_CH_CMSAtHome. I'll make sure Laurence and Federica know about it.
ID: 42502 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42521 - Posted: 16 May 2020, 16:11:52 UTC - in response to Message 42502.  

OK, there is a sort of aliasing going on.
"The databridge does redirection to the S3 bucket. vc-cms-output.s3.cern.ch is the S3 bucket we use. You can't access that URL directly as you need to have the authentication key. The databridge does something called redirect and sign to give you temporary access after it authenticates you with your BOINC credentials."
ID: 42521 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42557 - Posted: 21 May 2020, 17:53:11 UTC
Last modified: 22 May 2020, 11:14:05 UTC

I got finally some of your jobs, Ivan.

ireid_TC_SLC7_IDR_CMS_Home_200520_164749_5061
ireid_TC_SLC7_IDR_CMS_Home_200521_130530_953
and today
ireid_TC_SLC7_IDR_CMS_Home_200521_232529_4572
ID: 42557 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,439
RAC: 101,711
Message 42632 - Posted: 27 May 2020, 18:13:47 UTC

Ivan, should CMS work now, or should we better not download tasks?
ID: 42632 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 42640 - Posted: 28 May 2020, 0:58:20 UTC - in response to Message 42632.  

Erich,

I just did one. It failed as usual. No need to waste your time.
ID: 42640 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,439
RAC: 101,711
Message 42646 - Posted: 28 May 2020, 6:15:15 UTC - in response to Message 42640.  

I've got two successful ones now. I guess the failures from yesterday evening were due to these server problems described by Nils in the News Thread.
ID: 42646 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,439
RAC: 101,711
Message 42650 - Posted: 28 May 2020, 9:12:28 UTC - in response to Message 42646.  

well, also this morning, I had a few tasks that failed after about 20 Minuten, but also a few ones which have been running four hours now.

So, all in all, CMS still does not seem to function the way it's supposed to. Really too bad.
After it has not been working well for such long time now, I think it would make sense to either repair it as soon as possible, or to delete it from the subprojects list.
ID: 42650 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,439
RAC: 101,711
Message 42655 - Posted: 28 May 2020, 18:24:08 UTC

so, I just got a task finished after almost 10 hours with "unknown error code" - how nice:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=275487344

I really don't understand why CMS tasks are offered for download, if in reality they are junk :-(
ID: 42655 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42656 - Posted: 29 May 2020, 5:17:56 UTC - in response to Message 42655.  

Same here.
Proces cmsRun ended normally after 10,000 events and the 12 hours elapsed were over, so I expected a normal shutdown and a valid result.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=275589481
Instead of that: Exit status 1 (0x00000001) Unknown error code
Last lines of stderr output:
2020-05-28 22:20:15 (9344): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 600 seconds))
2020-05-28 22:40:35 (9344): Guest Log: [ERROR] Condor ended after 44281 seconds.
2020-05-28 22:40:35 (9344): Guest Log: [INFO] Shutting Down.
2020-05-28 22:40:35 (9344): VM Completion File Detected.
2020-05-28 22:40:35 (9344): VM Completion Message: Condor ended after 44281 seconds.
.
2020-05-28 22:40:35 (9344): Powering off VM.
2020-05-28 22:40:36 (9344): Successfully stopped VM.
2020-05-28 22:40:36 (9344): Deregistering VM. (boinc_084c9fd56a0a55ba, slot#0)
2020-05-28 22:40:36 (9344): Removing network bandwidth throttle group from VM.
2020-05-28 22:40:37 (9344): Removing storage controller(s) from VM.
2020-05-28 22:40:37 (9344): Removing VM from VirtualBox.
2020-05-28 22:40:37 (9344): Removing virtual disk drive from VirtualBox.
22:40:43 (9344): called boinc_finish(1)
ID: 42656 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,439
RAC: 101,711
Message 42657 - Posted: 29 May 2020, 5:28:26 UTC - in response to Message 42640.  

Well, I think Jim1348 was perfectly right when saying:
Erich,
I just did one. It failed as usual. No need to waste your time.
So I abandon CMS for now; it's a pitty, but what can we do if they are not able to repair it after so many months :-(
ID: 42657 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42674 - Posted: 30 May 2020, 13:05:23 UTC

Hi. As you've noticed, we are still having difficulties...
One small step forward was to understand why jobs that were being held pending, because they somehow acquire a requirement not to run on volunteer machines, start running after several days. It turns out that WMAgent times-out jobs that haven't run for five days and resubmits them -- without that restriction.
However, we still don't know what it is about the handshaking back and forth between WMAgent and HTCondor that makes failed jobs get resubmitted with the "strange requirement".
On the gripping hand, some other problems have arisen. Since last week, WMAgent has not been keeping the pending queue on the condor server topped up. This means fewer jobs are available, leading to inevitable time-outs and task failures even though the agent has plenty of jobs in its queue.
Yet another problem arose last night, when an upgrade was made to the testbed server we use. Now I can no longer check the status of our jobs -- the WMStats page returns all nulls. The responsibles are aware of this but can't make a fix until Monday or Tuesday.
So, i'm afraid it's time to once again take Little River Band's advice, and shut down for the time being. Sorry.
ID: 42674 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,439
RAC: 101,711
Message 42707 - Posted: 30 May 2020, 18:07:40 UTC - in response to Message 42674.  

hello Ivan,
once more many thanks for your thorough reply :-)

So all we can do is to keep our fingers crossed that one day CMS may work again!
ID: 42707 · Report as offensive     Reply Quote
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN