Message boards :
CMS Application :
Possible disruption in the next several hours
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
mea culpa! I realised today that I'd accidentally typed one zero too many in the WMAgent request for the current batch, and launched ten times too many jobs! Alan tells me this could overload the agent, so I've submitted a "normal" batch and have set this one to "force-complete". This will clear out its queue, but I don't know exactly what effect it will have on currently-running jobs. So, there may be some jobs report as failed, or otherwise faulty, but once the tasks start picking up jobs from the new batch it should all clear up. My apologies, I hope it's not too traumatic. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1562 Credit: 58,039,437 RAC: 45,446 ![]() ![]() ![]() |
... This will clear out its queue, but I don't know exactly what effect it will have on currently-running jobs. what it did was something like this: 2017-07-03 18:44:59 (6304): Guest Log: [INFO] CMS application starting. Check log files. 2017-07-03 18:45:09 (6304): Guest Log: [DEBUG] HTCondor ping 2017-07-03 18:45:09 (6304): Guest Log: [DEBUG] 0 2017-07-03 18:45:29 (6304): Guest Log: [INFO] New Job Starting in slot1 2017-07-03 18:45:29 (6304): Guest Log: [INFO] Condor JobID: 133232.17 in slot1 2017-07-03 18:46:59 (6304): Guest Log: [INFO] WMAgent_JobID = 62678 in slot1 2017-07-03 20:21:16 (6304): Guest Log: [ERROR] Condor exited after 5769s without running a job. 2017-07-03 20:21:16 (6304): Guest Log: [INFO] Shutting Down. |
![]() Send message Joined: 17 Sep 04 Posts: 89 Credit: 27,826,860 RAC: 11,662 ![]() ![]() ![]() |
Just curious, what happens if the CMS scientists have no tasks for us? Does the well run dry? Thanks! Regards, Bob P. |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
Just curious, what happens if the CMS scientists have no tasks for us? Does the well run dry? Yes, basically. That's why I try to keep the pump primed, and also why I warn you when I know a drought is coming up, so you can set No New Tasks, or transfer to another project. It's perhaps a little more onerous than it might look, I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out. Current batches run for about two days, depending on how many people are currently running. ![]() |
Send message Joined: 18 Dec 15 Posts: 1562 Credit: 58,039,437 RAC: 45,446 ![]() ![]() ![]() |
[quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ... hm, there is no way to get this automated somehow? |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
[quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ... Possibly. I've never bothered because I didn't expect to be on this project for so long... Keeping an eye on the monitors is only slightly more trouble than keeping an eye on my e-mail, anyway. ![]() |
Send message Joined: 18 Dec 15 Posts: 1562 Credit: 58,039,437 RAC: 45,446 ![]() ![]() ![]() |
... because I didn't expect to be on this project for so long... which in a way is good, though; I think you are doing a perfect Job :-) |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1562 Credit: 58,039,437 RAC: 45,446 ![]() ![]() ![]() |
Ivan, any news on this? |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1562 Credit: 58,039,437 RAC: 45,446 ![]() ![]() ![]() |
Thanks Ivan, as always, for your timely responses :-))) So I'll see tomorrow morning what's happening, falling to bed right now. |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
|
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1032 Credit: 48,541,507 RAC: 2,326 ![]() ![]() |
My first check of the day is not what I expected. About 60 ERRORS VM Heartbeat file specified, but missing. VM Heartbeat file specified, but missing file system status. (errno = '2') |
![]() Send message Joined: 29 Aug 05 Posts: 929 Credit: 6,103,434 RAC: 964 ![]() |
[quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ... Well, after last night's problems (which essentially happened between my leaving work and arriving at home :-( ), I have implemented a cron job to check the queue every 30 minutes. It's probably got bugs as I'm not a letter-perfect shell programmer... Since the check relies on my having a valid CMS proxy certificate I've put in a check for it expiring -- at least I could test that bit. ![]() |
©2023 CERN