Message boards : CMS Application : Possible disruption in the next several hours
Message board moderation

To post messages, you must log in.

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31260 - Posted: 3 Jul 2017, 18:13:24 UTC

mea culpa! I realised today that I'd accidentally typed one zero too many in the WMAgent request for the current batch, and launched ten times too many jobs! Alan tells me this could overload the agent, so I've submitted a "normal" batch and have set this one to "force-complete". This will clear out its queue, but I don't know exactly what effect it will have on currently-running jobs.
So, there may be some jobs report as failed, or otherwise faulty, but once the tasks start picking up jobs from the new batch it should all clear up. My apologies, I hope it's not too traumatic.
ID: 31260 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31261 - Posted: 3 Jul 2017, 18:47:23 UTC - in response to Message 31260.  

OK, the new batch has started queueing. There was a hiatus of about 35 minutes with no jobs in the queue.
ID: 31261 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,483
RAC: 102,014
Message 31265 - Posted: 3 Jul 2017, 19:50:07 UTC - in response to Message 31260.  

... This will clear out its queue, but I don't know exactly what effect it will have on currently-running jobs.
So, there may be some jobs report as failed, or otherwise faulty

what it did was something like this:

2017-07-03 18:44:59 (6304): Guest Log: [INFO] CMS application starting. Check log files.
2017-07-03 18:45:09 (6304): Guest Log: [DEBUG] HTCondor ping
2017-07-03 18:45:09 (6304): Guest Log: [DEBUG] 0
2017-07-03 18:45:29 (6304): Guest Log: [INFO] New Job Starting in slot1
2017-07-03 18:45:29 (6304): Guest Log: [INFO] Condor JobID: 133232.17 in slot1
2017-07-03 18:46:59 (6304): Guest Log: [INFO] WMAgent_JobID = 62678 in slot1
2017-07-03 20:21:16 (6304): Guest Log: [ERROR] Condor exited after 5769s without running a job.
2017-07-03 20:21:16 (6304): Guest Log: [INFO] Shutting Down.
ID: 31265 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,618,118
RAC: 3,938
Message 31266 - Posted: 3 Jul 2017, 20:15:41 UTC

Just curious, what happens if the CMS scientists have no tasks for us? Does the well run dry?
Thanks!
Regards,
Bob P.
ID: 31266 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31268 - Posted: 3 Jul 2017, 20:41:56 UTC - in response to Message 31266.  

Just curious, what happens if the CMS scientists have no tasks for us? Does the well run dry?
Thanks!

Yes, basically. That's why I try to keep the pump primed, and also why I warn you when I know a drought is coming up, so you can set No New Tasks, or transfer to another project. It's perhaps a little more onerous than it might look, I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out. Current batches run for about two days, depending on how many people are currently running.
ID: 31268 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,483
RAC: 102,014
Message 31274 - Posted: 4 Jul 2017, 4:54:32 UTC - in response to Message 31268.  
Last modified: 4 Jul 2017, 4:54:43 UTC

[quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ...

hm, there is no way to get this automated somehow?
ID: 31274 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31278 - Posted: 4 Jul 2017, 7:58:26 UTC - in response to Message 31274.  

[quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ...

hm, there is no way to get this automated somehow?

Possibly. I've never bothered because I didn't expect to be on this project for so long... Keeping an eye on the monitors is only slightly more trouble than keeping an eye on my e-mail, anyway.
ID: 31278 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,483
RAC: 102,014
Message 31282 - Posted: 4 Jul 2017, 9:14:44 UTC - in response to Message 31278.  

... because I didn't expect to be on this project for so long...

which in a way is good, though; I think you are doing a perfect Job :-)
ID: 31282 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31286 - Posted: 4 Jul 2017, 12:45:53 UTC
Last modified: 4 Jul 2017, 13:25:33 UTC

There will be an upgrade to the HTCondor schedd this afternoon. I'm told it should make no significant disturbance, but be warned...
[Added] Done, with no problems seen. [/Added]
ID: 31286 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31291 - Posted: 4 Jul 2017, 17:12:50 UTC

Oops, something has gone wrong now. There is a failure in the WMAgent -- it still says there are jobs available, but other monitors and Dashboard say that they have run out.
Suggest you set No New Tasks until I round up the CERN posse.
ID: 31291 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,483
RAC: 102,014
Message 31295 - Posted: 4 Jul 2017, 19:04:37 UTC - in response to Message 31291.  

Ivan, any news on this?
ID: 31295 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31296 - Posted: 4 Jul 2017, 19:19:35 UTC - in response to Message 31295.  
Last modified: 4 Jul 2017, 19:26:17 UTC

Ivan, any news on this?

Yes, Alan fixed the problem and there are jobs in the queue again . Time to restart.
[Added] Hmm, but the LHC@Home CMS queue is showing no tasks and it's not giving me any. -dev has tasks. I'll ping Laurence, he may need to tickle the server queue. [/Added]
ID: 31296 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,483
RAC: 102,014
Message 31297 - Posted: 4 Jul 2017, 19:30:46 UTC - in response to Message 31296.  

Thanks Ivan, as always, for your timely responses :-)))

So I'll see tomorrow morning what's happening, falling to bed right now.
ID: 31297 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31298 - Posted: 4 Jul 2017, 19:34:06 UTC - in response to Message 31297.  
Last modified: 4 Jul 2017, 19:36:12 UTC

Thanks Ivan, as always, for your timely responses :-)))

So I'll see tomorrow morning what's happening, falling to bed right now.

I've actually got some tasks on one of both my machines now, so the problem may just have been that requests are coming in faster than tasks can be created.[/s]
ID: 31298 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,501,728
RAC: 4,157
Message 31299 - Posted: 4 Jul 2017, 20:45:22 UTC

My first check of the day is not what I expected.

About 60 ERRORS

VM Heartbeat file specified, but missing.
VM Heartbeat file specified, but missing file system status. (errno = '2')
ID: 31299 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 31300 - Posted: 5 Jul 2017, 10:18:52 UTC - in response to Message 31278.  

[quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ...

hm, there is no way to get this automated somehow?

Possibly. I've never bothered because I didn't expect to be on this project for so long... Keeping an eye on the monitors is only slightly more trouble than keeping an eye on my e-mail, anyway.

Well, after last night's problems (which essentially happened between my leaving work and arriving at home :-( ), I have implemented a cron job to check the queue every 30 minutes. It's probably got bugs as I'm not a letter-perfect shell programmer... Since the check relies on my having a valid CMS proxy certificate I've put in a check for it expiring -- at least I could test that bit.
ID: 31300 · Report as offensive     Reply Quote

Message boards : CMS Application : Possible disruption in the next several hours


©2024 CERN