Thread 'Possible disruption in the next several hours'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31260 - Posted: 3 Jul 2017, 18:13:24 UTC mea culpa! I realised today that I'd accidentally typed one zero too many in the WMAgent request for the current batch, and launched ten times too many jobs! Alan tells me this could overload the agent, so I've submitted a "normal" batch and have set this one to "force-complete". This will clear out its queue, but I don't know exactly what effect it will have on currently-running jobs. So, there may be some jobs report as failed, or otherwise faulty, but once the tasks start picking up jobs from the new batch it should all clear up. My apologies, I hope it's not too traumatic. ID: 31260 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31261 - Posted: 3 Jul 2017, 18:47:23 UTC - in response to Message 31260. OK, the new batch has started queueing. There was a hiatus of about 35 minutes with no jobs in the queue. ID: 31261 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,559,976 RAC: 48,131	Message 31265 - Posted: 3 Jul 2017, 19:50:07 UTC - in response to Message 31260. ... This will clear out its queue, but I don't know exactly what effect it will have on currently-running jobs. So, there may be some jobs report as failed, or otherwise faulty what it did was something like this: 2017-07-03 18:44:59 (6304): Guest Log: [INFO] CMS application starting. Check log files. 2017-07-03 18:45:09 (6304): Guest Log: [DEBUG] HTCondor ping 2017-07-03 18:45:09 (6304): Guest Log: [DEBUG] 0 2017-07-03 18:45:29 (6304): Guest Log: [INFO] New Job Starting in slot1 2017-07-03 18:45:29 (6304): Guest Log: [INFO] Condor JobID: 133232.17 in slot1 2017-07-03 18:46:59 (6304): Guest Log: [INFO] WMAgent_JobID = 62678 in slot1 2017-07-03 20:21:16 (6304): Guest Log: [ERROR] Condor exited after 5769s without running a job. 2017-07-03 20:21:16 (6304): Guest Log: [INFO] Shutting Down. ID: 31265 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,549,147 RAC: 0	Message 31266 - Posted: 3 Jul 2017, 20:15:41 UTC Just curious, what happens if the CMS scientists have no tasks for us? Does the well run dry? Thanks! Regards, Bob P. ID: 31266 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31268 - Posted: 3 Jul 2017, 20:41:56 UTC - in response to Message 31266. Just curious, what happens if the CMS scientists have no tasks for us? Does the well run dry? Thanks! Yes, basically. That's why I try to keep the pump primed, and also why I warn you when I know a drought is coming up, so you can set No New Tasks, or transfer to another project. It's perhaps a little more onerous than it might look, I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out. Current batches run for about two days, depending on how many people are currently running. ID: 31268 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,559,976 RAC: 48,131	Message 31274 - Posted: 4 Jul 2017, 4:54:32 UTC - in response to Message 31268. Last modified: 4 Jul 2017, 4:54:43 UTC [quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ... hm, there is no way to get this automated somehow? ID: 31274 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31278 - Posted: 4 Jul 2017, 7:58:26 UTC - in response to Message 31274. [quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ... hm, there is no way to get this automated somehow? Possibly. I've never bothered because I didn't expect to be on this project for so long... Keeping an eye on the monitors is only slightly more trouble than keeping an eye on my e-mail, anyway. ID: 31278 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,559,976 RAC: 48,131	Message 31282 - Posted: 4 Jul 2017, 9:14:44 UTC - in response to Message 31278. ... because I didn't expect to be on this project for so long... which in a way is good, though; I think you are doing a perfect Job :-) ID: 31282 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31286 - Posted: 4 Jul 2017, 12:45:53 UTC Last modified: 4 Jul 2017, 13:25:33 UTC There will be an upgrade to the HTCondor schedd this afternoon. I'm told it should make no significant disturbance, but be warned... [Added] Done, with no problems seen. [/Added] ID: 31286 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31291 - Posted: 4 Jul 2017, 17:12:50 UTC Oops, something has gone wrong now. There is a failure in the WMAgent -- it still says there are jobs available, but other monitors and Dashboard say that they have run out. Suggest you set No New Tasks until I round up the CERN posse. ID: 31291 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,559,976 RAC: 48,131	Message 31295 - Posted: 4 Jul 2017, 19:04:37 UTC - in response to Message 31291. Ivan, any news on this? ID: 31295 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31296 - Posted: 4 Jul 2017, 19:19:35 UTC - in response to Message 31295. Last modified: 4 Jul 2017, 19:26:17 UTC Ivan, any news on this? Yes, Alan fixed the problem and there are jobs in the queue again . Time to restart. [Added] Hmm, but the LHC@Home CMS queue is showing no tasks and it's not giving me any. -dev has tasks. I'll ping Laurence, he may need to tickle the server queue. [/Added] ID: 31296 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,559,976 RAC: 48,131	Message 31297 - Posted: 4 Jul 2017, 19:30:46 UTC - in response to Message 31296. Thanks Ivan, as always, for your timely responses :-))) So I'll see tomorrow morning what's happening, falling to bed right now. ID: 31297 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31298 - Posted: 4 Jul 2017, 19:34:06 UTC - in response to Message 31297. Last modified: 4 Jul 2017, 19:36:12 UTC Thanks Ivan, as always, for your timely responses :-))) So I'll see tomorrow morning what's happening, falling to bed right now. I've actually got some tasks on ~~one of~~ both my machines now, so the problem may just have been that requests are coming in faster than tasks can be created.[/s] ID: 31298 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1294 Credit: 95,324,689 RAC: 26,132	Message 31299 - Posted: 4 Jul 2017, 20:45:22 UTC My first check of the day is not what I expected. About 60 ERRORS VM Heartbeat file specified, but missing. VM Heartbeat file specified, but missing file system status. (errno = '2') ID: 31299 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 31300 - Posted: 5 Jul 2017, 10:18:52 UTC - in response to Message 31278. [quote] ... I have to check every few hours (except when I'm asleep) to make sure the next batch is available before the old one runs out ... hm, there is no way to get this automated somehow? Possibly. I've never bothered because I didn't expect to be on this project for so long... Keeping an eye on the monitors is only slightly more trouble than keeping an eye on my e-mail, anyway. Well, after last night's problems (which essentially happened between my leaving work and arriving at home :-( ), I have implemented a cron job to check the queue every 30 minutes. It's probably got bugs as I'm not a letter-perfect shell programmer... Since the check relies on my having a valid CMS proxy certificate I've put in a check for it expiring -- at least I could test that bit. ID: 31300 · Reply Quote