Thread 'CMS Tasks Failing'

Author	Message
Brummig Send message Joined: 9 Feb 16 Posts: 50 Credit: 546,878 RAC: 0	Message 29901 - Posted: 11 Apr 2017, 13:46:10 UTC - in response to Message 29898. Last modified: 11 Apr 2017, 14:05:53 UTC More problems connecting to the mother ship, this time on a Theory task: 2017-04-11 13:51:19 (11052): VM Completion Message: Could not connect to lhchomeproxy.cern.ch on port 3125 (https://lhcathome.cern.ch/lhcathome/result.php?resultid=132873626) Given that that followed 6 hours 11 min 46 sec of CPU work, it would have been nice if it had tried again. No evidence of a network connectivity problem my end (ie no problems with the Radio Paradise stream). ID: 29901 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 29903 - Posted: 11 Apr 2017, 16:21:48 UTC - in response to Message 29901. A bit strange, you had been getting through before. I looked at the throughput graphs for what I believe is that proxy[] and didn't see any obvious glitches -- although the finest granularity is a five-minute average. [] http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgall/CERN-PROD_lhchomeproxy.cern.ch_0/index.html -- but you may need CERN credentials to view it. ID: 29903 · Reply Quote

Brummig Send message Joined: 9 Feb 16 Posts: 50 Credit: 546,878 RAC: 0	Message 29914 - Posted: 12 Apr 2017, 12:27:46 UTC - in response to Message 29903. Last modified: 12 Apr 2017, 12:29:34 UTC Well of course the glitch could have been out on the net somewhere, and glitches can be very short. Why did the task give up so quickly and easily when trying to connect to the server? It's not like it was hard up against the deadline. (That URL is public, BTW). ID: 29914 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 29926 - Posted: 13 Apr 2017, 10:06:09 UTC - in response to Message 29914. Well of course the glitch could have been out on the net somewhere, and glitches can be very short. Why did the task give up so quickly and easily when trying to connect to the server? It's not like it was hard up against the deadline. I've no idea, myself. That's one for the experts. )That URL is public, BTW). Oh, good. It's nowhere near as useful as the famous"cricket graph" was for SETI@Home but now it's known I'm sure someone will let us know immediately there's a catastrophic failure. ID: 29926 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,942,380 RAC: 95,195	Message 30054 - Posted: 26 Apr 2017, 8:19:13 UTC Last modified: 26 Apr 2017, 8:20:39 UTC Since yesterday evening, all CMS Jobs failed after 10-12 minutes. Excerpt from stderr: 2017-04-26 09:54:19 (2908): Guest Log: [INFO] CMS application starting. Check log files. 2017-04-26 09:54:19 (2908): Guest Log: [DEBUG] HTCondor ping 2017-04-26 09:54:19 (2908): Guest Log: [DEBUG] 0 2017-04-26 10:04:30 (2908): Guest Log: [ERROR] Condor exited after 612s without running a job. 2017-04-26 10:04:30 (2908): Guest Log: [INFO] Shutting Down. 2017-04-26 10:04:30 (2908): VM Completion File Detected. 2017-04-26 10:04:30 (2908): VM Completion Message: Condor exited after 612s without running a job. the complete Content can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136735347 Any idea what's going wrong? ID: 30054 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 30055 - Posted: 26 Apr 2017, 8:40:29 UTC - in response to Message 30054. Last modified: 26 Apr 2017, 9:16:44 UTC We're investigating, but at the moment Laurence and I are in a meeting... Recommend to set no new tasks until we work it out. Jobs are available, and WMAgent looks to be running so maybe it's a network problem. [Edit] Could also be a full disk... [/Edit] ID: 30055 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 30058 - Posted: 26 Apr 2017, 10:25:13 UTC - in response to Message 30055. We've found the problem. Should be a quick fix, but keep your fingers crossed for a while longer... ID: 30058 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 30060 - Posted: 26 Apr 2017, 13:08:24 UTC - in response to Message 30058. There are some jobs available now. The number of running jobs is picking up slowly -- not sure if that's limited supply (not all jobs in the queue can be sent to volunteers) or limited demand. ID: 30060 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,942,380 RAC: 95,195	Message 30304 - Posted: 12 May 2017, 16:36:16 UTC In the past half hour, I've got several cases where tasks failed after 10-12 minutes with "computation error". Excerpt from STDERR: 2017-05-12 18:13:03 (6820): VM Completion Message: Condor exited after 627s without running a job. One such complete STDERR can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=139490166 Any idea what's going wrong? ID: 30304 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1559 Credit: 10,102,701 RAC: 699	Message 30305 - Posted: 12 May 2017, 17:34:30 UTC - in response to Message 30304. 2017-05-12 18:13:03 (6820): VM Completion Message: Condor exited after 627s without running a job. . . Any idea what's going wrong? The well of jobs for CMS-VM's has run dry. Mostly Ivan is reacting like a goat on a corn-box even during weekends. Select another sub-project for the time being. ID: 30305 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,942,380 RAC: 95,195	Message 30306 - Posted: 12 May 2017, 17:59:58 UTC - in response to Message 30305. Select another sub-project for the time being. On one of my PCs, I switched to LHCb, with two jobs running. One job has now been runnig for 1 hour, the other one for half an hour. The strange thing though is that both do not use any CPU - the Windows task manager shows 2 Vbox.headless_exe with CPU usage exactly zero. Although the progress bar in the BOINC Manager is growing. I had tried LHCb tasks sucessfully short time ago, and of course they were using the CPU. So, what is going wrong this time? ID: 30306 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 952 Credit: 785,099,797 RAC: 121,808	Message 30307 - Posted: 12 May 2017, 18:28:10 UTC I think LHCb is out too? ID: 30307 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 30309 - Posted: 12 May 2017, 19:07:52 UTC Last modified: 12 May 2017, 19:17:06 UTC Sorry 'bout that, I was catching up on the news and Great British Menu. Only just noticed it -- the WMAgent is down. I've messaged Alan. Please set No New Tasks or try another subproject until he can fix it. [Edit] I really should have noticed that sooner. I did see that the estimated time to completion hadn't fallen as much as I'd expected, but I guess at that time the queue hadn't drained so it didn't show up on the graphs -- it dried up about 1700 and I left work a bit earlier than usual today, at 1730. There are jobs created and pending but they are not transferring to the queue. [/Edit] ID: 30309 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,942,380 RAC: 95,195	Message 30310 - Posted: 12 May 2017, 19:18:27 UTC - in response to Message 30307. Last modified: 12 May 2017, 19:21:26 UTC I think LHCb is out too? yes, seems to be the case :-( whereas I am surprised that the tasks which have been running without any CPU usage for 1 1/2 hours now, are not terminating themselves ID: 30310 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 30311 - Posted: 12 May 2017, 19:39:44 UTC - in response to Message 30310. I think LHCb is out too? yes, seems to be the case :-( whereas I am surprised that the tasks which have been running without any CPU usage for 1 1/2 hours now, are not terminating themselves There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly. ID: 30311 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,942,380 RAC: 95,195	Message 30312 - Posted: 12 May 2017, 20:06:24 UTC - in response to Message 30311. There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly. I tried to open this page: http://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php however, it does not open - I only get a blank, white screen ID: 30312 · Reply Quote

Juha Send message Joined: 22 Mar 17 Posts: 30 Credit: 360,676 RAC: 0	Message 30313 - Posted: 12 May 2017, 20:23:19 UTC - in response to Message 30311. ]There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly.[/quote] From what I can tell from the logs, the server has no trouble handing out jobs. It's that the jobs get stuck before even starting. [pre]05/12/17 21:59:41 (pid:4088) Job 3099277.20 set to execute immediately 05/12/17 21:59:41 (pid:4088) Starting a VANILLA universe job with ID: 3099277.20 05/12/17 21:59:41 (pid:4088) IWD: /var/lib/condor/execute/dir_4088 05/12/17 21:59:41 (pid:4088) Renice expr "10" evaluated to 10 05/12/17 21:59:41 (pid:4088) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_4088/condor_exec.exe 309927720 05/12/17 21:59:41 (pid:4088) Running job as user nobody 05/12/17 21:59:41 (pid:4088) Create_Process succeeded, pid=4092[/pre] There's nothing in the running.log. The process tree starting from the 4092 above looks like this: [pre]inner-wrapper job-wrapper sleep condor_exec.exe wget[/pre] wget there is a bit surprising, considering there is hardly any network traffic going on. Is there maybe some server having problems? netstat tells there is two connections (two VMs) to lbvobox33.cern.ch at SYN_SENT state. That server isn't responding to web browser. ID: 30313 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 30315 - Posted: 12 May 2017, 20:48:09 UTC - in response to Message 30313. Hmm, you're right. The squid proxy is taking a hit, but is that cause or effect? The CMS WMAgent is down, but I'm not entirely sure it would have that much effect on the proxy. ID: 30315 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 30317 - Posted: 12 May 2017, 22:41:44 UTC - in response to Message 30315. Hmm, you're right. The squid proxy is taking a hit, but is that cause or effect? The CMS WMAgent is down, but I'm not entirely sure it would have that much effect on the proxy. OK, it was effect, not cause. There was an "intervention" on the server that left it in a bad state. It's been fixed now and things are returning to something resembling normality. Should be OK to resume CMS tasks again now. ID: 30317 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 952 Credit: 785,099,797 RAC: 121,808	Message 30318 - Posted: 12 May 2017, 22:49:43 UTC LHCb is back too ID: 30318 · Reply Quote