Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 9 Feb 16 Posts: 48 Credit: 537,111 RAC: 0 |
More problems connecting to the mother ship, this time on a Theory task: 2017-04-11 13:51:19 (11052): VM Completion Message: Could not connect to lhchomeproxy.cern.ch on port 3125 (https://lhcathome.cern.ch/lhcathome/result.php?resultid=132873626) Given that that followed 6 hours 11 min 46 sec of CPU work, it would have been nice if it had tried again. No evidence of a network connectivity problem my end (ie no problems with the Radio Paradise stream). |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
A bit strange, you had been getting through before. I looked at the throughput graphs for what I believe is that proxy[*] and didn't see any obvious glitches -- although the finest granularity is a five-minute average. [*] http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgall/CERN-PROD_lhchomeproxy.cern.ch_0/index.html -- but you may need CERN credentials to view it. |
Send message Joined: 9 Feb 16 Posts: 48 Credit: 537,111 RAC: 0 |
Well of course the glitch could have been out on the net somewhere, and glitches can be very short. Why did the task give up so quickly and easily when trying to connect to the server? It's not like it was hard up against the deadline. (That URL is public, BTW). |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Well of course the glitch could have been out on the net somewhere, and glitches can be very short.I've no idea, myself. That's one for the experts. )That URL is public, BTW). Oh, good. It's nowhere near as useful as the famous"cricket graph" was for SETI@Home but now it's known I'm sure someone will let us know immediately there's a catastrophic failure. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,463,745 RAC: 29,779 |
Since yesterday evening, all CMS Jobs failed after 10-12 minutes. Excerpt from stderr: 2017-04-26 09:54:19 (2908): Guest Log: [INFO] CMS application starting. Check log files. 2017-04-26 09:54:19 (2908): Guest Log: [DEBUG] HTCondor ping 2017-04-26 09:54:19 (2908): Guest Log: [DEBUG] 0 2017-04-26 10:04:30 (2908): Guest Log: [ERROR] Condor exited after 612s without running a job. 2017-04-26 10:04:30 (2908): Guest Log: [INFO] Shutting Down. 2017-04-26 10:04:30 (2908): VM Completion File Detected. 2017-04-26 10:04:30 (2908): VM Completion Message: Condor exited after 612s without running a job. the complete Content can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136735347 Any idea what's going wrong? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,463,745 RAC: 29,779 |
In the past half hour, I've got several cases where tasks failed after 10-12 minutes with "computation error". Excerpt from STDERR: 2017-05-12 18:13:03 (6820): VM Completion Message: Condor exited after 627s without running a job. One such complete STDERR can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=139490166 Any idea what's going wrong? |
Send message Joined: 14 Jan 10 Posts: 1418 Credit: 9,460,759 RAC: 2,399 |
2017-05-12 18:13:03 (6820): VM Completion Message: Condor exited after 627s without running a job. The well of jobs for CMS-VM's has run dry. Mostly Ivan is reacting like a goat on a corn-box even during weekends. Select another sub-project for the time being. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,463,745 RAC: 29,779 |
Select another sub-project for the time being. On one of my PCs, I switched to LHCb, with two jobs running. One job has now been runnig for 1 hour, the other one for half an hour. The strange thing though is that both do not use any CPU - the Windows task manager shows 2 Vbox.headless_exe with CPU usage exactly zero. Although the progress bar in the BOINC Manager is growing. I had tried LHCb tasks sucessfully short time ago, and of course they were using the CPU. So, what is going wrong this time? |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,638,469 RAC: 113,003 |
I think LHCb is out too? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Sorry 'bout that, I was catching up on the news and Great British Menu. Only just noticed it -- the WMAgent is down. I've messaged Alan. Please set No New Tasks or try another subproject until he can fix it. [Edit] I really should have noticed that sooner. I did see that the estimated time to completion hadn't fallen as much as I'd expected, but I guess at that time the queue hadn't drained so it didn't show up on the graphs -- it dried up about 1700 and I left work a bit earlier than usual today, at 1730. There are jobs created and pending but they are not transferring to the queue. [/Edit] |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,463,745 RAC: 29,779 |
I think LHCb is out too? yes, seems to be the case :-( whereas I am surprised that the tasks which have been running without any CPU usage for 1 1/2 hours now, are not terminating themselves |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
I think LHCb is out too? There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,463,745 RAC: 29,779 |
There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly. I tried to open this page: http://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php however, it does not open - I only get a blank, white screen |
Send message Joined: 22 Mar 17 Posts: 30 Credit: 360,676 RAC: 0 |
There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly. From what I can tell from the logs, the server has no trouble handing out jobs. It's that the jobs get stuck before even starting. 05/12/17 21:59:41 (pid:4088) Job 3099277.20 set to execute immediately 05/12/17 21:59:41 (pid:4088) Starting a VANILLA universe job with ID: 3099277.20 05/12/17 21:59:41 (pid:4088) IWD: /var/lib/condor/execute/dir_4088 05/12/17 21:59:41 (pid:4088) Renice expr "10" evaluated to 10 05/12/17 21:59:41 (pid:4088) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_4088/condor_exec.exe 309927720 05/12/17 21:59:41 (pid:4088) Running job as user nobody 05/12/17 21:59:41 (pid:4088) Create_Process succeeded, pid=4092 There's nothing in the running.log. The process tree starting from the 4092 above looks like this: inner-wrapper job-wrapper sleep condor_exec.exe wget wget there is a bit surprising, considering there is hardly any network traffic going on. Is there maybe some server having problems? netstat tells there is two connections (two VMs) to lbvobox33.cern.ch at SYN_SENT state. That server isn't responding to web browser. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Hmm, you're right. The squid proxy is taking a hit, but is that cause or effect? The CMS WMAgent is down, but I'm not entirely sure it would have that much effect on the proxy. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Hmm, you're right. The squid proxy is taking a hit, but is that cause or effect? The CMS WMAgent is down, but I'm not entirely sure it would have that much effect on the proxy. OK, it was effect, not cause. There was an "intervention" on the server that left it in a bad state. It's been fixed now and things are returning to something resembling normality. Should be OK to resume CMS tasks again now. |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,638,469 RAC: 113,003 |
LHCb is back too |
©2024 CERN