Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 22 · Next

AuthorMessage
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 535,540
RAC: 0
Message 29901 - Posted: 11 Apr 2017, 13:46:10 UTC - in response to Message 29898.  
Last modified: 11 Apr 2017, 14:05:53 UTC

More problems connecting to the mother ship, this time on a Theory task:

2017-04-11 13:51:19 (11052): VM Completion Message: Could not connect to lhchomeproxy.cern.ch on port 3125


(https://lhcathome.cern.ch/lhcathome/result.php?resultid=132873626)

Given that that followed 6 hours 11 min 46 sec of CPU work, it would have been nice if it had tried again.

No evidence of a network connectivity problem my end (ie no problems with the Radio Paradise stream).
ID: 29901 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 29903 - Posted: 11 Apr 2017, 16:21:48 UTC - in response to Message 29901.  

A bit strange, you had been getting through before. I looked at the throughput graphs for what I believe is that proxy[*] and didn't see any obvious glitches -- although the finest granularity is a five-minute average.

[*] http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgall/CERN-PROD_lhchomeproxy.cern.ch_0/index.html -- but you may need CERN credentials to view it.
ID: 29903 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 535,540
RAC: 0
Message 29914 - Posted: 12 Apr 2017, 12:27:46 UTC - in response to Message 29903.  
Last modified: 12 Apr 2017, 12:29:34 UTC

Well of course the glitch could have been out on the net somewhere, and glitches can be very short.

Why did the task give up so quickly and easily when trying to connect to the server? It's not like it was hard up against the deadline.

(That URL is public, BTW).
ID: 29914 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 29926 - Posted: 13 Apr 2017, 10:06:09 UTC - in response to Message 29914.  

Well of course the glitch could have been out on the net somewhere, and glitches can be very short.

Why did the task give up so quickly and easily when trying to connect to the server? It's not like it was hard up against the deadline.
I've no idea, myself. That's one for the experts.
)That URL is public, BTW).

Oh, good. It's nowhere near as useful as the famous"cricket graph" was for SETI@Home but now it's known I'm sure someone will let us know immediately there's a catastrophic failure.
ID: 29926 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1511
Credit: 42,525,922
RAC: 39,449
Message 30054 - Posted: 26 Apr 2017, 8:19:13 UTC
Last modified: 26 Apr 2017, 8:20:39 UTC

Since yesterday evening, all CMS Jobs failed after 10-12 minutes.

Excerpt from stderr:

2017-04-26 09:54:19 (2908): Guest Log: [INFO] CMS application starting. Check log files.
2017-04-26 09:54:19 (2908): Guest Log: [DEBUG] HTCondor ping
2017-04-26 09:54:19 (2908): Guest Log: [DEBUG] 0
2017-04-26 10:04:30 (2908): Guest Log: [ERROR] Condor exited after 612s without running a job.
2017-04-26 10:04:30 (2908): Guest Log: [INFO] Shutting Down.
2017-04-26 10:04:30 (2908): VM Completion File Detected.
2017-04-26 10:04:30 (2908): VM Completion Message: Condor exited after 612s without running a job.

the complete Content can be seen here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=136735347

Any idea what's going wrong?
ID: 30054 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 30055 - Posted: 26 Apr 2017, 8:40:29 UTC - in response to Message 30054.  
Last modified: 26 Apr 2017, 9:16:44 UTC

We're investigating, but at the moment Laurence and I are in a meeting...
Recommend to set no new tasks until we work it out. Jobs are available, and WMAgent looks to be running so maybe it's a network problem.
[Edit] Could also be a full disk... [/Edit]
ID: 30055 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 30058 - Posted: 26 Apr 2017, 10:25:13 UTC - in response to Message 30055.  

We've found the problem. Should be a quick fix, but keep your fingers crossed for a while longer...
ID: 30058 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 30060 - Posted: 26 Apr 2017, 13:08:24 UTC - in response to Message 30058.  

There are some jobs available now. The number of running jobs is picking up slowly -- not sure if that's limited supply (not all jobs in the queue can be sent to volunteers) or limited demand.
ID: 30060 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1511
Credit: 42,525,922
RAC: 39,449
Message 30304 - Posted: 12 May 2017, 16:36:16 UTC

In the past half hour, I've got several cases where tasks failed after 10-12 minutes with "computation error".

Excerpt from STDERR:

2017-05-12 18:13:03 (6820): VM Completion Message: Condor exited after 627s without running a job.

One such complete STDERR can be seen here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=139490166

Any idea what's going wrong?
ID: 30304 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1094
Credit: 6,831,609
RAC: 769
Message 30305 - Posted: 12 May 2017, 17:34:30 UTC - in response to Message 30304.  

2017-05-12 18:13:03 (6820): VM Completion Message: Condor exited after 627s without running a job.
.
.
Any idea what's going wrong?

The well of jobs for CMS-VM's has run dry.
Mostly Ivan is reacting like a goat on a corn-box even during weekends.
Select another sub-project for the time being.
ID: 30305 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1511
Credit: 42,525,922
RAC: 39,449
Message 30306 - Posted: 12 May 2017, 17:59:58 UTC - in response to Message 30305.  

Select another sub-project for the time being.

On one of my PCs, I switched to LHCb, with two jobs running.
One job has now been runnig for 1 hour, the other one for half an hour.

The strange thing though is that both do not use any CPU - the Windows task manager shows 2 Vbox.headless_exe with CPU usage exactly zero. Although the progress bar in the BOINC Manager is growing.

I had tried LHCb tasks sucessfully short time ago, and of course they were using the CPU.

So, what is going wrong this time?
ID: 30306 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 725
Credit: 471,293,121
RAC: 247,325
Message 30307 - Posted: 12 May 2017, 18:28:10 UTC

I think LHCb is out too?
ID: 30307 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 30309 - Posted: 12 May 2017, 19:07:52 UTC
Last modified: 12 May 2017, 19:17:06 UTC

Sorry 'bout that, I was catching up on the news and Great British Menu. Only just noticed it -- the WMAgent is down. I've messaged Alan. Please set No New Tasks or try another subproject until he can fix it.

[Edit] I really should have noticed that sooner. I did see that the estimated time to completion hadn't fallen as much as I'd expected, but I guess at that time the queue hadn't drained so it didn't show up on the graphs -- it dried up about 1700 and I left work a bit earlier than usual today, at 1730. There are jobs created and pending but they are not transferring to the queue. [/Edit]
ID: 30309 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1511
Credit: 42,525,922
RAC: 39,449
Message 30310 - Posted: 12 May 2017, 19:18:27 UTC - in response to Message 30307.  
Last modified: 12 May 2017, 19:21:26 UTC

I think LHCb is out too?

yes, seems to be the case :-(
whereas I am surprised that the tasks which have been running without any CPU usage for 1 1/2 hours now, are not terminating themselves
ID: 30310 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 30311 - Posted: 12 May 2017, 19:39:44 UTC - in response to Message 30310.  

I think LHCb is out too?

yes, seems to be the case :-(
whereas I am surprised that the tasks which have been running without any CPU usage for 1 1/2 hours now, are not terminating themselves

There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly.
ID: 30311 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1511
Credit: 42,525,922
RAC: 39,449
Message 30312 - Posted: 12 May 2017, 20:06:24 UTC - in response to Message 30311.  

There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly.


I tried to open this page:

http://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php

however, it does not open - I only get a blank, white screen
ID: 30312 · Report as offensive     Reply Quote
Juha

Send message
Joined: 22 Mar 17
Posts: 30
Credit: 360,676
RAC: 0
Message 30313 - Posted: 12 May 2017, 20:23:19 UTC - in response to Message 30311.  

There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly.


From what I can tell from the logs, the server has no trouble handing out jobs. It's that the jobs get stuck before even starting.

05/12/17 21:59:41 (pid:4088) Job 3099277.20 set to execute immediately
05/12/17 21:59:41 (pid:4088) Starting a VANILLA universe job with ID: 3099277.20
05/12/17 21:59:41 (pid:4088) IWD: /var/lib/condor/execute/dir_4088
05/12/17 21:59:41 (pid:4088) Renice expr "10" evaluated to 10
05/12/17 21:59:41 (pid:4088) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_4088/condor_exec.exe 309927720
05/12/17 21:59:41 (pid:4088) Running job as user nobody
05/12/17 21:59:41 (pid:4088) Create_Process succeeded, pid=4092


There's nothing in the running.log.

The process tree starting from the 4092 above looks like this:

inner-wrapper
  job-wrapper
    sleep
  condor_exec.exe
    wget


wget there is a bit surprising, considering there is hardly any network traffic going on. Is there maybe some server having problems? netstat tells there is two connections (two VMs) to lbvobox33.cern.ch at SYN_SENT state. That server isn't responding to web browser.
ID: 30313 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 30315 - Posted: 12 May 2017, 20:48:09 UTC - in response to Message 30313.  

Hmm, you're right. The squid proxy is taking a hit, but is that cause or effect? The CMS WMAgent is down, but I'm not entirely sure it would have that much effect on the proxy.
ID: 30315 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 874
Credit: 5,805,977
RAC: 244
Message 30317 - Posted: 12 May 2017, 22:41:44 UTC - in response to Message 30315.  

Hmm, you're right. The squid proxy is taking a hit, but is that cause or effect? The CMS WMAgent is down, but I'm not entirely sure it would have that much effect on the proxy.

OK, it was effect, not cause. There was an "intervention" on the server that left it in a bad state. It's been fixed now and things are returning to something resembling normality. Should be OK to resume CMS tasks again now.
ID: 30317 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 725
Credit: 471,293,121
RAC: 247,325
Message 30318 - Posted: 12 May 2017, 22:49:43 UTC

LHCb is back too
ID: 30318 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2022 CERN