Message boards : Number crunching : General Work Shortage?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 93
Credit: 5,160,246
RAC: 0
Message 28504 - Posted: 15 Jan 2017, 17:13:31 UTC

My hosts have been returning quite a few errors recently with the "...Condor exited after XXXs without running a job..." message. I'm used to seeing these occasionally, but not so many as in the last couple of days. This is happening mostly on Theory and CMS tasks, but I've had a few on LHCb, too.
ID: 28504 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 700
Credit: 444,767,087
RAC: 186,472
Message 28505 - Posted: 15 Jan 2017, 18:05:21 UTC

Seems like an uptick this afternoon in issues.
ID: 28505 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,027,000
RAC: 0
Message 28506 - Posted: 15 Jan 2017, 18:08:39 UTC

Same here.
ID: 28506 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 821
Credit: 5,717,880
RAC: 149
Message 28507 - Posted: 15 Jan 2017, 20:34:49 UTC - in response to Message 28504.  

My hosts have been returning quite a few errors recently with the "...Condor exited after XXXs without running a job..." message. I'm used to seeing these occasionally, but not so many as in the last couple of days. This is happening mostly on Theory and CMS tasks, but I've had a few on LHCb, too.

We're aware of it. Laurence is investigating, but I can't yet give a schedule for a solution.
ID: 28507 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 71
Credit: 1,807,236
RAC: 1,215
Message 28509 - Posted: 15 Jan 2017, 23:26:36 UTC

I have noticed that the make_work_app daemon is marked as "Not Running" on the new LHC@home server status page at https://lhcathome.cern.ch/lhcathome/server_status.php (which is a different page from the old LHC@home 1.0 server status page at http://lhcathomeclassic.cern.ch/sixtrack/server_status.php). (Both server status pages list different daemons on different servers, so both are still useful for now.) Is this related to the problems people are having?
ID: 28509 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 700
Credit: 444,767,087
RAC: 186,472
Message 28515 - Posted: 16 Jan 2017, 7:25:50 UTC

looks like it's fixed now.
ID: 28515 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 821
Credit: 5,717,880
RAC: 149
Message 28516 - Posted: 16 Jan 2017, 9:03:03 UTC - in response to Message 28515.  

looks like it's fixed now.

Yes, we had a problem that Laurence's cluster had stopped running CMS merge jobs (where the smaller 60 MB result files are merged into 2 GB files) so the queue filled up. He's now running merge jobs again and the bottleneck is gone.
ID: 28516 · Report as offensive     Reply Quote

Message boards : Number crunching : General Work Shortage?


©2022 CERN