Message boards : Theory Application : Theory tasks failing
Message board moderation

To post messages, you must log in.

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 959
Credit: 6,331,938
RAC: 1,589
Message 32745 - Posted: 10 Oct 2017, 6:36:18 UTC

Theory tasks are all failing on my Win10 32-bit:

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10416365

Exit status: EXIT_NO_SUB_TASKS

Guest Log: 10/10/17 08:29:00 **** condor_startd (condor_STARTD) pid 3709 EXITING WITH STATUS 0
Guest Log: [ERROR] No jobs were available to run.
Guest Log: [INFO] Shutting Down.
ID: 32745 · Report as offensive     Reply Quote
Profile Geoff Harmer

Send message
Joined: 11 Aug 11
Posts: 8
Credit: 3,542,159
RAC: 232
Message 32751 - Posted: 10 Oct 2017, 10:40:45 UTC - in response to Message 32745.  
Last modified: 10 Oct 2017, 10:44:09 UTC

Exactly the same for me on my Win Vista 32 bit for the 9 and 10 Oct.

https://lhcathome.cern.ch/lhcathome/results.php?userid=456564


Hope that helps.
Kind regards
Geoff
ID: 32751 · Report as offensive     Reply Quote
Profile Mumak
Avatar

Send message
Joined: 14 Feb 14
Posts: 5
Credit: 16,906,689
RAC: 72
Message 32753 - Posted: 10 Oct 2017, 11:16:10 UTC

Same here. First there were some problems with Condor, then no tasks avail.
ID: 32753 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 943
Credit: 40,279,511
RAC: 11,515
Message 32767 - Posted: 10 Oct 2017, 20:15:51 UTC

I keep getting these with X64

[ERROR] Condor exited after 2373s without running a job.

Running over an hour each task so I always used to think these were going to run and wouldn't have to keep checking all of the tasks on all of my computers.

Now I can't trust them to be running and also can't just leave them set to *Allow New Tasks* just to get these *Computer Error* tasks that are not actually computer errors.

Can't get much done this way so I will have to just suspend all of mine until something is done.
Volunteer Mad Scientist For Life
ID: 32767 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 959
Credit: 6,331,938
RAC: 1,589
Message 32822 - Posted: 13 Oct 2017, 10:13:18 UTC

For me Theory is working again on the platforms Win32 and Win64.
ID: 32822 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 943
Credit: 40,279,511
RAC: 11,515
Message 32837 - Posted: 16 Oct 2017, 2:05:32 UTC

Guest Log: [ERROR] Condor exited after 57012s without running a job.

More wasted time with these theory tasks lately and I have to start them all running after 2am just for the internet speed for the dreaded VB

https://lhcathome.cern.ch/lhcathome/results.php?userid=5472
Volunteer Mad Scientist For Life
ID: 32837 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1448
Credit: 77,142,965
RAC: 95,654
Message 33666 - Posted: 5 Jan 2018, 7:23:43 UTC

Since this morning all new Theory VMs are failing after a few minutes.
Mostly with "207 (0x000000CF) EXIT_NO_SUB_TASKS" but also with "206 (0x000000CE) EXIT_INIT_FAILURE".
ID: 33666 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,092,623
RAC: 2,454
Message 33679 - Posted: 6 Jan 2018, 9:24:33 UTC - in response to Message 33666.  

same here, unfortunately :-(
ID: 33679 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,092,623
RAC: 2,454
Message 34027 - Posted: 22 Jan 2018, 11:12:04 UTC

since a few hours, all Theory tasks fail after about half an hour with "207 (0x000000CF) EXIT_NO_SUB_TASKS"

Is the WMAgent down?
ID: 34027 · Report as offensive     Reply Quote
Profile Ben Segal
Volunteer moderator
Project administrator

Send message
Joined: 1 Sep 04
Posts: 122
Credit: 2,579
RAC: 0
Message 34033 - Posted: 22 Jan 2018, 13:14:06 UTC - in response to Message 34027.  

since a few hours, all Theory tasks fail after about half an hour with "207 (0x000000CF) EXIT_NO_SUB_TASKS"

Is the WMAgent down?


Thanks for the heads-up. I just asked our system manager (Nils) about it and he replied:

"This is probably because the Condor node handling jobs for Theory is being updated for Spectre and Meltdown today.

https://cern.service-now.com/service-portal/view-outage.do?n=OTG0041682

Should be back again soon.

Cheers, Nils"
ID: 34033 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 943
Credit: 40,279,511
RAC: 11,515
Message 34039 - Posted: 22 Jan 2018, 15:43:12 UTC - in response to Message 34033.  
Last modified: 22 Jan 2018, 16:30:48 UTC



Thanks for the heads-up. I just asked our system manager (Nils) about it and he replied:

"This is probably because the Condor node handling jobs for Theory is being updated for Spectre and Meltdown today.

https://cern.service-now.com/service-portal/view-outage.do?n=OTG0041682

Should be back again soon.
"


Hello Ben,

Yes we had the same problems with these over at LHC-dev so I had 7 multi-core tasks crash (not start up) but I just got another 6 of them up and running so maybe it was taken care of......just in time (as I mentioned over there)

EDIT: well it looks like they are still having problems.

They do make it past HTCondor Ping but then after a few minutes crash with
[ERROR] Condor exited after 758s without running a job.

So far after a few tries I may only have ONE multi-core running and the other 5 are past HTCondor Ping but I don't trust them to run.......so I will watch but I have them set to not get new ones if these fail again and get back to running all those AVX tasks I have still.
Volunteer Mad Scientist For Life
ID: 34039 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,092,623
RAC: 2,454
Message 34040 - Posted: 22 Jan 2018, 19:05:16 UTC

same or similar problem here with the Condor Server:
...
guest Log: 01/22/18 18:42:57 CCBListener: connection to CCB server vccondor01.cern.ch failed
...
All the tasks from this afternoon failed :-(

Often enough in the past, also with CMS tasks I made the same experience: no connection to Condor Server.
No idea why the Condor Server is making problems so many times. And no idea whether the people in charge have ever looked into that.
ID: 34040 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 127
Credit: 6,221,829
RAC: 0
Message 34041 - Posted: 23 Jan 2018, 4:08:00 UTC - in response to Message 34040.  

The three Theory WU that are still left from late yesterday are still calculating but the new WU can get MCPlots work.

Error count is near 300 on all 3 servers.

Switching to backup jobs till morning. (10pm local time now).
ID: 34041 · Report as offensive     Reply Quote

Message boards : Theory Application : Theory tasks failing


©2020 CERN