Message boards : Theory Application : Theory tasks all failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2096
Credit: 159,611,541
RAC: 141,207
Message 35299 - Posted: 17 May 2018, 18:44:20 UTC

ID: 35299 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35365 - Posted: 25 May 2018, 2:27:11 UTC

Theory tasks have been running fine here for weeks. Suddenly today they are all failing on my Linux host. OK, they don't exactly fail. They run for about 3 minutes then complete, upload and verify.

I haven't updated anything for over a week and the Theory tasks have all been running fine since the last update until today. Is it a problem with Condor?

Logs from tasks run on Linux are showing:

2018-05-24 19:02:19 (464): Guest Log: [INFO] New Job Starting in slot1
2018-05-24 19:02:19 (464): Guest Log: [INFO] Condor JobID:  138749.54 in slot1
2018-05-24 19:02:24 (464): Guest Log: [INFO] MCPlots JobID: 44326589 in slot1
2018-05-24 19:02:32 (464): VM Completion File Detected.
2018-05-24 19:02:32 (464): Powering off VM.
2018-05-24 19:02:33 (464): Successfully stopped VM.
2018-05-24 19:02:33 (464): Deregistering VM. (boinc_4ec7bc9bb8991e34, slot#0)
2018-05-24 19:02:33 (464): Removing network bandwidth throttle group from VM.
2018-05-24 19:02:33 (464): Removing storage controller(s) from VM.
2018-05-24 19:02:33 (464): Removing VM from VirtualBox.
2018-05-24 19:02:33 (464): Removing virtual disk drive from VirtualBox.
19:02:38 (464): called boinc_finish(0)


Note that it takes only 13 secs to go from "New Job Starting in slot1" to "VM Completion File Detected"

My Windows host is acting up too but ignore it. I tried to upgrade BOINC and VBox but it went all weird. Needs a BOINC VBox uninstall/install I think.
ID: 35365 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,563,023
RAC: 119,636
Message 35366 - Posted: 25 May 2018, 3:34:12 UTC - in response to Message 35365.  

@bronco: it's indeed strange that your Theory tasks are that short (a few minutes only). But from what I can see from stderr, all is "normal", the tasks seem to be classified as valid, and you get credit for them.

On my Windows hosts, the tasks still run for about 12 1/2 hours.
Maybe it's only Linux which receives these ultra-short tasks.
ID: 35366 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35371 - Posted: 25 May 2018, 10:37:29 UTC - in response to Message 35366.  

I didn't scroll down far enough in the stderr reports. The problem was that it (BOINC or Condor, not sure which) was not able to start a VM. My attempts to fix that problem seem to have only made things worse. When I first reported the problem BOINC was at least able to detect a VBox installation and would at least download Theory tasks. Now, after my mucking about, BOINC won't even fetch Theory tasks and complains that VBox is not installed. But VBox is most definitely installed. It looks like it's a problem with dkms (Linux's dynamic kernel module system) being configured incorrectly. I believe that configuration got messed up when I tried to update VBox from 5.1 to 5.2 but I'm not sure. I seem to recall posts from others here saying they had trouble with VBox 5.2 as well (but not necessarily the same troubles I am having) and ended up rolling back to 5.1.

This is a Ubuntu 16.04 installation. I've been thinking it's time to reformat the disk and upgrade to version 18.04. Perhaps that's the most expedient way to fix whatever is fubar.
ID: 35371 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,746,750
RAC: 10,655
Message 35461 - Posted: 9 Jun 2018, 7:38:34 UTC

The server once again was having problems even with Theory tasks and I just happened to check \to see if there were any errors and found over 100 on the three 8-core pc's I set for Theory tasks.

They all seem to be running again but there must have been quite a few of these *yesterday*

But the last 12 hours they have all been running as they should.


2018-06-07 22:46:34 (6868): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125

2018-06-07 22:46:38 (6868): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded!

2018-06-07 22:46:38 (6868): Guest Log: [DEBUG] 0

2018-06-07 22:46:38 (6868): Guest Log: [DEBUG] Testing VCCS connection to vccs1.cern.ch on port 443

2018-06-07 22:46:40 (6868): Guest Log: [DEBUG] nc: getaddrinfo: Name or service not known

2018-06-07 22:46:40 (6868): Guest Log: [DEBUG] 1

2018-06-07 22:46:40 (6868): Guest Log: [ERROR] Could not connect to vccs1.cern.ch on port 443

2018-06-07 22:46:40 (6868): Guest Log: [INFO] Shutting Down.

2018-06-07 22:46:40 (6868): VM Completion File Detected.
2018-06-07 22:46:40 (6868): VM Completion Message: Could not connect to vccs1.cern.ch on port 443
.
2018-06-07 22:46:40 (6868): Powering off VM.
ID: 35461 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,746,750
RAC: 10,655
Message 35600 - Posted: 21 Jun 2018, 8:06:32 UTC

As usual I make a rare trip away from home for a few hours and even the Theory tasks decided to give me several Invalids.

Mostly that typical *no jobs* reason and a few

[ERROR] Enviroment setup script /cvmfs/grid.cern.ch/emi3wn-latest/etc/profile.d/setup-wn-example.sh does not exist.

2018-06-20 21:10:22 (6368): Guest Log: [ERROR] The x509 proxy creation failed.

2018-06-20 21:10:22 (6368): Guest Log: [INFO] Shutting Down.

But since they were all loaded and the server decided to behave they all started running Valids again.

(I guess it tells me I can't be away until midnight ever) but it did that server thing where it makes up new words in the stderr...
ID: 35600 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2096
Credit: 159,611,541
RAC: 141,207
Message 35601 - Posted: 21 Jun 2018, 8:40:14 UTC

24/7 ;-)
see Atlas-Forum.Yesterday was a lot of work for Cern-IT to be done.
ID: 35601 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,746,750
RAC: 10,655
Message 36483 - Posted: 18 Aug 2018, 1:06:16 UTC

I just started having a problem with the server on all mine.

40 of these so far so I suspended all that have not started yet.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=205113985

(same thing over at -dev)


2018-08-17 17:48:57 (2120): Guest Log: [DEBUG] DC_NOP failed!

2018-08-17 17:48:57 (2120): Guest Log: AUTHENTICATE:1003:Failed to authenticate with any method

2018-08-17 17:48:57 (2120): Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI

2018-08-17 17:48:57 (2120): Guest Log: GSI:5004:Failed to authenticate. Globus is reporting error (655360:17)

2018-08-17 17:48:57 (2120): Guest Log: 08/18/18 02:48:46 recognized DC_NOP as command name, using command 60011.

2018-08-17 17:48:57 (2120): Guest Log: 08/18/18 02:48:55 Condor GSI authentication failure

2018-08-17 17:48:57 (2120): Guest Log: GSS Major Status: Authentication Failed

2018-08-17 17:48:57 (2120): Guest Log: GSS Minor Status Error Chain:

2018-08-17 17:48:57 (2120): Guest Log: globus_gss_assist: Error during context initialization

2018-08-17 17:48:57 (2120): Guest Log: globus_gsi_callback_module: Could not verify credential

2018-08-17 17:48:57 (2120): Guest Log: globus_gsi_callback_module: Could not verify credential

2018-08-17 17:48:57 (2120): Guest Log: globus_gsi_callback_module: Invalid CRL: The available CRL has expired

2018-08-17 17:48:57 (2120): Guest Log: 08/18/18 02:48:56 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY.

2018-08-17 17:49:19 (2120): Guest Log: [ERROR] Could not ping HTCondor.

2018-08-17 17:49:19 (2120): Guest Log: [INFO] Shutting Down.
ID: 36483 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36486 - Posted: 18 Aug 2018, 2:37:11 UTC - in response to Message 36483.  

Same problem here, same errors.
ID: 36486 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36487 - Posted: 18 Aug 2018, 5:43:00 UTC - in response to Message 36486.  

Working now.
ID: 36487 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,746,750
RAC: 10,655
Message 36488 - Posted: 18 Aug 2018, 6:54:27 UTC - in response to Message 36487.  

Working now.


Thanks for the update,all back to work here too.
ID: 36488 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,746,750
RAC: 10,655
Message 36590 - Posted: 31 Aug 2018, 3:24:17 UTC - in response to Message 36488.  

Well as usual soon after these seemed to be running perfect I find an error page of 20 of these time wasters.

Guest Log: [ERROR] Condor exited after 1064s without running a job.
(9144): Guest Log: [INFO] Shutting Down.
(9144): VM Completion File Detected.
(9144): VM Completion Message: Condor exited after 1064s without running a job.

Usually take about 30 minutes each before they end up with this error.

AND then a few of these that we had last week.
.https://lhcathome.cern.ch/lhcathome/result.php?resultid=206286237

[ERROR] Could not get an x509 credential
(7800): Guest Log: [ERROR] The x509 proxy creation failed.
(7800): Guest Log: [INFO] Shutting Down.
(7800): VM Completion File Detected.
(7800): VM Completion Message: The x509 proxy creation failed

And also one of these after almost 9 hours running https://lhcathome.cern.ch/lhcathome/result.php?resultid=206220381

VBoxManage.exe: error: Could not find a registered machine named 'boinc_e9ec224720d72234'
VBoxManage.exe: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown
VBoxManage.exe: error: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2834 of file VBoxManageInfo.cpp

And several Valids that are only about one hour running time.
ID: 36590 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2096
Credit: 159,611,541
RAC: 141,207
Message 36591 - Posted: 31 Aug 2018, 4:40:56 UTC - in response to Message 36590.  


VBoxManage.exe: error: Could not find a registered machine named 'boinc_e9ec224720d72234'
VBoxManage.exe: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown

Saw this Error also at 18/8/29 18 UTC once for a Atlas.
ID: 36591 · Report as offensive     Reply Quote
abhi506

Send message
Joined: 3 May 14
Posts: 1
Credit: 5,933,057
RAC: 1,279
Message 37727 - Posted: 6 Jan 2019, 18:55:09 UTC

Hi, Theory tasks are failing on my below mentioned machine. The below error is shown consistently on this machine. Incidentally, I am not able to run a single successful Theory task on this machine till now. My other machine is able to successfully complete Theory tasks and VirtualBox version is different though. Moreover, both the machines are successfully crunching ATLAS tasks.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=213392264
https://lhcathome.cern.ch/lhcathome/result.php?resultid=213316972
https://lhcathome.cern.ch/lhcathome/result.php?resultid=213059180

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10571131


2019-01-04 22:57:07 (4156): Guest Log: [DEBUG] HTCondor ping

2019-01-04 22:57:09 (4156): Guest Log: [DEBUG] 1

2019-01-04 22:57:10 (4156): Guest Log: [DEBUG] 1

2019-01-04 22:57:11 (4156): Guest Log: [DEBUG] 1

2019-01-04 22:57:11 (4156): Guest Log: [DEBUG] DC_NOP failed!

2019-01-04 22:57:11 (4156): Guest Log: AUTHENTICATE:1003:Failed to authenticate with any method

2019-01-04 22:57:11 (4156): Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI

2019-01-04 22:57:11 (4156): Guest Log: GSI:5004:Failed to authenticate. Globus is reporting error (655360:17)

2019-01-04 22:57:11 (4156): Guest Log: 01/05/19 02:27:10 recognized DC_NOP as command name, using command 60011.

2019-01-04 22:57:11 (4156): Guest Log: 01/05/19 02:27:11 Condor GSI authentication failure

2019-01-04 22:57:11 (4156): Guest Log: GSS Major Status: Authentication Failed

2019-01-04 22:57:11 (4156): Guest Log: GSS Minor Status Error Chain:

2019-01-04 22:57:11 (4156): Guest Log: globus_gss_assist: Error during context initialization

2019-01-04 22:57:11 (4156): Guest Log: globus_gsi_callback_module: Could not verify credential

2019-01-04 22:57:11 (4156): Guest Log: globus_gsi_callback_module: Could not verify credential

2019-01-04 22:57:11 (4156): Guest Log: globus_gsi_callback_module: Invalid CRL: The available CRL has expired

2019-01-04 22:57:11 (4156): Guest Log: 01/05/19 02:27:11 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY.

2019-01-04 22:57:12 (4156): Guest Log: [ERROR] Could not ping HTCondor.

2019-01-04 22:57:12 (4156): Guest Log: [INFO] Shutting Down.

Any help will be greatly appreciated.
ID: 37727 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,110,114
RAC: 127,692
Message 37728 - Posted: 6 Jan 2019, 21:09:52 UTC - in response to Message 37727.  

It's the HTCondor ping that failes:
2019-01-04 22:57:07 (4156): Guest Log: [DEBUG] HTCondor ping
2019-01-04 22:57:09 (4156): Guest Log: [DEBUG] 1
2019-01-04 22:57:10 (4156): Guest Log: [DEBUG] 1
2019-01-04 22:57:11 (4156): Guest Log: [DEBUG] 1
2019-01-04 22:57:12 (4156): Guest Log: [ERROR] Could not ping HTCondor.
2019-01-04 22:57:12 (4156): Guest Log: [INFO] Shutting Down.

Possible reasons/solutions:
1. Your firewall blocks the Condor connection.
You may check your firewall logs for related issues and open the blocked ports.

2. A damaged Theory vdi file.
You may set LHC@home to "no new tasks", wait until all work has been reported to the project server and then do a project reset.


ATLAS works as it doesn't use HTCondor.
ID: 37728 · Report as offensive     Reply Quote
Guiri-One[Andalucia]

Send message
Joined: 1 Feb 06
Posts: 66
Credit: 9,723
RAC: 0
Message 37821 - Posted: 25 Jan 2019, 12:02:24 UTC

Hi,

All my Theory tasks are failing due to incorrect proxy set u, whereas ATLAS are working fine:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=214603305


Theory:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=214535693


Any suggesiton?

My system will keep sending error Theory back to server...
ID: 37821 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Theory Application : Theory tasks all failing


©2024 CERN