Message boards :
Theory Application :
Theory tasks all failing
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 2 May 07 Posts: 2098 Credit: 159,736,506 RAC: 143,982 |
|
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Theory tasks have been running fine here for weeks. Suddenly today they are all failing on my Linux host. OK, they don't exactly fail. They run for about 3 minutes then complete, upload and verify. I haven't updated anything for over a week and the Theory tasks have all been running fine since the last update until today. Is it a problem with Condor? Logs from tasks run on Linux are showing: 2018-05-24 19:02:19 (464): Guest Log: [INFO] New Job Starting in slot1 2018-05-24 19:02:19 (464): Guest Log: [INFO] Condor JobID: 138749.54 in slot1 2018-05-24 19:02:24 (464): Guest Log: [INFO] MCPlots JobID: 44326589 in slot1 2018-05-24 19:02:32 (464): VM Completion File Detected. 2018-05-24 19:02:32 (464): Powering off VM. 2018-05-24 19:02:33 (464): Successfully stopped VM. 2018-05-24 19:02:33 (464): Deregistering VM. (boinc_4ec7bc9bb8991e34, slot#0) 2018-05-24 19:02:33 (464): Removing network bandwidth throttle group from VM. 2018-05-24 19:02:33 (464): Removing storage controller(s) from VM. 2018-05-24 19:02:33 (464): Removing VM from VirtualBox. 2018-05-24 19:02:33 (464): Removing virtual disk drive from VirtualBox. 19:02:38 (464): called boinc_finish(0) Note that it takes only 13 secs to go from "New Job Starting in slot1" to "VM Completion File Detected" My Windows host is acting up too but ignore it. I tried to upgrade BOINC and VBox but it went all weird. Needs a BOINC VBox uninstall/install I think. |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,660,700 RAC: 121,257 |
@bronco: it's indeed strange that your Theory tasks are that short (a few minutes only). But from what I can see from stderr, all is "normal", the tasks seem to be classified as valid, and you get credit for them. On my Windows hosts, the tasks still run for about 12 1/2 hours. Maybe it's only Linux which receives these ultra-short tasks. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I didn't scroll down far enough in the stderr reports. The problem was that it (BOINC or Condor, not sure which) was not able to start a VM. My attempts to fix that problem seem to have only made things worse. When I first reported the problem BOINC was at least able to detect a VBox installation and would at least download Theory tasks. Now, after my mucking about, BOINC won't even fetch Theory tasks and complains that VBox is not installed. But VBox is most definitely installed. It looks like it's a problem with dkms (Linux's dynamic kernel module system) being configured incorrectly. I believe that configuration got messed up when I tried to update VBox from 5.1 to 5.2 but I'm not sure. I seem to recall posts from others here saying they had trouble with VBox 5.2 as well (but not necessarily the same troubles I am having) and ended up rolling back to 5.1. This is a Ubuntu 16.04 installation. I've been thinking it's time to reformat the disk and upgrade to version 18.04. Perhaps that's the most expedient way to fix whatever is fubar. |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,749,586 RAC: 10,234 |
The server once again was having problems even with Theory tasks and I just happened to check \to see if there were any errors and found over 100 on the three 8-core pc's I set for Theory tasks. They all seem to be running again but there must have been quite a few of these *yesterday* But the last 12 hours they have all been running as they should. 2018-06-07 22:46:34 (6868): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125 2018-06-07 22:46:38 (6868): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded! 2018-06-07 22:46:38 (6868): Guest Log: [DEBUG] 0 2018-06-07 22:46:38 (6868): Guest Log: [DEBUG] Testing VCCS connection to vccs1.cern.ch on port 443 2018-06-07 22:46:40 (6868): Guest Log: [DEBUG] nc: getaddrinfo: Name or service not known 2018-06-07 22:46:40 (6868): Guest Log: [DEBUG] 1 2018-06-07 22:46:40 (6868): Guest Log: [ERROR] Could not connect to vccs1.cern.ch on port 443 2018-06-07 22:46:40 (6868): Guest Log: [INFO] Shutting Down. 2018-06-07 22:46:40 (6868): VM Completion File Detected. 2018-06-07 22:46:40 (6868): VM Completion Message: Could not connect to vccs1.cern.ch on port 443 . 2018-06-07 22:46:40 (6868): Powering off VM. |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,749,586 RAC: 10,234 |
As usual I make a rare trip away from home for a few hours and even the Theory tasks decided to give me several Invalids. Mostly that typical *no jobs* reason and a few [ERROR] Enviroment setup script /cvmfs/grid.cern.ch/emi3wn-latest/etc/profile.d/setup-wn-example.sh does not exist. 2018-06-20 21:10:22 (6368): Guest Log: [ERROR] The x509 proxy creation failed. 2018-06-20 21:10:22 (6368): Guest Log: [INFO] Shutting Down. But since they were all loaded and the server decided to behave they all started running Valids again. (I guess it tells me I can't be away until midnight ever) but it did that server thing where it makes up new words in the stderr... |
Send message Joined: 2 May 07 Posts: 2098 Credit: 159,736,506 RAC: 143,982 |
24/7 ;-) see Atlas-Forum.Yesterday was a lot of work for Cern-IT to be done. |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,749,586 RAC: 10,234 |
I just started having a problem with the server on all mine. 40 of these so far so I suspended all that have not started yet. https://lhcathome.cern.ch/lhcathome/result.php?resultid=205113985 (same thing over at -dev) 2018-08-17 17:48:57 (2120): Guest Log: [DEBUG] DC_NOP failed! 2018-08-17 17:48:57 (2120): Guest Log: AUTHENTICATE:1003:Failed to authenticate with any method 2018-08-17 17:48:57 (2120): Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI 2018-08-17 17:48:57 (2120): Guest Log: GSI:5004:Failed to authenticate. Globus is reporting error (655360:17) 2018-08-17 17:48:57 (2120): Guest Log: 08/18/18 02:48:46 recognized DC_NOP as command name, using command 60011. 2018-08-17 17:48:57 (2120): Guest Log: 08/18/18 02:48:55 Condor GSI authentication failure 2018-08-17 17:48:57 (2120): Guest Log: GSS Major Status: Authentication Failed 2018-08-17 17:48:57 (2120): Guest Log: GSS Minor Status Error Chain: 2018-08-17 17:48:57 (2120): Guest Log: globus_gss_assist: Error during context initialization 2018-08-17 17:48:57 (2120): Guest Log: globus_gsi_callback_module: Could not verify credential 2018-08-17 17:48:57 (2120): Guest Log: globus_gsi_callback_module: Could not verify credential 2018-08-17 17:48:57 (2120): Guest Log: globus_gsi_callback_module: Invalid CRL: The available CRL has expired 2018-08-17 17:48:57 (2120): Guest Log: 08/18/18 02:48:56 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY. 2018-08-17 17:49:19 (2120): Guest Log: [ERROR] Could not ping HTCondor. 2018-08-17 17:49:19 (2120): Guest Log: [INFO] Shutting Down. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Same problem here, same errors. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Working now. |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,749,586 RAC: 10,234 |
Working now. Thanks for the update,all back to work here too. |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,749,586 RAC: 10,234 |
Well as usual soon after these seemed to be running perfect I find an error page of 20 of these time wasters. Guest Log: [ERROR] Condor exited after 1064s without running a job. (9144): Guest Log: [INFO] Shutting Down. (9144): VM Completion File Detected. (9144): VM Completion Message: Condor exited after 1064s without running a job. Usually take about 30 minutes each before they end up with this error. AND then a few of these that we had last week. .https://lhcathome.cern.ch/lhcathome/result.php?resultid=206286237 [ERROR] Could not get an x509 credential (7800): Guest Log: [ERROR] The x509 proxy creation failed. (7800): Guest Log: [INFO] Shutting Down. (7800): VM Completion File Detected. (7800): VM Completion Message: The x509 proxy creation failed And also one of these after almost 9 hours running https://lhcathome.cern.ch/lhcathome/result.php?resultid=206220381 VBoxManage.exe: error: Could not find a registered machine named 'boinc_e9ec224720d72234' VBoxManage.exe: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown VBoxManage.exe: error: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2834 of file VBoxManageInfo.cpp And several Valids that are only about one hour running time. |
Send message Joined: 2 May 07 Posts: 2098 Credit: 159,736,506 RAC: 143,982 |
Saw this Error also at 18/8/29 18 UTC once for a Atlas. |
Send message Joined: 3 May 14 Posts: 1 Credit: 5,933,057 RAC: 1,279 |
Hi, Theory tasks are failing on my below mentioned machine. The below error is shown consistently on this machine. Incidentally, I am not able to run a single successful Theory task on this machine till now. My other machine is able to successfully complete Theory tasks and VirtualBox version is different though. Moreover, both the machines are successfully crunching ATLAS tasks. https://lhcathome.cern.ch/lhcathome/result.php?resultid=213392264 https://lhcathome.cern.ch/lhcathome/result.php?resultid=213316972 https://lhcathome.cern.ch/lhcathome/result.php?resultid=213059180 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10571131 2019-01-04 22:57:07 (4156): Guest Log: [DEBUG] HTCondor ping 2019-01-04 22:57:09 (4156): Guest Log: [DEBUG] 1 2019-01-04 22:57:10 (4156): Guest Log: [DEBUG] 1 2019-01-04 22:57:11 (4156): Guest Log: [DEBUG] 1 2019-01-04 22:57:11 (4156): Guest Log: [DEBUG] DC_NOP failed! 2019-01-04 22:57:11 (4156): Guest Log: AUTHENTICATE:1003:Failed to authenticate with any method 2019-01-04 22:57:11 (4156): Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI 2019-01-04 22:57:11 (4156): Guest Log: GSI:5004:Failed to authenticate. Globus is reporting error (655360:17) 2019-01-04 22:57:11 (4156): Guest Log: 01/05/19 02:27:10 recognized DC_NOP as command name, using command 60011. 2019-01-04 22:57:11 (4156): Guest Log: 01/05/19 02:27:11 Condor GSI authentication failure 2019-01-04 22:57:11 (4156): Guest Log: GSS Major Status: Authentication Failed 2019-01-04 22:57:11 (4156): Guest Log: GSS Minor Status Error Chain: 2019-01-04 22:57:11 (4156): Guest Log: globus_gss_assist: Error during context initialization 2019-01-04 22:57:11 (4156): Guest Log: globus_gsi_callback_module: Could not verify credential 2019-01-04 22:57:11 (4156): Guest Log: globus_gsi_callback_module: Could not verify credential 2019-01-04 22:57:11 (4156): Guest Log: globus_gsi_callback_module: Invalid CRL: The available CRL has expired 2019-01-04 22:57:11 (4156): Guest Log: 01/05/19 02:27:11 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY. 2019-01-04 22:57:12 (4156): Guest Log: [ERROR] Could not ping HTCondor. 2019-01-04 22:57:12 (4156): Guest Log: [INFO] Shutting Down. Any help will be greatly appreciated. |
Send message Joined: 15 Jun 08 Posts: 2411 Credit: 226,225,131 RAC: 130,595 |
It's the HTCondor ping that failes: 2019-01-04 22:57:07 (4156): Guest Log: [DEBUG] HTCondor ping 2019-01-04 22:57:09 (4156): Guest Log: [DEBUG] 1 2019-01-04 22:57:10 (4156): Guest Log: [DEBUG] 1 2019-01-04 22:57:11 (4156): Guest Log: [DEBUG] 1 2019-01-04 22:57:12 (4156): Guest Log: [ERROR] Could not ping HTCondor. 2019-01-04 22:57:12 (4156): Guest Log: [INFO] Shutting Down. Possible reasons/solutions: 1. Your firewall blocks the Condor connection. You may check your firewall logs for related issues and open the blocked ports. 2. A damaged Theory vdi file. You may set LHC@home to "no new tasks", wait until all work has been reported to the project server and then do a project reset. ATLAS works as it doesn't use HTCondor. |
Send message Joined: 1 Feb 06 Posts: 66 Credit: 9,723 RAC: 0 |
Hi, All my Theory tasks are failing due to incorrect proxy set u, whereas ATLAS are working fine: https://lhcathome.cern.ch/lhcathome/result.php?resultid=214603305 Theory: https://lhcathome.cern.ch/lhcathome/result.php?resultid=214535693 Any suggesiton? My system will keep sending error Theory back to server... |
©2024 CERN