Problems connecting to servers?

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1875 Credit: 138,755,883 RAC: 64,663	Message 51099 - Posted: 19 Nov 2024, 17:24:05 UTC After I noticed about an hour ago that there were new CMS tasks available, I first downloaded one and found out that the connection to Condor now works well. So I started CMS on several other hosts, and all the tasks which got downloaded started okay. However, after some time I found out that obviously there were no jobs available - no CPU activity, and the tasks breaking off after about half an hour :-( ID: 51099 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,918,087 RAC: 146,876	Message 51100 - Posted: 19 Nov 2024, 17:34:03 UTC - in response to Message 51099. Same here. HTCondor is now on vocms0830.cern.ch (old: vocms0840.cern.ch) and works fine. @Ivan The job queue seems to be dry. ID: 51100 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 58 Credit: 1,917,819 RAC: 594	Message 51101 - Posted: 20 Nov 2024, 2:44:11 UTC Last modified: 20 Nov 2024, 3:41:10 UTC Mine is still not working - All Tasks Checking the times on that... Ah. OK There are, as I write this, reports it's working now... [Preferences set to send me CMS...] Frustrating. I'll probably have to wait because there's a "10 Day" Theory task running at the moment! (See this post) Yesterday I set the Project Preferences as below to stop sending me CMS tasks , but I still got them: Run only the selected applications SixTrack: yes sixtracktest: yes CMS Simulation: no Theory Simulation: yes ATLAS Simulation: yes That's probably because of - If no work for selected applications is available, accept work from other applications? yes Fun with a dash of very dry irony. ID: 51101 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1875 Credit: 138,755,883 RAC: 64,663	Message 51105 - Posted: 20 Nov 2024, 10:12:52 UTC Still no jobs are available. and obviously, the automatic termination of task distribution does not work - since yesterday evening, tasks keep being sent out and they all fail after about half an hour :-( ID: 51105 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1875 Credit: 138,755,883 RAC: 64,663	Message 51121 - Posted: 23 Nov 2024, 19:03:51 UTC - in response to Message 51105. Still no jobs are available. and obviously, the automatic termination of task distribution does not work - since yesterday evening, tasks keep being sent out and they all fail after about half an hour :-( for 4 days now, all downloaded CMS tasks finish after half an hour due to lack of jobs. And hence they are of no value to the science (although they even get low credit points). What I am surprised about is that no one at the receiving point of these faulty tasks has noticed it yet. Or, in other words: does no one care what we volunteers submit? This makes we wonder how much sense it makes at all to crunch for LHC ... ID: 51121 · Reply Quote

Glohr Send message Joined: 13 Jan 24 Posts: 27 Credit: 4,445,141 RAC: 20,143	Message 51123 - Posted: 24 Nov 2024, 0:29:27 UTC - in response to Message 51121. CMS and Atlas have problems, but I'm getting a few Theory jobs that seem to be running. ID: 51123 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1875 Credit: 138,755,883 RAC: 64,663	Message 51124 - Posted: 24 Nov 2024, 7:38:01 UTC - in response to Message 51123. ... but I'm getting a few Theory jobs that seem to be running. yes, there are Theory tasks once in a while. But from what I noticed: they are all "longrunners", so it could well happen that on a slow host they won't get finished within the 10 days' limit and subsequently error out. ID: 51124 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 58 Credit: 1,917,819 RAC: 594	Message 51127 - Posted: 24 Nov 2024, 14:06:38 UTC - in response to Message 51124. It's vexing. Long running tasks and fair play - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6240 This gonna be long - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6251 If you get nothing but long Theory jobs using all your CPUs you could limit the number with app_config.xml <app_config> <app> <name>Theory</name> <max_concurrent>4</max_concurrent> </app> </app_config> and thus allow other jobs to run at the same time. (4 is just an example. It depends on how many of your CPUs you are willing to give to Theory tasks.) Create the above app_config.xml file and drop it in your LHC@home project data folder, here: Windows: C:\ProgramData\BOINC\projects\<your project>\app_config.xml Linux: /var/lib/boinc/projects/<your project>/app_config.xml Instructions for writing app_config.xml files are here ID: 51127 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1875 Credit: 138,755,883 RAC: 64,663	Message 51130 - Posted: 24 Nov 2024, 16:13:08 UTC - in response to Message 51127. If you get nothing but long Theory jobs using all your CPUs you could limit the number with app_config.xml the number of CPUs used for a given task does not depend on the length of the task. A Theory task is designed to use 1 CPU, regardless of whether it runs for 1 hour or 15 days. ID: 51130 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 58 Credit: 1,917,819 RAC: 594	Message 51182 - Posted: 26 Nov 2024, 21:32:20 UTC - in response to Message 51130. Last modified: 26 Nov 2024, 21:50:48 UTC Yes, Theory jobs run with one thread using exactly one CPU core. Above, in the <app>..</app> section, the line <max_concurrent>4</max_concurrent> limits the number of Theory jobs that run at the same time - that, as you note, run in the form: 1 Theory work unit will use 1 CPU core. So it is possible to have more than one Theory job running on your BOINC client at a time - well, if you have a multi-core CPU. Now with all the 'week long' Theory jobs that are being sent out now, it may be useful to allow other job types to run, even from other projects, at the same time as the long Theory jobs. This *app_config.xml* - <app_config> <app> <name>Theory</name> <max_concurrent>4</max_concurrent> </app> </app_config> limits the number of Theory *jobs* that run at the same time. And with one job per one CPU core, a maximum of 4 CPU cores will ever be used for Theory jobs with this app_config.xml in effect. If you have any more cores available they will run other non-Theory job types. That's useful for running, if they're available, some different types of LHC@home tasks concurrently (at the same time) or to keep another projects tasks running while your PC concurrently crunches through some long LHC@home Theory tasks. There are as many ways to use app_config.xml as there are different computers. These examples are running well on my 8 core system. Before anything else - it's recommended to leave a couple of CPU cores free to run all the background OS processes. A reliable way to do this is to use the BOINC Manager's "Options -> Computing preferences" to limit the number of CPUs that BOINC uses for its number crunching. Use at most [75] % of the CPUs works well with my 8-core CPU. Multi-threaded tasks NOTE that there are multi-threaded apps out there and they use completely different app_config.xml elements for limiting, if you want to, the number of CPU cores that a multi-thread task will use. For instance two of the other job types offered by the LHC@home project are ATLAS and CMS. These job types are multi-threaded meaning that they use many of your CPUs per job. So just one of these multi-threaded jobs will try to use all the CPUs available to your BOINC client - leaving no room for anything else to run. This may suit your needs. In my 8-core multi project set-up I prefer a workload that's balanced across different app types and different projects. For me, that means limiting the number of any one particular type of app running at a time and limiting the total number of all apps any project can run concurrently. But if you're only going to run just one project you can leave out those last limits. So, in the following app_config.xml the CMS tasks are limited with the <max_concurrent>1</max_concurrent> xml element to running 1 CMS job at a time. The <avg_ncpus>4</avg_ncpus> and <cmdline>--nthreads 4</cmdline> elements limit the number of CPU cores & threads it uses to 4. That allows other job types to run on your remaining free CPUs. Groovy. app_config.xml <app_config> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>CMS</app_name> <plan_class>vbox64_mt_mcore_cms</plan_class> <avg_ncpus>4</avg_ncpus> <cmdline>--nthreads 4</cmdline> </app_version> <app> <name>Theory</name> <max_concurrent>4</max_concurrent> </app> </app_config> You can do the same with ATLAS, like this - app_config.xml <app_config> <project_max_concurrent>4</project_max_concurrent> <!-- limiting the concurrent apps run by any one project above allows apps from other projects to run as well --> <!-- as long as they have work available. Optional --> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>4</avg_ncpus> <cmdline>--nthreads 4</cmdline> </app_version> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>CMS</app_name> <plan_class>vbox64_mt_mcore_cms</plan_class> <avg_ncpus>4</avg_ncpus> <cmdline>--nthreads 4</cmdline> </app_version> <!-- This Theory app section is now superfluous because of line 3 --> <!-- But if you only have LHC work, or have deleted line 3 --> <!-- then you may find this section useful --> <app> <name>Theory</name> <max_concurrent>4</max_concurrent> </app> <app_version> <app_name>Theory</app_name> <plan_class>vbox64_theory</plan_class> <!-- nothing to do here! --> </app_version> </app_config> Instructions for writing app_config.xml files are here. If you want to write an app_config.xml for your project (or one each for more than one project!) - to use it, you put it in its particular project folder, here: Windows: C:\ProgramData\BOINC\projects\<your project>\app_config.xml Linux: /var/lib/boinc/projects/<your project>/app_config.xml Then start your BOINC client. Or, if it's running already, click "Options -> Read config files" in your BOINC Manager. And it takes effect. ID: 51182 · Reply Quote

Gery Oei Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 0	Message 51197 - Posted: 27 Nov 2024, 8:46:21 UTC I have continuously errors running CMS jobs (1152 jobs run flawlessly the last 7 days) with now 152 (more to come) running into a connection error: 2024-11-26 21:36:48 (72613): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8 2024-11-26 21:36:48 (72613): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094] and a nc command gives: nc: connectx to 128.142.160.140 port 1094 (tcp) failed: Connection refused It all run well until yesterday. Server down? Any ideas? ID: 51197 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,918,087 RAC: 146,876	Message 51198 - Posted: 27 Nov 2024, 8:59:52 UTC - in response to Message 51197. To all: These CMS failures are not caused on the volunteer side. Instead, they are caused by major upgrades and system replacements on the CERN side. Be patient and allow the CERN team to finish their work. The system at CERN running the HTCondor service has been replaced a few days ago which caused the same error type. See: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6198&postid=51084 Now it's EOSCMS which does not respond. ID: 51198 · Reply Quote

Gery Oei Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 0	Message 51199 - Posted: 27 Nov 2024, 10:09:18 UTC - in response to Message 51198. Thank you computezrmle, will you inform us when the issue has been solved? Thank you in advance Géry ID: 51199 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 58 Credit: 1,917,819 RAC: 594	Message 51200 - Posted: 27 Nov 2024, 10:16:41 UTC From my All tasks web page - Task Work unit Computer Sent Time reported Status Run CPU Application or deadline time time 417388207 228404289 10860321 27 Nov 2024, 9:04:47 UTC 9:24:36 UTC Error while computing 126.59 21.70 CMS Simulation v70.30 (vbox64_mt_mcore_cms) x86_64-pc-linux-gnu this is the error I'm seeing with CMS - 417388207 stderr output from above task. The following error occurs towards the end of the above stderr output: ... 2024-11-27 09:22:20 (35388): Guest Log: [INFO] Testing connection to EOSCMS 2024-11-27 09:22:20 (35388): Guest Log: [DEBUG] Status run 1 of up to 3: 1 2024-11-27 09:22:26 (35388): Guest Log: [DEBUG] Status run 2 of up to 3: 1 2024-11-27 09:22:40 (35388): Guest Log: [DEBUG] Status run 3 of up to 3: 1 2024-11-27 09:22:40 (35388): Guest Log: [DEBUG] run 1 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address... 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: run 2 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address... 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: run 3 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-27 09:22:40 (35388): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-27 09:22:40 (35388): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094] 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address... 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 2001:1458:301:17::100:9:1094 (IOD #1) EID 16 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 16 [2001:1458:301:17::100:9:1094] 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: [ERROR] Could not connect to eoscms-ns-ip563.cern.ch on port 1094 2024-11-27 09:22:40 (35388): Guest Log: [INFO] Testing connection to CMS-Factory 2024-11-27 09:22:41 (35388): Guest Log: [INFO] Testing connection to CMS-Frontier 2024-11-27 09:22:41 (35388): Guest Log: [INFO] Testing connection to Frontier 2024-11-27 09:22:41 (35388): Guest Log: [DEBUG] Check your firewall and your network load 2024-11-27 09:22:41 (35388): Guest Log: [ERROR] Could not connect to all required network services ... ID: 51200 · Reply Quote

Gery Oei Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 0	Message 51201 - Posted: 27 Nov 2024, 11:00:12 UTC - in response to Message 51200. >>2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8 >>2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094] Yes, this is known: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6198&postid=51198 best regards Géry ID: 51201 · Reply Quote

M0CZY Send message Joined: 27 Apr 24 Posts: 17 Credit: 1,129,616 RAC: 439	Message 51202 - Posted: 27 Nov 2024, 13:42:27 UTC Last modified: 27 Nov 2024, 13:44:36 UTC I don't mind the current situation at all. I get credits for running CMS in 'non-CPU-intensive' mode, while I am able to run other, non-BOINC projects as well. ID: 51202 · Reply Quote

Gery Oei Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 0	Message 51203 - Posted: 27 Nov 2024, 13:57:41 UTC - in response to Message 51202. I don't get credits at all for CMS because all jobs fail ID: 51203 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1450 Credit: 9,747,300 RAC: 593	Message 51204 - Posted: 27 Nov 2024, 15:55:55 UTC - in response to Message 51201. Connection problem seems to be solved: 16:42:01 +0100 2024-11-27 [INFO] Mounting the shared directory 16:42:02 +0100 2024-11-27 [INFO] Shared directory mounted, enabling vboxmonitor 16:42:02 +0100 2024-11-27 [INFO] Sourcing essential functions from /cvmfs/grid.cern.ch 16:42:02 +0100 2024-11-27 [INFO] Testing connection to cern.ch 16:42:02 +0100 2024-11-27 [INFO] Testing connection to VCCS 16:42:03 +0100 2024-11-27 [INFO] Testing connection to HTCondor 16:42:03 +0100 2024-11-27 [INFO] Testing connection to WMAgent 16:42:03 +0100 2024-11-27 [INFO] Testing connection to EOSCMS 16:42:04 +0100 2024-11-27 [INFO] Testing connection to CMS-Factory 16:42:04 +0100 2024-11-27 [INFO] Testing connection to CMS-Frontier 16:42:04 +0100 2024-11-27 [INFO] Testing connection to Frontier 16:42:05 +0100 2024-11-27 [INFO] Could not find a local HTTP proxy 16:42:05 +0100 2024-11-27 [INFO] CVMFS and Frontier will have to use DIRECT connections 16:42:05 +0100 2024-11-27 [INFO] This makes the application less efficient 16:42:05 +0100 2024-11-27 [INFO] It also puts higher load on the project servers 16:42:06 +0100 2024-11-27 [INFO] Setting up a local HTTP proxy is highly recommended 16:42:06 +0100 2024-11-27 [INFO] Advice can be found in the project forum 16:42:07 +0100 2024-11-27 [INFO] Reloading and probing the CVMFS configuration 16:42:22 +0100 2024-11-27 [INFO] Excerpt from "cvmfs_config stat": VERSION HOST PROXY 16:42:22 +0100 2024-11-27 [INFO] 2.7.2.0 http://s1bnl-cvmfs.openhtc.io DIRECT 16:42:22 +0100 2024-11-27 [INFO] Environment HTTP proxy: not set 16:42:22 +0100 2024-11-27 [INFO] Reading volunteer information 16:42:32 +0100 2024-11-27 [INFO] CMS application starting. Check log files. But I think Ivan have to submit a new batch to the system, so we will get sub-jobs for the VM's ID: 51204 · Reply Quote

Gery Oei Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 0	Message 51205 - Posted: 27 Nov 2024, 21:26:37 UTC now I've got his one: <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 2024-11-27 16:49:36 (21450): vboxwrapper version 26207 2024-11-27 16:49:36 (21450): BOINC client version: 8.0.2 2024-11-27 16:49:42 (21450): Error in guest additions for VM: -182 Command: VBoxManage -q list systemproperties Output: VBoxManage: error: Failed to create the VirtualBox object! VBoxManage: error: Code NS_ERROR_SOCKET_FAIL (0xC1F30200) - IPC daemon socket error (extended info not available) VBoxManage: error: Most likely, the VirtualBox COM server is not running or failed to start. 2024-11-27 16:49:42 (21450): Detected: VirtualBox VboxManage Interface (Version: 7.0.12) 2024-11-27 16:49:42 (21450): Detected: Sandbox Configuration Enabled 2024-11-27 16:49:48 (21450): Error in host info for VM: -182 Command: VBoxManage -q list hostinfo Output: VBoxManage: error: Failed to create the VirtualBox object! VBoxManage: error: Code NS_ERROR_SOCKET_FAIL (0xC1F30200) - IPC daemon socket error (extended info not available) VBoxManage: error: Most likely, the VirtualBox COM server is not running or failed to start. 2024-11-27 16:49:48 (21450): WARNING: Communication with VM Hypervisor failed. 2024-11-27 16:49:48 (21450): ERROR: VBoxManage list hostinfo failed 2024-11-27 16:49:48 (21450): called boinc_finish(1) </stderr_txt> ]]> Any ideas and help? Thank you in advance Géry ID: 51205 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,918,087 RAC: 146,876	Message 51206 - Posted: 28 Nov 2024, 6:46:21 UTC - in response to Message 51205. This is a local VirtualBox issue. You may check for orphaned ".vbox-*-ipc", usually in "/tmp/". Ensure no VM and no VirtualBox GUI component is running. Wait 10 minutes, then delete the orphans. ID: 51206 · Reply Quote

LHC@home