Message boards :
CMS Application :
Problems connecting to servers?
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1840 Credit: 126,207,786 RAC: 123,481 ![]() ![]() ![]() |
After I noticed about an hour ago that there were new CMS tasks available, I first downloaded one and found out that the connection to Condor now works well. So I started CMS on several other hosts, and all the tasks which got downloaded started okay. However, after some time I found out that obviously there were no jobs available - no CPU activity, and the tasks breaking off after about half an hour :-( |
![]() Send message Joined: 15 Jun 08 Posts: 2605 Credit: 262,056,290 RAC: 132,011 ![]() ![]() |
Same here. HTCondor is now on vocms0830.cern.ch (old: vocms0840.cern.ch) and works fine. @Ivan The job queue seems to be dry. |
![]() ![]() Send message Joined: 9 Feb 08 Posts: 55 Credit: 1,521,616 RAC: 3,319 ![]() ![]() ![]() |
Mine is still not working - All Tasks Checking the times on that... Ah. OK There are, as I write this, reports it's working now... [Preferences set to send me CMS...] Frustrating. I'll probably have to wait because there's a "10 Day" Theory task running at the moment! (See this post) Yesterday I set the Project Preferences as below to stop sending me CMS tasks , but I still got them: Run only the selected applications SixTrack: yes sixtracktest: yes CMS Simulation: no Theory Simulation: yes ATLAS Simulation: yesThat's probably because of - If no work for selected applications is available, accept work from other applications? yesFun with a dash of very dry irony. |
Send message Joined: 18 Dec 15 Posts: 1840 Credit: 126,207,786 RAC: 123,481 ![]() ![]() ![]() |
Still no jobs are available. and obviously, the automatic termination of task distribution does not work - since yesterday evening, tasks keep being sent out and they all fail after about half an hour :-( |
Send message Joined: 18 Dec 15 Posts: 1840 Credit: 126,207,786 RAC: 123,481 ![]() ![]() ![]() |
Still no jobs are available.for 4 days now, all downloaded CMS tasks finish after half an hour due to lack of jobs. And hence they are of no value to the science (although they even get low credit points). What I am surprised about is that no one at the receiving point of these faulty tasks has noticed it yet. Or, in other words: does no one care what we volunteers submit? This makes we wonder how much sense it makes at all to crunch for LHC ... |
Send message Joined: 13 Jan 24 Posts: 5 Credit: 2,688,738 RAC: 4,057 ![]() ![]() ![]() |
CMS and Atlas have problems, but I'm getting a few Theory jobs that seem to be running. ![]() |
Send message Joined: 18 Dec 15 Posts: 1840 Credit: 126,207,786 RAC: 123,481 ![]() ![]() ![]() |
... but I'm getting a few Theory jobs that seem to be running.yes, there are Theory tasks once in a while. But from what I noticed: they are all "longrunners", so it could well happen that on a slow host they won't get finished within the 10 days' limit and subsequently error out. |
![]() ![]() Send message Joined: 9 Feb 08 Posts: 55 Credit: 1,521,616 RAC: 3,319 ![]() ![]() ![]() |
It's vexing. Long running tasks and fair play - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6240 This gonna be long - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6251 If you get nothing but long Theory jobs using all your CPUs you could limit the number with app_config.xml <app_config> <app> <name>Theory</name> <max_concurrent>4</max_concurrent> </app> </app_config>and thus allow other jobs to run at the same time. (4 is just an example. It depends on how many of your CPUs you are willing to give to Theory tasks.) Create the above app_config.xml file and drop it in your LHC@home project data folder, here: Windows: C:\ProgramData\BOINC\projects\<your project>\app_config.xmlLinux: /var/lib/boinc/projects/<your project>/app_config.xml Instructions for writing app_config.xml files are here ![]() |
Send message Joined: 18 Dec 15 Posts: 1840 Credit: 126,207,786 RAC: 123,481 ![]() ![]() ![]() |
If you get nothing but long Theory jobs using all your CPUs you could limit the number withthe number of CPUs used for a given task does not depend on the length of the task. A Theory task is designed to use 1 CPU, regardless of whether it runs for 1 hour or 15 days. |
![]() ![]() Send message Joined: 9 Feb 08 Posts: 55 Credit: 1,521,616 RAC: 3,319 ![]() ![]() ![]() |
Yes, Theory jobs run with one thread using exactly one CPU core. Above, in the <app>..</app> section, the line <max_concurrent>4</max_concurrent>limits the number of Theory jobs that run at the same time - that, as you note, run in the form: 1 Theory work unit will use 1 CPU core. So it is possible to have more than one Theory job running on your BOINC client at a time - well, if you have a multi-core CPU. Now with all the 'week long' Theory jobs that are being sent out now, it may be useful to allow other job types to run, even from other projects, at the same time as the long Theory jobs. This app_config.xml - <app_config> <app> <name>Theory</name> <max_concurrent>4</max_concurrent> </app> </app_config>limits the number of Theory jobs that run at the same time. And with one job per one CPU core, a maximum of 4 CPU cores will ever be used for Theory jobs with this app_config.xml in effect. If you have any more cores available they will run other non-Theory job types. That's useful for running, if they're available, some different types of LHC@home tasks concurrently (at the same time) or to keep another projects tasks running while your PC concurrently crunches through some long LHC@home Theory tasks. There are as many ways to use app_config.xml as there are different computers. These examples are running well on my 8 core system. Before anything else - it's recommended to leave a couple of CPU cores free to run all the background OS processes. A reliable way to do this is to use the BOINC Manager's "Options -> Computing preferences" to limit the number of CPUs that BOINC uses for its number crunching. Use at most [75] % of the CPUsworks well with my 8-core CPU. Multi-threaded tasks NOTE that there are multi-threaded apps out there and they use completely different app_config.xml elements for limiting, if you want to, the number of CPU cores that a multi-thread task will use. For instance two of the other job types offered by the LHC@home project are ATLAS and CMS. These job types are multi-threaded meaning that they use many of your CPUs per job. So just one of these multi-threaded jobs will try to use all the CPUs available to your BOINC client - leaving no room for anything else to run. This may suit your needs. In my 8-core multi project set-up I prefer a workload that's balanced across different app types and different projects. For me, that means limiting the number of any one particular type of app running at a time and limiting the total number of all apps any project can run concurrently. But if you're only going to run just one project you can leave out those last limits. So, in the following app_config.xml the CMS tasks are limited with the <max_concurrent>1</max_concurrent> xml element to running 1 CMS job at a time. The <avg_ncpus>4</avg_ncpus> and <cmdline>--nthreads 4</cmdline> elements limit the number of CPU cores & threads it uses to 4. That allows other job types to run on your remaining free CPUs. Groovy. app_config.xml <app_config> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>CMS</app_name> <plan_class>vbox64_mt_mcore_cms</plan_class> <avg_ncpus>4</avg_ncpus> <cmdline>--nthreads 4</cmdline> </app_version> <app> <name>Theory</name> <max_concurrent>4</max_concurrent> </app> </app_config> You can do the same with ATLAS, like this - app_config.xml <app_config> <project_max_concurrent>4</project_max_concurrent> <!-- limiting the concurrent apps run by any one project above allows apps from other projects to run as well --> <!-- as long as they have work available. Optional --> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>4</avg_ncpus> <cmdline>--nthreads 4</cmdline> </app_version> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>CMS</app_name> <plan_class>vbox64_mt_mcore_cms</plan_class> <avg_ncpus>4</avg_ncpus> <cmdline>--nthreads 4</cmdline> </app_version> <!-- This Theory app section is now superfluous because of line 3 --> <!-- But if you only have LHC work, or have deleted line 3 --> <!-- then you may find this section useful --> <app> <name>Theory</name> <max_concurrent>4</max_concurrent> </app> <app_version> <app_name>Theory</app_name> <plan_class>vbox64_theory</plan_class> <!-- nothing to do here! --> </app_version> </app_config> Instructions for writing app_config.xml files are here. If you want to write an app_config.xml for your project (or one each for more than one project!) - to use it, you put it in its particular project folder, here: Windows: C:\ProgramData\BOINC\projects\<your project>\app_config.xml Linux: /var/lib/boinc/projects/<your project>/app_config.xml Then start your BOINC client. Or, if it's running already, click "Options -> Read config files" in your BOINC Manager. And it takes effect. |
Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 2 ![]() ![]() |
I have continuously errors running CMS jobs (1152 jobs run flawlessly the last 7 days) with now 152 (more to come) running into a connection error: 2024-11-26 21:36:48 (72613): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8 2024-11-26 21:36:48 (72613): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094] and a nc command gives: nc: connectx to 128.142.160.140 port 1094 (tcp) failed: Connection refused It all run well until yesterday. Server down? Any ideas? |
![]() Send message Joined: 15 Jun 08 Posts: 2605 Credit: 262,056,290 RAC: 132,011 ![]() ![]() |
To all: These CMS failures are not caused on the volunteer side. Instead, they are caused by major upgrades and system replacements on the CERN side. Be patient and allow the CERN team to finish their work. The system at CERN running the HTCondor service has been replaced a few days ago which caused the same error type. See: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6198&postid=51084 Now it's EOSCMS which does not respond. |
Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 2 ![]() ![]() |
Thank you computezrmle, will you inform us when the issue has been solved? Thank you in advance Géry |
![]() ![]() Send message Joined: 9 Feb 08 Posts: 55 Credit: 1,521,616 RAC: 3,319 ![]() ![]() ![]() |
From my All tasks web page - Task Work unit Computer Sent Time reported Status Run CPU Application or deadline time time 417388207 228404289 10860321 27 Nov 2024, 9:04:47 UTC 9:24:36 UTC Error while computing 126.59 21.70 CMS Simulation v70.30 (vbox64_mt_mcore_cms) x86_64-pc-linux-gnu this is the error I'm seeing with CMS - 417388207 stderr output from above task. The following error occurs towards the end of the above stderr output: ... 2024-11-27 09:22:20 (35388): Guest Log: [INFO] Testing connection to EOSCMS 2024-11-27 09:22:20 (35388): Guest Log: [DEBUG] Status run 1 of up to 3: 1 2024-11-27 09:22:26 (35388): Guest Log: [DEBUG] Status run 2 of up to 3: 1 2024-11-27 09:22:40 (35388): Guest Log: [DEBUG] Status run 3 of up to 3: 1 2024-11-27 09:22:40 (35388): Guest Log: [DEBUG] run 1 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address... 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: run 2 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address... 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: run 3 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-27 09:22:40 (35388): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-27 09:22:40 (35388): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094] 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address... 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 2001:1458:301:17::100:9:1094 (IOD #1) EID 16 2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 16 [2001:1458:301:17::100:9:1094] 2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused. 2024-11-27 09:22:40 (35388): Guest Log: [ERROR] Could not connect to eoscms-ns-ip563.cern.ch on port 1094 2024-11-27 09:22:40 (35388): Guest Log: [INFO] Testing connection to CMS-Factory 2024-11-27 09:22:41 (35388): Guest Log: [INFO] Testing connection to CMS-Frontier 2024-11-27 09:22:41 (35388): Guest Log: [INFO] Testing connection to Frontier 2024-11-27 09:22:41 (35388): Guest Log: [DEBUG] Check your firewall and your network load 2024-11-27 09:22:41 (35388): Guest Log: [ERROR] Could not connect to all required network services ... |
Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 2 ![]() ![]() |
>>2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8 >>2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094] Yes, this is known: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6198&postid=51198 best regards Géry |
Send message Joined: 27 Apr 24 Posts: 13 Credit: 1,065,859 RAC: 1,440 ![]() ![]() ![]() |
I don't mind the current situation at all. I get credits for running CMS in 'non-CPU-intensive' mode, while I am able to run other, non-BOINC projects as well. |
Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 2 ![]() ![]() |
I don't get credits at all for CMS because all jobs fail |
Send message Joined: 14 Jan 10 Posts: 1440 Credit: 9,657,607 RAC: 1,253 ![]() ![]() |
Connection problem seems to be solved: 16:42:01 +0100 2024-11-27 [INFO] Mounting the shared directory 16:42:02 +0100 2024-11-27 [INFO] Shared directory mounted, enabling vboxmonitor 16:42:02 +0100 2024-11-27 [INFO] Sourcing essential functions from /cvmfs/grid.cern.ch 16:42:02 +0100 2024-11-27 [INFO] Testing connection to cern.ch 16:42:02 +0100 2024-11-27 [INFO] Testing connection to VCCS 16:42:03 +0100 2024-11-27 [INFO] Testing connection to HTCondor 16:42:03 +0100 2024-11-27 [INFO] Testing connection to WMAgent 16:42:03 +0100 2024-11-27 [INFO] Testing connection to EOSCMS 16:42:04 +0100 2024-11-27 [INFO] Testing connection to CMS-Factory 16:42:04 +0100 2024-11-27 [INFO] Testing connection to CMS-Frontier 16:42:04 +0100 2024-11-27 [INFO] Testing connection to Frontier 16:42:05 +0100 2024-11-27 [INFO] Could not find a local HTTP proxy 16:42:05 +0100 2024-11-27 [INFO] CVMFS and Frontier will have to use DIRECT connections 16:42:05 +0100 2024-11-27 [INFO] This makes the application less efficient 16:42:05 +0100 2024-11-27 [INFO] It also puts higher load on the project servers 16:42:06 +0100 2024-11-27 [INFO] Setting up a local HTTP proxy is highly recommended 16:42:06 +0100 2024-11-27 [INFO] Advice can be found in the project forum 16:42:07 +0100 2024-11-27 [INFO] Reloading and probing the CVMFS configuration 16:42:22 +0100 2024-11-27 [INFO] Excerpt from "cvmfs_config stat": VERSION HOST PROXY 16:42:22 +0100 2024-11-27 [INFO] 2.7.2.0 http://s1bnl-cvmfs.openhtc.io DIRECT 16:42:22 +0100 2024-11-27 [INFO] Environment HTTP proxy: not set 16:42:22 +0100 2024-11-27 [INFO] Reading volunteer information 16:42:32 +0100 2024-11-27 [INFO] CMS application starting. Check log files. But I think Ivan have to submit a new batch to the system, so we will get sub-jobs for the VM's |
Send message Joined: 8 Apr 06 Posts: 7 Credit: 248,210 RAC: 2 ![]() ![]() |
now I've got his one: <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 2024-11-27 16:49:36 (21450): vboxwrapper version 26207 2024-11-27 16:49:36 (21450): BOINC client version: 8.0.2 2024-11-27 16:49:42 (21450): Error in guest additions for VM: -182 Command: VBoxManage -q list systemproperties Output: VBoxManage: error: Failed to create the VirtualBox object! VBoxManage: error: Code NS_ERROR_SOCKET_FAIL (0xC1F30200) - IPC daemon socket error (extended info not available) VBoxManage: error: Most likely, the VirtualBox COM server is not running or failed to start. 2024-11-27 16:49:42 (21450): Detected: VirtualBox VboxManage Interface (Version: 7.0.12) 2024-11-27 16:49:42 (21450): Detected: Sandbox Configuration Enabled 2024-11-27 16:49:48 (21450): Error in host info for VM: -182 Command: VBoxManage -q list hostinfo Output: VBoxManage: error: Failed to create the VirtualBox object! VBoxManage: error: Code NS_ERROR_SOCKET_FAIL (0xC1F30200) - IPC daemon socket error (extended info not available) VBoxManage: error: Most likely, the VirtualBox COM server is not running or failed to start. 2024-11-27 16:49:48 (21450): WARNING: Communication with VM Hypervisor failed. 2024-11-27 16:49:48 (21450): ERROR: VBoxManage list hostinfo failed 2024-11-27 16:49:48 (21450): called boinc_finish(1) </stderr_txt> ]]> Any ideas and help? Thank you in advance Géry |
![]() Send message Joined: 15 Jun 08 Posts: 2605 Credit: 262,056,290 RAC: 132,011 ![]() ![]() |
This is a local VirtualBox issue. You may check for orphaned ".vbox-*-ipc", usually in "/tmp/". Ensure no VM and no VirtualBox GUI component is running. Wait 10 minutes, then delete the orphans. |
©2025 CERN