Message boards : CMS Application : Problems connecting to servers?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1840
Credit: 126,207,786
RAC: 123,481
Message 51099 - Posted: 19 Nov 2024, 17:24:05 UTC

After I noticed about an hour ago that there were new CMS tasks available, I first downloaded one and found out that the connection to Condor now works well.
So I started CMS on several other hosts, and all the tasks which got downloaded started okay. However, after some time I found out that obviously there were no jobs available - no CPU activity, and the tasks breaking off after about half an hour :-(
ID: 51099 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2605
Credit: 262,056,290
RAC: 132,011
Message 51100 - Posted: 19 Nov 2024, 17:34:03 UTC - in response to Message 51099.  

Same here.
HTCondor is now on vocms0830.cern.ch (old: vocms0840.cern.ch) and works fine.

@Ivan
The job queue seems to be dry.
ID: 51100 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51101 - Posted: 20 Nov 2024, 2:44:11 UTC
Last modified: 20 Nov 2024, 3:41:10 UTC

Mine is still not working -

All Tasks

Checking the times on that... Ah.
OK There are, as I write this, reports it's working now...
[Preferences set to send me CMS...]

Frustrating.
I'll probably have to wait because there's a "10 Day" Theory task running at the moment! (See this post)
Yesterday I set the Project Preferences as below to stop sending me CMS tasks , but I still got them:
Run only the selected applications      SixTrack: yes
                                        sixtracktest: yes
                                        CMS Simulation: no
                                        Theory Simulation: yes
                                        ATLAS Simulation: yes
That's probably because of -
If no work for selected applications is available, accept work from other applications?        yes
Fun with a dash of very dry irony.
ID: 51101 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1840
Credit: 126,207,786
RAC: 123,481
Message 51105 - Posted: 20 Nov 2024, 10:12:52 UTC

Still no jobs are available.
and obviously, the automatic termination of task distribution does not work - since yesterday evening, tasks keep being sent out and they all fail after about half an hour :-(
ID: 51105 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1840
Credit: 126,207,786
RAC: 123,481
Message 51121 - Posted: 23 Nov 2024, 19:03:51 UTC - in response to Message 51105.  

Still no jobs are available.
and obviously, the automatic termination of task distribution does not work - since yesterday evening, tasks keep being sent out and they all fail after about half an hour :-(
for 4 days now, all downloaded CMS tasks finish after half an hour due to lack of jobs. And hence they are of no value to the science (although they even get low credit points). What I am surprised about is that no one at the receiving point of these faulty tasks has noticed it yet. Or, in other words: does no one care what we volunteers submit? This makes we wonder how much sense it makes at all to crunch for LHC ...
ID: 51121 · Report as offensive     Reply Quote
Glohr

Send message
Joined: 13 Jan 24
Posts: 5
Credit: 2,688,738
RAC: 4,057
Message 51123 - Posted: 24 Nov 2024, 0:29:27 UTC - in response to Message 51121.  

CMS and Atlas have problems, but I'm getting a few Theory jobs that seem to be running.
ID: 51123 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1840
Credit: 126,207,786
RAC: 123,481
Message 51124 - Posted: 24 Nov 2024, 7:38:01 UTC - in response to Message 51123.  

... but I'm getting a few Theory jobs that seem to be running.
yes, there are Theory tasks once in a while. But from what I noticed: they are all "longrunners", so it could well happen that on a slow host they won't get finished within the 10 days' limit and subsequently error out.
ID: 51124 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51127 - Posted: 24 Nov 2024, 14:06:38 UTC - in response to Message 51124.  

It's vexing.
Long running tasks and fair play - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6240
This gonna be long - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6251

If you get nothing but long Theory jobs using all your CPUs you could limit the number with
app_config.xml
<app_config>

  <app>  
    <name>Theory</name>
    <max_concurrent>4</max_concurrent>
  </app>

</app_config>
and thus allow other jobs to run at the same time. (4 is just an example. It depends on how many of your CPUs you are willing to give to Theory tasks.)

Create the above app_config.xml file and drop it in your LHC@home project data folder, here:
Windows:
C:\ProgramData\BOINC\projects\<your project>\app_config.xml
Linux:
/var/lib/boinc/projects/<your project>/app_config.xml

Instructions for writing app_config.xml files are here
ID: 51127 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1840
Credit: 126,207,786
RAC: 123,481
Message 51130 - Posted: 24 Nov 2024, 16:13:08 UTC - in response to Message 51127.  

If you get nothing but long Theory jobs using all your CPUs you could limit the number with
app_config.xml[code]
the number of CPUs used for a given task does not depend on the length of the task. A Theory task is designed to use 1 CPU, regardless of whether it runs for 1 hour or 15 days.
ID: 51130 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51182 - Posted: 26 Nov 2024, 21:32:20 UTC - in response to Message 51130.  
Last modified: 26 Nov 2024, 21:50:48 UTC

Yes, Theory jobs run with one thread using exactly one CPU core.
Above, in the <app>..</app> section, the line
<max_concurrent>4</max_concurrent>
limits the number of Theory jobs that run at the same time - that, as you note, run in the form: 1 Theory work unit will use 1 CPU core.
So it is possible to have more than one Theory job running on your BOINC client at a time - well, if you have a multi-core CPU. Now with all the 'week long' Theory jobs that are being sent out now, it may be useful to allow other job types to run, even from other projects, at the same time as the long Theory jobs.
This
app_config.xml -
<app_config>

  <app>  
    <name>Theory</name>
    <max_concurrent>4</max_concurrent>
  </app>

</app_config>
limits the number of Theory jobs that run at the same time. And with one job per one CPU core, a maximum of 4 CPU cores will ever be used for Theory jobs with this app_config.xml in effect.
If you have any more cores available they will run other non-Theory job types. That's useful for running, if they're available, some different types of LHC@home tasks concurrently (at the same time) or to keep another projects tasks running while your PC concurrently crunches through some long LHC@home Theory tasks.

There are as many ways to use app_config.xml as there are different computers.
These examples are running well on my 8 core system.

Before anything else - it's recommended to leave a couple of CPU cores free to run all the background OS processes. A reliable way to do this is to use the BOINC Manager's "Options -> Computing preferences" to limit the number of CPUs that BOINC uses for its number crunching.
Use at most [75] % of the CPUs
works well with my 8-core CPU.

Multi-threaded tasks
NOTE that there are multi-threaded apps out there and they use completely different app_config.xml elements for limiting, if you want to, the number of CPU cores that a multi-thread task will use.
For instance two of the other job types offered by the LHC@home project are ATLAS and CMS. These job types are multi-threaded meaning that they use many of your CPUs per job. So just one of these multi-threaded jobs will try to use all the CPUs available to your BOINC client - leaving no room for anything else to run. This may suit your needs. In my 8-core multi project set-up I prefer a workload that's balanced across different app types and different projects. For me, that means limiting the number of any one particular type of app running at a time and limiting the total number of all apps any project can run concurrently. But if you're only going to run just one project you can leave out those last limits.
So, in the following app_config.xml the CMS tasks are limited with the <max_concurrent>1</max_concurrent> xml element to running 1 CMS job at a time. The <avg_ncpus>4</avg_ncpus> and <cmdline>--nthreads 4</cmdline> elements limit the number of CPU cores & threads it uses to 4. That allows other job types to run on your remaining free CPUs. Groovy.

app_config.xml
<app_config>

  <app>
    <name>CMS</name>
    <max_concurrent>1</max_concurrent>
  </app>
  <app_version>
    <app_name>CMS</app_name>
    <plan_class>vbox64_mt_mcore_cms</plan_class>
    <avg_ncpus>4</avg_ncpus>
    <cmdline>--nthreads 4</cmdline>
  </app_version>

  <app>
    <name>Theory</name>
    <max_concurrent>4</max_concurrent>
  </app>

</app_config>


You can do the same with ATLAS, like this -
app_config.xml
<app_config>

  <project_max_concurrent>4</project_max_concurrent>

  <!-- limiting the concurrent apps run by any one project above allows apps from other projects to run as well -->
  <!-- as long as they have work available. Optional -->

  <app>
    <name>ATLAS</name>
    <max_concurrent>1</max_concurrent>
  </app>
  <app_version>
    <app_name>ATLAS</app_name>
    <plan_class>vbox64_mt_mcore_atlas</plan_class>
    <avg_ncpus>4</avg_ncpus>
    <cmdline>--nthreads 4</cmdline>
  </app_version>

  <app>
    <name>CMS</name>
    <max_concurrent>1</max_concurrent>
  </app>
  <app_version>
    <app_name>CMS</app_name>
    <plan_class>vbox64_mt_mcore_cms</plan_class>
    <avg_ncpus>4</avg_ncpus>
    <cmdline>--nthreads 4</cmdline>
  </app_version>

<!-- This Theory app section is now superfluous because of line 3 -->
<!-- But if you only have LHC work, or have deleted line 3 -->
<!-- then you may find this section useful -->
  <app>  
    <name>Theory</name>
    <max_concurrent>4</max_concurrent>
  </app>
  <app_version>
    <app_name>Theory</app_name>
    <plan_class>vbox64_theory</plan_class>
    <!-- nothing to do here! -->
  </app_version>

</app_config>



Instructions for writing app_config.xml files are here.

If you want to write an app_config.xml for your project (or one each for more than one project!) -
to use it, you put it in its particular project folder, here:

Windows:
C:\ProgramData\BOINC\projects\<your project>\app_config.xml

Linux:
/var/lib/boinc/projects/<your project>/app_config.xml

Then start your BOINC client.
Or, if it's running already, click "Options -> Read config files" in your BOINC Manager.
And it takes effect.
ID: 51182 · Report as offensive     Reply Quote
Gery Oei

Send message
Joined: 8 Apr 06
Posts: 7
Credit: 248,210
RAC: 2
Message 51197 - Posted: 27 Nov 2024, 8:46:21 UTC

I have continuously errors running CMS jobs (1152 jobs run flawlessly the last 7 days) with now 152 (more to come) running into a connection error:

2024-11-26 21:36:48 (72613): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8
2024-11-26 21:36:48 (72613): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094]

and a nc command gives:

nc: connectx to 128.142.160.140 port 1094 (tcp) failed: Connection refused

It all run well until yesterday. Server down?

Any ideas?
ID: 51197 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2605
Credit: 262,056,290
RAC: 132,011
Message 51198 - Posted: 27 Nov 2024, 8:59:52 UTC - in response to Message 51197.  

To all:
These CMS failures are not caused on the volunteer side.
Instead, they are caused by major upgrades and system replacements on the CERN side.
Be patient and allow the CERN team to finish their work.



The system at CERN running the HTCondor service has been replaced a few days ago which caused the same error type.
See:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6198&postid=51084
Now it's EOSCMS which does not respond.
ID: 51198 · Report as offensive     Reply Quote
Gery Oei

Send message
Joined: 8 Apr 06
Posts: 7
Credit: 248,210
RAC: 2
Message 51199 - Posted: 27 Nov 2024, 10:09:18 UTC - in response to Message 51198.  

Thank you computezrmle,

will you inform us when the issue has been solved?

Thank you in advance

Géry
ID: 51199 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51200 - Posted: 27 Nov 2024, 10:16:41 UTC

From my All tasks web page -

Task          Work unit     Computer      Sent                          Time reported   Status                  Run     CPU     Application
                                                                        or deadline                             time    time
417388207     228404289     10860321      27 Nov 2024, 9:04:47 UTC      9:24:36 UTC    Error while computing    126.59  21.70   CMS Simulation v70.30 (vbox64_mt_mcore_cms)
                                                                                                                                x86_64-pc-linux-gnu

this is the error I'm seeing with CMS -

417388207 stderr output from above task.

The following error occurs towards the end of the above stderr output:
...
2024-11-27 09:22:20 (35388): Guest Log: [INFO] Testing connection to EOSCMS
2024-11-27 09:22:20 (35388): Guest Log: [DEBUG] Status run 1 of up to 3: 1
2024-11-27 09:22:26 (35388): Guest Log: [DEBUG] Status run 2 of up to 3: 1
2024-11-27 09:22:40 (35388): Guest Log: [DEBUG] Status run 3 of up to 3: 1
2024-11-27 09:22:40 (35388): Guest Log: [DEBUG] run 1
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat )
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused.
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address...
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused.
2024-11-27 09:22:40 (35388): Guest Log: run 2
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat )
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused.
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address...
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused.
2024-11-27 09:22:40 (35388): Guest Log: run 3
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat )
2024-11-27 09:22:40 (35388): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt.
2024-11-27 09:22:40 (35388): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory
2024-11-27 09:22:40 (35388): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1)
2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8
2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094]
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection to 128.142.160.140 failed: Connection refused.
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Trying next address...
2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 2001:1458:301:17::100:9:1094 (IOD #1) EID 16
2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 16 [2001:1458:301:17::100:9:1094]
2024-11-27 09:22:40 (35388): Guest Log: Ncat: Connection refused.
2024-11-27 09:22:40 (35388): Guest Log: [ERROR] Could not connect to eoscms-ns-ip563.cern.ch on port 1094
2024-11-27 09:22:40 (35388): Guest Log: [INFO] Testing connection to CMS-Factory
2024-11-27 09:22:41 (35388): Guest Log: [INFO] Testing connection to CMS-Frontier
2024-11-27 09:22:41 (35388): Guest Log: [INFO] Testing connection to Frontier
2024-11-27 09:22:41 (35388): Guest Log: [DEBUG] Check your firewall and your network load
2024-11-27 09:22:41 (35388): Guest Log: [ERROR] Could not connect to all required network services
...
ID: 51200 · Report as offensive     Reply Quote
Gery Oei

Send message
Joined: 8 Apr 06
Posts: 7
Credit: 248,210
RAC: 2
Message 51201 - Posted: 27 Nov 2024, 11:00:12 UTC - in response to Message 51200.  

>>2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 128.142.160.140:1094 (IOD #1) EID 8
>>2024-11-27 09:22:40 (35388): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [128.142.160.140:1094]

Yes, this is known: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6198&postid=51198

best regards

Géry
ID: 51201 · Report as offensive     Reply Quote
M0CZY

Send message
Joined: 27 Apr 24
Posts: 13
Credit: 1,065,859
RAC: 1,440
Message 51202 - Posted: 27 Nov 2024, 13:42:27 UTC
Last modified: 27 Nov 2024, 13:44:36 UTC

I don't mind the current situation at all. I get credits for running CMS in 'non-CPU-intensive' mode, while I am able to run other, non-BOINC projects as well.
ID: 51202 · Report as offensive     Reply Quote
Gery Oei

Send message
Joined: 8 Apr 06
Posts: 7
Credit: 248,210
RAC: 2
Message 51203 - Posted: 27 Nov 2024, 13:57:41 UTC - in response to Message 51202.  

I don't get credits at all for CMS because all jobs fail
ID: 51203 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1440
Credit: 9,657,607
RAC: 1,253
Message 51204 - Posted: 27 Nov 2024, 15:55:55 UTC - in response to Message 51201.  

Connection problem seems to be solved:

16:42:01 +0100 2024-11-27 [INFO] Mounting the shared directory
16:42:02 +0100 2024-11-27 [INFO] Shared directory mounted, enabling vboxmonitor
16:42:02 +0100 2024-11-27 [INFO] Sourcing essential functions from /cvmfs/grid.cern.ch
16:42:02 +0100 2024-11-27 [INFO] Testing connection to cern.ch
16:42:02 +0100 2024-11-27 [INFO] Testing connection to VCCS
16:42:03 +0100 2024-11-27 [INFO] Testing connection to HTCondor
16:42:03 +0100 2024-11-27 [INFO] Testing connection to WMAgent
16:42:03 +0100 2024-11-27 [INFO] Testing connection to EOSCMS
16:42:04 +0100 2024-11-27 [INFO] Testing connection to CMS-Factory
16:42:04 +0100 2024-11-27 [INFO] Testing connection to CMS-Frontier
16:42:04 +0100 2024-11-27 [INFO] Testing connection to Frontier
16:42:05 +0100 2024-11-27 [INFO] Could not find a local HTTP proxy
16:42:05 +0100 2024-11-27 [INFO] CVMFS and Frontier will have to use DIRECT connections
16:42:05 +0100 2024-11-27 [INFO] This makes the application less efficient
16:42:05 +0100 2024-11-27 [INFO] It also puts higher load on the project servers
16:42:06 +0100 2024-11-27 [INFO] Setting up a local HTTP proxy is highly recommended
16:42:06 +0100 2024-11-27 [INFO] Advice can be found in the project forum
16:42:07 +0100 2024-11-27 [INFO] Reloading and probing the CVMFS configuration
16:42:22 +0100 2024-11-27 [INFO] Excerpt from "cvmfs_config stat": VERSION HOST PROXY
16:42:22 +0100 2024-11-27 [INFO] 2.7.2.0 http://s1bnl-cvmfs.openhtc.io DIRECT
16:42:22 +0100 2024-11-27 [INFO] Environment HTTP proxy: not set
16:42:22 +0100 2024-11-27 [INFO] Reading volunteer information
16:42:32 +0100 2024-11-27 [INFO] CMS application starting. Check log files.


But I think Ivan have to submit a new batch to the system, so we will get sub-jobs for the VM's
ID: 51204 · Report as offensive     Reply Quote
Gery Oei

Send message
Joined: 8 Apr 06
Posts: 7
Credit: 248,210
RAC: 2
Message 51205 - Posted: 27 Nov 2024, 21:26:37 UTC

now I've got his one:

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
2024-11-27 16:49:36 (21450): vboxwrapper version 26207
2024-11-27 16:49:36 (21450): BOINC client version: 8.0.2
2024-11-27 16:49:42 (21450): Error in guest additions for VM: -182
Command:
VBoxManage -q list systemproperties
Output:
VBoxManage: error: Failed to create the VirtualBox object!
VBoxManage: error: Code NS_ERROR_SOCKET_FAIL (0xC1F30200) - IPC daemon socket error (extended info not available)
VBoxManage: error: Most likely, the VirtualBox COM server is not running or failed to start.

2024-11-27 16:49:42 (21450): Detected: VirtualBox VboxManage Interface (Version: 7.0.12)
2024-11-27 16:49:42 (21450): Detected: Sandbox Configuration Enabled
2024-11-27 16:49:48 (21450): Error in host info for VM: -182
Command:
VBoxManage -q list hostinfo 
Output:
VBoxManage: error: Failed to create the VirtualBox object!
VBoxManage: error: Code NS_ERROR_SOCKET_FAIL (0xC1F30200) - IPC daemon socket error (extended info not available)
VBoxManage: error: Most likely, the VirtualBox COM server is not running or failed to start.

2024-11-27 16:49:48 (21450): WARNING: Communication with VM Hypervisor failed.
2024-11-27 16:49:48 (21450): ERROR: VBoxManage list hostinfo failed
2024-11-27 16:49:48 (21450): called boinc_finish(1)

</stderr_txt>
]]>


Any ideas and help?

Thank you in advance


Géry
ID: 51205 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2605
Credit: 262,056,290
RAC: 132,011
Message 51206 - Posted: 28 Nov 2024, 6:46:21 UTC - in response to Message 51205.  

This is a local VirtualBox issue.
You may check for orphaned ".vbox-*-ipc", usually in "/tmp/".

Ensure no VM and no VirtualBox GUI component is running.
Wait 10 minutes, then delete the orphans.
ID: 51206 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : CMS Application : Problems connecting to servers?


©2025 CERN