Message boards : CMS Application : Problems connecting to servers?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51207 - Posted: 28 Nov 2024, 11:02:17 UTC
Last modified: 28 Nov 2024, 11:53:38 UTC

The CMS multithread tasks are using just 1 CPU at the moment.
The sysfolk must be sending out test jobs.

<stderr_txt>
2024-11-28 10:14:06 (15635): vboxwrapper version 26207
...
2024-11-28 10:15:35 (15635): Guest Log: [INFO] CMS application starting. Check log files.
2024-11-28 10:42:29 (15635): Guest Log: [INFO] glidein exited with return value 0.
2024-11-28 10:42:30 (15635): Guest Log: [INFO] Shutting Down.
2024-11-28 10:42:30 (15635): VM Completion File Detected.
2024-11-28 10:42:30 (15635): VM Completion Message: glidein exited with return value 0.
...

</stderr_txt>
The jobs are all completing and being verified successfully.
But note the timestamps - the jobs take a minute and a half to initialise, then 28 minutes to complete
ID: 51207 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 744
Credit: 51,950,509
RAC: 31,935
Message 51208 - Posted: 28 Nov 2024, 11:43:09 UTC - in response to Message 51207.  
Last modified: 28 Nov 2024, 11:45:12 UTC

This probably is not any testing. There just isn't any actual jobs available to run. Only Boinc tasks which just starts the virtual machine that cannot get any jobs. You can see this on the site menu Jobs -> CMS jobs -> Running Jobs (you can use the Cern SSO to login, for example using your Google account). There hasn't been any jobs for over a week now.
ID: 51208 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51209 - Posted: 28 Nov 2024, 11:47:07 UTC - in response to Message 51207.  
Last modified: 28 Nov 2024, 12:18:23 UTC

Because CMS appears to be using only 1 CPU I tried adjusting my app_config.xml to use 1 CPU for CMS jobs.
They all failed!
CMS multithread jobs need 4 CPUs (minimum).
It "looked" like it was working... But all have since failed with this logged in stderr -
2024-11-28 11:15:41 (18379): Guest Log: [INFO] CMS application starting. Check log files.
2024-11-28 11:27:55 (18379): Guest Log: [ERROR] VM expects at least 4 CPUs but reports only 1.
Changed it back to 4 CPUs & threads. All OK now!
But this is a waste of compute units (CPUs). Three CPUs are doing exactly nothing.
ID: 51209 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2605
Credit: 262,052,562
RAC: 131,919
Message 51210 - Posted: 28 Nov 2024, 11:51:59 UTC - in response to Message 51207.  

Best would be to set CMS to NNT until the issues are solved.


Never set "<avg_ncpus>", "--nthreads" lower than 4.
That value is tested by the scientific app inside the VM and VMs configured to use less cores will forcefully fail by intention.


As for the glidein return value of 0.
This results in a BOINC success although the scientific output is missing.
The reason for this is that there are uncountable error reasons in deeper levels and most of them are by intention not forwarded to the BOINC level.
ID: 51210 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51211 - Posted: 28 Nov 2024, 12:05:08 UTC - in response to Message 51210.  

Yes. Thanks.
As noted - that failed utterly! Yikes!
ID: 51211 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51212 - Posted: 28 Nov 2024, 12:07:06 UTC - in response to Message 51208.  
Last modified: 28 Nov 2024, 12:10:53 UTC

OK thanks, Harri.
These "empty" CMS jobs still use 4 of my CPUs...
I'll stop pulling CMS tasks until work is available.
ID: 51212 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1840
Credit: 126,202,363
RAC: 123,165
Message 51213 - Posted: 28 Nov 2024, 12:24:48 UTC

I am aware that I am repeating myself:
but I keep wondering why CMS tasks are being sent out as long as not jobs are available ... :-(
ID: 51213 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,521,616
RAC: 3,319
Message 51214 - Posted: 28 Nov 2024, 12:31:57 UTC - in response to Message 51213.  

They must be dry-running the servers.
It would be nice if they kept that local.
ID: 51214 · Report as offensive     Reply Quote
M0CZY

Send message
Joined: 27 Apr 24
Posts: 13
Credit: 1,065,859
RAC: 1,440
Message 51215 - Posted: 28 Nov 2024, 12:35:34 UTC

I am getting credit for nearly all of my "failed" work units.
https://lhcathome.cern.ch/lhcathome/results.php?userid=1191237
ID: 51215 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1840
Credit: 126,202,363
RAC: 123,165
Message 51217 - Posted: 28 Nov 2024, 15:24:49 UTC - in response to Message 51215.  

I am getting credit for nearly all of my "failed" work units.
https://lhcathome.cern.ch/lhcathome/results.php?userid=1191237
clicking on your link shows "access denied".
As mentioned somewhere above, for some (strange) reason credits are warranted for this kind of tasks, but they are of no value to the science.
Meanwhile a pretty high number of these faulty tasks must have been processed and sent back, without the recepient noticing that something is wrong
ID: 51217 · Report as offensive     Reply Quote
Gery Oei

Send message
Joined: 8 Apr 06
Posts: 7
Credit: 248,210
RAC: 2
Message 51219 - Posted: 28 Nov 2024, 18:01:27 UTC - in response to Message 51206.  
Last modified: 28 Nov 2024, 18:01:58 UTC

This is a local VirtualBox issue.
You may check for orphaned ".vbox-*-ipc", usually in "/tmp/".

Ensure no VM and no VirtualBox GUI component is running.
Wait 10 minutes, then delete the orphans.


found:

drwx------   4 boinc_master  wheel   128B 27 Nov 09:07 .vbox-boinc_project-ipc
drwx------   4 geryoei       wheel   128B 27 Nov 22:12 .vbox-geryoei-ipc
drwx------   4 root          wheel   128B 27 Nov 09:50 .vbox-root-ipc


sudo rm -Rf .vbox-*

Does the job, thank you for your help!

Géry
ID: 51219 · Report as offensive     Reply Quote
M0CZY

Send message
Joined: 27 Apr 24
Posts: 13
Credit: 1,065,859
RAC: 1,440
Message 51220 - Posted: 28 Nov 2024, 20:00:38 UTC - in response to Message 51217.  

clicking on your link shows "access denied".

My computers are not hidden. You can click on my username to reveal them.
ID: 51220 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : CMS Application : Problems connecting to servers?


©2025 CERN