Message boards : CMS Application : Had ~100 failures on CMS 50
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
far

Send message
Joined: 27 May 11
Posts: 5
Credit: 9,747,819
RAC: 0
Message 43262 - Posted: 24 Aug 2020, 4:00:32 UTC

Hi Team,
Noticed that a machine wasn't using all of it's CPU power and tracked back to something with the CMS tasks.
They have been failing for a while but also preventing other tasks from utilising the PC's resources properly:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10516481
I've disabled CMS so other projects can ramp the PC up to 100% CPU again, but would be great if you can spot anything up with it so I can re-enable it.
The machine had lots or resources free but for some reason this project was preventing them being used.
Eg 32 threads but BOINC put other projects in a "Waiting for memory" state when there was heaps free, plus was only seeing ~32% of CPU being used.

If there are logs or any assistance I can provide please let me know,
Thanks, Far
ID: 43262 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,417,873
RAC: 123,624
Message 43264 - Posted: 24 Aug 2020, 7:00:21 UTC - in response to Message 43262.  

Each CMS VM allocates 1 CPU core and 2 GB RAM.
Your computer has 32 cores and 32 GB RAM.
This would allow you to run up to 16*) CMS VMs concurrently and would leave 16 cores idle.
In addition each CMS task makes heavy use of disk I/O and network, both don't need much CPU.


*) Less in reality - even if the BOINC client is configured to use 100% RAM - since the OS and other processes also require RAM.
ID: 43264 · Report as offensive     Reply Quote
far

Send message
Joined: 27 May 11
Posts: 5
Credit: 9,747,819
RAC: 0
Message 43275 - Posted: 24 Aug 2020, 23:18:34 UTC - in response to Message 43264.  

Thanks, that helps understand the restricted usage of available resources. Wish I could have afforded 64GB of RAM.

However all the CMS50 tasks were failing anyway :-(

If there are logs or anything that is needed to check why please let me know. In case it's a factor, the version of VirtualBox is more recent than the one distributed with Boinc being 6.1.12r139181 (Qt5.6.2)
ID: 43275 · Report as offensive     Reply Quote
Profile Francesc Josep LLort i Gutiér...

Send message
Joined: 23 Nov 15
Posts: 4
Credit: 1,391,488
RAC: 0
Message 43910 - Posted: 15 Dec 2020, 2:08:56 UTC

Hello,
same problem. Theoretically the computer has resources, but it waits for memory with CMS and blocks the loading of other projects. In the project properties comes out this: Application
CMS Simulation 50.00 (vbox64)
First name
CMS_1538738_1607610862.401591
State
Waiting for memory
Received
Thursday, December 10, 2020, 5:31:55 PM
Deadline for reporting
Saturday, January 9, 2021, 5:31:54 PM
Estimated computation
1,000,000 GFLOPs
CPU time
10:13:26
CPU time since last control
---
Time elapsed
10:24:46
Estimated time remaining
1d 04:47:11
Fraction performed
58.399%
Virtual memory size
281.63 MB
Work block size
2.79 GB
Directors
slots / 1
Progress rate
5.760% per hour
Executable
vboxwrapper_26196_x86_64-pc-linux-gnu.
I don't know anymore. It seems to me that I will try to disable CMS, to see what happens. I'll tell you
ID: 43910 · Report as offensive     Reply Quote
Profile Francesc Josep LLort i Gutiér...

Send message
Joined: 23 Nov 15
Posts: 4
Credit: 1,391,488
RAC: 0
Message 43911 - Posted: 15 Dec 2020, 2:34:46 UTC

Hola de nou, doncs en avortar la tasca del CMS, el BOINC a començat a acceptar i executar nous treballs.
ID: 43911 · Report as offensive     Reply Quote
Profile Francesc Josep LLort i Gutiér...

Send message
Joined: 23 Nov 15
Posts: 4
Credit: 1,391,488
RAC: 0
Message 43912 - Posted: 15 Dec 2020, 2:35:29 UTC - in response to Message 43911.  

Hi again, for by aborting the work of the CMS, the BOINC has begun to accept and execute new work.
ID: 43912 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43913 - Posted: 15 Dec 2020, 3:00:53 UTC - in response to Message 43910.  
Last modified: 15 Dec 2020, 3:11:47 UTC

Memory 5.8 GB on both computers with 8 core system is bare minimum to handle is and few sixtrack task.
You have CMS task running and got valid but aborted last one. Probably it wet other task on waiting for ram.

Please uncheck box for native task and test application. You client got many task failed because CVMFS is not installed. If you want to run virtualbox you only get these task by uncheck native but would suggest to run sixtrack and maybe theory until you added more memory.
ID: 43913 · Report as offensive     Reply Quote
Profile Francesc Josep LLort i Gutiér...

Send message
Joined: 23 Nov 15
Posts: 4
Credit: 1,391,488
RAC: 0
Message 43926 - Posted: 15 Dec 2020, 18:11:34 UTC - in response to Message 43913.  

Hello Gunde.
I have disabled CMS, ATLAS and native spots; as you told me, and BOINC has started running Theory. I'm also running the GPUGRID project in BOINC, maybe it's too much ?. With Kubuntu 18.04, I had no such issues. They came to me as a result of switching to Kubuntu 20.04, although the change is well worth it. I plan to upgrade the RAM to 24GB, but it won’t be this year. I go to the Linux section, to ask about CVMFS installation issues, thanks.
ID: 43926 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44484 - Posted: 14 Mar 2021, 14:36:57 UTC

All CMS tasks fail on my Windows 10 PC, while Atlas tasks on 2 cores and Theory tasks work perfectly, not to speak of SixTrack tasks. Condor fails after about 10000 s.
Tullio
ID: 44484 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,793,446
RAC: 127,534
Message 44490 - Posted: 15 Mar 2021, 12:38:30 UTC - in response to Message 44484.  

Tullio,
you have 12 GByte RAM for your Windows Core.
This can be to small to run Atlas or Theory AND CMS.
What is, when you start only one CMS Task, to see what happens.
ID: 44490 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44492 - Posted: 15 Mar 2021, 13:33:29 UTC - in response to Message 44490.  

Tullio,
you have 12 GByte RAM for your Windows Core.
This can be to small to run Atlas or Theory AND CMS.
What is, when you start only one CMS Task, to see what happens.

Condor reaches 15268 s, it was my last running task since QuChemPedIA@home is apparently dead. It also uses VirtualBox, but most of the time it completes a task before its Linux wingman, even if Linux runs on a much more powerful CPU. But the CMS was a problem also on my other Windows 10 PC with 24 GB RAM.
Tullio
ID: 44492 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,417,873
RAC: 123,624
Message 44493 - Posted: 15 Mar 2021, 14:50:36 UTC - in response to Message 44492.  

What you describe looks like an individual problem of your computer.
The time range points out it happens at the end of a subtask calculation.

Hence, to get out what happens it would be necessary that you watch the console output around that phase.

Console at ALT-F2:
shows how many records are processed.
You will see something like "Begin processing the 6144th record ..."
A subtask is complete after 10000 records.


Console at ALT-F3:
Shows the output of the top utility
In the list look at the runtime of the command cmsRun.
It's in minutes and together with the output from ALT-F2 it allows you to estimate when your subtasks are complete.


ALT-F4 and ALT-F5 show diagnostic messages as well as errors.
If CMS can identify a problem it prints corresponding messages on that consoles.
Unfortunatly they disappear when other messages are printed.
Hence, you would have to quickly switch between ALT-F4/5 during a subtask change.



If the task is alive long enough you may have a chance to copy the logfiles from inside the VM.
Mark the task in your BOINC manager and click on "show graphics".
A Browser window opens where you can follow a link to the logs.
This can be prepared before the calculation is done.
At the critical moment just refresh the browser window.
Check the last messages from the logs for errors or warnings, typically for network connection errors and/or retries.



It makes no sense to compare CMS with other VBox apps, not even with ATLAS or Theory from LHC@home since all of them use different communication channels.
In case of CMS it's HTCondor and WMAgent.
ID: 44493 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44496 - Posted: 15 Mar 2021, 20:11:25 UTC
Last modified: 15 Mar 2021, 20:16:00 UTC

My computer is as standard as it can. No other project has any problem, either on CPU or GPU. There was a time when Einstein@home Gravitational wave tasks on GPU would not run because the GTX 1060 had only 3 GB Video RAM so I installed a new board with 4 GB Video RAM. Now I am running World Community Grid and Rosetta@home tasks on another PC with 3 GB Video RAM and they all run perfectly. On QuChemPedIA@home, now unluckily stopped, I was 37th in the ranking list of RAC using VirtualBox against CPUs running Linux on 124 or more Processors such as Ryzen 9. And this is only an Intel i5 9400F with 6 processors.
Tullio
ID: 44496 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44524 - Posted: 21 Mar 2021, 17:18:15 UTC

I can see the show graphics logs and they signal some connection errors. But the task goes on all the same.
Tullio
ID: 44524 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,793,446
RAC: 127,534
Message 44525 - Posted: 21 Mar 2021, 18:24:11 UTC - in response to Message 44524.  

Have you a LAN-Cable or WiFi for this PC?
Is the Network-Connect to the Router correct shown in Windows?
Don't know why only CMS with HT-Condor have this problem and not Atlas or Theory.....
ID: 44525 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,417,873
RAC: 123,624
Message 44527 - Posted: 21 Mar 2021, 20:15:34 UTC - in response to Message 44524.  

I can see the show graphics logs and they signal some connection errors.

Those error messages might be helpful.
Can you post them here?
ID: 44527 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44529 - Posted: 22 Mar 2021, 6:42:14 UTC - in response to Message 44525.  

Have you a LAN-Cable or WiFi for this PC?
Is the Network-Connect to the Router correct shown in Windows?
Don't know why only CMS with HT-Condor have this problem and not Atlas or Theory.....

I have WiFi connection for this Pc, another PC and a HP Printer. They all work.
Tullio
ID: 44529 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44530 - Posted: 22 Mar 2021, 6:43:28 UTC - in response to Message 44527.  

I can see the show graphics logs and they signal some connection errors.

Those error messages might be helpful.
Can you post them here?

I shall do it next time I get a CMS task. Thanks.
Tullio
ID: 44530 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,793,446
RAC: 127,534
Message 44531 - Posted: 22 Mar 2021, 7:33:57 UTC - in response to Message 44529.  

Have you a LAN-Cable or WiFi for this PC?
Is the Network-Connect to the Router correct shown in Windows?
Don't know why only CMS with HT-Condor have this problem and not Atlas or Theory.....

I have WiFi connection for this Pc, another PC and a HP Printer. They all work.
Tullio

When you have the possibility to test CMS with a LAN-Cable for one running CMS-Task, it can show the reason of the Error, maybe.
ID: 44531 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44541 - Posted: 24 Mar 2021, 15:26:20 UTC - in response to Message 44531.  

I have launched LHC@home on a HP Laptop which is connected to a LAN Cable. It got two Atlas tasks, one of which completed and validated. When I get a CMS task on it, I shall look at the error messages and post them here. Thanks.
Tullio
ID: 44541 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : Had ~100 failures on CMS 50


©2024 CERN