Message boards : Theory Application : Herwig7 7.2.1 nlo-dipole tasks run very slowly.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,009
RAC: 20,590
Message 50868 - Posted: 20 Oct 2024, 12:18:28 UTC - in response to Message 50867.  

The slot folder contains a file calles "init_data.xml" in which, among lots of data, the rsc_disk-bound limitation shows up. Has anyone ever tried to increase this value?
The answer is: YES. I did it a few minutes ago by replacing the first digit "8" by "12". So the question is: will this have the desired effect?
ID: 50868 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50869 - Posted: 20 Oct 2024, 14:15:15 UTC - in response to Message 50868.  
Last modified: 20 Oct 2024, 16:05:40 UTC

The slot folder contains a file calles "init_data.xml" in which, among lots of data, the rsc_disk-bound limitation shows up. Has anyone ever tried to increase this value?
The answer is: YES. I did it a few minutes ago by replacing the first digit "8" by "12". So the question is: will this have the desired effect?
The answer is NO.
The init_data.xml is an extraction from client_state.xml and contains data for that specific task.
But indeed the rsc_disk_bound is the limit setting for that task and controlled by BOINC and not by the task. So BOINC kills the task when the limit is exceeded like:

09-Oct-2024 00:17:08 [LHC@home] Aborting task Theory_2794-3267759-142_1: exceeded disk limit: 9321.23MB > 7629.39MB

The setting may be changed in client_state.xml when the client is not running and must be done for every (new) task.
I don't want to encourage everyone to fiddle around with client_state.xml, so no support, but try and error on your own.

Two results with excessive use of the slot-folder:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415016497 Peak disk usage 17.27 GB
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415016558 Peak disk usage 17.04 GB
ID: 50869 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 50870 - Posted: 20 Oct 2024, 14:34:51 UTC

The disk limit is used as a watchdog to avoid a task can write huge amounts of data to the disk until it is really full.
8 GB usually has plenty of headroom, so there's no need to extend that value.
Instead, if there are tasks hitting the limit (without writing snapshots for certain reasons) there's something wrong with the task setup, e.g. lots of errors written to the internal logfiles.

At the end it's nothing that should be repaired by the BOINC volunteers.

Suggestion:
Let the tasks run to get the wrong ones sorted out automatically.
Failed tasks are not nice but once they are in the queue nobody will manually remove them.
ID: 50870 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,009
RAC: 20,590
Message 50872 - Posted: 21 Oct 2024, 7:42:32 UTC - in response to Message 50836.  

computezrmle wrote:
...but those tasks could suffer from
- the huge disk read activity, CPU not getting the data quick enough to proceed ...

I wrote:
well, this should (hopefully) not happen here, since BOINC runs on a ramdisk. But who knows ...

computezrmle wrote:
Even then it is not efficient since the data can't be used directly if it is on the ramdisk.
It has to be copied to the RAM controlled by the process first
okay, I see - so this might explain why console_3 shows a CPU usage of only about 70% for Herwig
well, to me now the low CPU usage (70%) seems to be a result of too much swapping because of the 630MB default RAM. After I increased this value to 1.536MB (I have plenty of RAM available on some of my hosts) the CPU usage figure shown for Herwig in console_3 is around 98/99%.
ID: 50872 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 50873 - Posted: 21 Oct 2024, 7:56:14 UTC - in response to Message 50872.  

After I increased this value to 1.536MB (I have plenty of RAM available on some of my hosts) the CPU usage figure shown for Herwig in console_3 is around 98/99%.

+1
ID: 50873 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 50874 - Posted: 21 Oct 2024, 12:13:35 UTC

Have one herwig7 7.2.0 nlo 37000 110
Is there a difference with herwig7 7.2.1
File have 2.03 GByte and is NOT growing.
ID: 50874 · Report as offensive     Reply Quote
Profile pascali

Send message
Joined: 17 Sep 11
Posts: 1
Credit: 2,389,835
RAC: 866
Message 50875 - Posted: 21 Oct 2024, 14:14:10 UTC

Every single one of these I have let run have had "Error while computing" finish, so it seems pointless to run any longer unless there is a way to make them work. I have now started aborting them.
ID: 50875 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 50876 - Posted: 21 Oct 2024, 14:35:29 UTC - in response to Message 50875.  
Last modified: 21 Oct 2024, 15:02:08 UTC

Ok, but will control it for the next days. Limit is 10 days ;-)
4 GByte now for this Task.
At the End 10 GByte and C R A S H E D.
This sort of Tasks needed also been stopped from Cern-IT like SHERPA in the past!
ID: 50876 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50967 - Posted: 30 Oct 2024, 8:10:59 UTC

This task https://lhcathome.cern.ch/lhcathome/result.php?resultid=415343514 exceeded the disk limit - [boinc pp z1j 13000 110 - herwig7 7.2.0 nlo 100000 234]

LHC@home 30 Oct 07:56:50 Aborting task Theory_2794-3245286-234_1: exceeded disk limit: 17327.41MB > 7629.39MB
ID: 50967 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1176
Credit: 54,887,670
RAC: 5,761
Message 50969 - Posted: 30 Oct 2024, 20:26:11 UTC

So far the -dev version 6.01 (vbox64_theory) Microsoft Windows running on an AMD x86_64 or Intel EM64T CPU
[boinc pp z1j 13000 35 - herwig7 7.2.1 nlo-pw-dipole 100000 253 has been running Valids in just over 5 days each
ID: 50969 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1176
Credit: 54,887,670
RAC: 5,761
Message 50980 - Posted: 1 Nov 2024, 11:01:19 UTC - in response to Message 50969.  

So far the -dev version 6.01 (vbox64_theory) Microsoft Windows running on an AMD x86_64 or Intel EM64T CPU
[boinc pp z1j 13000 35 - herwig7 7.2.1 nlo-pw-dipole 100000 253 has been running Valids in just over 5 days each

I knew that would be a jinx if I said they worked because my current batch of 4 of those 3 just crashed after .......
Run time 6 days 6 hours 20 min 57 sec
CPU time 6 days 4 hours 52 min 3 sec
The 4th one is still in the running log but I will be surprised if that actually is Valid since the previous ones finished in 5 days
ID: 50980 · Report as offensive     Reply Quote
Glohr

Send message
Joined: 13 Jan 24
Posts: 3
Credit: 2,248,813
RAC: 3,690
Message 51079 - Posted: 17 Nov 2024, 2:53:24 UTC - in response to Message 50872.  

Thanks for the hint about the memory size. I set <memory_size_mb>2048</memory_size_mb> in Theory_2024_04_30_prod.xml and finally a couple of herwig tasks completed successfully after 9+ days.

Those tasks used up to about 1.3 GB in the VM, so 1536 MB would have been sufficient. 630 MB is clearly inadequate given the number of tasks that had been running out of time or otherwise getting errors.
ID: 51079 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Theory Application : Herwig7 7.2.1 nlo-dipole tasks run very slowly.


©2024 CERN