Message boards : Theory Application : Herwig7 7.2.1 nlo-dipole tasks run very slowly.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50823 - Posted: 16 Oct 2024, 13:38:53 UTC - in response to Message 50822.  
Last modified: 16 Oct 2024, 13:57:31 UTC

The x of 760 is the 1. part of the workunit.
You are already in the 2. part of the task with "27.200 events processed" so my guess is that it will just take 1 to 3 days till done from now.
If the 1.194:58 are for the second part of the workunit, should be around 20hours per 28,000 events so a bit more than 40hours i guess.
And should be easy inside the 10day limit.
okay, so I'll keep my fingers crossed :-)
P.S. the question though remains, whether the 8GB limit in the slots folder won't be reached too soon :-(
ID: 50823 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50836 - Posted: 17 Oct 2024, 13:10:59 UTC - in response to Message 50810.  

computezrmle wrote:
...but those tasks could suffer from
- the huge disk read activity, CPU not getting the data quick enough to proceed ...

I wrote:
well, this should (hopefully) not happen here, since BOINC runs on a ramdisk. But who knows ...

computezrmle wrote:
Even then it is not efficient since the data can't be used directly if it is on the ramdisk.
It has to be copied to the RAM controlled by the process first
okay, I see - so this might explain why console_3 shows a CPU usage of only about 70% for Herwig
ID: 50836 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50840 - Posted: 18 Oct 2024, 21:26:49 UTC

Finally finished this one: [boinc pp z1j 13000 75 - herwig7 7.2.1 nlo 4000 196] https://lhcathome.cern.ch/lhcathome/result.php?resultid=414856897

Run time 8 days 12 hours 30 min 23 sec
CPU time 10 days 15 hours 32 min 55 sec

After 196 hours processing time for the 760 integrations. the 4000 events processing lasted 'only' 40 minutes.
ID: 50840 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50841 - Posted: 19 Oct 2024, 3:30:31 UTC - in response to Message 50840.  
Last modified: 19 Oct 2024, 3:32:18 UTC

Finally finished this one: [boinc pp z1j 13000 75 - herwig7 7.2.1 nlo 4000 196] https://lhcathome.cern.ch/lhcathome/result.php?resultid=414856897

Run time 8 days 12 hours 30 min 23 sec
CPU time 10 days 15 hours 32 min 55 sec

After 196 hours processing time for the 760 integrations. the 4000 events processing lasted 'only' 40 minutes.
so changing from 1 to 2 CPUs brought some effect after all. Maybe without this, the task wouldn't have made it within the 10 days' limit (unless you did eliminate it beforehand as suggested in one of your recent postings).
Another question is whether a RAM size of even more than 1 GB would have helped additonally ???

BTW: did you check how close the task came to the (<)8GB disk limit?
ID: 50841 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50848 - Posted: 19 Oct 2024, 8:57:04 UTC - in response to Message 50841.  
Last modified: 19 Oct 2024, 9:05:54 UTC

so changing from 1 to 2 CPUs brought some effect after all.
The effect was much less than it appears here, cause I used CPU too by monitoring the VM.

Maybe without this, the task wouldn't have made it within the 10 days' limit (unless you did eliminate it beforehand as suggested in one of your recent postings).
Yes, I elimate the 10 days job duration by default. This limit is more or less set for users not monitoring the tasks at all.
It also suppresses the not needed 'High Priority' running of Theory, sometimes causing trouble by setting ATLAS, CMS or another BOINC-project in a wait state.
The task would have made it on time anyway and I also suspended the task 1 time for 8 hours.
Btw: i now have a task done 500 events of 49000. The total event processing time would last 19 days. It can't be true.


Another question is whether a RAM size of even more than 1 GB would have helped additonally ???
I'm watching now 5 tasks with RAM set at 1536MB. The used swap is only 0, 268, 780, 2316 and 2572KiB.
So giving the VM 1024MB should be enough.

BTW: did you check how close the task came to the (<)8GB disk limit?
The whole slot contents did not exceed the 7GB.
ID: 50848 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50851 - Posted: 19 Oct 2024, 11:55:54 UTC - in response to Message 50848.  


BTW: did you check how close the task came to the (<)8GB disk limit?
The whole slot contents did not exceed the 7GB.
In one case here, the slotfile shows 7,17GB now. And about 8,5 days have gone. So, by tomorrow I'll think about shutting down the task(s) gracefully.
ID: 50851 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50853 - Posted: 19 Oct 2024, 13:37:10 UTC

on a third host, I now wanted to try Herwig7, and before downloading a task I removed the 10 days's runtime limit by following the 2 steps as suggested by CP short time ago.

I.e. I removed the line <job_duration>864000</job_duration> from the Theory_2024_04_30_prod.xml, and I added <dont_check_file_sizes>1</dont_check_file_sizes> to the cc_config.xml. Then I closed the BOINC manager and opened it again.

However, when downloading a Theory task, it errors out immediately, with stderr saying:
<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
couldn't start app: Task file Theory_2024_04_30_prod.xml: file has the wrong size</message>
]]>

what's going wrong ?
ID: 50853 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50854 - Posted: 19 Oct 2024, 13:48:20 UTC - in response to Message 50853.  
Last modified: 19 Oct 2024, 13:49:48 UTC

Did you put <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of cc_config.xml

and/or did you not only closed the manager, but also the client?
ID: 50854 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50855 - Posted: 19 Oct 2024, 13:56:03 UTC - in response to Message 50854.  

Did you put <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of cc_config.xml

and/or did you not only closed the manager, but also the client?
item 1: yes
item 2: hm, that's what I am not sure. So I need to do this again, but have to wait until a task from another project gets finished. Thanks for the hint, anyway
ID: 50855 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50856 - Posted: 19 Oct 2024, 14:31:23 UTC
Last modified: 19 Oct 2024, 20:38:14 UTC

Wow! ====> z1j 13000 10 - herwig7 7.2.1 nlo-pw-dipole 49000 204

36.10GB written to disk and the {157ee720-f509-41aa-b15b-56c835603b57}.vdi differencing file has a size of 18.108.416 KB (17.3GB)

Happily running 15600 events done of 49000

The VM however is created with a virtual disk that may extend to MAX 20GB.

EDIT: Two others tasks now exceeds the 8.000.000.000 bytes (7.45GB) --- 17.0 GB and 17.7 GB
One of the three: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415016558
ID: 50856 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50857 - Posted: 19 Oct 2024, 14:41:21 UTC - in response to Message 50855.  

Did you put <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of cc_config.xml

and/or did you not only closed the manager, but also the client?
item 1: yes
item 2: hm, that's what I am not sure. So I need to do this again, but have to wait until a task from another project gets finished. Thanks for the hint, anyway
next question: I notice that the Theory_2024_04_30_prod.xml in the second line says <memory_size_mb>630</memory_size_mb>
So, if I change the memory size to 1024MB in the app_config.xml, do I also need to make the same change in the Theory_2024_04_30_prod.xml in order to achieve the desired effect?
ID: 50857 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50858 - Posted: 19 Oct 2024, 14:51:35 UTC - in response to Message 50857.  
Last modified: 19 Oct 2024, 14:52:17 UTC

So, if I change the memory size to 1024MB in the app_config.xml, do I also need to make the same change in the Theory_2024_04_30_prod.xml in order to achieve the desired effect?

No, that's not necessary. App_config.xml overrules the settings of the Theory_2024_04_30_prod.xml.
With a running client the most settings can also be read-in by "read config files" from BOINC's Options menu. For future tasks of course, but can have effect on running tasks like avg_cpus to the amount of free BOINC cores.
ID: 50858 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50859 - Posted: 20 Oct 2024, 3:36:13 UTC - in response to Message 50855.  
Last modified: 20 Oct 2024, 4:35:54 UTC

Did you put <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of cc_config.xml

and/or did you not only closed the manager, but also the client?
item 1: yes
item 2: hm, that's what I am not sure. So I need to do this again, but have to wait until a task from another project gets finished. Thanks for the hint, anyway
to make sure, I made the exercise once more: closing the client, then closing the manager, then opening the manager.
Downloaded 2 Herwig7 tasks, they started okay.
for some reason, I looked up the Theory app_config.xml, and was surprised to see the job duration line (864000) back there :-( No idea how come, because I had definitely deleted it before.
So I assumed that the reason why these two tasks did not fail right at the beginning was that the job duration limitation is in place. In order to see whether that's true or not, I deleted this line in the app_config.xml again and then downloaded a third task. And, as I was afraid of: it failed right away with "couldn't start app: Task file Theory_2024_04_30_prod.xml: file has the wrong size</message>".

So two things are weird:
- why is the <dont_check_file_sizes>1</dont_check_file_sizes> line in the options section of the cc_config.xml not recognized/accepted?
- why was the job duration line back in the app_config.xml after I had definitely deleted it before (in fact, I noticed that already earlier that it keeps coming back).

P.S: how can I check whether the two currently running tasks are subject to the 10-days-limitation or not?
ID: 50859 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50860 - Posted: 20 Oct 2024, 8:10:11 UTC

on one of my hosts which has run 2 Herwig7 one of them came close to 7,45GB disk plus close to the 10 days, so I decided to give it a "graceful showdown".
The shutdown worked, but, in contrast to what happened before when I shut down Theory tasks gracefully, the task description shows "comptuation error", "invalid", and does not yield credit points. So almost 10 days CPU work for nothing, obviously the "graceful shutdown" does NOT work with Herwig7.
This is really annoying, and I think I will keep my fingers away from Theory as long as these highly experimental Herwig7 are issued.
In my opinion, there are 3 things that should be done by the project:

1) making a separate sub-sub project for Herwig7, so the volunteers can choose to either run the "usual" Theory tasks, or the Herwig7 tasks. I think that only people with high-end systems are the right target group for Herwig7 (although in the subject case, I would definitely count my Intel Core i9-10900KF which currently runs at 4.6GHz as "high end" - and still runtime would have been beyond 10 days, plus disk usage higher than 7.45GB).

2) remove the 10 days task runtime limitation.

3) increase the 7.45GB disk limitation.
ID: 50860 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50861 - Posted: 20 Oct 2024, 8:14:17 UTC - in response to Message 50859.  

So two things are weird:
- why is the <dont_check_file_sizes>1</dont_check_file_sizes> line in the options section of the cc_config.xml not recognized/accepted?
- why was the job duration line back in the app_config.xml after I had definitely deleted it before (in fact, I noticed that already earlier that it keeps coming back).

P.S: how can I check whether the two currently running tasks are subject to the 10-days-limitation or not?
At the start of BOINC client some cc_config settings like flags and options are displayed in BOINC's event log.
You could also reread the config files (Options) and "Config: don't check file sizes" should be visible.
When it's not there and you change Theory_2024_04_30_prod.xml (not Theory app_config.xml as you wrote), it will check the size and download the server version again, so your changement is gone.

When job duration is used the "time left" (Remaining) very often jumps to job duration time (about 10 days)
When job duration is not used BOINC will start with a time left and will decrease that time - as longer the task lasts as slower this decreasing will go, but as you know BOINC is not aware of the time to go, so just guessing.
ID: 50861 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50862 - Posted: 20 Oct 2024, 8:44:36 UTC - in response to Message 50861.  

At the start of BOINC client some cc_config settings like flags and options are displayed in BOINC's event log.
You could also reread the config files (Options) and "Config: don't check file sizes" should be visible.
When it's not there and you change Theory_2024_04_30_prod.xml (not Theory app_config.xml as you wrote), it will check the size and download the server version again, so your changement is gone.
I checked the BOINC event log. it definitely says "Config: don' check file sizes". A few lines later comes the entry about wrong size of Theory_2024_04_30_prod.xml (706bytes instead of 743), and right thereafter the Theory_2024_04_30_prod.xml is downloaded from the server (743bytes).
So I am wondering what happens ...
ID: 50862 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 28,391
Message 50863 - Posted: 20 Oct 2024, 9:01:29 UTC - in response to Message 50862.  

You may need to upgrade BOINC to at least v7.26.0 according to this PR:
https://github.com/BOINC/boinc/pull/5523
ID: 50863 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50864 - Posted: 20 Oct 2024, 9:20:40 UTC - in response to Message 50863.  
Last modified: 20 Oct 2024, 9:23:00 UTC

Did not think about BOINC version could be a reason. For me that cc_config option is there for a long time,
but true: in the past I used another option for text files. Just edit the file and keep exactly the same amount of bytes.
So for Erich: Update BOINC or delete the job duration line and add at the beginning of several lines enough spaces to achieve 743 bytes again.
But be careful, the developer used spaces and tabs interchangeably.
ID: 50864 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50865 - Posted: 20 Oct 2024, 9:52:43 UTC - in response to Message 50860.  
Last modified: 20 Oct 2024, 9:55:25 UTC

.... there are 3 things that should be done by the project:

1) ..
2) ..
3) ..
I think that's the main problem. Almost no one left from the project team, who feels responsible for BOINC overall. Only the server is updated every now and then.

In the past, scientists were expected to give only jobs to BOINC, which could be done on a minimalist VM.

Btw: a 4th task with an extended virtual disk: 18.524.143.616 bytes
ID: 50865 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,947,087
RAC: 18,188
Message 50867 - Posted: 20 Oct 2024, 11:47:57 UTC - in response to Message 50863.  

You may need to upgrade BOINC to at least v7.26.0 according to this PR:
https://github.com/BOINC/boinc/pull/5523
thanks for the hint - I upraded to the latest version 8.0.2 - and now the "don't check file sizes" thing seems to work - no computation error right after start of the task.

Still the tasks may run into the 7,45GB rsc_disk_bound issue - is there nothing that can be done from my side? The slot folder contains a file calles "init_data.xml" in which, among lots of data, the rsc_disk-bound limitation shows up. Has anyone ever tried to increase this value?
ID: 50867 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Theory Application : Herwig7 7.2.1 nlo-dipole tasks run very slowly.


©2024 CERN