Message boards :
Theory Application :
Herwig7 7.2.1 nlo-dipole tasks run very slowly.
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
The x of 760 is the 1. part of the workunit.okay, so I'll keep my fingers crossed :-) P.S. the question though remains, whether the 8GB limit in the slots folder won't be reached too soon :-( |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
computezrmle wrote: ...but those tasks could suffer from I wrote: well, this should (hopefully) not happen here, since BOINC runs on a ramdisk. But who knows ... computezrmle wrote: Even then it is not efficient since the data can't be used directly if it is on the ramdisk.okay, I see - so this might explain why console_3 shows a CPU usage of only about 70% for Herwig |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
Finally finished this one: [boinc pp z1j 13000 75 - herwig7 7.2.1 nlo 4000 196] https://lhcathome.cern.ch/lhcathome/result.php?resultid=414856897 Run time 8 days 12 hours 30 min 23 sec CPU time 10 days 15 hours 32 min 55 sec After 196 hours processing time for the 760 integrations. the 4000 events processing lasted 'only' 40 minutes. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
Finally finished this one: [boinc pp z1j 13000 75 - herwig7 7.2.1 nlo 4000 196] https://lhcathome.cern.ch/lhcathome/result.php?resultid=414856897so changing from 1 to 2 CPUs brought some effect after all. Maybe without this, the task wouldn't have made it within the 10 days' limit (unless you did eliminate it beforehand as suggested in one of your recent postings). Another question is whether a RAM size of even more than 1 GB would have helped additonally ??? BTW: did you check how close the task came to the (<)8GB disk limit? |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
so changing from 1 to 2 CPUs brought some effect after all.The effect was much less than it appears here, cause I used CPU too by monitoring the VM. Maybe without this, the task wouldn't have made it within the 10 days' limit (unless you did eliminate it beforehand as suggested in one of your recent postings).Yes, I elimate the 10 days job duration by default. This limit is more or less set for users not monitoring the tasks at all. It also suppresses the not needed 'High Priority' running of Theory, sometimes causing trouble by setting ATLAS, CMS or another BOINC-project in a wait state. The task would have made it on time anyway and I also suspended the task 1 time for 8 hours. Btw: i now have a task done 500 events of 49000. The total event processing time would last 19 days. It can't be true. Another question is whether a RAM size of even more than 1 GB would have helped additonally ???I'm watching now 5 tasks with RAM set at 1536MB. The used swap is only 0, 268, 780, 2316 and 2572KiB. So giving the VM 1024MB should be enough. BTW: did you check how close the task came to the (<)8GB disk limit?The whole slot contents did not exceed the 7GB. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
In one case here, the slotfile shows 7,17GB now. And about 8,5 days have gone. So, by tomorrow I'll think about shutting down the task(s) gracefully. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
on a third host, I now wanted to try Herwig7, and before downloading a task I removed the 10 days's runtime limit by following the 2 steps as suggested by CP short time ago. I.e. I removed the line <job_duration>864000</job_duration> from the Theory_2024_04_30_prod.xml, and I added <dont_check_file_sizes>1</dont_check_file_sizes> to the cc_config.xml. Then I closed the BOINC manager and opened it again. However, when downloading a Theory task, it errors out immediately, with stderr saying: <core_client_version>7.24.1</core_client_version> <![CDATA[ <message> couldn't start app: Task file Theory_2024_04_30_prod.xml: file has the wrong size</message> ]]> what's going wrong ? |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
Did you put <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of cc_config.xml and/or did you not only closed the manager, but also the client? |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
Did you put <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of cc_config.xmlitem 1: yes item 2: hm, that's what I am not sure. So I need to do this again, but have to wait until a task from another project gets finished. Thanks for the hint, anyway |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
Wow! ====> z1j 13000 10 - herwig7 7.2.1 nlo-pw-dipole 49000 204 36.10GB written to disk and the {157ee720-f509-41aa-b15b-56c835603b57}.vdi differencing file has a size of 18.108.416 KB (17.3GB) Happily running 15600 events done of 49000 The VM however is created with a virtual disk that may extend to MAX 20GB. EDIT: Two others tasks now exceeds the 8.000.000.000 bytes (7.45GB) --- 17.0 GB and 17.7 GB One of the three: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415016558 |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
next question: I notice that the Theory_2024_04_30_prod.xml in the second line says <memory_size_mb>630</memory_size_mb>Did you put <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of cc_config.xmlitem 1: yes So, if I change the memory size to 1024MB in the app_config.xml, do I also need to make the same change in the Theory_2024_04_30_prod.xml in order to achieve the desired effect? |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
So, if I change the memory size to 1024MB in the app_config.xml, do I also need to make the same change in the Theory_2024_04_30_prod.xml in order to achieve the desired effect? No, that's not necessary. App_config.xml overrules the settings of the Theory_2024_04_30_prod.xml. With a running client the most settings can also be read-in by "read config files" from BOINC's Options menu. For future tasks of course, but can have effect on running tasks like avg_cpus to the amount of free BOINC cores. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
to make sure, I made the exercise once more: closing the client, then closing the manager, then opening the manager.Did you put <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of cc_config.xmlitem 1: yes Downloaded 2 Herwig7 tasks, they started okay. for some reason, I looked up the Theory app_config.xml, and was surprised to see the job duration line (864000) back there :-( No idea how come, because I had definitely deleted it before. So I assumed that the reason why these two tasks did not fail right at the beginning was that the job duration limitation is in place. In order to see whether that's true or not, I deleted this line in the app_config.xml again and then downloaded a third task. And, as I was afraid of: it failed right away with "couldn't start app: Task file Theory_2024_04_30_prod.xml: file has the wrong size</message>". So two things are weird: - why is the <dont_check_file_sizes>1</dont_check_file_sizes> line in the options section of the cc_config.xml not recognized/accepted? - why was the job duration line back in the app_config.xml after I had definitely deleted it before (in fact, I noticed that already earlier that it keeps coming back). P.S: how can I check whether the two currently running tasks are subject to the 10-days-limitation or not? |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
on one of my hosts which has run 2 Herwig7 one of them came close to 7,45GB disk plus close to the 10 days, so I decided to give it a "graceful showdown". The shutdown worked, but, in contrast to what happened before when I shut down Theory tasks gracefully, the task description shows "comptuation error", "invalid", and does not yield credit points. So almost 10 days CPU work for nothing, obviously the "graceful shutdown" does NOT work with Herwig7. This is really annoying, and I think I will keep my fingers away from Theory as long as these highly experimental Herwig7 are issued. In my opinion, there are 3 things that should be done by the project: 1) making a separate sub-sub project for Herwig7, so the volunteers can choose to either run the "usual" Theory tasks, or the Herwig7 tasks. I think that only people with high-end systems are the right target group for Herwig7 (although in the subject case, I would definitely count my Intel Core i9-10900KF which currently runs at 4.6GHz as "high end" - and still runtime would have been beyond 10 days, plus disk usage higher than 7.45GB). 2) remove the 10 days task runtime limitation. 3) increase the 7.45GB disk limitation. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
So two things are weird:At the start of BOINC client some cc_config settings like flags and options are displayed in BOINC's event log. You could also reread the config files (Options) and "Config: don't check file sizes" should be visible. When it's not there and you change Theory_2024_04_30_prod.xml (not Theory app_config.xml as you wrote), it will check the size and download the server version again, so your changement is gone. When job duration is used the "time left" (Remaining) very often jumps to job duration time (about 10 days) When job duration is not used BOINC will start with a time left and will decrease that time - as longer the task lasts as slower this decreasing will go, but as you know BOINC is not aware of the time to go, so just guessing. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
At the start of BOINC client some cc_config settings like flags and options are displayed in BOINC's event log.I checked the BOINC event log. it definitely says "Config: don' check file sizes". A few lines later comes the entry about wrong size of Theory_2024_04_30_prod.xml (706bytes instead of 743), and right thereafter the Theory_2024_04_30_prod.xml is downloaded from the server (743bytes). So I am wondering what happens ... |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 28,391 |
You may need to upgrade BOINC to at least v7.26.0 according to this PR: https://github.com/BOINC/boinc/pull/5523 |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
Did not think about BOINC version could be a reason. For me that cc_config option is there for a long time, but true: in the past I used another option for text files. Just edit the file and keep exactly the same amount of bytes. So for Erich: Update BOINC or delete the job duration line and add at the beginning of several lines enough spaces to achieve 743 bytes again. But be careful, the developer used spaces and tabs interchangeably. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
.... there are 3 things that should be done by the project:I think that's the main problem. Almost no one left from the project team, who feels responsible for BOINC overall. Only the server is updated every now and then. In the past, scientists were expected to give only jobs to BOINC, which could be done on a minimalist VM. Btw: a 4th task with an extended virtual disk: 18.524.143.616 bytes |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,947,087 RAC: 18,188 |
You may need to upgrade BOINC to at least v7.26.0 according to this PR:thanks for the hint - I upraded to the latest version 8.0.2 - and now the "don't check file sizes" thing seems to work - no computation error right after start of the task. Still the tasks may run into the 7,45GB rsc_disk_bound issue - is there nothing that can be done from my side? The slot folder contains a file calles "init_data.xml" in which, among lots of data, the rsc_disk-bound limitation shows up. Has anyone ever tried to increase this value? |
©2024 CERN