Message boards :
Theory Application :
Herwig7 7.2.1 nlo-dipole tasks run very slowly.
Message board moderation
Author | Message |
---|---|
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 1 |
They run integration stage very slowly. They display integrate 1 of 760 This number increases over time, but too slow. |
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 1 |
It reaches "integrate 7 of 760 approximately at 2 hours. |
Send message Joined: 4 Mar 17 Posts: 25 Credit: 10,262,043 RAC: 1,268 |
They are long running tasks. Once it reached 760 of 760 it goes into the next stage that takes 1-2 days more on the tasks i do right now. Have 3 Finished 8 more are running since 4 days so far. https://lhcathome.cern.ch/lhcathome/result.php?resultid=414779140 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414777594 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775932 The herwig7 7.2.1 nlo 100000 (the batch from saturday/sunday) at least on my system take 3 to 5 days ts finish. |
Send message Joined: 11 Jul 06 Posts: 6 Credit: 2,915,386 RAC: 1,785 |
Yes, they are really long tasks (as I see all of them herwig7): https://lhcathome.cern.ch/lhcathome/result.php?resultid=414773291 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414774205 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775461 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775741 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775947 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775944 My longest was: Run time 4 days 20 hours 14 min 14 sec CPU time 4 days 8 hours 48 min 25 sec Here shows the average jobs run time: 3817.1 minutes (about 2.6 days per task) and was 513 successful run. http://mcplots-dev.cern.ch/production.php?view=revision&rev=2794 |
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 1 |
It reached 290 out of 760 |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
This one reached the 760 integration steps after 2 days and 4 hours - Job: pp z1j 13000 280 - herwig7 7.2.0 nlo-pw But the 100,000 events processing for this task will be way longer than normally. After 168 minutes only 1000 events are processed. Extrapolating: Another 11.5 d a y s to go. |
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 1 |
I think number of events planned should be present in the beginning of runRivet.log I think it is 16000 for me. ===> [runRivet] Wed Oct 9 21:54:16 UTC 2024 [boinc pp z1j 7000 150 - herwig7 7.2.1 nlo-dipole 16000 190] Can you look in your runRivet.log? I think you should be able to press Show Graphics to open browser window with logs. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
I think number of events planned should be present in the beginning of runRivet.log That's correct. 16000 is the number of events. From that run (pp z1j 7000 150 - herwig7 7.2.1 nlo-dipole) already 328000 events are done in 4 successful tasks, so probably 3 tasks with each 100,000 events and one with 28,000 events. In the beginning you see the input parameters like: mode=boinc beam=pp process=z1j energy=13000 params=75 specific=- generator=herwig7 version=7.2.1 tune=nlo nevts=4000 seed=196 |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
This one reached the 760 integration steps after 2 days and 4 hours - Job: pp z1j 13000 280 - herwig7 7.2.0 nlo-pwI decided to shutdown this task gracefully: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414836160 The last 12 hours there was almost no CPU-usage (about 1%) and the VM was contstantly reading from the disk with a diskratio of 21MB. A BOINC shutdown and a reboot of host and guest did not solve this. This run (pp z1j 13000 280 - herwig7 7.2.0 nlo-pw) had already 1 success with 100000 events. |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,363,408 RAC: 17,955 |
This one reached the 760 integration steps after 2 days and 4 hours - Job: pp z1j 13000 280 - herwig7 7.2.0 nlo-pwI decided to shutdown this task gracefully: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414836160 I saw similar behavior on my tasks as well. Tasks were running on a SSD with disk read about 250 MB/s. After it had read about 7 TB from the disk I aborted all Herwig Theory tasks. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,941,347 RAC: 21,472 |
On a host with an Intel i9-10900KF running at 4.6GHz, 2 Herwig7 tasks have been running for 4 days 8 hrs now. Console_2 in one case shows 22.400 events processed, in the other case 14.600. So for sure the tasks will not reach 100.000 events before they will be stopped after 10 days runtime. In other words: I should abandon these 2 tasks immediately, right ? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
... In other words: I should abandon these 2 tasks immediately, right ? No. Herwig7 runs 2 long phases, Integration then Processing. If you want to micromanage the task then - once it is in the processing phase - check the CPU time of Herwig in console 3 (top). Use this time together with the already processed events to estimate how long it will take to finish. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
On a host with an Intel i9-10900KF running at 4.6GHz, 2 Herwig7 tasks have been running for 4 days 8 hrs now.Not necessary because of the deadline, but those tasks could suffer from - the huge disk read activity, CPU not getting the data quick enough to proceed - getting the exceeded disk limit error |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,941,347 RAC: 21,472 |
... once it is in the processing phase - check the CPU time of Herwig in console 3 (top).right at the beginning it says: top - 18:28:52 - up 4 days 6:24 ... so if this indicates the CPU time, then it's clear that the task won't finish within the 10 days limit. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,941,347 RAC: 21,472 |
...but those tasks could suffer fromwell, this should (hopefully) not happen here, since BOINC runs on a ramdisk. But who knows ... |
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 1 |
I was able to enable network and start cernvmfs and boinc from recovery console and run lhcathome tasks like this. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
... once it is in the processing phase - check the CPU time of Herwig in console 3 (top). This shows the runtime of the VM since it's last restart. You need to look at the CPU time of the Herwig process as shown here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6230&postid=50795 ...but those tasks could suffer from Even then it is not efficient since the data can't be used directly if it is on the ramdisk. It has to be copied to the RAM controlled by the process first. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
562 out of 760 integrations done after 139 hours :-( |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,941,347 RAC: 21,472 |
I just took a look:... In other words: I should abandon these 2 tasks immediately, right ?computezrmle wrote: CPU time Herwig: 1.194:58 (must be minutes); 27.200 events processed; total runtime so far as seen in the BOINC Manager: 5 ds 4 hrs. So, if I understand everything correctly, the task will NOT finish within the 10 days limit, right? Besides, the slot folder shows 7,11GB now. At around 8GB, the task will error out, right? |
Send message Joined: 4 Mar 17 Posts: 25 Credit: 10,262,043 RAC: 1,268 |
The x of 760 is the 1. part of the workunit. You are already in the 2. part of the task with "27.200 events processed" so my guess is that it will just take 1 to 3 days till done from now. If the 1.194:58 are for the second part of the workunit, should be around 20hours per 28,000 events so a bit more than 40hours i guess. And should be easy inside the 10day limit. |
©2024 CERN