Message boards : Theory Application : Herwig7 7.2.1 nlo-dipole tasks run very slowly.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 1
Message 50768 - Posted: 9 Oct 2024, 22:48:45 UTC

They run integration stage very slowly.

They display integrate 1 of 760
This number increases over time, but too slow.
ID: 50768 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 1
Message 50769 - Posted: 9 Oct 2024, 23:50:21 UTC

It reaches "integrate 7 of 760 approximately at 2 hours.
ID: 50769 · Report as offensive     Reply Quote
Toggleton

Send message
Joined: 4 Mar 17
Posts: 25
Credit: 10,262,043
RAC: 1,268
Message 50770 - Posted: 10 Oct 2024, 5:17:56 UTC
Last modified: 10 Oct 2024, 5:27:50 UTC

They are long running tasks. Once it reached 760 of 760 it goes into the next stage that takes 1-2 days more on the tasks i do right now. Have 3 Finished 8 more are running since 4 days so far.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414779140
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414777594
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775932
The herwig7 7.2.1 nlo 100000 (the batch from saturday/sunday) at least on my system take 3 to 5 days ts finish.
ID: 50770 · Report as offensive     Reply Quote
ktamail666

Send message
Joined: 11 Jul 06
Posts: 6
Credit: 2,915,386
RAC: 1,785
Message 50773 - Posted: 10 Oct 2024, 23:46:59 UTC

ID: 50773 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 1
Message 50775 - Posted: 11 Oct 2024, 12:28:10 UTC

It reached 290 out of 760

ID: 50775 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50777 - Posted: 13 Oct 2024, 9:20:08 UTC

This one reached the 760 integration steps after 2 days and 4 hours - Job: pp z1j 13000 280 - herwig7 7.2.0 nlo-pw
But the 100,000 events processing for this task will be way longer than normally. After 168 minutes only 1000 events are processed.
Extrapolating: Another 11.5 d a y s to go.
ID: 50777 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 1
Message 50778 - Posted: 13 Oct 2024, 9:29:22 UTC
Last modified: 13 Oct 2024, 9:33:11 UTC

I think number of events planned should be present in the beginning of runRivet.log
I think it is 16000 for me.
===> [runRivet] Wed Oct 9 21:54:16 UTC 2024 [boinc pp z1j 7000 150 - herwig7 7.2.1 nlo-dipole 16000 190]
Can you look in your runRivet.log?
I think you should be able to press Show Graphics to open browser window with logs.
ID: 50778 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50779 - Posted: 13 Oct 2024, 9:46:57 UTC - in response to Message 50778.  
Last modified: 13 Oct 2024, 9:56:53 UTC

I think number of events planned should be present in the beginning of runRivet.log
I think it is 16000 for me.
===> [runRivet] Wed Oct 9 21:54:16 UTC 2024 [boinc pp z1j 7000 150 - herwig7 7.2.1 nlo-dipole 16000 190]
Can you look in your runRivet.log?
I think you should be able to press Show Graphics to open browser window with logs.

That's correct. 16000 is the number of events.
From that run (pp z1j 7000 150 - herwig7 7.2.1 nlo-dipole) already 328000 events are done in 4 successful tasks, so probably 3 tasks with each 100,000 events and one with 28,000 events.
In the beginning you see the input parameters like:

mode=boinc
beam=pp
process=z1j
energy=13000
params=75
specific=-
generator=herwig7
version=7.2.1
tune=nlo
nevts=4000
seed=196
ID: 50779 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50783 - Posted: 14 Oct 2024, 5:47:49 UTC - in response to Message 50777.  
Last modified: 14 Oct 2024, 5:56:21 UTC

This one reached the 760 integration steps after 2 days and 4 hours - Job: pp z1j 13000 280 - herwig7 7.2.0 nlo-pw
But the 100,000 events processing for this task will be way longer than normally. After 168 minutes only 1000 events are processed.
Extrapolating: Another 11.5 d a y s to go.
I decided to shutdown this task gracefully: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414836160
The last 12 hours there was almost no CPU-usage (about 1%) and the VM was contstantly reading from the disk with a diskratio of 21MB.
A BOINC shutdown and a reboot of host and guest did not solve this.

This run (pp z1j 13000 280 - herwig7 7.2.0 nlo-pw) had already 1 success with 100000 events.
ID: 50783 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,363,408
RAC: 17,955
Message 50784 - Posted: 14 Oct 2024, 6:25:36 UTC - in response to Message 50783.  

This one reached the 760 integration steps after 2 days and 4 hours - Job: pp z1j 13000 280 - herwig7 7.2.0 nlo-pw
But the 100,000 events processing for this task will be way longer than normally. After 168 minutes only 1000 events are processed.
Extrapolating: Another 11.5 d a y s to go.
I decided to shutdown this task gracefully: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414836160
The last 12 hours there was almost no CPU-usage (about 1%) and the VM was contstantly reading from the disk with a diskratio of 21MB.
A BOINC shutdown and a reboot of host and guest did not solve this.

This run (pp z1j 13000 280 - herwig7 7.2.0 nlo-pw) had already 1 success with 100000 events.

I saw similar behavior on my tasks as well. Tasks were running on a SSD with disk read about 250 MB/s. After it had read about 7 TB from the disk I aborted all Herwig Theory tasks.
ID: 50784 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,941,347
RAC: 21,472
Message 50802 - Posted: 15 Oct 2024, 15:43:00 UTC - in response to Message 50784.  

On a host with an Intel i9-10900KF running at 4.6GHz, 2 Herwig7 tasks have been running for 4 days 8 hrs now.
Console_2 in one case shows 22.400 events processed, in the other case 14.600. So for sure the tasks will not reach 100.000 events before they will be stopped after 10 days runtime.
In other words: I should abandon these 2 tasks immediately, right ?
ID: 50802 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 50803 - Posted: 15 Oct 2024, 16:04:25 UTC - in response to Message 50802.  

... In other words: I should abandon these 2 tasks immediately, right ?

No.
Herwig7 runs 2 long phases, Integration then Processing.
If you want to micromanage the task then - once it is in the processing phase - check the CPU time of Herwig in console 3 (top).
Use this time together with the already processed events to estimate how long it will take to finish.
ID: 50803 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50804 - Posted: 15 Oct 2024, 16:05:46 UTC - in response to Message 50802.  

On a host with an Intel i9-10900KF running at 4.6GHz, 2 Herwig7 tasks have been running for 4 days 8 hrs now.
Console_2 in one case shows 22.400 events processed, in the other case 14.600. So for sure the tasks will not reach 100.000 events before they will be stopped after 10 days runtime.
In other words: I should abandon these 2 tasks immediately, right ?
Not necessary because of the deadline, but those tasks could suffer from
- the huge disk read activity, CPU not getting the data quick enough to proceed
- getting the exceeded disk limit error
ID: 50804 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,941,347
RAC: 21,472
Message 50806 - Posted: 15 Oct 2024, 16:31:14 UTC - in response to Message 50803.  

... once it is in the processing phase - check the CPU time of Herwig in console 3 (top).
right at the beginning it says: top - 18:28:52 - up 4 days 6:24 ... so if this indicates the CPU time, then it's clear that the task won't finish within the 10 days limit.
ID: 50806 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,941,347
RAC: 21,472
Message 50807 - Posted: 15 Oct 2024, 16:51:53 UTC - in response to Message 50804.  

...but those tasks could suffer from
- the huge disk read activity, CPU not getting the data quick enough to proceed ...
well, this should (hopefully) not happen here, since BOINC runs on a ramdisk. But who knows ...
ID: 50807 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 1
Message 50808 - Posted: 15 Oct 2024, 19:06:33 UTC
Last modified: 15 Oct 2024, 19:08:26 UTC

I was able to enable network and start cernvmfs and boinc from recovery console and run lhcathome tasks like this.
ID: 50808 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 50810 - Posted: 15 Oct 2024, 19:11:08 UTC - in response to Message 50806.  

... once it is in the processing phase - check the CPU time of Herwig in console 3 (top).

right at the beginning it says: top - 18:28:52 - up 4 days 6:24 ... so if this indicates the CPU time, then it's clear that the task won't finish within the 10 days limit.

This shows the runtime of the VM since it's last restart.
You need to look at the CPU time of the Herwig process as shown here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6230&postid=50795



...but those tasks could suffer from
- the huge disk read activity, CPU not getting the data quick enough to proceed ...

well, this should (hopefully) not happen here, since BOINC runs on a ramdisk. But who knows ...

Even then it is not efficient since the data can't be used directly if it is on the ramdisk.
It has to be copied to the RAM controlled by the process first.
ID: 50810 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50820 - Posted: 16 Oct 2024, 11:50:00 UTC

562 out of 760 integrations done after 139 hours :-(
ID: 50820 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,941,347
RAC: 21,472
Message 50821 - Posted: 16 Oct 2024, 12:19:49 UTC - in response to Message 50803.  
Last modified: 16 Oct 2024, 12:30:40 UTC

... In other words: I should abandon these 2 tasks immediately, right ?
computezrmle wrote:
No.
Herwig7 runs 2 long phases, Integration then Processing.
If you want to micromanage the task then - once it is in the processing phase - check the CPU time of Herwig in console 3 (top).
Use this time together with the already processed events to estimate how long it will take to finish.
I just took a look:
CPU time Herwig: 1.194:58 (must be minutes); 27.200 events processed; total runtime so far as seen in the BOINC Manager: 5 ds 4 hrs.
So, if I understand everything correctly, the task will NOT finish within the 10 days limit, right?

Besides, the slot folder shows 7,11GB now. At around 8GB, the task will error out, right?
ID: 50821 · Report as offensive     Reply Quote
Toggleton

Send message
Joined: 4 Mar 17
Posts: 25
Credit: 10,262,043
RAC: 1,268
Message 50822 - Posted: 16 Oct 2024, 12:54:40 UTC - in response to Message 50821.  

The x of 760 is the 1. part of the workunit.
You are already in the 2. part of the task with "27.200 events processed" so my guess is that it will just take 1 to 3 days till done from now.
If the 1.194:58 are for the second part of the workunit, should be around 20hours per 28,000 events so a bit more than 40hours i guess.
And should be easy inside the 10day limit.
ID: 50822 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Theory Application : Herwig7 7.2.1 nlo-dipole tasks run very slowly.


©2024 CERN