Message boards : Theory Application : tasks "...powheg-box..." not showing processed events in console F2
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1757
Credit: 115,865,700
RAC: 84,714
Message 50382 - Posted: 10 Jun 2024, 18:33:24 UTC
Last modified: 10 Jun 2024, 18:33:49 UTC

since this afternoon, the downloaded tasks with "...powheg-box..." in the name (shown in last line of console F1) do NOT show any processed events in console F2, even after several hours.
During processing, sometimes they eat up a lot of memory; I had two of them running on a machine with 32GB RAM and had to kill one of them, because RAM usage was at 31.32 GB and the system was close to crash.
The tasks is: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411759993
can one of the experts here tell from the sterr what is different to the tasks we had so far?

Some of my hosts are now processing this kind of task, none of them shows processed events in console F2.
ID: 50382 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2198
Credit: 173,403,419
RAC: 44,056
Message 50383 - Posted: 10 Jun 2024, 18:53:10 UTC - in response to Message 50382.  

events attempts success failure unknown
pp jets 13000 100 - powheg-box r3744 ptdef2 200000 28 2 0 26
ID: 50383 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50388 - Posted: 11 Jun 2024, 11:37:31 UTC - in response to Message 50382.  
Last modified: 11 Jun 2024, 12:06:12 UTC

since this afternoon, the downloaded tasks with "...powheg-box..." in the name (shown in last line of console F1) do NOT show any processed events in console F2, even after several hours..
I got this one: ===> [runRivet] Tue Jun 11 11:24:33 UTC 2024 [boinc pp jets 8000 250 - powheg-box r3744 pthard2 100000 13]
Input parameters:
mode=boinc
beam=pp
process=jets
energy=8000
params=250
specific=-
generator=powheg-box
version=r3744
tune=pthard2
nevts=100000
seed=13

At the moment process run-dijet using 99% CPU. I suppose this is the init phase. Let's see how it proceeds.
Differences to your setup:
             Yours     Mine
BOINC       7.22.2    8.0.2
VBox        7.0.10    7.0.18
VM RAM      630 MB    768 MB
Max runtime 10days   unlimited
Snapshots     No       Yes
ID: 50388 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50389 - Posted: 11 Jun 2024, 12:14:33 UTC - in response to Message 50382.  
Last modified: 11 Jun 2024, 12:17:14 UTC

@Erich56: This is a powheg-box valid task of yours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411759432

Why is your VM state constantly changing from paused to running?
ID: 50389 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1757
Credit: 115,865,700
RAC: 84,714
Message 50390 - Posted: 11 Jun 2024, 15:47:43 UTC - in response to Message 50389.  
Last modified: 11 Jun 2024, 15:55:10 UTC

@Erich56: This is a powheg-box valid task of yours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411759432

Why is your VM state constantly changing from paused to running?
I suppose this constant change between paused and running may have had to do with the fact that 99% of the RAM was used up, for what reason ever? When I noticed this unusal RAM usage, I killed one of the two running tasks, hoping that there would be more free RAM. Which was not the case though. So I let the other task get finished (obviously the one you set the link for) and rebooted the host. Then I started 2 tasks again (this morning), one of them got finished after about 10-1/2 hours, the other one is still running with about 76000 events processed so far.
The RAM problem has no longer shown up so far; hence I guess that there were other, unknown reasons for that.

Edit:
I just notice that the task with started this morning and got finished after 10-1/2 hours also shows this on/off behaviour in the stderr:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=411772443
However, there was definitely no RAM problem involved. So I have not the slightest idea what's happening :-(
I'll look up finished tasks from some of my other hosts which are crunching Theory to see whether same thing happens there, too.
ID: 50390 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50392 - Posted: 12 Jun 2024, 5:21:31 UTC - in response to Message 50388.  

Let's see how it proceeds
It proceeds not well.

After the VM was properly saved to disk and a nightly shutdown, it did not restore OK this morning. For Theory tasks normally no problem at all.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=411769158
ID: 50392 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2198
Credit: 173,403,419
RAC: 44,056
Message 50393 - Posted: 12 Jun 2024, 7:15:01 UTC - in response to Message 50392.  

Crystal,
today is Microsoft-Patchday.
All of my Win11pro need a reboot.
Is this a Reason for your Error with Windows?
ID: 50393 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50394 - Posted: 12 Jun 2024, 7:55:09 UTC - in response to Message 50393.  
Last modified: 12 Jun 2024, 8:01:48 UTC

Crystal,
today is Microsoft-Patchday.
All of my Win11pro need a reboot.
Is this a Reason for your Error with Windows?

No, I had some Win updates and reboots the day before and my shutdown and startup was full under my own control.

The error was:
VBoxManage.exe: error: ahci#0: The target VM is missing a device on port 0. Please make sure the source and target VMs have compatible storage configurations [ver=9 pass=final] (VERR_SSM_LOAD_CONFIG_MISMATCH)

and I have had a vdi hard disk (difference image) with a yellow triangle: {abac2f31-557a-4964-83f7-231aa5268f30} probably causing the error.

I'm now running several tasks with the powheg-box generator on another faster machine.
ID: 50394 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1757
Credit: 115,865,700
RAC: 84,714
Message 50395 - Posted: 12 Jun 2024, 8:45:19 UTC - in response to Message 50392.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=411769158
definitely annoying if this happens after 9 hours :-(
ID: 50395 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50396 - Posted: 12 Jun 2024, 9:20:56 UTC - in response to Message 50395.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=411769158
definitely annoying if this happens after 9 hours :-(
Annoying yes, but also interesting for a "Volunteer tester" :-)
After that 9 hours the job still was running the run-dijet process without starting the process of the 100000 events.
Now I've running 10 tasks with the powheg-box generator.
As a test I suspended those 10 (one after the other) without leaving them in memory, so that all VMs were saved to disk. I only had 1 pythia8 left running.
After resuming the suspended Theory's, they restored properly and start running again smoothly.
ID: 50396 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 825
Message 50397 - Posted: 12 Jun 2024, 10:48:21 UTC

with my native task it was creating events in /var/lib/boinc/slots/1/cernvm/shared/tmp/tmp.83dKBZ3Z72/run-main/pwgevents.lhe
ID: 50397 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50398 - Posted: 12 Jun 2024, 10:58:14 UTC

After 4 hours init phase using run-dijet.exe the 'normal' processes pythia8 and rivetvm started processing the 100000 events visible by using the ALT-F2 Console.
ID: 50398 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50399 - Posted: 12 Jun 2024, 11:26:55 UTC

Error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411784804

This task was running in BOINC Manager, but in VirtualBox Manager it had a "Stopped" state.
ID: 50399 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2198
Credit: 173,403,419
RAC: 44,056
Message 50400 - Posted: 12 Jun 2024, 12:11:52 UTC - in response to Message 50399.  

upload failure: <file_xfer_error>
<file_name>Theory_2773-2918175-17_2_r1116385319_result</file_name>
<error_code>-240 (stat() failed)</error_code>
Task finished correct, but upload have a problem.
Have no idea why,
ID: 50400 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50401 - Posted: 12 Jun 2024, 13:07:58 UTC - in response to Message 50400.  

upload failure: <file_xfer_error>
<file_name>Theory_2773-2918175-17_2_r1116385319_result</file_name>
<error_code>-240 (stat() failed)</error_code>
Task finished correct, but upload have a problem.
Have no idea why,
There is no output file created like in a valid result:

[INFO] Container 'runc' finished with status code 0.
[INFO] Preparing output.
ID: 50401 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50402 - Posted: 12 Jun 2024, 13:42:14 UTC - in response to Message 50399.  
Last modified: 13 Jun 2024, 5:45:16 UTC

Error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411784804

This task was running in BOINC Manager, but in VirtualBox Manager it had a "Stopped" state.
The same happened to this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411784596

Running in BOINC, but VirtualBox Manager had the Stopped state.
Found this in stderr (effect of the stopped state):

2024-06-12 13:19:38 (9808): Guest Log: 13:19:37 CEST +02:00 2024-06-12: cranky: [INFO] ===> [runRivet] Wed Jun 12 11:19:36 UTC 2024 [boinc pp jets 7000 40,-,610 - powheg-box r3744 pthard2 100000 34]
2024-06-12 13:43:32 (9808): Preference change detected
2024-06-12 13:43:32 (9808): Setting CPU throttle for VM. (100%)
2024-06-12 13:43:38 (9808): Error in CPU throttle for VM: -182
Command:
VBoxManage -q controlvm "boinc_283a394103a62d2c" cpuexecutioncap 100
Output:
VBoxManage.exe: error: Machine 'boinc_283a394103a62d2c' is not currently running.

2024-06-12 13:43:38 (9808): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 600 seconds))
2024-06-12 13:46:20 (9808): Preference change detected
2024-06-12 13:46:20 (9808): Setting CPU throttle for VM. (100%)
2024-06-12 13:46:27 (9808): Error in CPU throttle for VM: -182
Command:
VBoxManage -q controlvm "boinc_283a394103a62d2c" cpuexecutioncap 100
Output:
VBoxManage.exe: error: Machine 'boinc_283a394103a62d2c' is not currently running.

2024-06-12 13:46:27 (9808): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 600 seconds))
2024-06-12 13:46:30 (9808): Stopping VM.
2024-06-12 13:46:30 (9808): Error in stop VM for VM: -108
Command:
VBoxManage -q controlvm "boinc_283a394103a62d2c" savestate
Output:
VBoxManage.exe: error: Machine 'boinc_283a394103a62d2c' is not currently running.
2024-06-12 13:46:30 (9808): VM did not stop when requested.
2024-06-12 13:46:30 (9808): VM was NOT successfully terminated.


No error this time, cause after suspending the task in BOINC (see lines above), the VM could not be saved while is was in Stopped State.
I started the stopped VM in VBox Manager and after 6 minutes, I saved it to disk. State saved and resumed the task in BOINC Manager.

2024-06-12 13:53:13 (1128): vboxwrapper version 26207
2024-06-12 13:53:13 (1128): BOINC client version: 8.0.2
ID: 50402 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1384
Credit: 9,170,298
RAC: 4,266
Message 50404 - Posted: 14 Jun 2024, 5:43:32 UTC - in response to Message 50392.  
Last modified: 15 Jun 2024, 5:59:42 UTC

Let's see how it proceeds
It proceeds not well.

After the VM was properly saved to disk and a nightly shutdown, it did not restore OK this morning. For Theory tasks normally no problem at all.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=411769158
I tried it again and again it did not proceed from the restored VM:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=411814325

Conclusion: This powheg-box generator tasks don't like long (network?) interruptions :-(

I'll try to confirm this by running three Theory's: 1 powheg-box and 2 Pythia8's. All three now suspended (saved to disk) and I'll wait a few hours before resuming them.

Edit: Resumed and all three happily processing. The 'powheg' still in the run-dijet process phase:
LHC@home 14 Jun 08:31:28 task Theory_2773-2924193-40_2 suspended by user
LHC@home 14 Jun 11:50:29 task Theory_2773-2924193-40_2 resumed by user

Edit2: After the powheg task survived the 3 hours suspension, I did not dare to suspend it overnight, so let it run.
This morning the task finished succesfull. https://lhcathome.cern.ch/lhcathome/result.php?resultid=411824583
ID: 50404 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 711
Credit: 47,666,664
RAC: 35,260
Message 50472 - Posted: 9 Jul 2024, 9:18:11 UTC

I've got a powheg-box tasks that says on terminal 1 (Alt-F1) that it has 100000 events, but Alt-F2 says it has already processed 273000 events. Runtime so far over 25 hours.
ID: 50472 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 711
Credit: 47,666,664
RAC: 35,260
Message 50474 - Posted: 10 Jul 2024, 9:01:47 UTC - in response to Message 50472.  

I've got a powheg-box tasks that says on terminal 1 (Alt-F1) that it has 100000 events, but Alt-F2 says it has already processed 273000 events. Runtime so far over 25 hours.

This task still continues to run, now after 49 hours running it has processed over 570000 events. It did survive the WIndows patch-Tuesday reboot. If it isn't finished at 1000000 events, I will abort it.
ID: 50474 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 711
Credit: 47,666,664
RAC: 35,260
Message 50476 - Posted: 11 Jul 2024, 11:54:24 UTC - in response to Message 50474.  
Last modified: 11 Jul 2024, 11:54:41 UTC

I've got a powheg-box tasks that says on terminal 1 (Alt-F1) that it has 100000 events, but Alt-F2 says it has already processed 273000 events. Runtime so far over 25 hours.

This task still continues to run, now after 49 hours running it has processed over 570000 events. It did survive the WIndows patch-Tuesday reboot. If it isn't finished at 1000000 events, I will abort it.

Task finally finished after 75 hours. I don't know how many events it actually did, it was over 700000 when I last looked. https://lhcathome.cern.ch/lhcathome/result.php?resultid=412530871
ID: 50476 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : tasks "...powheg-box..." not showing processed events in console F2


©2024 CERN