Message boards :
Theory Application :
tasks "...powheg-box..." not showing processed events in console F2
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1832 Credit: 119,670,138 RAC: 49,197 |
since this afternoon, the downloaded tasks with "...powheg-box..." in the name (shown in last line of console F1) do NOT show any processed events in console F2, even after several hours. During processing, sometimes they eat up a lot of memory; I had two of them running on a machine with 32GB RAM and had to kill one of them, because RAM usage was at 31.32 GB and the system was close to crash. The tasks is: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411759993 can one of the experts here tell from the sterr what is different to the tasks we had so far? Some of my hosts are now processing this kind of task, none of them shows processed events in console F2. |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,025,522 RAC: 9,726 |
events attempts success failure unknown pp jets 13000 100 - powheg-box r3744 ptdef2 200000 28 2 0 26 |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
since this afternoon, the downloaded tasks with "...powheg-box..." in the name (shown in last line of console F1) do NOT show any processed events in console F2, even after several hours..I got this one: ===> [runRivet] Tue Jun 11 11:24:33 UTC 2024 [boinc pp jets 8000 250 - powheg-box r3744 pthard2 100000 13] Input parameters: mode=boinc beam=pp process=jets energy=8000 params=250 specific=- generator=powheg-box version=r3744 tune=pthard2 nevts=100000 seed=13 At the moment process run-dijet using 99% CPU. I suppose this is the init phase. Let's see how it proceeds. Differences to your setup: Yours Mine BOINC 7.22.2 8.0.2 VBox 7.0.10 7.0.18 VM RAM 630 MB 768 MB Max runtime 10days unlimited Snapshots No Yes |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
@Erich56: This is a powheg-box valid task of yours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411759432 Why is your VM state constantly changing from paused to running? |
Send message Joined: 18 Dec 15 Posts: 1832 Credit: 119,670,138 RAC: 49,197 |
@Erich56: This is a powheg-box valid task of yours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411759432I suppose this constant change between paused and running may have had to do with the fact that 99% of the RAM was used up, for what reason ever? When I noticed this unusal RAM usage, I killed one of the two running tasks, hoping that there would be more free RAM. Which was not the case though. So I let the other task get finished (obviously the one you set the link for) and rebooted the host. Then I started 2 tasks again (this morning), one of them got finished after about 10-1/2 hours, the other one is still running with about 76000 events processed so far. The RAM problem has no longer shown up so far; hence I guess that there were other, unknown reasons for that. Edit: I just notice that the task with started this morning and got finished after 10-1/2 hours also shows this on/off behaviour in the stderr: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411772443 However, there was definitely no RAM problem involved. So I have not the slightest idea what's happening :-( I'll look up finished tasks from some of my other hosts which are crunching Theory to see whether same thing happens there, too. |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
Let's see how it proceedsIt proceeds not well. After the VM was properly saved to disk and a nightly shutdown, it did not restore OK this morning. For Theory tasks normally no problem at all. https://lhcathome.cern.ch/lhcathome/result.php?resultid=411769158 |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,025,522 RAC: 9,726 |
Crystal, today is Microsoft-Patchday. All of my Win11pro need a reboot. Is this a Reason for your Error with Windows? |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
Crystal, No, I had some Win updates and reboots the day before and my shutdown and startup was full under my own control. The error was: VBoxManage.exe: error: ahci#0: The target VM is missing a device on port 0. Please make sure the source and target VMs have compatible storage configurations [ver=9 pass=final] (VERR_SSM_LOAD_CONFIG_MISMATCH) and I have had a vdi hard disk (difference image) with a yellow triangle: {abac2f31-557a-4964-83f7-231aa5268f30} probably causing the error. I'm now running several tasks with the powheg-box generator on another faster machine. |
Send message Joined: 18 Dec 15 Posts: 1832 Credit: 119,670,138 RAC: 49,197 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=411769158definitely annoying if this happens after 9 hours :-( |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
Annoying yes, but also interesting for a "Volunteer tester" :-)https://lhcathome.cern.ch/lhcathome/result.php?resultid=411769158definitely annoying if this happens after 9 hours :-( After that 9 hours the job still was running the run-dijet process without starting the process of the 100000 events. Now I've running 10 tasks with the powheg-box generator. As a test I suspended those 10 (one after the other) without leaving them in memory, so that all VMs were saved to disk. I only had 1 pythia8 left running. After resuming the suspended Theory's, they restored properly and start running again smoothly. |
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 0 |
with my native task it was creating events in /var/lib/boinc/slots/1/cernvm/shared/tmp/tmp.83dKBZ3Z72/run-main/pwgevents.lhe |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
After 4 hours init phase using run-dijet.exe the 'normal' processes pythia8 and rivetvm started processing the 100000 events visible by using the ALT-F2 Console. |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
Error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411784804 This task was running in BOINC Manager, but in VirtualBox Manager it had a "Stopped" state. |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,025,522 RAC: 9,726 |
upload failure: <file_xfer_error> <file_name>Theory_2773-2918175-17_2_r1116385319_result</file_name> <error_code>-240 (stat() failed)</error_code> Task finished correct, but upload have a problem. Have no idea why, |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
upload failure: <file_xfer_error>There is no output file created like in a valid result: [INFO] Container 'runc' finished with status code 0. [INFO] Preparing output. |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
Error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411784804The same happened to this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411784596 Running in BOINC, but VirtualBox Manager had the Stopped state. Found this in stderr (effect of the stopped state): 2024-06-12 13:19:38 (9808): Guest Log: 13:19:37 CEST +02:00 2024-06-12: cranky: [INFO] ===> [runRivet] Wed Jun 12 11:19:36 UTC 2024 [boinc pp jets 7000 40,-,610 - powheg-box r3744 pthard2 100000 34] 2024-06-12 13:43:32 (9808): Preference change detected 2024-06-12 13:43:32 (9808): Setting CPU throttle for VM. (100%) 2024-06-12 13:43:38 (9808): Error in CPU throttle for VM: -182 Command: VBoxManage -q controlvm "boinc_283a394103a62d2c" cpuexecutioncap 100 Output: VBoxManage.exe: error: Machine 'boinc_283a394103a62d2c' is not currently running. 2024-06-12 13:43:38 (9808): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 600 seconds)) 2024-06-12 13:46:20 (9808): Preference change detected 2024-06-12 13:46:20 (9808): Setting CPU throttle for VM. (100%) 2024-06-12 13:46:27 (9808): Error in CPU throttle for VM: -182 Command: VBoxManage -q controlvm "boinc_283a394103a62d2c" cpuexecutioncap 100 Output: VBoxManage.exe: error: Machine 'boinc_283a394103a62d2c' is not currently running. 2024-06-12 13:46:27 (9808): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 600 seconds)) 2024-06-12 13:46:30 (9808): Stopping VM. 2024-06-12 13:46:30 (9808): Error in stop VM for VM: -108 Command: VBoxManage -q controlvm "boinc_283a394103a62d2c" savestate Output: VBoxManage.exe: error: Machine 'boinc_283a394103a62d2c' is not currently running. 2024-06-12 13:46:30 (9808): VM did not stop when requested. 2024-06-12 13:46:30 (9808): VM was NOT successfully terminated. No error this time, cause after suspending the task in BOINC (see lines above), the VM could not be saved while is was in Stopped State. I started the stopped VM in VBox Manager and after 6 minutes, I saved it to disk. State saved and resumed the task in BOINC Manager. 2024-06-12 13:53:13 (1128): vboxwrapper version 26207 2024-06-12 13:53:13 (1128): BOINC client version: 8.0.2 |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
I tried it again and again it did not proceed from the restored VM:Let's see how it proceedsIt proceeds not well. https://lhcathome.cern.ch/lhcathome/result.php?resultid=411814325 Conclusion: This powheg-box generator tasks don't like long (network?) interruptions :-( I'll try to confirm this by running three Theory's: 1 powheg-box and 2 Pythia8's. All three now suspended (saved to disk) and I'll wait a few hours before resuming them. Edit: Resumed and all three happily processing. The 'powheg' still in the run-dijet process phase: LHC@home 14 Jun 08:31:28 task Theory_2773-2924193-40_2 suspended by user LHC@home 14 Jun 11:50:29 task Theory_2773-2924193-40_2 resumed by user Edit2: After the powheg task survived the 3 hours suspension, I did not dare to suspend it overnight, so let it run. This morning the task finished succesfull. https://lhcathome.cern.ch/lhcathome/result.php?resultid=411824583 |
Send message Joined: 28 Sep 04 Posts: 736 Credit: 49,894,044 RAC: 35,302 |
I've got a powheg-box tasks that says on terminal 1 (Alt-F1) that it has 100000 events, but Alt-F2 says it has already processed 273000 events. Runtime so far over 25 hours. |
Send message Joined: 28 Sep 04 Posts: 736 Credit: 49,894,044 RAC: 35,302 |
I've got a powheg-box tasks that says on terminal 1 (Alt-F1) that it has 100000 events, but Alt-F2 says it has already processed 273000 events. Runtime so far over 25 hours. This task still continues to run, now after 49 hours running it has processed over 570000 events. It did survive the WIndows patch-Tuesday reboot. If it isn't finished at 1000000 events, I will abort it. |
Send message Joined: 28 Sep 04 Posts: 736 Credit: 49,894,044 RAC: 35,302 |
I've got a powheg-box tasks that says on terminal 1 (Alt-F1) that it has 100000 events, but Alt-F2 says it has already processed 273000 events. Runtime so far over 25 hours. Task finally finished after 75 hours. I don't know how many events it actually did, it was over 700000 when I last looked. https://lhcathome.cern.ch/lhcathome/result.php?resultid=412530871 |
©2025 CERN