Questions and Answers : Windows : Windows Theory Simulation v300.30 deadline miss
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 54
Credit: 1,339,875
RAC: 3,250
Message 50596 - Posted: 31 Aug 2024, 10:06:34 UTC
Last modified: 31 Aug 2024, 10:47:44 UTC

Windows 10, i7-4790K (2014), 32 GB RAM, M.2 2TB SSD,
Gigabyte nVidia RTX 2060 mini OC 6GB, HP Elitedesk 800 G1 TWR.
Using Squid http proxy.

I'm having problems with
Theory Simulation v300.30 (vbox64_theory) windows_x86_64
tasks - they often overrun.

When the task is downloaded, the BOINC client reports "Remaining" time is about 5 hours. When the task starts running the "Remaining" time is adjusted to show 9+ days.
Sometimes the task completes before this time. Often they carry on running past the deadline.

It's happening right now, for example:

The local BOINC client shows -
Project         Status      Elapsed         Remaining (estimated)       Deadline        Application                                     Name

LHC@home        Running     8d 00:07:18     2d 00:01:30                 29/08/2024      Theory Simulation v300.30 (vbox64_theory)       Theory_2743-2802274-370_0

It's still running.

The LHC@home current tasks webpage -
Task        Work unit   Computer    Sent            Time reported       Status                      Run time    CPU time    Credit      Application
                                                    or deadline                                     (sec)       (sec)

413632263   224936523   10730901    19 Aug 2024     30 Aug 2024         Timed out - no response     0.00        0.00         ---       Theory Simulation v300.30 (vbox64_theory) windows_x86_64

Clicking on the "Task" -
               Name   Theory_2743-2802274-370_0
           Workunit   224936523
            Created   19 Aug 2024, 9:22:34 UTC
               Sent   19 Aug 2024, 16:08:06 UTC
    Report deadline   30 Aug 2024, 16:08:06 UTC
           Received   ---
       Server state   Over
            Outcome   No reply
       Client state   New
        Exit status   0 (0x00000000)
        Computer ID   10730901
           Run time   0 sec
           CPU time   0 sec
     Validate state   Initial
             Credit   0.00
  Device peak FLOPS   5.05 GFLOPS
Application version   Theory Simulation v300.30 (vbox64_theory) windows_x86_64

Stderr output

ID: 50596 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 739
Credit: 50,636,662
RAC: 32,671
Message 50597 - Posted: 31 Aug 2024, 13:08:36 UTC

This is normal behavior for Virtual Box tasks. Virtual Box does not report the actual progress of the task back to Boinc Manager. Instead Boinc uses a simulated progress that it shows for the task. The initial value (in your case 5 hours) is some kind of average from previously finished Theory tasks. When the actual runtime exceeds this value, Boinc starts to use as estimate the cutoff time server has given to the task (10 days for Theory Tasks = 864000 s). Only place where you can monitor and estimate the actual task progress is inside a virtual box terminal. See more from Yeti's checklist : https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&sort_style=6&start=0 This was made primarily for ATlas tasks but is mostly valid for Theory too.
ID: 50597 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 54
Credit: 1,339,875
RAC: 3,250
Message 50598 - Posted: 31 Aug 2024, 17:23:10 UTC - in response to Message 50597.  
Last modified: 31 Aug 2024, 17:45:58 UTC

Thank you, Harri - I'm happy to browse through that checklist.

Non-the-less, despite Windows not being able to tell what's going on inside a VM, it seems that BOINC can't tell when this task has exceeded its deadline...

The task -
Project         Status      Elapsed         Remaining (estimated)       Deadline        Application                                     Name
LHC@home        Running     8d 08:15:38     1d 15:54:31                 29/08/2024      Theory Simulation v300.30 (vbox64_theory)       Theory_2743-2802274-370_0
is still running - way past its deadline.

Should I "Abort" jobs that go past their deadline?


Maybe this sheds some light...

Pressing - ALT+F1 in the VM Console of the above task,
showed some cvmfs and cranky stuff and -
job: htmld=/shared/html/ job
job: unpack exitcode=0
INFO: activated the work-around for ld:
lrwxrwxrwx 1 0 0 15 Aug 19 21:02 /tmp/tmp.SAvOWJwc6Y/ld -> /usr/bin/ld.bfd
22:02:57 BST +01:00 2024-08-19: cranky: [INFO] ===> [runRivet] Mon Aug 19 21:02:56 UTC 2024 [boinc p p jets 8000 350 - pythia8 8.244 CP1-CR1 100000 370]

And, pressing ALT+F2
The last bit -
 Pythia::next(): 82000 events have been generated 
82000 events processed
dumping histograms...
So there's lots of work being done. (Rarely a Theory task runs without registering any CPU activity in Windows Task Manager...)

Using my web browser to navigate to
http://localhost:52218/

Test4Theory simulations
Waiting for some nice figures to show you.
Please, reload again in a few minutes
Meanwhile you can check the logs
(http://localhost:52218/logs/)


The logs (after the [runRivet] and PYTHIA initialisation sections) -

--------  End PYTHIA Event Listing  -----------------------------------------------------------------------------------------------
Rivet.AnalysisHandler: INFO  Only using nominal weight. Variation weights will be ignored.
0 events processed
 PYTHIA Warning in StringFragmentation::fragmentToJunction: bad convergence junction rest frame  
 PYTHIA Error in StringFragmentation::fragment: stuck in joining  
 PYTHIA Error in Pythia::next: hadronLevel failed; try again  
 PYTHIA Warning in JunctionSplitting::SplitJunPairs: parallel junction state not allowed.  
 PYTHIA Warning in JunctionSplitting::CheckColours: Not possible to split junctions; making new colours  
 PYTHIA Warning in JunctionSplitting::CheckColours: Made a gluon colour singlet; redoing colours  
 PYTHIA Warning in SimpleSpaceShower::pT2nextQCD: weight above unity  
100 events processed
dumping histograms...
 PYTHIA Error in MiniStringFragmentation::fragment: no 1- or 2-body state found above mass threshold  
 PYTHIA Error in StringFragmentation::fragmentToJunction: caught in junction flavour loop  
200 events processed
dumping histograms...
300 events processed
dumping histograms...
 PYTHIA Warning in MiniStringFragmentation::ministring2two: random axis needed to break tie  
400 events processed
dumping histograms...

...
900 events processed
dumping histograms...
 PYTHIA Warning in SimpleSpaceShower::pT2nextQCD: small daughter PDF

...
1600 events processed
 PYTHIA Warning in StringFragmentation::finalRegion: random axis needed to break tie

...
2100 events processed
 PYTHIA Error in SimpleSpaceShower::pT2nearThreshold: stuck in loop

...
9800 events processed
 PYTHIA Warning in MultipartonInteractions::pTnext: weight above unity

...
12500 events processed
 PYTHIA Warning in Pythia::check: energy-momentum not quite conserved

...
13200 events processed
 PYTHIA Warning in TauDecays::decay: unknown correlated tau production, assuming from unpolarized photon

...
17800 events processed
 PYTHIA Error in BeamRemnants::setKinematics: kinematics construction failed

...
...
64700 events processed
 PYTHIA Warning in Pythia::check: not quite matched particle energy/momentum/mass

...
 Pythia::next(): 82000 events have been generated 
82000 events processed
dumping histograms...
82100 events processed
82200 events processed
82300 events processed

ID: 50598 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 54
Credit: 1,339,875
RAC: 3,250
Message 50599 - Posted: 1 Sep 2024, 2:26:42 UTC - in response to Message 50598.  
Last modified: 1 Sep 2024, 2:26:55 UTC

I'm speculating that the above warnings and errors are simulation telemetry and not program errors.
But the task is just going to carry on 'til the end, oblivious that it's gone past its deadline - a waste of time...
ID: 50599 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 54
Credit: 1,339,875
RAC: 3,250
Message 50993 - Posted: 2 Nov 2024, 10:44:27 UTC

System: OpenSuSE Tumbleweed, Intel i7-4790K, 32GB, 2TB SSD

(The current date is 02/11/2024)

The BOINC Manager shows these running tasks:
Project     Status      Elapsed         Remaining (estimated)       Deadline        Application                                 Name

LHC@home    Running     6d 09:32:46     3d 11:55:01                 01/11/2024      Theory Simulation 300.30 (vbox64_theory)    Theory_2794-3266819-199_1
LHC@home    Running     5d 23:40:10     3d 21:55:48                 01/11/2024      Theory Simulation 300.30 (vbox64_theory)    Theory_2794-3257411-175_1

The LHC@home All tasks webpage (https://lhcathome.cern.ch/lhcathome/results.php?userid=95350)
shows this -
Task        Work unit   Computer    Sent            Time reported       Status                      Run time    CPU time    Credit      Application
                                                    or deadline                                     (sec)       (sec)

415137523   225914767   10860321    22 Oct 2024     2 Nov 2024          Timed out - no response     0.00        0.00        ---         Theory Simulation v300.30 (vbox64_theory)
                                                                                                                                        x86_64-pc-linux-gnu

415142791   226066966   10860321    22 Oct 2024    27 Oct 2024          Completed and validated     311,018.69  235,746.40  719.95      Theory Simulation v300.30 (vbox64_theory)
                                                                                                                                        x86_64-pc-linux-gnu

415137790   225932422   10860321    22 Oct 2024     2 Nov 2024          Timed out - no response     0.00        0.00        ---         Theory Simulation v300.30 (vbox64_theory)
                                                                                                                                        x86_64-pc-linux-gnu

No doubt the timed-out tasks would have eventually provided a valid result - if they'd have had more time.
Why not up the servers default response from 10 to 20 days?
ID: 50993 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 54
Credit: 1,339,875
RAC: 3,250
Message 50995 - Posted: 2 Nov 2024, 11:31:44 UTC

This problem is not limited to Windows.
So a more poignant post on the subject is here -

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6240
ID: 50995 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 54
Credit: 1,339,875
RAC: 3,250
Message 51129 - Posted: 24 Nov 2024, 15:18:59 UTC - in response to Message 50995.  

Also on the subject of very long running Theory tasks -
This gonna be long - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6251
And (currently) -
CMS and Atlas have problems, but I'm getting a few Theory jobs that seem to be running.

Have a little patience. They'll sort it out eventually.
ID: 51129 · Report as offensive     Reply Quote

Questions and Answers : Windows : Windows Theory Simulation v300.30 deadline miss


©2025 CERN