Message boards : Theory Application : Theory's endless looping
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 38541 - Posted: 6 Apr 2019, 6:06:55 UTC - in response to Message 38527.  
Last modified: 6 Apr 2019, 6:07:37 UTC

Ooops: Event 600 ( 1h 3m 42s elapsed / 1d 6h 47m 27s left ) -> ETA: Sat Apr 06 01:55
Added 1 success to pp jets 8000 350 - sherpa 1.4.3 default
ID: 38541 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38542 - Posted: 6 Apr 2019, 9:31:58 UTC - in response to Message 38541.  

Ooops: Event 600 ( 1h 3m 42s elapsed / 1d 6h 47m 27s left ) -> ETA: Sat Apr 06 01:55
Added 1 success to pp jets 8000 350 - sherpa 1.4.3 default

Cool! If you can extend the 18 hour limit manually then I can do it in my watchdog script too. I wish I had thought of this solution for Theory VBox.
ID: 38542 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,555,122
RAC: 131,295
Message 38543 - Posted: 6 Apr 2019, 11:44:22 UTC

Nearly 10 years left and increasing.
This should run on next generation hardware.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=220536113
===> [runRivet] Wed Apr  3 22:53:07 UTC 2019 [boinc ee zhad 200 - - sherpa 1.4.3 default 2000 38]
.
.
.
1.23182e+16 pb +- ( 6.16585e+15 pb = 50.0547 % ) 963710000 ( 963744472 -> 99.9 % )
integration time:  ( 1d 11h 46m 51s elapsed / 3611d 15h 55m 24s left )   
1.23181e+16 pb +- ( 6.16578e+15 pb = 50.0547 % ) 963720000 ( 963754472 -> 99.9 % )
integration time:  ( 1d 11h 46m 52s elapsed / 3611d 16h 58m 24s left )
ID: 38543 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,969,718
RAC: 122,350
Message 38544 - Posted: 6 Apr 2019, 12:35:07 UTC - in response to Message 38543.  

This should run on next generation hardware.
:-) :-) :-)
ID: 38544 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 38581 - Posted: 18 Apr 2019, 19:54:03 UTC

Endless running:
===> [runRivet] Thu Apr 18 20:20:24 CEST 2019 [boinc ee zhad 22 - - sherpa 1.2.3 default 1000 44]
ID: 38581 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 463
Message 38605 - Posted: 22 Apr 2019, 11:29:10 UTC - in response to Message 38581.  

Native theory task 221036445 has so far clocked up over 176 hours
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
27715 boinc     39  19  485236  45968    720 R  97.3  0.6  10566:03 Sherpa
which I think means it's stuck in some loop; I'll probably kill it off tomorrow when I shift that host over to Atlas work.

I note the process seems to have been started via
   PID TTY      STAT   TIME COMMAND
27436 ?        SN     6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43
27622 ?        SN     0:00 /bin/bash ./rungen.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 /shared/tmp/tmp.lpORph0vRx/generator.hepmc
27623 ?        SN     0:00 /shared/rivetvm/rivetvm.exe -a ATLAS_2010_S8919674 -i /shared/tmp/tmp.lpORph0vRx/generator.hepmc -o /shared/tmp/tmp.lpORph0vRx/flat -H /shared/tmp/tmp.lpORph0vRx/generator.yoda -d /shared/tmp/tmp.lpORph0vRx/dump
27624 ?        SN     3:12 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43
27715 ?        RN   10567:14 /cvmfs/sft.cern.ch/lcg/releases/LCG_94/MCGenerators/sherpa/2.2.5/x86_64-slc6-gcc62-opt/bin/Sherpa -f /shared/tmp/tmp.lpORph0vRx/generator.params

I've previously run Sherpa pp tasks successfully[*]. Here I'm puzzled by those references to /shared - that mount point doesn't exist on the system
~ > ls /shared
ls: cannot access /shared: No such file or directory
even though PID 27623 was run from it... do the tasks do some internal mounting of filesystems going which is timing out or breaking down?

* e.g. 221184137, 221183430, 221178163

Also, the Theory Native logging has a flaw: it doesn't log what it's trying to run until after it's tried and possibly failed, so there's no way client-side to understand which app caused which failure; e.g. there's nothing in
221182407 to indicate whether a Sherpa caused the problem.
ID: 38605 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 38606 - Posted: 22 Apr 2019, 12:14:52 UTC - in response to Message 38605.  
Last modified: 22 Apr 2019, 12:35:16 UTC

27436 ?        SN     6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43
pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 43
ID: 38606 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,555,122
RAC: 131,295
Message 38607 - Posted: 22 Apr 2019, 13:20:30 UTC - in response to Message 38605.  

Native theory task 221036445 has so far clocked up over 176 hours

This task did not finish before it's due date.
Even if it succeeds, it is already marked as lost by the project server.
Hence it makes no sense to let it run any longer.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=110458183



  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
27715 boinc     39  19  485236  45968    720 R  97.3  0.6  10566:03 Sherpa
which I think means it's stuck in some loop; I'll probably kill it off tomorrow when I shift that host over to Atlas work.

This shows that sherpa is still running at full speed.
To check whether it's doing useful work the logfile can be checked at:
<BOINC's basic directory>/slots/X/cernvm/shared/runRivet.log




I note the process seems to have been started via
   PID TTY      STAT   TIME COMMAND
27436 ?        SN     6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43
27622 ?        SN     0:00 /bin/bash ./rungen.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 /shared/tmp/tmp.lpORph0vRx/generator.hepmc
27623 ?        SN     0:00 /shared/rivetvm/rivetvm.exe -a ATLAS_2010_S8919674 -i /shared/tmp/tmp.lpORph0vRx/generator.hepmc -o /shared/tmp/tmp.lpORph0vRx/flat -H /shared/tmp/tmp.lpORph0vRx/generator.yoda -d /shared/tmp/tmp.lpORph0vRx/dump
27624 ?        SN     3:12 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43
27715 ?        RN   10567:14 /cvmfs/sft.cern.ch/lcg/releases/LCG_94/MCGenerators/sherpa/2.2.5/x86_64-slc6-gcc62-opt/bin/Sherpa -f /shared/tmp/tmp.lpORph0vRx/generator.params

I've previously run Sherpa pp tasks successfully[*]. Here I'm puzzled by those references to /shared - that mount point doesn't exist on the system
~ > ls /shared
ls: cannot access /shared: No such file or directory
even though PID 27623 was run from it... do the tasks do some internal mounting of filesystems going which is timing out or breaking down?

* e.g. 221184137, 221183430, 221178163

The native app runs inside a runc container.
Runc maps the path of the apps running inside this container.
Hence "/shared" resolves to "<BOINC's basic directory>/slots/X/cernvm/shared" from a perspective outside the container.




Also, the Theory Native logging has a flaw: it doesn't log what it's trying to run until after it's tried and possibly failed, so there's no way client-side to understand which app caused which failure; e.g. there's nothing in
221182407 to indicate whether a Sherpa caused the problem.

Right.
This version of cranky logs too late to stderr.txt.
I already sent a suggestion to CERN how this could be solved.
Let's see if it will make it into a future cranky version.
ID: 38607 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 463
Message 38612 - Posted: 23 Apr 2019, 11:15:45 UTC - in response to Message 38607.  
Last modified: 23 Apr 2019, 11:38:09 UTC

Hi,

The native app runs inside a runc container.
Runc maps the path of the apps running inside this container.
Hence "/shared" resolves to "<BOINC's basic directory>/slots/X/cernvm/shared" from a perspective outside the container.

I thought it did something like that, but naively I was expecting that ps would unwrap that when reporting the command; time to man ps again.

Looking in runRivet.log

==> [runRivet] Sun Apr 14 23:12:47 UTC 2019 [boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43]
...
===> [rungen] Sun Apr 14 23:12:50 UTC 2019 [boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 /shared/tmp/tmp.lpORph0vRx/generator.hepmc]
...
Starting the calculation at 23:13:36. Lean back and enjoy ... .


but no sign in any of the time stamps of anything happening after 23:12 and runRivet.log reports 0 events.
ID: 38612 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 463
Message 38613 - Posted: 23 Apr 2019, 11:28:07 UTC - in response to Message 38607.  

Native theory task 221036445 has so far clocked up over 176 hours
This task did not finish before it's due date.
Even if it succeeds, it is already marked as lost by the project server.
Hence it makes no sense to let it run any longer.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=110458183

As of writing, the task status under WU 110458183 is merely "Timed out - no response" as the client hasn't bothered to update the server on the task's (lack of) progress. AFAIK the client can still upload results and report task completion as long as the WU database entry exists (hasn't been cleaned out) - I've certainly had results validate and credit granted for overrunning Sixtrack tasks previously reported as timed out.

Not that I'm expecting this task to complete - expect the status to change when I restart BOINC after lunch!
ID: 38613 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,555,122
RAC: 131,295
Message 38614 - Posted: 23 Apr 2019, 12:03:37 UTC - in response to Message 38613.  

I've certainly had results validate and credit granted for overrunning Sixtrack tasks previously reported as timed out.

A grace period can be set at the project server to accept overdue results, but I don't know if it is active for Theory native.


... when I restart BOINC after lunch ...

Did you restart the BOINC client while a Theory native task was running?
Hopefully not as Theory native (also: ATLAS native) doesn't use checkpointing.
Hence the task would have started from the scratch at every client restart.
ID: 38614 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 463
Message 38615 - Posted: 23 Apr 2019, 13:00:51 UTC - in response to Message 38614.  

... when I restart BOINC after lunch ...

Did you restart the BOINC client while a Theory native task was running?
Hopefully not as Theory native (also: ATLAS native) doesn't use checkpointing.
Hence the task would have started from the scratch at every client restart.

There was only this one Theory native left... what I forgot to do was to leave the re-started invocation running long enough to see if it was at least doing something useful this time around, before aborting it. :(
ID: 38615 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 463
Message 38617 - Posted: 23 Apr 2019, 17:03:11 UTC - in response to Message 38606.  

27436 ?        SN     6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43
pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 43
That's 49 submissions, via BOINC?
ID: 38617 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 38618 - Posted: 23 Apr 2019, 19:08:59 UTC - in response to Message 38617.  

27436 ?        SN     6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43
pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 43
That's 49 submissions, via BOINC?
I'm not sure, but think it's not only via BOINC.
BOINC has done the last 6 weeks about 13,730 jobs a day, so I think the attemps are also done by other computing sources.
ID: 38618 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38641 - Posted: 26 Apr 2019, 6:19:16 UTC

===> [runRivet] Thu Apr 25 22:33:11 MDT 2019 [boinc ee zhad 22 - - sherpa 1.3.0 default 1000 48]

An hour ago my watchdog script flagged this one for graceful shutdown due to integration time > 1 day. The script hasn't actually deleted it because it's running in "flag but don't shutdown" mode. Now, an hour later, integration time has increased to 90 days. Log filesize is 360 KB and it's processed 0 events.

The MC Production site has this report: ee zhad 22 - - sherpa 1.3.0 default events 0, atttempts 26, success 0, failure 1, lost 25

Does anybody see any hope for this one? I don't but I'm going to give it another 20 hours (task duration is set for 10 days) to see how large the filesize and integration time become. I'm wondering if the integration time will peak and then decrease to 0 at which point it might start processing events. ATM it looks like it's heading for a 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED shutdown.
ID: 38641 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38642 - Posted: 26 Apr 2019, 13:35:57 UTC - in response to Message 38641.  

Integration time left is 750 days, log filesize is 2.2 MB so I let the script gracefully terminate it.
ID: 38642 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,555,122
RAC: 131,295
Message 38643 - Posted: 26 Apr 2019, 14:12:15 UTC

The ETA may not tell the truth.
Hence there's no real reason to shutdown those tasks.

If a babysitter service is active beside BOINC a task should only be shut down close before it hits one of the hard limits which are:
1. EXIT_DISK_LIMIT_EXCEEDED
2. BOINC due date reached

I'm not yet sure if there is a #3. "Condor runtime limit".
If so it would replace (2.).
ID: 38643 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38649 - Posted: 27 Apr 2019, 8:41:37 UTC - in response to Message 38643.  

The ETA may not tell the truth.
Hence there's no real reason to shutdown those tasks.

If a babysitter service is active beside BOINC a task should only be shut down close before it hits one of the hard limits which are:
1. EXIT_DISK_LIMIT_EXCEEDED
2. BOINC due date reached

I'm not yet sure if there is a #3. "Condor runtime limit".
If so it would replace (2.).


The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>.
The disk limit is what I'm not sure about. In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound> but VBoxManage -q showhdinfo "/var/lib/boinc-client/slots/4/vm_image.vdi" says "Capacity: 20480 MBytes"... 8 GB versus 20 GB. Which figure should I use?
ID: 38649 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,555,122
RAC: 131,295
Message 38651 - Posted: 27 Apr 2019, 10:23:20 UTC - in response to Message 38649.  

The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>.

Right.

In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound>

This is the value (in bytes) the slots folder must not exceed.


VBoxManage returns the max value the disk can grow from VBox's perspective.
Real size is allocated dynamically, hence much smaller.


The slots folder usually grows up to 3-4.5 GB.
Hence a babysitter script may set a limit between 5-7 GB.
ID: 38651 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38656 - Posted: 27 Apr 2019, 15:18:02 UTC - in response to Message 38651.  

The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>.

Right.

In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound>

This is the value (in bytes) the slots folder must not exceed.


OK. But I think you meant to say "slot folders" rather than "slots folder", yes? In other words monitor the size of the numbered slot folders in the folder named "slots", not the size of the folder named "slots".
ID: 38656 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Theory Application : Theory's endless looping


©2024 CERN