Thread 'Theory's endless looping'

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 38541 - Posted: 6 Apr 2019, 6:06:55 UTC - in response to Message 38527. Last modified: 6 Apr 2019, 6:07:37 UTC Ooops: Event 600 ( 1h 3m 42s elapsed / 1d 6h 47m 27s left ) -> ETA: Sat Apr 06 01:55 Added 1 success to pp jets 8000 350 - sherpa 1.4.3 default ID: 38541 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38542 - Posted: 6 Apr 2019, 9:31:58 UTC - in response to Message 38541. Ooops: Event 600 ( 1h 3m 42s elapsed / 1d 6h 47m 27s left ) -> ETA: Sat Apr 06 01:55 Added 1 success to pp jets 8000 350 - sherpa 1.4.3 default Cool! If you can extend the 18 hour limit manually then I can do it in my watchdog script too. I wish I had thought of this solution for Theory VBox. ID: 38542 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 38543 - Posted: 6 Apr 2019, 11:44:22 UTC 10 years left and increasing. This should run on next generation hardware. https://lhcathome.cern.ch/lhcathome/result.php?resultid=220536113 [pre]===> [runRivet] Wed Apr 3 22:53:07 UTC 2019 [boinc ee zhad 200 - - sherpa 1.4.3 default 2000 38] . . . 1.23182e+16 pb +- ( 6.16585e+15 pb = 50.0547 % ) 963710000 ( 963744472 -> 99.9 % ) integration time: ( 1d 11h 46m 51s elapsed / 3611d 15h 55m 24s left ) 1.23181e+16 pb +- ( 6.16578e+15 pb = 50.0547 % ) 963720000 ( 963754472 -> 99.9 % ) integration time: ( 1d 11h 46m 52s elapsed / 3611d 16h 58m 24s left )[/pre] ID: 38543 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 38544 - Posted: 6 Apr 2019, 12:35:07 UTC - in response to Message 38543. This should run on next generation hardware. :-) :-) :-) ID: 38544 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 38581 - Posted: 18 Apr 2019, 19:54:03 UTC Endless running: ===> [runRivet] Thu Apr 18 20:20:24 CEST 2019 [boinc ee zhad 22 - - sherpa 1.2.3 default 1000 44] ID: 38581 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 38605 - Posted: 22 Apr 2019, 11:29:10 UTC - in response to Message 38581. Native theory task 221036445 has so far clocked up over 176 hours PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27715 boinc 39 19 485236 45968 720 R 97.3 0.6 10566:03 Sherpa which I think means it's stuck in some loop; I'll probably kill it off tomorrow when I shift that host over to Atlas work. I note the process seems to have been started via PID TTY STAT TIME COMMAND 27436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 27622 ? SN 0:00 /bin/bash ./rungen.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 /shared/tmp/tmp.lpORph0vRx/generator.hepmc 27623 ? SN 0:00 /shared/rivetvm/rivetvm.exe -a ATLAS_2010_S8919674 -i /shared/tmp/tmp.lpORph0vRx/generator.hepmc -o /shared/tmp/tmp.lpORph0vRx/flat -H /shared/tmp/tmp.lpORph0vRx/generator.yoda -d /shared/tmp/tmp.lpORph0vRx/dump 27624 ? SN 3:12 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 27715 ? RN 10567:14 /cvmfs/sft.cern.ch/lcg/releases/LCG_94/MCGenerators/sherpa/2.2.5/x86_64-slc6-gcc62-opt/bin/Sherpa -f /shared/tmp/tmp.lpORph0vRx/generator.params I've previously run Sherpa pp tasks successfully[]. Here I'm puzzled by those references to /shared - that mount point doesn't exist on the system ~ > ls /shared ls: cannot access /shared: No such file or directory even though PID 27623 was run from it... do the tasks do some internal mounting of filesystems going which is timing out or breaking down? e.g. 221184137, 221183430, 221178163 Also, the Theory Native logging has a flaw: it doesn't log what it's trying to run until after it's tried and possibly failed, so there's no way client-side to understand which app caused which failure; e.g. there's nothing in 221182407 to indicate whether a Sherpa caused the problem. ID: 38605 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 38606 - Posted: 22 Apr 2019, 12:14:52 UTC - in response to Message 38605. Last modified: 22 Apr 2019, 12:35:16 UTC 27436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 43 ID: 38606 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 38607 - Posted: 22 Apr 2019, 13:20:30 UTC - in response to Message 38605. Native theory task 221036445 has so far clocked up over 176 hours This task did not finish before it's due date. Even if it succeeds, it is already marked as lost by the project server. Hence it makes no sense to let it run any longer. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=110458183 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27715 boinc 39 19 485236 45968 720 R 97.3 0.6 10566:03 Sherpa which I think means it's stuck in some loop; I'll probably kill it off tomorrow when I shift that host over to Atlas work. This shows that sherpa is still running at full speed. To check whether it's doing useful work the logfile can be checked at: <BOINC's basic directory>/slots/X/cernvm/shared/runRivet.log I note the process seems to have been started via PID TTY STAT TIME COMMAND 27436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 27622 ? SN 0:00 /bin/bash ./rungen.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 /shared/tmp/tmp.lpORph0vRx/generator.hepmc 27623 ? SN 0:00 /shared/rivetvm/rivetvm.exe -a ATLAS_2010_S8919674 -i /shared/tmp/tmp.lpORph0vRx/generator.hepmc -o /shared/tmp/tmp.lpORph0vRx/flat -H /shared/tmp/tmp.lpORph0vRx/generator.yoda -d /shared/tmp/tmp.lpORph0vRx/dump 27624 ? SN 3:12 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 27715 ? RN 10567:14 /cvmfs/sft.cern.ch/lcg/releases/LCG_94/MCGenerators/sherpa/2.2.5/x86_64-slc6-gcc62-opt/bin/Sherpa -f /shared/tmp/tmp.lpORph0vRx/generator.params I've previously run Sherpa pp tasks successfully[]. Here I'm puzzled by those references to /shared - that mount point doesn't exist on the system ~ > ls /shared ls: cannot access /shared: No such file or directory even though PID 27623 was run from it... do the tasks do some internal mounting of filesystems going which is timing out or breaking down? e.g. 221184137, 221183430, 221178163 The native app runs inside a runc container. Runc maps the path of the apps running inside this container. Hence "/shared" resolves to "<BOINC's basic directory>/slots/X/cernvm/shared" from a perspective outside the container. Also, the Theory Native logging has a flaw: it doesn't log what it's trying to run until after it's tried and possibly failed, so there's no way client-side to understand which app caused which failure; e.g. there's nothing in 221182407 to indicate whether a Sherpa caused the problem. Right. This version of cranky logs too late to stderr.txt. I already sent a suggestion to CERN how this could be solved. Let's see if it will make it into a future cranky version. ID: 38607 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 38612 - Posted: 23 Apr 2019, 11:15:45 UTC - in response to Message 38607. Last modified: 23 Apr 2019, 11:38:09 UTC /> The native app runs inside a runc container. Runc maps the path of the apps running inside this container. Hence "/shared" resolves to "<BOINC's basic directory>/slots/X/cernvm/shared" from a perspective outside the container. I thought it did something like that, but naively I was expecting that ps would unwrap that when reporting the command; time to man ps again. Looking in runRivet.log [pre]==> [runRivet] Sun Apr 14 23:12:47 UTC 2019 [boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43] ... ===> [rungen] Sun Apr 14 23:12:50 UTC 2019 [boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 /shared/tmp/tmp.lpORph0vRx/generator.hepmc] ... Starting the calculation at 23:13:36. Lean back and enjoy ... .[/pre] but no sign in any of the time stamps of anything happening after 23:12 and runRivet.log reports 0 events. ID: 38612 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 38613 - Posted: 23 Apr 2019, 11:28:07 UTC - in response to Message 38607. Native theory task 221036445 has so far clocked up over 176 hours This task did not finish before it's due date. Even if it succeeds, it is already marked as lost by the project server. Hence it makes no sense to let it run any longer. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=110458183 As of writing, the task status under WU 110458183 is merely "Timed out - no response" as the client hasn't bothered to update the server on the task's (lack of) progress. AFAIK the client can still upload results and report task completion as long as the WU database entry exists (hasn't been cleaned out) - I've certainly had results validate and credit granted for overrunning Sixtrack tasks previously reported as timed out. Not that I'm expecting this task to complete - expect the status to change when I restart BOINC after lunch! ID: 38613 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 38614 - Posted: 23 Apr 2019, 12:03:37 UTC - in response to Message 38613. I've certainly had results validate and credit granted for overrunning Sixtrack tasks previously reported as timed out. A grace period can be set at the project server to accept overdue results, but I don't know if it is active for Theory native. ... when I restart BOINC after lunch ... Did you restart the BOINC client while a Theory native task was running? Hopefully not as Theory native (also: ATLAS native) doesn't use checkpointing. Hence the task would have started from the scratch at every client restart. ID: 38614 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 38615 - Posted: 23 Apr 2019, 13:00:51 UTC - in response to Message 38614. ... when I restart BOINC after lunch ... Did you restart the BOINC client while a Theory native task was running? Hopefully not as Theory native (also: ATLAS native) doesn't use checkpointing. Hence the task would have started from the scratch at every client restart. There was only this one Theory native left... what I forgot to do was to leave the re-started invocation running long enough to see if it was at least doing something useful this time around, before aborting it. :( ID: 38615 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 38617 - Posted: 23 Apr 2019, 17:03:11 UTC - in response to Message 38606. 27436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 43 That's 49 submissions, via BOINC? ID: 38617 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 38618 - Posted: 23 Apr 2019, 19:08:59 UTC - in response to Message 38617. 27436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 43 That's 49 submissions, via BOINC? I'm not sure, but think it's not only via BOINC. BOINC has done the last 6 weeks about 13,730 jobs a day, so I think the attemps are also done by other computing sources. ID: 38618 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38641 - Posted: 26 Apr 2019, 6:19:16 UTC ===> [runRivet] Thu Apr 25 22:33:11 MDT 2019 [boinc ee zhad 22 - - sherpa 1.3.0 default 1000 48] An hour ago my watchdog script flagged this one for graceful shutdown due to integration time > 1 day. The script hasn't actually deleted it because it's running in "flag but don't shutdown" mode. Now, an hour later, integration time has increased to 90 days. Log filesize is 360 KB and it's processed 0 events. The MC Production site has this report: ee zhad 22 - - sherpa 1.3.0 default events 0, atttempts 26, success 0, failure 1, lost 25 Does anybody see any hope for this one? I don't but I'm going to give it another 20 hours (task duration is set for 10 days) to see how large the filesize and integration time become. I'm wondering if the integration time will peak and then decrease to 0 at which point it might start processing events. ATM it looks like it's heading for a 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED shutdown. ID: 38641 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38642 - Posted: 26 Apr 2019, 13:35:57 UTC - in response to Message 38641. Integration time left is 750 days, log filesize is 2.2 MB so I let the script gracefully terminate it. ID: 38642 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 38643 - Posted: 26 Apr 2019, 14:12:15 UTC The ETA may not tell the truth. Hence there's no real reason to shutdown those tasks. If a babysitter service is active beside BOINC a task should only be shut down close before it hits one of the hard limits which are: 1. EXIT_DISK_LIMIT_EXCEEDED 2. BOINC due date reached I'm not yet sure if there is a #3. "Condor runtime limit". If so it would replace (2.). ID: 38643 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38649 - Posted: 27 Apr 2019, 8:41:37 UTC - in response to Message 38643. The ETA may not tell the truth. Hence there's no real reason to shutdown those tasks. If a babysitter service is active beside BOINC a task should only be shut down close before it hits one of the hard limits which are: 1. EXIT_DISK_LIMIT_EXCEEDED 2. BOINC due date reached I'm not yet sure if there is a #3. "Condor runtime limit". If so it would replace (2.). The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>. The disk limit is what I'm not sure about. In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound> but VBoxManage -q showhdinfo "/var/lib/boinc-client/slots/4/vm_image.vdi" says "Capacity: 20480 MBytes"... 8 GB versus 20 GB. Which figure should I use? ID: 38649 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 38651 - Posted: 27 Apr 2019, 10:23:20 UTC - in response to Message 38649. The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>. Right. In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound> This is the value (in bytes) the slots folder must not exceed. VBoxManage returns the max value the disk can grow from VBox's perspective. Real size is allocated dynamically, hence much smaller. The slots folder usually grows up to 3-4.5 GB. Hence a babysitter script may set a limit between 5-7 GB. ID: 38651 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38656 - Posted: 27 Apr 2019, 15:18:02 UTC - in response to Message 38651. The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>. Right. In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound> This is the value (in bytes) the slots folder must not exceed. OK. But I think you meant to say "slot folders" rather than "slots folder", yes? In other words monitor the size of the numbered slot folders in the folder named "slots", not the size of the folder named "slots". ID: 38656 · Reply Quote