Message boards :
Theory Application :
Theory's endless looping
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1289 Credit: 8,523,354 RAC: 2,363 |
Ooops: Event 600 ( 1h 3m 42s elapsed / 1d 6h 47m 27s left ) -> ETA: Sat Apr 06 01:55Added 1 success to pp jets 8000 350 - sherpa 1.4.3 default |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Ooops: Event 600 ( 1h 3m 42s elapsed / 1d 6h 47m 27s left ) -> ETA: Sat Apr 06 01:55Added 1 success to pp jets 8000 350 - sherpa 1.4.3 default Cool! If you can extend the 18 hour limit manually then I can do it in my watchdog script too. I wish I had thought of this solution for Theory VBox. |
Send message Joined: 15 Jun 08 Posts: 2433 Credit: 227,919,233 RAC: 125,763 |
Nearly 10 years left and increasing. This should run on next generation hardware. https://lhcathome.cern.ch/lhcathome/result.php?resultid=220536113 ===> [runRivet] Wed Apr 3 22:53:07 UTC 2019 [boinc ee zhad 200 - - sherpa 1.4.3 default 2000 38] . . . 1.23182e+16 pb +- ( 6.16585e+15 pb = 50.0547 % ) 963710000 ( 963744472 -> 99.9 % ) integration time: ( 1d 11h 46m 51s elapsed / 3611d 15h 55m 24s left ) 1.23181e+16 pb +- ( 6.16578e+15 pb = 50.0547 % ) 963720000 ( 963754472 -> 99.9 % ) integration time: ( 1d 11h 46m 52s elapsed / 3611d 16h 58m 24s left ) |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,772,821 RAC: 83,201 |
This should run on next generation hardware.:-) :-) :-) |
Send message Joined: 14 Jan 10 Posts: 1289 Credit: 8,523,354 RAC: 2,363 |
Endless running: ===> [runRivet] Thu Apr 18 20:20:24 CEST 2019 [boinc ee zhad 22 - - sherpa 1.2.3 default 1000 44] |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 172 |
Native theory task 221036445 has so far clocked up over 176 hours PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27715 boinc 39 19 485236 45968 720 R 97.3 0.6 10566:03 Sherpawhich I think means it's stuck in some loop; I'll probably kill it off tomorrow when I shift that host over to Atlas work. I note the process seems to have been started via PID TTY STAT TIME COMMAND 27436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 27622 ? SN 0:00 /bin/bash ./rungen.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 /shared/tmp/tmp.lpORph0vRx/generator.hepmc 27623 ? SN 0:00 /shared/rivetvm/rivetvm.exe -a ATLAS_2010_S8919674 -i /shared/tmp/tmp.lpORph0vRx/generator.hepmc -o /shared/tmp/tmp.lpORph0vRx/flat -H /shared/tmp/tmp.lpORph0vRx/generator.yoda -d /shared/tmp/tmp.lpORph0vRx/dump 27624 ? SN 3:12 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 27715 ? RN 10567:14 /cvmfs/sft.cern.ch/lcg/releases/LCG_94/MCGenerators/sherpa/2.2.5/x86_64-slc6-gcc62-opt/bin/Sherpa -f /shared/tmp/tmp.lpORph0vRx/generator.params I've previously run Sherpa pp tasks successfully[*]. Here I'm puzzled by those references to /shared - that mount point doesn't exist on the system ~ > ls /shared ls: cannot access /shared: No such file or directoryeven though PID 27623 was run from it... do the tasks do some internal mounting of filesystems going which is timing out or breaking down? * e.g. 221184137, 221183430, 221178163 Also, the Theory Native logging has a flaw: it doesn't log what it's trying to run until after it's tried and possibly failed, so there's no way client-side to understand which app caused which failure; e.g. there's nothing in 221182407 to indicate whether a Sherpa caused the problem. |
Send message Joined: 14 Jan 10 Posts: 1289 Credit: 8,523,354 RAC: 2,363 |
pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 4327436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 |
Send message Joined: 15 Jun 08 Posts: 2433 Credit: 227,919,233 RAC: 125,763 |
Native theory task 221036445 has so far clocked up over 176 hours This task did not finish before it's due date. Even if it succeeds, it is already marked as lost by the project server. Hence it makes no sense to let it run any longer. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=110458183 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27715 boinc 39 19 485236 45968 720 R 97.3 0.6 10566:03 Sherpawhich I think means it's stuck in some loop; I'll probably kill it off tomorrow when I shift that host over to Atlas work. This shows that sherpa is still running at full speed. To check whether it's doing useful work the logfile can be checked at: <BOINC's basic directory>/slots/X/cernvm/shared/runRivet.log I note the process seems to have been started via The native app runs inside a runc container. Runc maps the path of the apps running inside this container. Hence "/shared" resolves to "<BOINC's basic directory>/slots/X/cernvm/shared" from a perspective outside the container. Also, the Theory Native logging has a flaw: it doesn't log what it's trying to run until after it's tried and possibly failed, so there's no way client-side to understand which app caused which failure; e.g. there's nothing in Right. This version of cranky logs too late to stderr.txt. I already sent a suggestion to CERN how this could be solved. Let's see if it will make it into a future cranky version. |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 172 |
Hi, The native app runs inside a runc container. I thought it did something like that, but naively I was expecting that ps would unwrap that when reporting the command; time to man ps again. Looking in runRivet.log ==> [runRivet] Sun Apr 14 23:12:47 UTC 2019 [boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43] ... ===> [rungen] Sun Apr 14 23:12:50 UTC 2019 [boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 /shared/tmp/tmp.lpORph0vRx/generator.hepmc] ... Starting the calculation at 23:13:36. Lean back and enjoy ... . but no sign in any of the time stamps of anything happening after 23:12 and runRivet.log reports 0 events. |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 172 |
Native theory task 221036445 has so far clocked up over 176 hoursThis task did not finish before it's due date. As of writing, the task status under WU 110458183 is merely "Timed out - no response" as the client hasn't bothered to update the server on the task's (lack of) progress. AFAIK the client can still upload results and report task completion as long as the WU database entry exists (hasn't been cleaned out) - I've certainly had results validate and credit granted for overrunning Sixtrack tasks previously reported as timed out. Not that I'm expecting this task to complete - expect the status to change when I restart BOINC after lunch! |
Send message Joined: 15 Jun 08 Posts: 2433 Credit: 227,919,233 RAC: 125,763 |
I've certainly had results validate and credit granted for overrunning Sixtrack tasks previously reported as timed out. A grace period can be set at the project server to accept overdue results, but I don't know if it is active for Theory native. ... when I restart BOINC after lunch ... Did you restart the BOINC client while a Theory native task was running? Hopefully not as Theory native (also: ATLAS native) doesn't use checkpointing. Hence the task would have started from the scratch at every client restart. |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 172 |
... when I restart BOINC after lunch ... There was only this one Theory native left... what I forgot to do was to leave the re-started invocation running long enough to see if it was at least doing something useful this time around, before aborting it. :( |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 172 |
That's 49 submissions, via BOINC?pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 4327436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 |
Send message Joined: 14 Jan 10 Posts: 1289 Credit: 8,523,354 RAC: 2,363 |
I'm not sure, but think it's not only via BOINC.That's 49 submissions, via BOINC?pp winclusive 7000 10 - sherpa 2.2.5 default ===> attemps 49 success 0 failures 6 lost 4327436 ? SN 6:38 /bin/bash ./runRivet.sh boinc pp winclusive 7000 10 - sherpa 2.2.5 default 1000 43 BOINC has done the last 6 weeks about 13,730 jobs a day, so I think the attemps are also done by other computing sources. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
===> [runRivet] Thu Apr 25 22:33:11 MDT 2019 [boinc ee zhad 22 - - sherpa 1.3.0 default 1000 48] An hour ago my watchdog script flagged this one for graceful shutdown due to integration time > 1 day. The script hasn't actually deleted it because it's running in "flag but don't shutdown" mode. Now, an hour later, integration time has increased to 90 days. Log filesize is 360 KB and it's processed 0 events. The MC Production site has this report: ee zhad 22 - - sherpa 1.3.0 default events 0, atttempts 26, success 0, failure 1, lost 25 Does anybody see any hope for this one? I don't but I'm going to give it another 20 hours (task duration is set for 10 days) to see how large the filesize and integration time become. I'm wondering if the integration time will peak and then decrease to 0 at which point it might start processing events. ATM it looks like it's heading for a 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED shutdown. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Integration time left is 750 days, log filesize is 2.2 MB so I let the script gracefully terminate it. |
Send message Joined: 15 Jun 08 Posts: 2433 Credit: 227,919,233 RAC: 125,763 |
The ETA may not tell the truth. Hence there's no real reason to shutdown those tasks. If a babysitter service is active beside BOINC a task should only be shut down close before it hits one of the hard limits which are: 1. EXIT_DISK_LIMIT_EXCEEDED 2. BOINC due date reached I'm not yet sure if there is a #3. "Condor runtime limit". If so it would replace (2.). |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
The ETA may not tell the truth. The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>. The disk limit is what I'm not sure about. In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound> but VBoxManage -q showhdinfo "/var/lib/boinc-client/slots/4/vm_image.vdi" says "Capacity: 20480 MBytes"... 8 GB versus 20 GB. Which figure should I use? |
Send message Joined: 15 Jun 08 Posts: 2433 Credit: 227,919,233 RAC: 125,763 |
The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>. Right. In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound> This is the value (in bytes) the slots folder must not exceed. VBoxManage returns the max value the disk can grow from VBox's perspective. Real size is allocated dynamically, hence much smaller. The slots folder usually grows up to 3-4.5 GB. Hence a babysitter script may set a limit between 5-7 GB. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>. OK. But I think you meant to say "slot folders" rather than "slots folder", yes? In other words monitor the size of the numbered slot folders in the folder named "slots", not the size of the folder named "slots". |
©2024 CERN