Message boards :
Theory Application :
(Native) Theory - Sherpa looooooong runners
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,837 RAC: 794 |
Shall I give this one a try?I woke up this morning and saw the VM of this task in VBox Manager "Inaccessible" with the one-way-traffic sign and in BOINC Manager the task was gone after 31.5 hours run time.===> [runRivet] Thu Dec 19 15:36:23 UTC 2019 [boinc pp jets 7000 150,-,2360 - sherpa 2.2.5 default 1000 195]Of 137 attempts, 'only' 7 were successful to process 1000 events each, whereof 2 from the last 8 attemps. Not the error the previous user had with this task ([ERROR] Container 'runc' terminated with status code 1.), but for me it was EXIT_DISK_LIMIT_EXCEEDED In BOINC Manager: [LHC@home] Aborting task Theory_2279-772397-195_2: exceeded disk limit: 2039.03MB > 1907.35MB So an unnecessary error when the project had set the <rsc_disk_bound>2000000000.000000</rsc_disk_bound> higher. |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
Successful long sherpa. pp jets 7000 400 - sherpa 1.4.2 default 33000 1962 days 4 hours 32 minutes 16 seconds https://lhcathome.cern.ch/lhcathome/result.php?resultid=255845676 |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,837 RAC: 794 |
Valid long sherpa: https://lhcathome.cern.ch/lhcathome/result.php?resultid=256264416 -- 1 days 23 hours 3 min 17 sec ===> [runRivet] Sat Dec 21 10:00:27 UTC 2019 [boinc pp jets 8000 600 - sherpa 2.1.0 default 24000 196] . . Starting the calculation at 10:02:17. Lean back and enjoy ... . . . Starting the calculation at 10:14:43. Lean back and enjoy ... . . . Starting the calculation at 10:23:44. Lean back and enjoy ... . . . Starting the calculation at 11:48:57. Lean back and enjoy ... . ..... integration time: ( 4h 41m 45s (6h 59m 40s) elapsed / 0s (0s) left ) [16:30:42] . . Event 1 ( 5s elapsed / 1d 13h 3m 54s left ) -> ETA: Mon Dec 23 05:34 . . Event 20000 ( 1d 6h 34m 3s elapsed / 6h 6m 48s left ) -> ETA: Tue Dec 24 00:46 |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,375,661 RAC: 25,716 |
here another one, got finished this morning, after 1 day 14 hours 4 min 14 sec https://lhcathome.cern.ch/lhcathome/result.php?resultid=256423635 Merry Christmas ! |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Merry Christmas ! Merry Christmas to you! They did Vienna up here last year; this year it was Salzburg, but we could not get up there due to the weather. https://www.oldchristkindl.com/ |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
Some new native sherpa long runners: https://lhcathome.cern.ch/lhcathome/result.php?resultid=255578610 pp jets 7000 150,-,2160 - sherpa 2.2.4 default 3000 194runtime: 4d 2h 29m 15s cputime: 4d 2h 29m 15s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255516491 pp jets 13000 250,-,4160 - sherpa 2.2.4 default 2000 194runtime: 3d 13h 1m 26s cputime: 2d 16h 45m 19s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255584644 pp jets 7000 150,-,2360 - sherpa 2.2.4 default 2000 194runtime: 2d 20h 51m 12s cputime: 2d 15h 59m 55s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255537421 pp jets 13000 250,-,4160 - sherpa 2.2.2 default 3000 194runtime: 2d 5h 42m 12s cputime: 1d 21h 59m 8s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255553665 pp jets 7000 300 - sherpa 1.4.0 default 43000 194runtime: 1d 19h 3m 39s cputime: 1d 14h 16m 31s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256296514 pp jets 7000 250 - sherpa 1.4.0 default 54000 196runtime: 1d 18h 26m 46s cputime: 1d 18h 26m 46s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256017665 pp jets 8000 600 - sherpa 1.2.2p default 24000 196runtime: 1d 13h 8m 14s cputime: 1d 13h 8m 14s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255588683 pp jets 7000 500 - sherpa 1.4.0 default 28000 192runtime: 1d 9h 27m 19s cputime: 1d 9h 27m 19s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255654487 pp jets 7000 600 - sherpa 2.1.1 default 16000 194runtime: 1d 8h 26m 22s cputime: 1d 7h 18m 15s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255608465 pp jets 7000 500 - sherpa 1.4.1 default 29000 194runtime: 1d 6h 40m 11s cputime: 0d 23h 53m 32s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256400181 pp jets 7000 500 - sherpa 2.2.5 default 32000 197runtime: 1d 5h 57m 11s cputime: 0d 15h 0m 53s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255634568 pp jets 7000 350 - sherpa 2.2.5 default 32000 193runtime: 1d 5h 23m 39s cputime: 1d 3h 5m 14s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255638982 pp bbbar 7000 - - sherpa 2.2.5 default 100000 195runtime: 1d 4h 40m 6s cputime: 1d 4h 40m 6s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255615782 pp jets 8000 350 - sherpa 2.2.4 default 53000 194runtime: 1d 4h 0m 50s cputime: 1d 0h 56m 19s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255589170 pp jets 7000 400 - sherpa 2.2.2 default 35000 194runtime: 1d 3h 30m 2s cputime: 1d 3h 30m 2s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256329358 pp jets 8000 600 - sherpa 1.4.3 default 32000 194runtime: 1d 2h 7m 4s cputime: 1d 2h 7m 4s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256108556 pp jets 7000 800 - sherpa 1.3.0 default 24000 196runtime: 1d 1h 51m 6s cputime: 0d 19h 34m 48s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255587448 pp jets 7000 300 - sherpa 1.4.0 default 47000 192runtime: 1d 1h 27m 47s cputime: 1d 1h 27m 47s https://lhcathome.cern.ch/lhcathome/result.php?resultid=255614466 pp jets 7000 500 - sherpa 2.2.4 default 26000 194runtime: 1d 0h 20m 45s cputime: 0d 20h 37m 20s For more details see: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5247&postid=41070#41070 |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,837 RAC: 794 |
Some new native sherpa long runners:I did not studied all your mentioned long runners, but the above one looks like it has run the task (partially) twice. |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
Another one of mine. https://lhcathome.cern.ch/lhcathome/result.php?resultid=256741296 pp jets 7000 500 - sherpa 1.4.5 default 30000 1981d 18h 54m 41s I did not studied all your mentioned long runners, but the above one looks like it has run the task (partially) twice.Some of them get(?) paused for hours. E.g. 14:36:54 CET +01:00 2019-12-21: cranky-0.0.29: [INFO] Pausing container Theory_2279-791030-196_0. 14:36:54 CET +01:00 2019-12-21: cranky-0.0.29: [WARNING] Cannot pause container as /sys/fs/cgroup/freezer/boinc/freezer.state not exists. 16:31:58 CET +01:00 2019-12-21: cranky-0.0.29: [INFO] Resuming container Theory_2279-791030-196_0. Some other tasks has already been deleted and they can be investigate no more. Does server delete tasks that are older than 10 or 11 days? |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,837 RAC: 794 |
I did not mean the pausing, although that 'freezer.state not exists' doesn't happen when I suspend a task (with and without LAIM).I did not studied all your mentioned long runners, but the above one looks like it has run the task (partially) twice.Some of them get(?) paused for hours. I meant this 2 starts of your job with >3 days in between: 00:45:52 CET +01:00 2019-12-16: cranky-0.0.29: [INFO] Running Container 'runc'. 00:45:55 CET +01:00 2019-12-16: cranky-0.0.29: [INFO] ===> [runRivet] Sun Dec 15 23:45:54 UTC 2019 [boinc pp jets 7000 150,-,2160 - sherpa 2.2.4 default 3000 194] . . 03:53:29 CET +01:00 2019-12-20: cranky-0.0.29: [INFO] Running Container 'runc'. 03:53:31 CET +01:00 2019-12-20: cranky-0.0.29: [INFO] ===> [runRivet] Fri Dec 20 02:53:29 UTC 2019 [boinc pp jets 7000 150,-,2160 - sherpa 2.2.4 default 3000 194] Does server delete tasks that are older than 10 or 11 days?Results are not stored very long. I don't know how long exactly. Sometimes however (very) old results stay in the DB accidentally and have to be purged by manually. |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
I did not mean the pausing, although that 'freezer.state not exists' doesn't happen when I suspend a task (with and without LAIM).Yeah, I understood that. Sorry, I was just reporting another possible problem as a reply to "I did not studied all your mentioned long runners". Results are not stored very long. I don't know how long exactly.Well, I saved "reported/received time" in "sherpaLongRunners.txt" and today server is deleting tasks reported on December 16th. If someone would like to study those tasks, we need to save stderr output too. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 13 |
Task 256751764 looks to have been stuck in a loop for nearly a week: ===> [runRivet] Tue Dec 24 13:11:39 UTC 2019 [boinc pp jets 7000 80,-,1360 - sherpa 1.4.3 default 100000 198]and then spent over 6 days looping round (9 histograms, 60000 events). I'd have killed it outright but I had to restart the boinc client anyway, so it's instead re-running from scratch. I'll see how it gets on in a few hours' time. (n.b. after ~1 week the runRivet.log is still less than 1 MB.) |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 13 |
Task 256751764 looks to have been stuck in a loop for nearly a week:... I'd have killed it outright but I had to restart the boinc client anyway, so it's instead re-running from scratch. I'll see how it gets on in a few hours' time.It looks to be stuck in the same loop as last time, and I'm not wasting another week on it: ===> [runRivet] Tue Dec 31 12:41:51 UTC 2019 [boinc pp jets 7000 80,-,1360 - sherpa 1.4.3 default 100000 198] |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 13 |
So I told the BOINC client to abort it: but the reportedTask 256751764 looks to have been stuck in a loop for nearly a week:... I'd have killed it outright but I had to restart the boinc client anyway, so it's instead re-running from scratch. I'll see how it gets on in a few hours' time.It looks to be stuck in the same loop as last time, and I'm not wasting another week on it: Run time 2 hours 17 min 52 secdoesn't tell the real story as that's only counting the second attempt; the first run had clocked up well over 150 hours for the Sherpa process alone. Is that a known BOINC issue? |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
Some new native sherpa long runners: https://lhcathome.cern.ch/lhcathome/result.php?resultid=255648161 pp jets 7000 150,-,2160 - sherpa 2.2.4 default 3000 196runtime: 10d 13h 58m 30s cputime: 7d 23h 19m 56s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256746509 pp jets 13000 250,-,4160 - sherpa 2.2.4 default 2000 198runtime: 7d 18h 28m 38s cputime: 6d 7h 38m 24s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256839400 pp ttbar 7000 - - sherpa 2.1.1 default 3000 198runtime: 5d 16h 47m 50s cputime: 4d 22h 16m 11s pp jets 7000 150,-,2160 - sherpa 2.2.4 default 3000 194runtime: 4d 2h 29m 15s cputime: 4d 2h 29m 15s pp jets 7000 500 - sherpa 1.4.5 default 30000 198runtime: 1d 18h 54m 42s cputime: 1d 17h 18m 43s pp jets 7000 250 - sherpa 1.4.0 default 54000 196runtime: 1d 18h 26m 46s cputime: 1d 18h 26m 46s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256791131 pp jets 7000 300 - sherpa 1.4.1 default 34000 198runtime: 1d 13h 22m 36s cputime: 1d 4h 38m 29s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256797351 pp jets 7000 80,-,1760 - sherpa 2.2.1 default 41000 198runtime: 1d 11h 16m 14s cputime: 1d 1h 50m 57s https://lhcathome.cern.ch/lhcathome/result.php?resultid=257165869 pp jets 7000 20,-,310 - sherpa 2.2.4 default 100000 200runtime: 1d 4h 0m 6s cputime: 0d 23h 50m 35s https://lhcathome.cern.ch/lhcathome/result.php?resultid=256812772 pp jets 8000 350 - sherpa 2.2.0 default 61000 198runtime: 1d 2h 36m 57s cputime: 1d 2h 11m 20s https://lhcathome.cern.ch/lhcathome/result.php?resultid=257243301 pp jets 7000 800 - sherpa 2.2.2 default 31000 200runtime: 1d 0h 33m 30s cputime: 1d 0h 41m 45s For more details see: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5247&postid=41134#41134 |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
Some vbox tasks look like true long runners (just checking longer ones manually). https://lhcathome.cern.ch/lhcathome/result.php?resultid=256857613 pp jets 8000 600 - sherpa 1.3.0 default 37000 198runtime: 4 days 3 hours 31 min 22 sec cputime: 4 days 3 hours 31 min 22 sec https://lhcathome.cern.ch/lhcathome/result.php?resultid=256763673 pp jets 7000 400 - sherpa 1.4.2 default 41000 198runtime: 3 days 17 hours 35 min 52 sec cputime: 3 days 17 hours 23 min 16 sec https://lhcathome.cern.ch/lhcathome/result.php?resultid=256816995 pp jets 8000 600 - sherpa 2.1.0 default 26000 194runtime: 3 days 12 hours 35 min 19 sec cputime: 3 days 12 hours 12 min 9 sec https://lhcathome.cern.ch/lhcathome/result.php?resultid=256442290 pp jets 7000 500 - sherpa 2.2.0 default 55000 198runtime: 3 days 12 hours 1 min 50 sec cputime: 3 days 11 hours 58 min 27 sec Maybe I can implement something to detect them... Good start looks like 2019-12-26 05:57:10 (63072): Status Report: Job Duration: '360000.000000' 2019-12-26 05:57:10 (63072): Status Report: Elapsed Time: '6000.000000' 2019-12-26 05:57:10 (63072): Status Report: CPU Time: '5978.625000'There is Elapsed Time: '6000.000000'. Then I expect a succession of 2019-12-24 04:12:12 (11892): Status Report: Job Duration: '360000.000000' 2019-12-24 04:12:12 (11892): Status Report: Elapsed Time: '60000.000000' 2019-12-24 04:12:12 (11892): Status Report: CPU Time: '60056.015625'Elapsed time must be ceil("previous one"+6000). Graceful end looks like 2019-12-26 15:00:34 (5528): Status Report: Job Duration: '360000.000000' 2019-12-26 15:00:34 (5528): Status Report: Elapsed Time: '338780.708530' 2019-12-26 15:00:34 (5528): Status Report: CPU Time: '332449.090000' 2019-12-26 16:27:24 (5528): Guest Log: job: CPU usage:and vbox elapsed/cpu time are similar to serverside ones. Other vbox tasks are a bit faulty. This one started again 14 times. https://lhcathome.cern.ch/lhcathome/result.php?resultid=255551532 ppbar jets 1960 140 - sherpa 2.1.0 default 100000 192runtime: 3 days 23 hours 19 min 36 sec cputime: 3 days 21 hours 46 min 58 sec |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,837 RAC: 794 |
4 long running sherpa's: Theory_2279-783976-200_2 Elapsed Time 44:39:48 Time Left 55:22:10 10 Jan 11:49:07 Running High P. Theory_2279-752616-197_2 Elapsed Time 47:14:29 Time Left 953:14:57 10 Jan 11:49:07 Running High P. Theory_2279-751919-198_2 Elapsed Time 44:43:18 Time Left 955:44:53 10 Jan 11:49:07 Running High P. Theory_2279-750935-198_2 Elapsed Time 44:42:13 Time Left 955:50:15 10 Jan 11:49:07 Running |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,837 RAC: 794 |
Shortest one (of the four mentioned before) finished: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2853813 ===> [runRivet] Tue Dec 31 13:39:17 UTC 2019 [boinc pp jets 7000 800 - sherpa 1.3.0 default 26000 200] |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,837 RAC: 794 |
4 long running sherpa's:This one still running (66 hours) and now allocating 3.24 GB disk space in the used slot. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 13 |
And now Task 257414218 looks to be stuck in a loop - and the completion time is increasing: ===> [runRivet] Sat Jan 4 03:05:54 UTC 2020 [boinc ee zhad 197 - - sherpa 2.2.5 default 1000 201]I won't have tinkering time tomorrow, so I'll have to abort it today rather than waste ~48hrs CPU on it. |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,837 RAC: 794 |
I killed this sherpa after 102 hours elapsed time: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2853953 ===> [runRivet] Tue Dec 31 13:33:47 UTC 2019 [boinc ee zhad 29 - - sherpa 2.2.4 default 2000 198] . . . 3.14142e+16 pb +- ( 1.07261e+16 pb = 34.144 % ) 1030200000 ( 1030200173 -> 99.9 % ) integration time: ( 4d 1h 10m 20s elapsed / 4716d 21h 16m 16s left ) [14:43:08] My_File<FileType>::Close(): '0x31005d8' returns 'out of memory'. 3.14136e+16 pb +- ( 1.07259e+16 pb = 34.144 % ) 1030220000 ( 1030220173 -> 99.9 % ) integration time: ( 4d 1h 10m 28s elapsed / 4716d 23h 51m 23s left ) [14:43:16] My_File<FileType>::Close(): '0x31005d8' returns 'out of memory'. 3.1413e+16 pb +- ( 1.07256e+16 pb = 34.144 % ) 1030240000 ( 1030240173 -> 99.9 % ) integration time: ( 4d 1h 10m 36s elapsed / 4717d 2h 26m 19s left ) [14:43:24] My_File<FileType>::Close(): '0x31005d8' returns 'out of memory'. 3.14124e+16 pb +- ( 1.07254e+16 pb = 34.144 % ) 1030260000 ( 1030260173 -> 99.9 % ) integration time: ( 4d 1h 10m 44s elapsed / 4717d 5h 1m 3s left ) [14:43:32] My_File<FileType>::Close(): '0x31005d8' returns 'out of memory'. Poincare::Poincare(): Inaccurate rotation { a = (0.441795,-0.740307,0.506719) b = (0,0,1) a' = (0.99728,-0.0608254,0.0416332) -> rel. dev. (inf,-inf,-0.958367) m_ct = 0.506719 m_st = -0.862112 m_n = (-0,4.81631e-07,7.03654e-07) } Poincare::Poincare(): Inaccurate rotation { a = (0.441795,-0.740307,0.506719) b = (0,0,1) a' = (0.99728,-0.0608254,0.0416332) -> rel. dev. (inf,-inf,-0.958367) m_ct = 0.506719 m_st = -0.862112 m_n = (-0,4.81631e-07,7.03654e-07) } 3.14118e+16 pb +- ( 1.07252e+16 pb = 34.144 % ) 1030280000 ( 1030280173 -> 99.9 % ) integration time: ( 4d 1h 10m 53s elapsed / 4717d 7h 51m 7s left ) [14:43:41] My_File<FileType>::Close(): '0x31005d8' returns 'out of memory'. Poincare::Poincare(): Inaccurate rotation { a = (0.532472,0.199727,-0.822547) b = (0,0,1) a' = (0.0433899,-0.235736,0.970848) -> rel. dev. (inf,-inf,-0.0291521) m_ct = -0.822547 m_st = -0.568697 m_n = (0,-6.98892e-07,-1.69701e-07) } Poincare::Poincare(): Inaccurate rotation { a = (0.532472,0.199727,-0.822547) b = (0,0,1) a' = (0.0433899,-0.235736,0.970848) -> rel. dev. (inf,-inf,-0.0291521) m_ct = -0.822547 m_st = -0.568697 m_n = (0,-6.98892e-07,-1.69701e-07) } Updating display... Display update finished (0 histograms, 0 events). 3.14112e+16 pb +- ( 1.0725e+16 pb = 34.144 % ) 1030300000 ( 1030300173 -> 99.9 % ) integration time: ( 4d 1h 11m 1s elapsed / 4717d 10h 17m 30s left ) [14:43:49] My_File<FileType>::Close(): '0x31005d8' returns 'out of memory'. |
©2024 CERN