exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G)

Author	Message
Trotador Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0	Message 30336 - Posted: 14 May 2017, 19:34:41 UTC Last modified: 14 May 2017, 19:47:15 UTC Many unit failing with this message, and actually correspond to units which exceed the indicated processing time.They stop and abort themselves at that moment programming error?, there are other units of the same type finishing Ok with processing times beyond 30940 seconds https://lhcathome.cern.ch/lhcathome/result.php?resultid=139382571 <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) </message> <stderr_txt> </stderr_txt> ]]> ID: 30336 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 30341 - Posted: 14 May 2017, 22:45:33 UTC - in response to Message 30336. Thankyou for your report, very strange indeed and there is nothing "special" about these tasks that I can see. Will try and investigate further. Eric. ID: 30341 · Reply Quote

Trotador Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0	Message 30343 - Posted: 15 May 2017, 4:27:34 UTC It continues happening in that host, suspending similar tasks.Let's see what happen with wzero_ ones ID: 30343 · Reply Quote

Trotador Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0	Message 30344 - Posted: 15 May 2017, 4:34:08 UTC Same issue, suspending LHC in that host, ID: 30344 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2266 Credit: 175,643,965 RAC: 4,407	Message 30345 - Posted: 15 May 2017, 6:01:56 UTC The same task was finished from a other PC after more than ONE DAY! https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=67361476 ID: 30345 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1447 Credit: 9,719,036 RAC: 358	Message 30346 - Posted: 15 May 2017, 6:08:59 UTC - in response to Message 30336. exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) This floating point was reported much too high. Therefore the client has calculated a much shorter time to finish. Meanwhile that machine is reporting a measured floating point speed of 1957.67 million ops/second. If that value was used when requesting tasks, you would have 91946.04 seconds to finish a task. ID: 30346 · Reply Quote

Trotador Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0	Message 30348 - Posted: 15 May 2017, 9:25:18 UTC - in response to Message 30346. exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) This floating point was reported much too high. Therefore the client has calculated a much shorter time to finish. Meanwhile that machine is reporting a measured floating point speed of 1957.67 million ops/second. If that value was used when requesting tasks, you would have 91946.04 seconds to finish a task. How that could happen? ID: 30348 · Reply Quote

Trotador Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0	Message 30349 - Posted: 15 May 2017, 9:59:38 UTC - in response to Message 30348. And, is there a way to identify which tasks have this problem? I see in my log successful and errored tasks downloaded at the same time ID: 30349 · Reply Quote

Trotador Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0	Message 30350 - Posted: 15 May 2017, 12:34:45 UTC It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos... any idea? ID: 30350 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1447 Credit: 9,719,036 RAC: 358	Message 30351 - Posted: 15 May 2017, 13:37:33 UTC - in response to Message 30350. It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos... any idea? Probably all tasks downloaded before you changed to the lower fpops will still have the higher fpops in the workunit settings. If you are already hacking the client_state.xml, you could increase the <rsc_fpops_bound> for those workunits with a factor 10. ID: 30351 · Reply Quote

Trotador Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0	Message 30354 - Posted: 15 May 2017, 16:45:12 UTC - in response to Message 30351. It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos... any idea? Probably all tasks downloaded before you changed to the lower fpops will still have the higher fpops in the workunit settings. If you are already hacking the client_state.xml, you could increase the <rsc_fpops_bound> for those workunits with a factor 10. Thank for the suggestion but it is also reverting after I chnge it. ID: 30354 · Reply Quote

Juha Send message Joined: 22 Mar 17 Posts: 30 Credit: 360,676 RAC: 0	Message 30357 - Posted: 15 May 2017, 18:31:25 UTC - in response to Message 30336. The average processing rate for the x86_64 sse2 version on your host is hundred or so times larger than it should be. You have had a couple hundred short running tasks. BOINC expects that runtime of task is proportional to its FLOPS estimate. Short running tasks like you have had could have made BOINC think your computer is really super fast. Projects that have tasks like these are supposed to code their validators so that unusual tasks are marked as runtime outliers. Sixtrack validator seems to have that code (some app versions for my host have Consecutive valid tasks higher than Number of tasks completed) but I think there could be a bug in the code and some short running tasks are not marked runtime outliers and are allowed to influence runtime estimates. You can help yourself out of this situation by increasing <rsc_fpops_bound> of Sixtrack tasks 1000 times larger or possible even more. Before you edit client_state.xml you must shutdown BOINC client and make sure BOINC Manager or your OS doesn't automatically restart it until you are done with the edits. ID: 30357 · Reply Quote

Trotador Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0	Message 30378 - Posted: 17 May 2017, 18:18:46 UTC The trick was to shutdown boincmanager before making changes to client_state.xml. Doing it so, either changing application fplos or wu rsc_fpops_bound work like a charm. Thanks! ID: 30378 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 30379 - Posted: 17 May 2017, 18:30:45 UTC Thanks to all with your help for this; we shall be making sure that these short runs are treated as outliers in future. Sorry for that. However I think we are over the worst, at least I hope so. Eric. ID: 30379 · Reply Quote

xii5ku Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0	Message 30973 - Posted: 23 Jun 2017, 14:16:58 UTC - in response to Message 30379. Thanks to all with your help for this; we shall be making sure that these short runs are treated as outliers in future. Sorry for that. However I think we are over the worst, at least I hope so. Eric. "We" are definitely not over the worst. You are still sending out huge series of 5-seconds tasks. And these still corrupt the clients by severe miscalculation of app_version.flops. I still see all of my hosts which run sixtrack tending towards dangerously over-estimated app_version.flops. They are in constant danger to be pushed over the edge by another series of short WUs, after which all proper WUs will error out. Are you aware that this failure mode is a serious waste of computer time and networking bandwidth of your contributors? Currently I am spending a lot of time to get host 10486566 back into working order. Current contents of its "Application details" page: [CODE]SixTrack 451.07 i686-pc-linux-gnu (pni) Number of tasks completed 82 Max tasks per day 571 Number of tasks today 573 Consecutive valid tasks 80 Average processing rate 7,219.69 GFLOPS Average turnaround time 0.01 days [/CODE] It was even in the 15,000 GFLOPS range previously. Another host with same hardware is currently listed with 9 GFLOPS for SixTrack 451.07 i686-pc-linux-gnu (pni). I am still not sure whether host 10486566 can ever be recovered. And even if I succeed, it will just be a temporary win until the next destructive series of 5-seconds WUs. Thanks to Crystal Pellet for the pointers. ID: 30973 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 30982 - Posted: 23 Jun 2017, 18:45:39 UTC - in response to Message 30973. I agree this is a TERRIBLE problem. Since my hands are tied, I am going to try and find and delete all these high amplitude short runs. (The problem is complicated because not all of them may be genuine, but in your case I think they are.) I shall try and check. I am still WAITING for the fix. I shall also try and run a few but it is difficult for me to find the data files as they are deleted from the download directory. I am not sure about the impact of the deletion, but I reckon I have to do it. I can only apologise AGAIN (I am fed up apologising) but I do not have any authority, I cannot change much. I think the recent fixes for "Tasks not distributed" or, because you have over 50 processors, implies that you might well get a whole bunch of short runs. I am on the verge of cracking up over all this as the problem is KNOWN, UNDERSTOOD but NOT fixed. Eric. ID: 30982 · Reply Quote

xii5ku Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0	Message 30983 - Posted: 23 Jun 2017, 18:47:48 UTC - in response to Message 30973. Meanwhile, two more of my machines have contracted the same disease as host 10486566: They downloaded a few hundred tasks in one go, and the estimated runtime of these tasks was set to about 10 seconds. Luckily I noticed this before the machines began computing these tasks. (I shut down the clients, edited client_state.xml for 10,000 times larger workunit.rsc_fpops_bound, and restarted the clients.) Conclusion: You absolutely cannot run Sixtrack exclusively for more than a day on dual-socket machines, unless you are ready for repeated client_state.xml manipulations. (Smaller hosts may take longer to contract this issue, I guess.) I am looking forward to your fix of Sixtrack's validator. In addition: Would it be feasible to process all of the generated WUs before they are published, in order to detect all short-running WUs and never send them to your contributors in the first place? A naive implementation of this would require circa 10...15 CPU seconds for each generated WU. ID: 30983 · Reply Quote

xii5ku Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0	Message 30984 - Posted: 23 Jun 2017, 18:53:10 UTC - in response to Message 30982. Eric, thanks for your reply (I only saw it after I sent my latter post), and thanks for chasing all those interrelated problems. As a layman, I hardly have an idea of the hurdles you are encountering on your way to get this solved eventually. ID: 30984 · Reply Quote

Juha Send message Joined: 22 Mar 17 Posts: 30 Credit: 360,676 RAC: 0	Message 30985 - Posted: 23 Jun 2017, 20:30:40 UTC @Eric If you need help I could take a look at the validator. I'm of no use with the science but I'm good at reading code and finding bugs. ID: 30985 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 30987 - Posted: 23 Jun 2017, 22:40:25 UTC - in response to Message 30984. OK, thanks for your support. I have TRIED to delete all w-c6 Tasks. You may have some cached locally. Apparently there are more. We are giving up for tonight. I shall look again first thing in the morning. So much for a break! :-) Eric. ID: 30987 · Reply Quote

LHC@home