Maximum elapsed time exceeded

Author	Message
Richard Mitnick Send message Joined: 20 Dec 07 Posts: 69 Credit: 599,151 RAC: 0	Message 22481 - Posted: 17 Aug 2010, 19:12:06 UTC I have to say, this is all very disheartening. From what I can gather, the BOINC software includes measures of reliability of a host. If some number of WU\'s fail, the host is deemed unreliable and will not be sent any more WU\'s. If the WU\'s are faulty to begin with, then we are doomed as participants in this noblest of all projects. Those 7500 or so of us out of the approximately 90000 participants who have stuck out all of the starts and stops, we deserve the opportunity to work and make a contribution. Please check out my blog http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 22481 · Reply Quote

mfbabb2 Send message Joined: 10 Oct 08 Posts: 19 Credit: 7,191 RAC: 0	Message 22485 - Posted: 18 Aug 2010, 1:23:11 UTC - in response to Message 21992. Last modified: 18 Aug 2010, 1:28:56 UTC Hi, I got two tasks which both ended with an error after about 100000 seconds. wuid=3706346 wuid=3706345 Anybody else got similar? Computer is a Q9400 so it should not be too slow. Yes -- last 3 WU had the problem -- somebody has messed up big time. Either the threshold needs to be fixed to something reasonable, or else remove it entirely and leave it to the user to abort if it runs too long. Here they are, if anyone cares: 18940532 3963113 17 Aug 2010 21:12:40 UTC 18 Aug 2010 1:05:34 UTC Over Client error Compute error 1,467.44 4.37 --- 18935277 3962740 17 Aug 2010 17:04:58 UTC 17 Aug 2010 20:27:04 UTC Over Client error Compute error 1,523.06 4.53 --- 18920377 3963598 17 Aug 2010 15:41:35 UTC 18 Aug 2010 0:47:41 UTC Over Client error Compute error 1,630.96 5.00 --- ID: 22485 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 101 Credit: 8,994,586 RAC: 0	Message 22486 - Posted: 18 Aug 2010, 6:14:29 UTC Last modified: 18 Aug 2010, 6:26:17 UTC De facto the batch, named wbnlaug10_DA-scaling-law1, is an example of genuine CPU time waste. ;-) The question is - how and why this strange test or experiment is useful for LHC@home? To commit the end of SixTrack stage and start of SixTrackBNL era??? Some explanation would be very welcome. ID: 22486 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 101 Credit: 8,994,586 RAC: 0	Message 22489 - Posted: 18 Aug 2010, 9:25:25 UTC Maybe, I am bad, but I decided to abort all tasks from wbnlaug10_DA-scaling-law1 batch. Of course, if I find them Ready to start or Running. :-) ID: 22489 · Reply Quote

Conan Send message Joined: 6 Jul 06 Posts: 111 Credit: 977,454 RAC: 12,132	Message 23997 - Posted: 25 Jun 2012, 12:43:12 UTC Thread is 2 years old but I have the same problem so used the old thread. All the following work units received the same message, all were running at the same time but were at different percentage completion times, yet all appear to have failed at the same time? Maybe a memory issue? 3645509 3645510 3645512 3645514 <![CDATA[ <message> Maximum elapsed time exceeded </message> Conan ID: 23997 · Reply Quote

boroda3 Send message Joined: 13 Mar 12 Posts: 4 Credit: 205,048 RAC: 0	Message 23998 - Posted: 25 Jun 2012, 14:01:59 UTC - in response to Message 23997. Last modified: 25 Jun 2012, 14:20:15 UTC Thread is 2 years old but I have the same problem Yes, I too. 25.06.2012 18:37:09 \| LHC@home 1.0 \| Starting task w29feb_job_tracking_bignblz__23__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1686_2 using sixtrack version 44307 in slot 2 25.06.2012 18:43:29 \| LHC@home 1.0 \| Starting task w29feb_job_tracking_bignblz__17__s__62.31_60.32__8_10__5__85.5_1_sixvf_boinc1254_2 using sixtrack version 44307 in slot 5 25.06.2012 19:31:02 \| LHC@home 1.0 \| Starting task w29feb_job_tracking_bignblz__10__s__62.31_60.32__6_8__5__81_1_sixvf_boinc702_3 using sixtrack version 44307 in slot 6 25.06.2012 19:32:47 \| LHC@home 1.0 \| Aborting task w29feb_job_tracking_bignblz__23__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1686_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G) 25.06.2012 19:39:07 \| LHC@home 1.0 \| Aborting task w29feb_job_tracking_bignblz__17__s__62.31_60.32__8_10__5__85.5_1_sixvf_boinc1254_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G) 25.06.2012 19:39:08 \| LHC@home 1.0 \| Starting task w29feb_job_tracking_bignblz__21__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1534_2 using sixtrack version 44307 in slot 5 25.06.2012 20:26:40 \| LHC@home 1.0 \| Aborting task w29feb_job_tracking_bignblz__10__s__62.31_60.32__6_8__5__81_1_sixvf_boinc702_3: exceeded elapsed time limit 3335.88 (120000.00G/35.97G) 25.06.2012 20:34:45 \| LHC@home 1.0 \| Aborting task w29feb_job_tracking_bignblz__21__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1534_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G) All WUs breaks with remaining time about 5-10 minutes, percents don't stops. May be it is not error of my system (Phenom II X4 @ 3.3GHz) - but wasted time. I think the other tasks of this party can also be immediately disposed of, not to be wasted CPU. ID: 23998 · Reply Quote

Conan Send message Joined: 6 Jul 06 Posts: 111 Credit: 977,454 RAC: 12,132	Message 23999 - Posted: 25 Jun 2012, 14:51:32 UTC Last modified: 25 Jun 2012, 14:52:29 UTC Yep had another 4 fail the same way (so far all on the same computer) but have also now had 2 successes on a couple of very short work units. Maximum elapsed time exceeded 3645464 3645516 3645518 3645520 What is the Maximum Time supposed to be? Conan ID: 23999 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 24000 - Posted: 25 Jun 2012, 15:34:50 UTC I'll look at this soonest but probably have to be tomorrow. Is your machine an AMD Athlon by any chance??? The CPU ime is guesstimated based on the maximum number of turns. Hoever, we have/had a problem with performance on AMD due to Intel ifort. I'll also check if the fix has been installed. See thread Number Crunching/Windows Linux apps. Eric (aka Bigmac.) ID: 24000 · Reply Quote

boroda3 Send message Joined: 13 Mar 12 Posts: 4 Credit: 205,048 RAC: 0	Message 24001 - Posted: 25 Jun 2012, 16:08:04 UTC - in response to Message 24000. Last modified: 25 Jun 2012, 16:10:46 UTC No Athlon. Phenom II X4 955. Passed very strong test. But task protocol says: <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> <stderr_txt> Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x752B3E2E It's not like a performance problem. --------------- PS to Message 23998: Crashed tasks are 3653125, 3653119, 3653107, 3653085 ID: 24001 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 24002 - Posted: 25 Jun 2012, 18:38:44 UTC - in response to Message 24001. Agh............NOT an ATHLON but still a (powerful) AMD system. Sadly the Intel ifort compiler does NOT use the optimised code on non-Intel hardware....... I have a fix but I would guess it is not installed yet. I'll confirm tomorrow. Sorry for the wasted cyccles. Please stay with us and tomorrow I'll also try and send the eprformance measurements I have made. Thanks for the feedback. Eric. ID: 24002 · Reply Quote

Swordfish Send message Joined: 2 Oct 11 Posts: 4 Credit: 30,680 RAC: 0	Message 24003 - Posted: 25 Jun 2012, 21:18:26 UTC I had 2 tasks earlier today giving computation error, which caused over 12 hrs of crunching time to be wasted, especially as they came in high priority, suspending 2 Seti tasks in the process. As Seti is running fine I suspected that the LHC WU's were at fault, and have aborted the 7 further WU's in my queue. Funny not had any problems with work from LHC in the past, and yes this particular computer has Athlon 64x2 installed. I note what has been posted earlier, and will monitor the message board , and this thread, to see if matters are resolved :) ID: 24003 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,197,847 RAC: 8,968	Message 24017 - Posted: 30 Jun 2012, 9:19:02 UTC Just had two tasks fail like this wasting over 30 hours between them. Although on a slow old machine they would have met the deadline... couldn't the resource limit be set such that this doesn't happen? It's galling to see a task at ~80% done and then be aborted before the deadline. John. ID: 24017 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 24019 - Posted: 1 Jul 2012, 4:59:54 UTC Dear John; I understand your frustration and disappointment. Could you please tell me the make, type of machine and MHz. We have a known problem with Athlon (or any non-Intel) machines which I hope will be fixed next week. I can extend the deadline but it is already an over estimate and I get complaints anyway because the actual time required is unpredicatble, only the maximum. That is a problem with studying chaos! If it is any consolation I have spent almost ten years ensuring that we get identical results on any IEEE754 hardware but now have this nasty proble with Intel ifort compiler which refuses to use optimised code on Athlon. I shall be publishing timing info later today on the thread Linux v Windows apps. Sorry for this and I'll let you know when we have a fix. Eric. ID: 24019 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,197,847 RAC: 8,968	Message 24022 - Posted: 1 Jul 2012, 16:38:16 UTC - in response to Message 24019. Dear Eric, All Intel stuff, I'm afraid, probably just too slow, but until recently, I've seen no problems at all running SixTrack, hence the concern at these failures. There is this task on this host (Pentium 4, 3.0Ghz, WXP, HT on). Normally runs 4 or 5 projects. Also:- This task and this one on this host (Pentium 3, 1.1Ghz, WXP, no HT.)... Normally runs one project. I did say old and slow, but thought it should work, I've never seen a task fail to validate so presumably it can make a useful contribution and the project still sends it work to do. Perhaps I just enjoy living in the past. John. ID: 24022 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 24027 - Posted: 2 Jul 2012, 11:59:30 UTC - in response to Message 24022. Thnaks John; it must be the beam-beam effect used by these studies. I'll increase the limit by I guess 50%. Eric ID: 24027 · Reply Quote

LHC@home