Message boards :
Number crunching :
Maximum elapsed time exceeded
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
![]() Send message Joined: 20 Dec 07 Posts: 69 Credit: 599,151 RAC: 0 |
I have to say, this is all very disheartening. From what I can gather, the BOINC software includes measures of reliability of a host. If some number of WU\'s fail, the host is deemed unreliable and will not be sent any more WU\'s. If the WU\'s are faulty to begin with, then we are doomed as participants in this noblest of all projects. Those 7500 or so of us out of the approximately 90000 participants who have stuck out all of the starts and stops, we deserve the opportunity to work and make a contribution. ![]() ![]() |
Send message Joined: 10 Oct 08 Posts: 19 Credit: 7,191 RAC: 0 |
Hi, Yes -- last 3 WU had the problem -- somebody has messed up big time. Either the threshold needs to be fixed to something reasonable, or else remove it entirely and leave it to the user to abort if it runs too long. Here they are, if anyone cares: 18940532 3963113 17 Aug 2010 21:12:40 UTC 18 Aug 2010 1:05:34 UTC Over Client error Compute error 1,467.44 4.37 --- 18935277 3962740 17 Aug 2010 17:04:58 UTC 17 Aug 2010 20:27:04 UTC Over Client error Compute error 1,523.06 4.53 --- 18920377 3963598 17 Aug 2010 15:41:35 UTC 18 Aug 2010 0:47:41 UTC Over Client error Compute error 1,630.96 5.00 --- |
![]() Send message Joined: 3 Oct 06 Posts: 101 Credit: 8,972,814 RAC: 2 |
De facto the batch, named wbnlaug10_DA-scaling-law1, is an example of genuine CPU time waste. ;-) The question is - how and why this strange test or experiment is useful for LHC@home? To commit the end of SixTrack stage and start of SixTrackBNL era??? Some explanation would be very welcome. ![]() |
![]() Send message Joined: 3 Oct 06 Posts: 101 Credit: 8,972,814 RAC: 2 |
Maybe, I am bad, but I decided to abort all tasks from wbnlaug10_DA-scaling-law1 batch. Of course, if I find them Ready to start or Running. :-) ![]() |
![]() ![]() Send message Joined: 6 Jul 06 Posts: 107 Credit: 591,975 RAC: 17 |
Thread is 2 years old but I have the same problem so used the old thread. All the following work units received the same message, all were running at the same time but were at different percentage completion times, yet all appear to have failed at the same time? Maybe a memory issue? 3645509 3645510 3645512 3645514 <![CDATA[ <message> Maximum elapsed time exceeded </message> Conan |
Send message Joined: 13 Mar 12 Posts: 4 Credit: 205,048 RAC: 0 |
Thread is 2 years old but I have the same problem Yes, I too. 25.06.2012 18:37:09 | LHC@home 1.0 | Starting task w29feb_job_tracking_bignblz__23__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1686_2 using sixtrack version 44307 in slot 2 25.06.2012 18:43:29 | LHC@home 1.0 | Starting task w29feb_job_tracking_bignblz__17__s__62.31_60.32__8_10__5__85.5_1_sixvf_boinc1254_2 using sixtrack version 44307 in slot 5 25.06.2012 19:31:02 | LHC@home 1.0 | Starting task w29feb_job_tracking_bignblz__10__s__62.31_60.32__6_8__5__81_1_sixvf_boinc702_3 using sixtrack version 44307 in slot 6 25.06.2012 19:32:47 | LHC@home 1.0 | Aborting task w29feb_job_tracking_bignblz__23__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1686_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G) 25.06.2012 19:39:07 | LHC@home 1.0 | Aborting task w29feb_job_tracking_bignblz__17__s__62.31_60.32__8_10__5__85.5_1_sixvf_boinc1254_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G) 25.06.2012 19:39:08 | LHC@home 1.0 | Starting task w29feb_job_tracking_bignblz__21__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1534_2 using sixtrack version 44307 in slot 5 25.06.2012 20:26:40 | LHC@home 1.0 | Aborting task w29feb_job_tracking_bignblz__10__s__62.31_60.32__6_8__5__81_1_sixvf_boinc702_3: exceeded elapsed time limit 3335.88 (120000.00G/35.97G) 25.06.2012 20:34:45 | LHC@home 1.0 | Aborting task w29feb_job_tracking_bignblz__21__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1534_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G) All WUs breaks with remaining time about 5-10 minutes, percents don't stops. May be it is not error of my system (Phenom II X4 @ 3.3GHz) - but wasted time. I think the other tasks of this party can also be immediately disposed of, not to be wasted CPU. |
![]() ![]() Send message Joined: 6 Jul 06 Posts: 107 Credit: 591,975 RAC: 17 |
|
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I'll look at this soonest but probably have to be tomorrow. Is your machine an AMD Athlon by any chance??? The CPU ime is guesstimated based on the maximum number of turns. Hoever, we have/had a problem with performance on AMD due to Intel ifort. I'll also check if the fix has been installed. See thread Number Crunching/Windows Linux apps. Eric (aka Bigmac.) |
Send message Joined: 13 Mar 12 Posts: 4 Credit: 205,048 RAC: 0 |
No Athlon. Phenom II X4 955. Passed very strong test. But task protocol says: <core_client_version>7.0.28</core_client_version> It's not like a performance problem. --------------- PS to Message 23998: Crashed tasks are 3653125, 3653119, 3653107, 3653085 |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Agh............NOT an ATHLON but still a (powerful) AMD system. Sadly the Intel ifort compiler does NOT use the optimised code on non-Intel hardware....... I have a fix but I would guess it is not installed yet. I'll confirm tomorrow. Sorry for the wasted cyccles. Please stay with us and tomorrow I'll also try and send the eprformance measurements I have made. Thanks for the feedback. Eric. |
Send message Joined: 2 Oct 11 Posts: 4 Credit: 30,680 RAC: 0 |
I had 2 tasks earlier today giving computation error, which caused over 12 hrs of crunching time to be wasted, especially as they came in high priority, suspending 2 Seti tasks in the process. As Seti is running fine I suspected that the LHC WU's were at fault, and have aborted the 7 further WU's in my queue. Funny not had any problems with work from LHC in the past, and yes this particular computer has Athlon 64x2 installed. I note what has been posted earlier, and will monitor the message board , and this thread, to see if matters are resolved :) |
Send message Joined: 6 Sep 08 Posts: 112 Credit: 9,627,006 RAC: 14,834 ![]() ![]() ![]() |
Just had two tasks fail like this wasting over 30 hours between them. Although on a slow old machine they would have met the deadline... couldn't the resource limit be set such that this doesn't happen? It's galling to see a task at ~80% done and then be aborted before the deadline. John. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Dear John; I understand your frustration and disappointment. Could you please tell me the make, type of machine and MHz. We have a known problem with Athlon (or any non-Intel) machines which I hope will be fixed next week. I can extend the deadline but it is already an over estimate and I get complaints anyway because the actual time required is unpredicatble, only the maximum. That is a problem with studying chaos! If it is any consolation I have spent almost ten years ensuring that we get identical results on any IEEE754 hardware but now have this nasty proble with Intel ifort compiler which refuses to use optimised code on Athlon. I shall be publishing timing info later today on the thread Linux v Windows apps. Sorry for this and I'll let you know when we have a fix. Eric. |
Send message Joined: 6 Sep 08 Posts: 112 Credit: 9,627,006 RAC: 14,834 ![]() ![]() ![]() |
Dear Eric, All Intel stuff, I'm afraid, probably just too slow, but until recently, I've seen no problems at all running SixTrack, hence the concern at these failures. There is this task on this host (Pentium 4, 3.0Ghz, WXP, HT on). Normally runs 4 or 5 projects. Also:- This task and this one on this host (Pentium 3, 1.1Ghz, WXP, no HT.)... Normally runs one project. I did say old and slow, but thought it should work, I've never seen a task fail to validate so presumably it can make a useful contribution and the project still sends it work to do. Perhaps I just enjoy living in the past. John. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thnaks John; it must be the beam-beam effect used by these studies. I'll increase the limit by I guess 50%. Eric |
©2023 CERN