Message boards : Number crunching : Maximum elapsed time exceeded
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
mitrichr
Avatar

Send message
Joined: 20 Dec 07
Posts: 69
Credit: 599,151
RAC: 0
Message 22481 - Posted: 17 Aug 2010, 19:12:06 UTC

I have to say, this is all very disheartening.

From what I can gather, the BOINC software includes measures of reliability of a host. If some number of WU\'s fail, the host is deemed unreliable and will not be sent any more WU\'s.

If the WU\'s are faulty to begin with, then we are doomed as participants in this noblest of all projects.

Those 7500 or so of us out of the approximately 90000 participants who have stuck out all of the starts and stops, we deserve the opportunity to work and make a contribution.

ID: 22481 · Report as offensive     Reply Quote
mfbabb2

Send message
Joined: 10 Oct 08
Posts: 19
Credit: 7,191
RAC: 0
Message 22485 - Posted: 18 Aug 2010, 1:23:11 UTC - in response to Message 21992.  
Last modified: 18 Aug 2010, 1:28:56 UTC

Hi,
I got two tasks which both ended with an error after about 100000 seconds.

wuid=3706346

wuid=3706345

Anybody else got similar? Computer is a Q9400 so it should not be too slow.



Yes -- last 3 WU had the problem -- somebody has messed up big time. Either the threshold needs to be fixed to something reasonable, or else remove it entirely and leave it to the user to abort if it runs too long.

Here they are, if anyone cares:
18940532 	3963113  	17 Aug 2010 21:12:40 UTC  	18 Aug 2010 1:05:34 UTC  	Over  	Client error  	Compute error  	1,467.44  	4.37  	---
18935277 	3962740 	17 Aug 2010 17:04:58 UTC 	17 Aug 2010 20:27:04 UTC 	Over 	Client error 	Compute error 	1,523.06 	4.53 	---
18920377 	3963598 	17 Aug 2010 15:41:35 UTC 	18 Aug 2010 0:47:41 UTC 	Over 	Client error 	Compute error 	1,630.96 	5.00 	---

ID: 22485 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 99
Credit: 8,152,904
RAC: 9
Message 22486 - Posted: 18 Aug 2010, 6:14:29 UTC
Last modified: 18 Aug 2010, 6:26:17 UTC

De facto the batch, named wbnlaug10_DA-scaling-law1, is an example of genuine CPU time waste. ;-)
The question is - how and why this strange test or experiment is useful for LHC@home?
To commit the end of SixTrack stage and start of SixTrackBNL era???
Some explanation would be very welcome.
ID: 22486 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 99
Credit: 8,152,904
RAC: 9
Message 22489 - Posted: 18 Aug 2010, 9:25:25 UTC

Maybe, I am bad, but I decided to abort all tasks from wbnlaug10_DA-scaling-law1 batch. Of course, if I find them Ready to start or Running. :-)
ID: 22489 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 107
Credit: 511,942
RAC: 0
Message 23997 - Posted: 25 Jun 2012, 12:43:12 UTC

Thread is 2 years old but I have the same problem so used the old thread.

All the following work units received the same message, all were running at the same time but were at different percentage completion times, yet all appear to have failed at the same time?
Maybe a memory issue?

3645509
3645510
3645512
3645514


<![CDATA[
<message>
Maximum elapsed time exceeded
</message>

Conan
ID: 23997 · Report as offensive     Reply Quote
boroda3

Send message
Joined: 13 Mar 12
Posts: 4
Credit: 205,048
RAC: 0
Message 23998 - Posted: 25 Jun 2012, 14:01:59 UTC - in response to Message 23997.  
Last modified: 25 Jun 2012, 14:20:15 UTC

Thread is 2 years old but I have the same problem


Yes, I too.

25.06.2012 18:37:09 | LHC@home 1.0 | Starting task w29feb_job_tracking_bignblz__23__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1686_2 using sixtrack version 44307 in slot 2
25.06.2012 18:43:29 | LHC@home 1.0 | Starting task w29feb_job_tracking_bignblz__17__s__62.31_60.32__8_10__5__85.5_1_sixvf_boinc1254_2 using sixtrack version 44307 in slot 5
25.06.2012 19:31:02 | LHC@home 1.0 | Starting task w29feb_job_tracking_bignblz__10__s__62.31_60.32__6_8__5__81_1_sixvf_boinc702_3 using sixtrack version 44307 in slot 6
25.06.2012 19:32:47 | LHC@home 1.0 | Aborting task w29feb_job_tracking_bignblz__23__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1686_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G)
25.06.2012 19:39:07 | LHC@home 1.0 | Aborting task w29feb_job_tracking_bignblz__17__s__62.31_60.32__8_10__5__85.5_1_sixvf_boinc1254_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G)
25.06.2012 19:39:08 | LHC@home 1.0 | Starting task w29feb_job_tracking_bignblz__21__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1534_2 using sixtrack version 44307 in slot 5
25.06.2012 20:26:40 | LHC@home 1.0 | Aborting task w29feb_job_tracking_bignblz__10__s__62.31_60.32__6_8__5__81_1_sixvf_boinc702_3: exceeded elapsed time limit 3335.88 (120000.00G/35.97G)
25.06.2012 20:34:45 | LHC@home 1.0 | Aborting task w29feb_job_tracking_bignblz__21__s__62.31_60.32__6_8__5__63_1_sixvf_boinc1534_2: exceeded elapsed time limit 3335.88 (120000.00G/35.97G)

All WUs breaks with remaining time about 5-10 minutes, percents don't stops. May be it is not error of my system (Phenom II X4 @ 3.3GHz) - but wasted time.

I think the other tasks of this party can also be immediately disposed of, not to be wasted CPU.
ID: 23998 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 107
Credit: 511,942
RAC: 0
Message 23999 - Posted: 25 Jun 2012, 14:51:32 UTC
Last modified: 25 Jun 2012, 14:52:29 UTC

Yep had another 4 fail the same way (so far all on the same computer) but have also now had 2 successes on a couple of very short work units.

Maximum elapsed time exceeded

3645464
3645516
3645518
3645520

What is the Maximum Time supposed to be?

Conan
ID: 23999 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 852
Credit: 1,619,050
RAC: 0
Message 24000 - Posted: 25 Jun 2012, 15:34:50 UTC

I'll look at this soonest but probably have to be tomorrow.
Is your machine an AMD Athlon by any chance???
The CPU ime is guesstimated based on the maximum number
of turns. Hoever, we have/had a problem with performance
on AMD due to Intel ifort. I'll also check if the fix
has been installed. See thread Number Crunching/Windows Linux apps.

Eric (aka Bigmac.)
ID: 24000 · Report as offensive     Reply Quote
boroda3

Send message
Joined: 13 Mar 12
Posts: 4
Credit: 205,048
RAC: 0
Message 24001 - Posted: 25 Jun 2012, 16:08:04 UTC - in response to Message 24000.  
Last modified: 25 Jun 2012, 16:10:46 UTC

No Athlon. Phenom II X4 955. Passed very strong test.

But task protocol says:
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x752B3E2E


It's not like a performance problem.

---------------
PS to Message 23998: Crashed tasks are 3653125, 3653119, 3653107, 3653085
ID: 24001 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 852
Credit: 1,619,050
RAC: 0
Message 24002 - Posted: 25 Jun 2012, 18:38:44 UTC - in response to Message 24001.  

Agh............NOT an ATHLON but still a (powerful)
AMD system. Sadly the Intel ifort compiler does NOT
use the optimised code on non-Intel hardware.......
I have a fix but I would guess it is not installed yet.
I'll confirm tomorrow. Sorry for the wasted cyccles.
Please stay with us and tomorrow I'll also try and
send the eprformance measurements I have made.

Thanks for the feedback. Eric.
ID: 24002 · Report as offensive     Reply Quote
Swordfish

Send message
Joined: 2 Oct 11
Posts: 4
Credit: 30,680
RAC: 0
Message 24003 - Posted: 25 Jun 2012, 21:18:26 UTC

I had 2 tasks earlier today giving computation error, which caused over 12 hrs of crunching time to be wasted, especially as they came in high priority, suspending 2 Seti tasks in the process.

As Seti is running fine I suspected that the LHC WU's were at fault, and have aborted the 7 further WU's in my queue.

Funny not had any problems with work from LHC in the past, and yes this particular computer has Athlon 64x2 installed.

I note what has been posted earlier, and will monitor the message board , and this thread, to see if matters are resolved

:)
ID: 24003 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 110
Credit: 6,767,631
RAC: 569
Message 24017 - Posted: 30 Jun 2012, 9:19:02 UTC

Just had two tasks fail like this wasting over 30 hours between them. Although on a slow old machine they would have met the deadline... couldn't the resource limit be set such that this doesn't happen? It's galling to see a task at ~80% done and then be aborted before the deadline.

John.
ID: 24017 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 852
Credit: 1,619,050
RAC: 0
Message 24019 - Posted: 1 Jul 2012, 4:59:54 UTC

Dear John; I understand your frustration and disappointment. Could
you please tell me the make, type of machine and MHz. We have a
known problem with Athlon (or any non-Intel) machines which I
hope will be fixed next week. I can extend the deadline but it is
already an over estimate and I get complaints anyway because the
actual time required is unpredicatble, only the maximum. That is
a problem with studying chaos! If it is any consolation I have spent
almost ten years ensuring that we get identical results on any
IEEE754 hardware but now have this nasty proble with Intel ifort
compiler which refuses to use optimised code on Athlon. I shall be
publishing timing info later today on the thread Linux v Windows
apps. Sorry for this and I'll let you know when we have a fix.
Eric.
ID: 24019 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 110
Credit: 6,767,631
RAC: 569
Message 24022 - Posted: 1 Jul 2012, 16:38:16 UTC - in response to Message 24019.  

Dear Eric,

All Intel stuff, I'm afraid, probably just too slow, but until recently, I've seen no problems at all running SixTrack, hence the concern at these failures.

There is this task on this host (Pentium 4, 3.0Ghz, WXP, HT on). Normally runs 4 or 5 projects.
Also:-
This task and this one on this host (Pentium 3, 1.1Ghz, WXP, no HT.)... Normally runs one project. I did say old and slow, but thought it should work, I've never seen a task fail to validate so presumably it can make a useful contribution and the project still sends it work to do. Perhaps I just enjoy living in the past.

John.
ID: 24022 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 852
Credit: 1,619,050
RAC: 0
Message 24027 - Posted: 2 Jul 2012, 11:59:30 UTC - in response to Message 24022.  

Thnaks John; it must be the beam-beam effect used by
these studies. I'll increase the limit by I guess 50%.

Eric
ID: 24027 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Maximum elapsed time exceeded


©2020 CERN