Message boards : Number crunching : Invalid tasks
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26339 - Posted: 9 Apr 2014, 10:19:45 UTC - in response to Message 26338.  

Thanks Matthias; if I don't get this fixed by Monday I'll
try and get the temporary fix using homogeneous feature.
Eric.
ID: 26339 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 108
Credit: 661,871
RAC: 196
Message 26344 - Posted: 10 Apr 2014, 21:45:44 UTC - in response to Message 26339.  

Thanks Matthias; if I don't get this fixed by Monday I'll
try and get the temporary fix using homogeneous feature.
Eric.


I hope that works as I have now had 5 fail due to the Windows/Linux validation issue, and 4 of them were the 30,000+ second jobs, so, many hours and over 1,000 points lost.

Conan
ID: 26344 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 26345 - Posted: 12 Apr 2014, 10:07:53 UTC

The lost time has resulted in my setting no new tasks. If they know the issue is there, why are they still sending jobs to Windows machines. My machines could have been doing useful work for someone else.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 26345 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,704,455
RAC: 259
Message 26348 - Posted: 15 Apr 2014, 9:06:25 UTC - in response to Message 26339.  

I don't have statistics to back this up, but my impression is that it is longer-running tasks on Linux that fail to validate against Windows. I have the impression that short runs may be OK, but any tasks that take more than a few hours on my i7-3770 seem to end up as invalidated if they go up against wingmen running Windows. They validate OK if my wingman is also running Linux.

I am particularly bummed out because I have an Intel Atom powered netbook that was crunching away for more than 2 days on a task. And the wingman ran Windows so it invalidated. Would be good if this good be rectified.
ID: 26348 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 116
Credit: 10,927,002
RAC: 2,464
Message 26349 - Posted: 15 Apr 2014, 13:26:43 UTC - in response to Message 26348.  

I haven't noticed this problem, this task took ca 4 days on a Linux box and it validated OK against a much faster Windows host.

John.
ID: 26349 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,704,455
RAC: 259
Message 26350 - Posted: 15 Apr 2014, 21:40:21 UTC - in response to Message 26349.  
Last modified: 15 Apr 2014, 21:41:46 UTC

I suspect it's more of a trend or tendency, than a fixed pattern or rule. I didn't analyze all my results, and we can't go back enough far in time anyway. Like you, I suspect I may have long jobs with Windows wingmen that were validated correctly. However, I do notice that all my recent invalidated results seem to be long jobs with Windows wingmen. That's why it's only an impression or hunch that I wanted to share for now.

P.S. Hat's off for keeping your Pentium III crunching.
ID: 26350 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26351 - Posted: 16 Apr 2014, 6:53:36 UTC

The problem is now identified as being with the Windows
executables on many, but not all, tasks of a specific study
involving tests of power supply ripple. The fix, using a different
ifort compiler is available, and if can't be installed pronto, I am
hoping to temporarily use homogeneous redundancy, as a
temporary fix to avoid invalid results (but which will not fix the
physics).
I am really really sorry about the long delay but the nasty problem
is creating Windows executables which don't give "cannot create task"
or syntax problems with the PC description.
More news soonest. Eric.
ID: 26351 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26352 - Posted: 17 Apr 2014, 7:21:46 UTC

Here we have quite a strange invalid result : wuid=16556468

One of the first two delivered workunits has been returned after the deadline, so the server side scheduler decided to send out some more.

The second one (in time Linux) didn't validate against the third and fourth (both in time Win x64), but when the delayed first result (Win x64) came back, it validated against the Linux result.
ID: 26352 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26466 - Posted: 17 May 2014, 12:33:24 UTC

2 inconclusive ones :

with SixTrack v451.07 wtest_newnuebb0105__5__s__64.31_59.32__6_8__5__30_1_sixvf_boinc610 waiting for a third result

with SixTrack v451.07 w14_eric_job_tracking_bb_np_nt_fset_240214__13__s__62.31_60.32__10_12__6__82.5_1_sixvf_boinc4540 invalid

In both cases the runtime on my box has been extremely low.

Unusual : All boxes ran windows and for some reason mine always picked SSE3, where the others picked PNI ... but otoh., I patched my clients to report SSE3 (5.10.28 didn't know that extension yet)

ID: 26466 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 26468 - Posted: 17 May 2014, 12:55:22 UTC - in response to Message 26466.  

Unusual : All boxes ran windows and for some reason mine always picked SSE3, where the others picked PNI ... but otoh., I patched my clients to report SSE3 (5.10.28 didn't know that extension yet)

Usual. 'SSE3' and 'Prescott New Instructions' are synonyms, and the applications are identical.
ID: 26468 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26469 - Posted: 17 May 2014, 13:06:59 UTC - in response to Message 26468.  

Unusual : All boxes ran windows and for some reason mine always picked SSE3, where the others picked PNI ... but otoh., I patched my clients to report SSE3 (5.10.28 didn't know that extension yet)

Usual. 'SSE3' and 'Prescott New Instructions' are synonyms, and the applications are identical.

Yes, I already learned that here - it was just surprising that mine always picked the sse3, whereas others picked pni ... until I remembered my core client patch
ID: 26469 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26471 - Posted: 17 May 2014, 14:12:56 UTC - in response to Message 26466.  

[quote]2 inconclusive ones :

with SixTrack v451.07 wtest_newnuebb0105__5__s__64.31_59.32__6_8__5__30_1_sixvf_boinc610 waiting for a third result

Looks OK now I think

with SixTrack v451.07 w14_eric_job_tracking_bb_np_nt_fset_240214__13__s__62.31_60.32__10_12__6__82.5_1_sixvf_boinc4540 invalid

This was a glitch on our side giving No such file or directory......


Thanks. Eric.
ID: 26471 · Report as offensive     Reply Quote
tom310

Send message
Joined: 28 Aug 12
Posts: 15
Credit: 500,336
RAC: 0
Message 26473 - Posted: 17 May 2014, 19:42:45 UTC

Hi, I have a wingman with 888 WU´s

http://lhcathomeclassic.cern.ch/sixtrack/show_host_detail.php?hostid=10137504

All of them are done within a second while my machine works since 15 hours on two of them. As of now there are ~490 validation pending and ~400 validation inconclusive. None is validated.
I think this machine does not like SixTrack v451.07 (sse2) at all. Or something else.
ID: 26473 · Report as offensive     Reply Quote
Qax

Send message
Joined: 22 Nov 10
Posts: 5
Credit: 778,394
RAC: 0
Message 26477 - Posted: 18 May 2014, 7:27:50 UTC - in response to Message 26473.  

I'm getting some errors too:

http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17279643
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17279642
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17279637
ID: 26477 · Report as offensive     Reply Quote
Qax

Send message
Joined: 22 Nov 10
Posts: 5
Credit: 778,394
RAC: 0
Message 26480 - Posted: 18 May 2014, 8:27:19 UTC - in response to Message 26477.  

I don't like this. . .

36856057 17279644 18 May 2014, 1:33:48 UTC 18 May 2014, 8:15:38 UTC Error while computing 10,104.09 7,861.61 --- SixTrack v451.07 (pni)
36856055 17279643 18 May 2014, 1:33:48 UTC 18 May 2014, 4:44:45 UTC Error while computing 10,104.57 7,076.31 --- SixTrack v451.07 (pni)
36856054 17279642 18 May 2014, 1:33:48 UTC 18 May 2014, 4:29:28 UTC Error while computing 10,105.04 8,444.38 --- SixTrack v451.07 (pni)
36856049 17279640 18 May 2014, 1:33:48 UTC 18 May 2014, 8:15:38 UTC Error while computing 10,104.29 5,325.66 --- SixTrack v451.07 (pni)
36856043 17279637 18 May 2014, 1:33:48 UTC 18 May 2014, 5:22:37 UTC Error while computing 10,104.04 8,278.92 --- SixTrack v451.07 (pni)
36856035 17279633 18 May 2014, 1:33:48 UTC 18 May 2014, 4:29:28 UTC Error while computing 10,104.18 8,430.53 --- SixTrack v451.07 (pni)
36856011 17279621 18 May 2014, 1:33:48 UTC 18 May 2014, 4:32:30 UTC Error while computing 10,103.88 7,055.89 --- SixTrack v451.07 (pni)
ID: 26480 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 26481 - Posted: 18 May 2014, 9:05:18 UTC

This is the 'EXIT_TIME_LIMIT_EXCEEDED' error we were discussing yesterday in the thread of the same name.

Host 10308609 has an APR of 178.17200653414 for the 64-bit PNI app, version 451.07 (production).

Qax, this is a problem on the server, not on your computer. Eric is aware of it. Don't adjust your settings, but you might prefer to concentrate on another project for a few hours.

Eric, we probably need that high rsc_fpops_bound multiplier on the production tasks for a while at least.

Do we have outlier detection in the current validator? If not, we need it, or this will keep happening. Once the server has been 'vaccinated' against outliers, you could try running 'reset credit statistics for this application' from Estimating job resource requirements.
ID: 26481 · Report as offensive     Reply Quote
Qax

Send message
Joined: 22 Nov 10
Posts: 5
Credit: 778,394
RAC: 0
Message 26482 - Posted: 18 May 2014, 9:08:42 UTC - in response to Message 26481.  

Should I cancel all the WUs I downloaded, or.....just suspend the project?
ID: 26482 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 26483 - Posted: 18 May 2014, 9:27:18 UTC - in response to Message 26482.  

Should I cancel all the WUs I downloaded, or.....just suspend the project?

Up to you. As things stand, any task which runs longer than three hours on your machine will be killed: you can cure that (gradually) by running shorter tasks, but catch-22 says that you can't know which tasks are going to be short (that's what the project is here to discover) - except perhaps by waiting until your 'wingmate' has completed and reported their copy of the task. But that's hard work.

I see you've run many BOINC projects, for many years. How much have you learned about how it works under the hood? Does the phrase "edit client_state.xml" fill you with dread?

There are ways of solving this problem locally, but they require knowledge and care. If you know enough about editing client_state to be worried by it, then you're probably in the right place to learn some more: but if you've never come across it before, then I think I'd advise against.
ID: 26483 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26484 - Posted: 18 May 2014, 10:38:34 UTC - in response to Message 26481.  

Thanks again; I am treating this as top priority now.
Trying to get to grips with it all. The error rate is reaching
3% which is too high.................Eric.
ID: 26484 · Report as offensive     Reply Quote
Qax

Send message
Joined: 22 Nov 10
Posts: 5
Credit: 778,394
RAC: 0
Message 26485 - Posted: 18 May 2014, 11:18:34 UTC - in response to Message 26483.  

I started running SETI in 2000. For some reason, I can't connect my old classical account to my new one. I'm thinking I must have changed the e-mail at some point towards the end, and now I can't remember it.

I try not to get too involved in "hacks" for these things. For the longest time I wouldn't run them on an overclocked machine, because I was always afraid that might compromise the data. And to me, the integrity of the results if the most important thing.

Right now I am running WUs on my server. Last I checked, they were working fine. But I'm having a very low success rate on my home PC, so....I suspended the project for now. However, I have like at least a dozen in the chamber ready to go, and I actually have like 4 or 5 already over 3 hours long. So I guess I should just kill those then?

BTW - It's not that I'm not curious about how things work. I run the program on a linux server via command line more than 2000 miles away from me. But as far as tweaking things to run not as the programmers intended always makes me worry about changing the results in some way.
ID: 26485 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Invalid tasks


©2024 CERN