Message boards : Number crunching : Tasks v530.09 crashing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
thierry.l

Send message
Joined: 21 Aug 05
Posts: 14
Credit: 119,137
RAC: 0
Message 23411 - Posted: 8 Oct 2011, 11:30:20 UTC - in response to Message 23389.  

Hi,

Got two computers that cannot run LHC anymore because of ERR 168 of new version, -1-, -2-.
Of course you may choose to run application only on new quantum computers or to use a larger part of computers in the world.
As I can see in applications page : "Microsoft Windows (98 or later) running on an Intel x86-compatible CPU", you choose to run on W98 at least, meaning that you want old computers too.
So I think you just have to recompile without this optimization set in order to solve problem or to use computer check's for task distribution.

Regards

ID: 23411 · Report as offensive     Reply Quote
Profile Igor Zacharov
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 16 May 11
Posts: 79
Credit: 111,419
RAC: 0
Message 23415 - Posted: 9 Oct 2011, 8:27:39 UTC - in response to Message 23411.  

we don't have much architectural choices when specifying which app version to run.

I have now retracted 530.9 (deleted) for all generic x86 Windows and Linux,
leaving 530.9 specifically only for platforms which report with AMD_x86_64 and Intel EM64T processors back to the server.

Please, check if that works for you.



skype id: igor-zacharov
ID: 23415 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 99
Credit: 8,152,957
RAC: 1
Message 23420 - Posted: 9 Oct 2011, 11:13:19 UTC

Yes, 530.9 disappeared and now I see v0.00 instead of it. On tasks pages. :-)
That is all.
ID: 23420 · Report as offensive     Reply Quote
thierry.l

Send message
Joined: 21 Aug 05
Posts: 14
Credit: 119,137
RAC: 0
Message 23422 - Posted: 9 Oct 2011, 12:03:51 UTC - in response to Message 23415.  

Well, I will check as soon as I will have some works - project has no jobs available -
I just hope that you didn't stop the distribution for all x86 computers, because my third computer hadn't the same problem that those two older.
ID: 23422 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 609
Credit: 3,819,080
RAC: 430
Message 23425 - Posted: 9 Oct 2011, 14:47:35 UTC
Last modified: 9 Oct 2011, 14:48:25 UTC

My Opteron 1210 is SSE3 capable and has completed a v0.00 task waiting for validation. CPU time 42k s, run time 91k s. But I am running other 5 BOINC projects, including Test4Theory@home.More exactly, BOINC_VM running one CERN job after another 24/7.
Tullio
ID: 23425 · Report as offensive     Reply Quote
Profile Krunchin-Keith [USA]
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 2 Sep 04
Posts: 209
Credit: 1,482,496
RAC: 0
Message 23426 - Posted: 9 Oct 2011, 15:24:53 UTC - in response to Message 23415.  
Last modified: 9 Oct 2011, 15:27:45 UTC

we don't have much architectural choices when specifying which app version to run.

I have now retracted 530.9 (deleted) for all generic x86 Windows and Linux,
leaving 530.9 specifically only for platforms which report with AMD_x86_64 and Intel EM64T processors back to the server.

Please, check if that works for you.



I guess it is working.

My x64's have no work, but the last ones done were 530.09

My x32 are showing v0.00

I also see "Database Error" appear now on a lot of the website task pages when viewing results. Some pages say it twice. example
ID: 23426 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 110
Credit: 6,802,550
RAC: 936
Message 23429 - Posted: 9 Oct 2011, 16:25:08 UTC
Last modified: 9 Oct 2011, 16:33:45 UTC

It's not working for me, I'm afraid. The original host has just been sent a 530.09 task, which crashed...

John.
ID: 23429 · Report as offensive     Reply Quote
Profile Krunchin-Keith [USA]
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 2 Sep 04
Posts: 209
Credit: 1,482,496
RAC: 0
Message 23430 - Posted: 9 Oct 2011, 18:53:48 UTC

Yeah something is not working.

I checked one of the x32 hosts, it shows 530.09 in boincmanage as running.

The task shows a v0.00 on the website.

I also looked into the lhc folder and slots for the task, there is only a 530.9 application.

on the website also tasks show as v530.08 and then v0.00 for the x32 hosts, no v530.09s appear in the task lists list, but for the x64 hosts it only shows v530.09 as run. Very odd ?
ID: 23430 · Report as offensive     Reply Quote
Profile Igor Zacharov
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 16 May 11
Posts: 79
Credit: 111,419
RAC: 0
Message 23431 - Posted: 9 Oct 2011, 20:21:07 UTC - in response to Message 23430.  

After consultation with Eric McIntosh, we desided to retrack completely the
530.9 version. I have reinstalled the 530.8 version, now called 530.10.

We will come back to it after a better investigation. Have to admit a mistake.
skype id: igor-zacharov
ID: 23431 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 110
Credit: 6,802,550
RAC: 936
Message 23432 - Posted: 9 Oct 2011, 20:46:39 UTC

All running well again. Thanks.

John.
ID: 23432 · Report as offensive     Reply Quote
thierry.l

Send message
Joined: 21 Aug 05
Posts: 14
Credit: 119,137
RAC: 0
Message 23438 - Posted: 10 Oct 2011, 14:18:29 UTC - in response to Message 23431.  

Version 530.10 seems to be fine - no crash at startup -
Thanks for update
ID: 23438 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 852
Credit: 1,619,050
RAC: 0
Message 23439 - Posted: 10 Oct 2011, 14:41:17 UTC

MEA CULPA. Having concentrated so much on the floating-point model options for the ifort compiler I rather forgot the basic PC architecture. I was anxious to get a two to four times faster version into production so as to maximise the use of your systems. I just removed the arch IA32 flags to allow use of SSE2 (which is floating-point compatible for me) and thus generated an executable for the very modern Linux and Windows PCs in my office. NOT a good idea as we see from your messages. So Igor has put us back to my Version 4308 or BOINC 530.8/530.10.
We shall get this sorted out as soon as possible and have different executables for different platforms so as to optimise resource utilisation. (We have also increased, probably by too much, the fpops, disk space, and elapsed time estimates.) This will hopefully give us some breathing space as it is now absolutely vital that I check the physics results of all these recent studies.

The problem is basically that I had to switch to the Intel IFORT compiler with the new BOINC version. With the appopriate fp-model flags this worked really well, until I found a small number of 1 ULP differences on the formatted input of the accelerator description.
This difference appears between the Linux and Windows executables even on the same hardware apparently. The problem of formatted input is well understood and was largely solved by David. M Gay some twenty years ago, and is I believe handled correctly by C99. As a (temporary) solution I now read the data as Single Precision. The recent studies, including those very short runs, not even one turn, will allow me to evaluate the physics impact of this change. If the effect is too large I shall have to replace the Fortran formatted input IFORT runtime routine with a correct C strtod........sigh. This would be useful on the longer term as it would hopefully allow the use of other compilers but still producing identical results. Again all this should be a non-issue when compilers with the new Fortran 2003 Formatted I/O ROUND options become available.

Thanks for your understanding. Eric.

ID: 23439 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 99
Credit: 8,152,957
RAC: 1
Message 23441 - Posted: 10 Oct 2011, 15:35:49 UTC - in response to Message 23438.  
Last modified: 10 Oct 2011, 15:35:58 UTC

Version 530.10 seems to be fine...

I do not agree - new and never seen before problems started. ;-)
The new problem is Error 148:
<message>
couldn't start CreateProcess() failed - : -148
</message>

Recommendations are very welcome...
ID: 23441 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 852
Credit: 1,619,050
RAC: 0
Message 23442 - Posted: 10 Oct 2011, 15:50:48 UTC - in response to Message 23441.  

Thanks I'll get right onto it.
Probably the resource limits are TOO big now.
Eric.
ID: 23442 · Report as offensive     Reply Quote
Profile Ageless
Avatar

Send message
Joined: 18 Sep 04
Posts: 143
Credit: 27,645
RAC: 0
Message 23444 - Posted: 10 Oct 2011, 21:31:32 UTC - in response to Message 23442.  

530.09 <rsc_fpops_est> is <rsc_fpops_est>30000000000000.000000</rsc_fpops_est>
530.10 <rsc_fpops_est> is <rsc_fpops_est>120000000000000.000000</rsc_fpops_est>

Is there a reason why you increased the estimated run time by 4 times the original value? These tasks run for approximately 8 hours on my i3-530, but they're estimated to run a whole lot more, thereby making these task runs in panic mode (high priority).

Just because you went back to a lower form of instruction set does not mean everything will run that whole lot slower. ;-)
(With all those tricks, the LHC Classic DCF on my system is now completely haywire. Where the other projects that run have their DCF around 1.0, LHC has it at 2.9; I may consider resetting the project, so my DCF will reset to 1.0).
Jord

BOINC FAQ Service
ID: 23444 · Report as offensive     Reply Quote
Profile trigggl
Avatar

Send message
Joined: 17 Feb 09
Posts: 22
Credit: 311,184
RAC: 0
Message 23446 - Posted: 10 Oct 2011, 23:47:12 UTC - in response to Message 23431.  
Last modified: 11 Oct 2011, 0:07:28 UTC

After consultation with Eric McIntosh, we desided to retrack completely the
530.9 version. I have reinstalled the 530.8 version, now called 530.10.

We will come back to it after a better investigation. Have to admit a mistake.

So, we're back to horrible run times and credits for Linux? I'll wait for the return of an improved .09.

How about using .09 (.11) for x86_64 and .10 for i686?
ID: 23446 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 852
Credit: 1,619,050
RAC: 0
Message 23450 - Posted: 11 Oct 2011, 4:43:50 UTC - in response to Message 23444.  

I need to discuss with Igor. We have been seeing complaints
about resource limits exceeded but we are not sure which.
Seems to me I should put back the fpops to the previous
value. However the current ifort version is actually
rather slow until I get round to optimisimg it.
Sorry about that. In the past we could test changes "in house"
but not at the moment. Thanks for the feedback. Eric.
ID: 23450 · Report as offensive     Reply Quote
Antjest

Send message
Joined: 30 Sep 04
Posts: 21
Credit: 1,442,034
RAC: 0
Message 23453 - Posted: 11 Oct 2011, 9:36:42 UTC

530.10 is much less efficient than 530.09 on my core2quad.
It's 15-33% slower and lower core temps also indicates less power is used by proc.

I had no problems with 530.09 windows xp on my two o/c computers (so far only one invalid and that was my fault with some manual intervention in boinc manager).

Perhaps other guys o/c too much or have some other hardware problems with computer.
ID: 23453 · Report as offensive     Reply Quote
Profile Ageless
Avatar

Send message
Joined: 18 Sep 04
Posts: 143
Credit: 27,645
RAC: 0
Message 23454 - Posted: 11 Oct 2011, 9:40:00 UTC - in response to Message 23450.  
Last modified: 11 Oct 2011, 9:44:02 UTC

However the current ifort version is actually rather slow until I get round to optimisimg it.

I'd like to disagree. :-)

Two of my last 530.09s:
CPU time 23271.08 seconds.
CPU time 27747.53 seconds.

This versus two 530.10s that have finished:
CPU time 11122.58 seconds.
CPU time 10511.39 seconds.

If you manage to decrease my run time by half to a third without optimizations, then please don't optimize the applications any further. ;-)

(I fully understand that not all tasks are the same length in run time around here.)

In the past we could test changes "in house" but not at the moment.

Well, you have us. You can send that work either to hosts you trust, or have a small group of us do some alpha work with feedback.
Jord

BOINC FAQ Service
ID: 23454 · Report as offensive     Reply Quote
Profile trigggl
Avatar

Send message
Joined: 17 Feb 09
Posts: 22
Credit: 311,184
RAC: 0
Message 23455 - Posted: 11 Oct 2011, 11:04:53 UTC - in response to Message 23454.  

Two of my last 530.09s:
CPU time 23271.08 seconds.
CPU time 27747.53 seconds.

This versus two 530.10s that have finished:
CPU time 11122.58 seconds.
CPU time 10511.39 seconds.


So, perhaps 530.09 slowed Windows down rather than speeding Linux up? Well, either way, if I'm validating against a Windows computer with 530.10, I'm at a great disadvantage.
ID: 23455 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Tasks v530.09 crashing


©2020 CERN