Message boards : Number crunching : workunits made to fail?
Message board moderation

To post messages, you must log in.

AuthorMessage
Ano

Send message
Joined: 29 Nov 09
Posts: 42
Credit: 229,229
RAC: 0
Message 27425 - Posted: 7 May 2015, 15:34:03 UTC

Hi,

Once in a while, I get a a work that temporarily is labeled "inconclusive", but an actual error had not happened for quite a while, which is why I thought of making a topic.
Since 3 others get error and 1 is Cancelled by server and 8 others get error and 1 is Cancelled by server, can I assume those workunits were designed to error out?
ID: 27425 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27426 - Posted: 7 May 2015, 16:21:42 UTC - in response to Message 27425.  

Not exactly 'designed to fail', but do note that they all have the application name "sixtracktest" - they are test workunits.

As with all testing, nobody knows for certain whether they will work or not - if we knew that, the test would be over! So, it's not certain in advance whether they will fail or not, and it helps the scientists if you run them anyway, to find out.
ID: 27426 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,537,005
RAC: 15,574
Message 27427 - Posted: 7 May 2015, 17:46:11 UTC

I have here one that was cancelled by server after 18 minutes of crunching. This is a normal WU, not a sixtracktest WU. The other copy is still being crunched by another host.
ID: 27427 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27428 - Posted: 8 May 2015, 8:19:21 UTC - in response to Message 27425.  

Right, as Richard says we are testing. I submitted a batch of
work which all failed because of the TIME LIMIT EXCEED, so I
cancelled all the others to not waste your resources. The next
batch of 59 cases looks better but we are seeing a problem
in that the wrong (not the new) executable is being used.....
These tests should be using SixTrack Version 4522.
Looking at that right now. Eric.
ID: 27428 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27429 - Posted: 8 May 2015, 8:22:50 UTC - in response to Message 27427.  

Looks like we must have wrongly cancelled it. Apologies. Eric.
(It is not exactly trivial to cancel as the ops page takes
a range of WU IDs and boinc and boinctest are mixed together,
really sorry.)
ID: 27429 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27430 - Posted: 8 May 2015, 8:36:05 UTC - in response to Message 27428.  

My laptop host 9924593 got a batch yesterday evening and errored them all. This isn't the runtime exceeded error: it looks like Linux version 452.02 processed them OK, but Windows version 451.07 failed to create an expected output file.
ID: 27430 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27431 - Posted: 8 May 2015, 17:05:00 UTC - in response to Message 27430.  

That's right; trying to figure out why you got
the "wrong" executable. eric.
ID: 27431 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27432 - Posted: 8 May 2015, 19:16:38 UTC - in response to Message 27431.  

That's right; trying to figure out why you got
the "wrong" executable. eric.

I've just aborted another two, both with multiple failures for other Windows wingmates.

v451.07 is listed as the current Windows test app on the applications page, too.
ID: 27432 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1116
Credit: 49,722,983
RAC: 14,167
Message 27433 - Posted: 9 May 2015, 3:10:49 UTC

Just to give you some help here I ask for a days worth of LHC tasks.....and as usual I get 32 tasks

I already had vLHC X2 and Atlas X2 and a CMS-dev running along with a Einstein GPU
Volunteer Mad Scientist For Life
ID: 27433 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27434 - Posted: 9 May 2015, 9:34:14 UTC - in response to Message 27432.  

Thanks a lot; this seems very wrong. I am trying to get the old
test executables removed and hope that will help. Eric.
ID: 27434 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27435 - Posted: 9 May 2015, 9:36:14 UTC - in response to Message 27433.  

Hi Magic; do you want more or less??? 32 seems a lot to me given
that we rarely have 100,000 active WUs..... Eric.
ID: 27435 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27436 - Posted: 9 May 2015, 11:28:04 UTC - in response to Message 27434.  

Thanks a lot; this seems very wrong. I am trying to get the old
test executables removed and hope that will help. Eric.

I believe the better procedure is to 'deprecate' the app_version, and deploy the new executables as a completely new app_version. That's a database operation, rather than a file exchange.
ID: 27436 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1116
Credit: 49,722,983
RAC: 14,167
Message 27437 - Posted: 9 May 2015, 18:03:30 UTC - in response to Message 27435.  

Hi Magic; do you want more or less??? 32 seems a lot to me given
that we rarely have 100,000 active WUs..... Eric.


Good morning Eric,

Well it turned out to actually be 44 tasks.

I set the 8-core to get 1 days worth just so I could see how things are going here and usually I would get 8 tasks and at time 16 as a 24 hour block.

But lately it has been giving me about 5 or 10 days worth.

I will still try to get most of them completed but the due dates are always too soon (the 15th on this batch)

I only have 4 cores to crunch the LHC's since I have the vLHC X2 and one Atlas and one CMS-dev (as usual all the Cern projects)

My other pc's are quads and a couple 3-cores so I don't have the free ones to do the Sixtracks.

And all of them run GPU's too.

- Samson
Volunteer Mad Scientist For Life
ID: 27437 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,537,005
RAC: 15,574
Message 27438 - Posted: 9 May 2015, 21:34:52 UTC

Still getting only 451.07 as a sixtracktest application on all three hosts and failing all WUs crunched with it. Some of those are crunched by wingmen with the newer test application and they are successful.
ID: 27438 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1116
Credit: 49,722,983
RAC: 14,167
Message 27440 - Posted: 10 May 2015, 2:15:08 UTC

So far the 451.07 are running fine on this one.

6 down 38 to go

http://lhcathomeclassic.cern.ch/sixtrack/results.php?userid=5472
Volunteer Mad Scientist For Life
ID: 27440 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27441 - Posted: 10 May 2015, 7:08:35 UTC - in response to Message 27436.  

Thanks Richard, as you well know I am not the manager expert! .-)
I don't have and don't want permissions on the server
but this I can try right now. Eric.

I have deprecated a few tens of obsolete apps and I await the server
restart. 09:08 CST
ID: 27441 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27442 - Posted: 10 May 2015, 7:10:38 UTC - in response to Message 27438.  

Sorry about that, but they are not too long....once the deprected apps
are sorted I'll put in longer tests. Eric.
ID: 27442 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27443 - Posted: 10 May 2015, 10:46:45 UTC - in response to Message 27440.  

Hi Samson; mut be sixtrack and not sixtracktest then.....
I am currently trying to kill the bad WUs. Eric.
ID: 27443 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27444 - Posted: 10 May 2015, 13:56:47 UTC

Well I have decided not to purge the "bad" WUs in sixtracktest
using old executables. There are not many and I would rather let
the good WUs continue. All old versions are now
deprecated and the server restarted.

I also found lots of ancient WUs/Results which I shall
try and delete. This harks back to an earlier message(s)
concerning WUs which are hanging around and will hang
around forever. I hope we shall then have a clean(er)
database. Eric.
ID: 27444 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1116
Credit: 49,722,983
RAC: 14,167
Message 27445 - Posted: 10 May 2015, 15:38:34 UTC

Thanks for the info Eric,

The main thing is these tasks are having no problems here and being completed.

I was wondering why my stats said I had more in progress than I actually have on this pc so I decided to take a look and it was those 12 tasks from back in March that say they are in progress but in fact no longer exist here.

So when the cleaner takes care of that the numbers will match again.

No big deal and I am used to this since the early days.

- Samson
Volunteer Mad Scientist For Life
ID: 27445 · Report as offensive     Reply Quote

Message boards : Number crunching : workunits made to fail?


©2024 CERN