Message boards :
Number crunching :
workunits made to fail?
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Nov 09 Posts: 42 Credit: 229,229 RAC: 0 |
Hi, Once in a while, I get a a work that temporarily is labeled "inconclusive", but an actual error had not happened for quite a while, which is why I thought of making a topic. Since 3 others get error and 1 is Cancelled by server and 8 others get error and 1 is Cancelled by server, can I assume those workunits were designed to error out? |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
Not exactly 'designed to fail', but do note that they all have the application name "sixtracktest" - they are test workunits. As with all testing, nobody knows for certain whether they will work or not - if we knew that, the test would be over! So, it's not certain in advance whether they will fail or not, and it helps the scientists if you run them anyway, to find out. |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,537,005 RAC: 15,574 |
I have here one that was cancelled by server after 18 minutes of crunching. This is a normal WU, not a sixtracktest WU. The other copy is still being crunched by another host. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Right, as Richard says we are testing. I submitted a batch of work which all failed because of the TIME LIMIT EXCEED, so I cancelled all the others to not waste your resources. The next batch of 59 cases looks better but we are seeing a problem in that the wrong (not the new) executable is being used..... These tests should be using SixTrack Version 4522. Looking at that right now. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Looks like we must have wrongly cancelled it. Apologies. Eric. (It is not exactly trivial to cancel as the ops page takes a range of WU IDs and boinc and boinctest are mixed together, really sorry.) |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
My laptop host 9924593 got a batch yesterday evening and errored them all. This isn't the runtime exceeded error: it looks like Linux version 452.02 processed them OK, but Windows version 451.07 failed to create an expected output file. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
That's right; trying to figure out why you got the "wrong" executable. eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
That's right; trying to figure out why you got I've just aborted another two, both with multiple failures for other Windows wingmates. v451.07 is listed as the current Windows test app on the applications page, too. |
Send message Joined: 24 Oct 04 Posts: 1116 Credit: 49,722,983 RAC: 14,167 |
Just to give you some help here I ask for a days worth of LHC tasks.....and as usual I get 32 tasks I already had vLHC X2 and Atlas X2 and a CMS-dev running along with a Einstein GPU Volunteer Mad Scientist For Life |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks a lot; this seems very wrong. I am trying to get the old test executables removed and hope that will help. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Hi Magic; do you want more or less??? 32 seems a lot to me given that we rarely have 100,000 active WUs..... Eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
Thanks a lot; this seems very wrong. I am trying to get the old I believe the better procedure is to 'deprecate' the app_version, and deploy the new executables as a completely new app_version. That's a database operation, rather than a file exchange. |
Send message Joined: 24 Oct 04 Posts: 1116 Credit: 49,722,983 RAC: 14,167 |
Hi Magic; do you want more or less??? 32 seems a lot to me given Good morning Eric, Well it turned out to actually be 44 tasks. I set the 8-core to get 1 days worth just so I could see how things are going here and usually I would get 8 tasks and at time 16 as a 24 hour block. But lately it has been giving me about 5 or 10 days worth. I will still try to get most of them completed but the due dates are always too soon (the 15th on this batch) I only have 4 cores to crunch the LHC's since I have the vLHC X2 and one Atlas and one CMS-dev (as usual all the Cern projects) My other pc's are quads and a couple 3-cores so I don't have the free ones to do the Sixtracks. And all of them run GPU's too. - Samson Volunteer Mad Scientist For Life |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,537,005 RAC: 15,574 |
Still getting only 451.07 as a sixtracktest application on all three hosts and failing all WUs crunched with it. Some of those are crunched by wingmen with the newer test application and they are successful. |
Send message Joined: 24 Oct 04 Posts: 1116 Credit: 49,722,983 RAC: 14,167 |
So far the 451.07 are running fine on this one. 6 down 38 to go http://lhcathomeclassic.cern.ch/sixtrack/results.php?userid=5472 Volunteer Mad Scientist For Life |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks Richard, as you well know I am not the manager expert! .-) I don't have and don't want permissions on the server but this I can try right now. Eric. I have deprecated a few tens of obsolete apps and I await the server restart. 09:08 CST |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Sorry about that, but they are not too long....once the deprected apps are sorted I'll put in longer tests. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Hi Samson; mut be sixtrack and not sixtracktest then..... I am currently trying to kill the bad WUs. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well I have decided not to purge the "bad" WUs in sixtracktest using old executables. There are not many and I would rather let the good WUs continue. All old versions are now deprecated and the server restarted. I also found lots of ancient WUs/Results which I shall try and delete. This harks back to an earlier message(s) concerning WUs which are hanging around and will hang around forever. I hope we shall then have a clean(er) database. Eric. |
Send message Joined: 24 Oct 04 Posts: 1116 Credit: 49,722,983 RAC: 14,167 |
Thanks for the info Eric, The main thing is these tasks are having no problems here and being completed. I was wondering why my stats said I had more in progress than I actually have on this pc so I decided to take a look and it was those 12 tasks from back in March that say they are in progress but in fact no longer exist here. So when the cleaner takes care of that the numbers will match again. No big deal and I am used to this since the early days. - Samson Volunteer Mad Scientist For Life |
©2024 CERN