Message boards :
Number crunching :
WU not being resent to another user
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Dec 10 Posts: 9 Credit: 1,000,912 RAC: 0 |
I have a dozen WU in which the wingman's task errored. These WU are not being resent. Some of them since the 26 august. Could someone have a look at it? thanks |
Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0 |
It is according to how the project works. The project manager recently made a huge batch of work (like 500k WU's). When a WU fails and gets scheduled for resending, it goes to the back of the queue. It has taken a lot of days to work through the huge amount of work that was posted, but you should see them being resend within the next 2-3 days (we are down to around 80k WU's when I write this). So don't worry, it will solve itself soon. |
Send message Joined: 6 Dec 10 Posts: 9 Credit: 1,000,912 RAC: 0 |
Many thanks Uffe F. ccandido |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
I think we're through the big batch now - I'm starting to see resends arriving on all my machines. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Right; thanks for all that. I am going to submit another big batch but I presume they will all go the end of the queue. I will also need to rerun a few incomplete tasks to finish some studies but I can wait for everything to be finished as I need to compare all the studies. I hope I have got this right. If I wait till the queue empties there would be no new work for a few days, I think I would rather take advantage of the excellent progress and keep everybody busy for another week or two. Eric. |
Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0 |
Sounds great Eric. And yes, I started receiving resends now also, so they should be on the way. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
My guess is that all the ~65K or so tasks ready to send on the server now are resends for one reason or another. That seems high, but possibly with so many WUs available recently, some hosts or users bit off more than they could chew and let the surplus time out. If you start a new batch now, my understanding is that they'll go to the end of the queue, and not start appearing on our machines until the resends have all been sent out. The potential problem is that inevitably some of the resends being sent out now will fail, and need to be resent yet again - and if you submit the batch first, they'll be backed up behind that. From the user point of view, it would be good to keep submitting smaller batches, so that the level of interest and engagement is kept high - people start drifting away if there's no work available. But I don't know how difficult or time-consuming it would be to keep drip-feeding us. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
OK; I've put a relatively small batch and I shall watch over the weekend. Eric. (I have plenty more work for the intensity scan. I am holding off "very" long jobs until I implement a scheme to run them as a sequential series of shorter WUs.) |
Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0 |
I just got a few wlxu2 on my computer, so it seems that there are still some fresh long ones in the queue. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well wlxu2 are "only" 10**6 turns maximum which I now consider "long". The very long are in wlxu7 and as I said I am holding off. There are very few resends as I have plenty of work and I don't want to waste resources by re-running until essential. |
Send message Joined: 25 Aug 05 Posts: 69 Credit: 306,627 RAC: 0 |
I have one of the wlxu7. Nearly 100 hours estimation. Not a re-send. Christoph |
Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0 |
I got 6 wlxu2, 2 of them is 7 hours, 2 of them is 61 hours and 2 of them is 76 hours. It's not a problem, because I can easely do them in time, was just so you know that there are some out there. They are not resends. 3121879 short 3123215 long 3123328 long 3123629 long 3125320 short 3125798 long I don't know where to look for turns, but i can se the estimated computation sizes: 180000 GFLOPs 1800000 GFLOPs The reason there is 3 different estimated times are because of sse2, sse3 and pni jobs dont have the same estimation. |
Send message Joined: 14 Jul 12 Posts: 2 Credit: 173,287 RAC: 0 |
You can extract that from the name of the workunit. I've taken the workunits you have posted: 3121879 - wlxu2_nuebb1__41__s__64.31_59.32__4.5_5__6__49.5_1_sixvf_boinc52013 3125320 - wlxu2_nuebb1__48__s__64.31_59.32__4_4.5__6__1.5_1_sixvf_boinc53574 3125798 - wlxu2_nuebb0__46__s__64.31_59.32__5.5_6__7__37.5_1_sixvf_boinc53303 3123629 - wlxu2_nuebb0__43__s__64.31_59.32__4.5_5__7__9_1_sixvf_boinc52458[/b] 3123328 - wlxu2_nuebb0__41__s__64.31_59.32__5.5_6__7__88.5_1_sixvf_boinc52157 3123215 - wlxu2_nuebb0__41__s__64.31_59.32__5_5.5__7__7.5_1_sixvf_boinc52044 The normal workunits have a 6 in bold, the longer ones have a 7. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
OK; I've put a relatively small batch and I shall watch Looks like it might be time for the next small batch, please. Zero tasks ready to send on the server status page. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
On their way. Eric. |
Send message Joined: 9 Oct 10 Posts: 77 Credit: 3,671,357 RAC: 0 |
I've seen a lot of very short WUs in this batch. They all validated fine. Was it expected ? |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Can't quite say "expected"; not surprising, but this is what we are investigating, the onset of chaotic motion. |
Send message Joined: 4 Nov 06 Posts: 12 Credit: 376,810 RAC: 0 |
This is getting silly.PVs are stacking up, over three week old WUs are still awaiting new wingmen to complete. Time to switch to another project. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well I am sorry about that; each task is sent only twice, the minimum necessary for comparison. BOINC is supposed to use deadlines to send an additional copy, but wea have other problems with deadlines. It can easily happenn that getting a second results could take some time as some machines are slow, ar switched off, or whatever. This means we too have a problem with a "tail" of incomplete cases for a particular study. Re-submitting too often is a waste of yout valuable resources. I'll discuss with colleagues but I think this is how BOINC works. Eric. |
Send message Joined: 4 Nov 06 Posts: 12 Credit: 376,810 RAC: 0 |
It dosen`t happen with other BOINC projects. Normally, once a WU is out of time, a new WU is automatically generated at once. |
©2024 CERN