Message boards :
ATLAS application :
New WU with 2 output files
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 ![]() ![]() |
Hi all, As you may have noticed we have a lot of ATLAS WU running currently and we struggle to keep enough in the queue. Part of the problem is that we have a backlog in processing the results (the "WU waiting for assimilation" on the server status page). At the end of each WU the big HITS data file is zipped up with a few other small files and sent back to the BOINC server as a single file. Unzipping this file on the server side is rather slow so this is why there is a backlog. This morning we made a change so that the HITS file is sent back as a separate file which will greatly reduce the load from the unzipping step. So at some point you will see that your WU have two separate uploads, one large HITS file and one smaller file. The overall data volume to transfer should be roughly the same as before. Please let us know of any problems you see. |
![]() Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 ![]() ![]() |
Hello! I presume the new WUs with two output files are V2.55 (native). Since V2.55 i constantly get confirmation errors. The WUs seem to run normal until the end, but get zero credits. See: https://lhcathome.cern.ch/lhcathome/result.php?resultid=214722014 and https://lhcathome.cern.ch/lhcathome/result.php?resultid=214718741 Cheers, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
![]() Send message Joined: 28 Sep 04 Posts: 643 Credit: 40,204,999 RAC: 14,735 ![]() ![]() ![]() |
Hello! I have got a couple of those as well on a Windows machine, here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=214736653 and here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=214717820 There are succesful ones as well, so not all of them are validate errors. ![]() |
![]() Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 ![]() ![]() |
There are succesful ones as well, so not all of them are validate errors. I can confirm this. After the first two WUs had validation errors i had several successful ones. But i have two more with validation errors: https://lhcathome.cern.ch/lhcathome/result.php?resultid=214737187 and https://lhcathome.cern.ch/lhcathome/result.php?resultid=214739842 Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
![]() Send message Joined: 28 Sep 04 Posts: 643 Credit: 40,204,999 RAC: 14,735 ![]() ![]() ![]() |
Likewise here, currently more validate errors than good results. ![]() |
![]() Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 ![]() ![]() |
With my machine it's about a 50/50 ratio. Just got a bad one more. Already about 23 hours of total runtime (or 69 hours of CPUtime) wasted :-( I have no problem if technical difficulties occur, but i really hate wasting recources... Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 ![]() ![]() |
I get only validation errors. Hits files produced. Tullio |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
50% invalid here too. |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 ![]() ![]() |
so far 2 out of 7 native v2.55 have validate errors. In the logs it says e.g. "Moving ./HITS.16756652._013772.pool.root.1 to shared/HITS.pool.root.1" for every task (valid and invalid ones according to boinc server standards), which indicates that the tasks ran successfully, so it is propably "just" a boinc server validating problem. I would not mind if not all tasks give credits for the moment but as djoser mentioned it is a waste of resources, since due to the validate errors, the tasks get send to another host although they already have produced good results (HITS file). |
Send message Joined: 26 Sep 17 Posts: 6 Credit: 1,190,866 RAC: 0 ![]() ![]() |
|
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
Please let us know of any problems you see. Well, you asked to be informed of any problems, so... The biggest problem here is that you people still have not learned to wait until Monday before implementing changes like this. Higgs discovered? Really???? I don't think you people have the talent required to make such a discovery. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 ![]() ![]() |
Sorry about all this mess. We finally figured out the problem - there was an extra BOINC server running which was not properly configured to handle the change to two output files. All the WU handled by this server were invalidated which explains the random success/failure. This server was also not supposed to be running so it took until now to figure out what the problem was. Since we fixed the problem 20 mins ago (by shutting down the extra server) I've seen no validation failures for "good" WU. |
![]() Send message Joined: 28 Sep 04 Posts: 643 Credit: 40,204,999 RAC: 14,735 ![]() ![]() ![]() |
Sorry about all this mess. We finally figured out the problem - there was an extra BOINC server running which was not properly configured to handle the change to two output files. All the WU handled by this server were invalidated which explains the random success/failure. This server was also not supposed to be running so it took until now to figure out what the problem was. Since we fixed the problem 20 mins ago (by shutting down the extra server) I've seen no validation failures for "good" WU. Thank you for the information. Is there a way you manually could rerun the validation process with the good server? It would be sad to loose the valid results because of this. And it would give the credit where it is due. ![]() |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 ![]() ![]() |
Two computing errors, EXIT_DISK_LIMIT_EXCEEDED, one on a Windows 10 PC the other on a Linux box. Tullio |
Send message Joined: 3 Dec 15 Posts: 4 Credit: 22,885,388 RAC: 0 ![]() ![]() |
Wow, After several resets and reinstalling VB, I found out the problem is not at my end. The joys of distributed computing......;) |
Send message Joined: 26 Sep 17 Posts: 6 Credit: 1,190,866 RAC: 0 ![]() ![]() |
No invalids so far, cheers! |
![]() Send message Joined: 15 Jun 08 Posts: 2252 Credit: 199,526,762 RAC: 130,355 ![]() ![]() ![]() |
I'm getting lots of ATLAS validation errors and it seems that other users get them too. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
My invalids on ATLAS native has dropped from about 60% to about 2% which is still higher than it was before the switch to 2.55. Fortunately they seem to run for only about 200 seconds so it's not a big loss. I've noticed a slight rise in the number of "failed download" results since David announced the other problem was fixed. |
![]() Send message Joined: 15 Jun 08 Posts: 2252 Credit: 199,526,762 RAC: 130,355 ![]() ![]() ![]() |
... "failed download" ... Right. That's also an issue ATM. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 ![]() ![]() |
I am gettting computation errors after a HITS file has been produced. There is an error message at the end of stderr.txt related to a file transfer. |
©2023 CERN