Message boards : ATLAS application : New WU with 2 output files
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 37820 - Posted: 25 Jan 2019, 9:02:04 UTC

Hi all,

As you may have noticed we have a lot of ATLAS WU running currently and we struggle to keep enough in the queue. Part of the problem is that we have a backlog in processing the results (the "WU waiting for assimilation" on the server status page). At the end of each WU the big HITS data file is zipped up with a few other small files and sent back to the BOINC server as a single file. Unzipping this file on the server side is rather slow so this is why there is a backlog.

This morning we made a change so that the HITS file is sent back as a separate file which will greatly reduce the load from the unzipping step. So at some point you will see that your WU have two separate uploads, one large HITS file and one smaller file. The overall data volume to transfer should be roughly the same as before. Please let us know of any problems you see.
ID: 37820 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 37822 - Posted: 25 Jan 2019, 16:59:07 UTC
Last modified: 25 Jan 2019, 17:01:36 UTC

Hello!

I presume the new WUs with two output files are V2.55 (native).
Since V2.55 i constantly get confirmation errors.
The WUs seem to run normal until the end, but get zero credits.

See:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=214722014
and
https://lhcathome.cern.ch/lhcathome/result.php?resultid=214718741

Cheers, djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 37822 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 37823 - Posted: 25 Jan 2019, 22:35:09 UTC - in response to Message 37822.  

Hello!

I presume the new WUs with two output files are V2.55 (native).
Since V2.55 i constantly get confirmation errors.
The WUs seem to run normal until the end, but get zero credits.

See:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=214722014
and
https://lhcathome.cern.ch/lhcathome/result.php?resultid=214718741

Cheers, djoser.

I have got a couple of those as well on a Windows machine, here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=214736653 and here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=214717820

There are succesful ones as well, so not all of them are validate errors.
ID: 37823 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 37825 - Posted: 26 Jan 2019, 9:39:42 UTC - in response to Message 37823.  

There are succesful ones as well, so not all of them are validate errors.


I can confirm this. After the first two WUs had validation errors i had several successful ones.

But i have two more with validation errors:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=214737187
and
https://lhcathome.cern.ch/lhcathome/result.php?resultid=214739842
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 37825 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 37827 - Posted: 26 Jan 2019, 12:27:04 UTC - in response to Message 37825.  

Likewise here, currently more validate errors than good results.
ID: 37827 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 37828 - Posted: 26 Jan 2019, 17:07:52 UTC - in response to Message 37827.  
Last modified: 26 Jan 2019, 17:09:20 UTC

With my machine it's about a 50/50 ratio. Just got a bad one more.
Already about 23 hours of total runtime (or 69 hours of CPUtime) wasted :-(
I have no problem if technical difficulties occur, but i really hate wasting recources...
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 37828 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 37834 - Posted: 26 Jan 2019, 23:28:16 UTC

I get only validation errors. Hits files produced.
Tullio
ID: 37834 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37835 - Posted: 27 Jan 2019, 0:01:10 UTC

50% invalid here too.
ID: 37835 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 37836 - Posted: 27 Jan 2019, 10:56:07 UTC

so far 2 out of 7 native v2.55 have validate errors.

In the logs it says e.g. "Moving ./HITS.16756652._013772.pool.root.1 to shared/HITS.pool.root.1" for every task (valid and invalid ones according to boinc server standards), which indicates that the tasks ran successfully, so it is propably "just" a boinc server validating problem.

I would not mind if not all tasks give credits for the moment but as djoser mentioned it is a waste of resources, since due to the validate errors, the tasks get send to another host although they already have produced good results (HITS file).
ID: 37836 · Report as offensive     Reply Quote
Azmodes

Send message
Joined: 26 Sep 17
Posts: 6
Credit: 1,190,866
RAC: 0
Message 37837 - Posted: 27 Jan 2019, 11:12:17 UTC

ID: 37837 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37838 - Posted: 27 Jan 2019, 14:55:53 UTC - in response to Message 37820.  

Please let us know of any problems you see.


Well, you asked to be informed of any problems, so...

The biggest problem here is that you people still have not learned to wait until Monday before implementing changes like this.

Higgs discovered? Really???? I don't think you people have the talent required to make such a discovery.
ID: 37838 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 37843 - Posted: 28 Jan 2019, 13:25:36 UTC

Sorry about all this mess. We finally figured out the problem - there was an extra BOINC server running which was not properly configured to handle the change to two output files. All the WU handled by this server were invalidated which explains the random success/failure. This server was also not supposed to be running so it took until now to figure out what the problem was. Since we fixed the problem 20 mins ago (by shutting down the extra server) I've seen no validation failures for "good" WU.
ID: 37843 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 37844 - Posted: 28 Jan 2019, 13:58:08 UTC - in response to Message 37843.  

Sorry about all this mess. We finally figured out the problem - there was an extra BOINC server running which was not properly configured to handle the change to two output files. All the WU handled by this server were invalidated which explains the random success/failure. This server was also not supposed to be running so it took until now to figure out what the problem was. Since we fixed the problem 20 mins ago (by shutting down the extra server) I've seen no validation failures for "good" WU.

Thank you for the information. Is there a way you manually could rerun the validation process with the good server? It would be sad to loose the valid results because of this. And it would give the credit where it is due.
ID: 37844 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 37846 - Posted: 29 Jan 2019, 0:09:39 UTC

Two computing errors, EXIT_DISK_LIMIT_EXCEEDED, one on a Windows 10 PC the other on a Linux box.
Tullio
ID: 37846 · Report as offensive     Reply Quote
Fuzzy Duck

Send message
Joined: 3 Dec 15
Posts: 4
Credit: 22,885,388
RAC: 0
Message 37847 - Posted: 29 Jan 2019, 0:43:38 UTC - in response to Message 37844.  

Wow,

After several resets and reinstalling VB, I found out the problem is not at my end.

The joys of distributed computing......;)
ID: 37847 · Report as offensive     Reply Quote
Azmodes

Send message
Joined: 26 Sep 17
Posts: 6
Credit: 1,190,866
RAC: 0
Message 37850 - Posted: 29 Jan 2019, 14:30:52 UTC

No invalids so far, cheers!
ID: 37850 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,925,441
RAC: 137,691
Message 37868 - Posted: 31 Jan 2019, 6:44:23 UTC

I'm getting lots of ATLAS validation errors and it seems that other users get them too.
ID: 37868 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37876 - Posted: 31 Jan 2019, 10:10:03 UTC - in response to Message 37868.  

My invalids on ATLAS native has dropped from about 60% to about 2% which is still higher than it was before the switch to 2.55. Fortunately they seem to run for only about 200 seconds so it's not a big loss.

I've noticed a slight rise in the number of "failed download" results since David announced the other problem was fixed.
ID: 37876 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,925,441
RAC: 137,691
Message 37877 - Posted: 31 Jan 2019, 10:19:20 UTC - in response to Message 37876.  

... "failed download" ...

Right.
That's also an issue ATM.
ID: 37877 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 37878 - Posted: 31 Jan 2019, 10:24:55 UTC
Last modified: 31 Jan 2019, 10:26:15 UTC

I am gettting computation errors after a HITS file has been produced. There is an error message at the end of stderr.txt related to a file transfer.
ID: 37878 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : New WU with 2 output files


©2024 CERN