Message boards : Number crunching : Ton of Atlas Validate Errors
Message board moderation

To post messages, you must log in.

AuthorMessage
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,096,685
RAC: 2,148
Message 31395 - Posted: 15 Jul 2017, 1:13:26 UTC

over the last couple of weeks, about 80% of my Atlas tasks wind up with a validate error and several have run for more than a day a half, while stuck on 99% (2.5 - 3 hour average task length for the ones that do complete and verify)

I've made no changes on my end at all

Anyone else seeing strange things with Atlas?

ID: 31395 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,004,183
RAC: 136,049
Message 31400 - Posted: 15 Jul 2017, 7:55:25 UTC - in response to Message 31395.  

over the last couple of weeks, about 80% of my Atlas tasks wind up with a validate error and several have run for more than a day a half, while stuck on 99% (2.5 - 3 hour average task length for the ones that do complete and verify)

I've made no changes on my end at all

Anyone else seeing strange things with Atlas?

It looks like something got scrambled on your host.
The reason may be that some resources are overloaded, e.g. RAM or disk IO.
You may try the following steps one after the other. If you get a stable setup, skip the rest.

1. Reboot your host.
2. Reduce the number of concurrently running VMs.
3. Reset the project to get a fresh vdi file.
ID: 31400 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,310
RAC: 3,828
Message 31401 - Posted: 15 Jul 2017, 10:25:22 UTC

keputnam, looking at your stderr for those I can almost guarantee it is that you had an ISP problem.

These tasks are known to do what yours did if you have a throttled down internet connection that is not able to start the task properly (I have run thousands of them and that is what always caused this to happen)

Maybe your ISP throttled you down to a slower speed or you just had a bad/slow day with the ISP that you have.

You will see this in your stderr when it is an internet problem on your end.

TThe hlea slta s1t0 1l0i nleisn eosf otfh et hpei lpoitl olto gl.og

ERROR: Missing metadata.xml

CCooppyyiinngg iinnppuutt ffiilleess iinnttoo RRuunnAAttllaass..
Guest Log: CCooppyyiinngg iinnppuutt ffiilleess iinnttoo RRuunnAAttllaass..
Guest Log: Copied input files into RunAtlas.
Guest Log: Copied input files into RunAtlas.

The task gets all messed up and at times will actually run to complete and THEN give you the Invalid.

You also got ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS which was not on your end and was supposed to be fixed with a new version the other day and I think that maybe that new version of Atlas will be here soon (ok we did find a new problem with it a few hours ago but I hope that get fixed too)

BUT as I mentioned you are having an ISP problem on your end with the Atlas tasks so maybe for now you should just try some during the off-hours and see if that works or run different tasks (Theory)

I am at PDT and when I first figured this out a couple months ago I started the tasks after midnight and they would start up and after they run for 30 minutes then the internet speed doesn't matter (I even tested them with mine unplugged)
Volunteer Mad Scientist For Life
ID: 31401 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,096,685
RAC: 2,148
Message 31445 - Posted: 17 Jul 2017, 1:02:28 UTC
Last modified: 17 Jul 2017, 1:04:40 UTC

Thanks,Guys

My ISP swears that they haven't throttled me, but I am about to upgrade/expand my system disk I've noticed the access light on solidly for quite a while several times I also have Atlas sefor only 1 job at a time


Any thoughts on the extremely long running ones?
ID: 31445 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,096,685
RAC: 2,148
Message 31910 - Posted: 10 Aug 2017, 19:48:35 UTC

Finally convinced ATT to come out and look at my router

When I upgraded service a few months ago, they swapped out the old one, and the tech didn't configure the new one properly

Connect speeds to the router itself are much better and actual throughput is almost an order of magnitude faster

Almost a whole day with no verify errors!

ID: 31910 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 31911 - Posted: 10 Aug 2017, 19:58:52 UTC - in response to Message 31910.  

Almost a whole day with no verify errors!

Good news!
ID: 31911 · Report as offensive     Reply Quote
kraljb

Send message
Joined: 15 Jun 16
Posts: 1
Credit: 34,845
RAC: 0
Message 32122 - Posted: 27 Aug 2017, 15:35:30 UTC

Hi there. As I look through my tasks I have found many with 'Validate error' status. Is there something I should do? :)
ID: 32122 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,004,183
RAC: 136,049
Message 32124 - Posted: 27 Aug 2017, 16:43:00 UTC - in response to Message 32122.  

kraljb wrote:
Hi there. As I look through my tasks I have found many with 'Validate error' status. Is there something I should do? :)

Your VMs need more RAM.

The "official" RAM setting for a 1-core VM is 3400 MB.
This is also the configuration you get from the project server.

Nonetheless the current ATLAS batch obviously needs more RAM during it's initial phase.

You may use a local app_config.xml to rise the RAM setting.
5000 MB should work in any case, less (4600-4800 MB) may also be enough but there is no guarantee.

A sample app_config.xml looks like this:
<app_config>
<app>
<name>ATLAS</name>
</app>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>1.0</avg_ncpus>
<cmdline>--nthreads 1 --memory_size_mb 5000</cmdline>
</app_version>
</app_config>

Reload the local setting and start a new WU as the new setting becomes active only for freshly started VMs.
ID: 32124 · Report as offensive     Reply Quote

Message boards : Number crunching : Ton of Atlas Validate Errors


©2024 CERN