Message boards : ATLAS application : Error -161
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Dave Peachey

Send message
Joined: 9 May 09
Posts: 17
Credit: 772,975
RAC: 0
Message 29948 - Posted: 17 Apr 2017, 12:06:49 UTC

Whilst erroring out another dozen or so of these WUs this morning (allowing them to error out rather than aborting them gets them out of the system that bit quicker due to the "max # errors" setting), I noticed that all of these WUs have a common fator ... namely they all appear to contain the text string ..qnDDn7oo6G73TpABFKDmABFKDmPaIKDm.. within the WU name (at least, all of the ones I've encountered have done so).

I don't know whether there are any other WUs with different name strings which are exhibiting the same problem nor do I know whether anyone has been able to crunch any of these succsessfully (and, if so, under what circumstances) but that does seem to suggest a common fault with a specific WU batch.

Moreover, I notice, to my dismay, that these WUs are still being produced (two of my most recent failures were created only this morning), so the problem isn't going to go away until someone terminates this batch production and/or identifies the problem and rectifies it.

Hopefully one of the project team will be able to get on the case some time this week.
ID: 29948 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,920,092
RAC: 121,920
Message 29951 - Posted: 17 Apr 2017, 17:35:53 UTC - in response to Message 29948.  

Hopefully one of the project team will be able to get on the case some time this week.

+1
ID: 29951 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29952 - Posted: 17 Apr 2017, 17:51:21 UTC

So what I am doing now is: once such a "Long runner" is downloaded, I abort it immediately.

+1
We are the product of random evolution.
ID: 29952 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,920,092
RAC: 121,920
Message 29961 - Posted: 18 Apr 2017, 15:55:13 UTC - in response to Message 29948.  

Moreover, I notice, to my dismay, that these WUs are still being produced (two of my most recent failures were created only this morning), so the problem isn't going to go away until someone terminates this batch production and/or identifies the problem and rectifies it.

Hopefully one of the project team will be able to get on the case some time this week.

That's what I hope too. The way it works right now is somewhat tedious.
ID: 29961 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,920,092
RAC: 121,920
Message 29970 - Posted: 19 Apr 2017, 19:01:01 UTC - in response to Message 29961.  

These faulty tasks still keep coming, and I keep aborting them at the moment they are being downloaded (anything else would be pure waste).

What I am surprised about is that nobody at LHC/ATLAS has noticed yet this problem which exists for almost one week.
ID: 29970 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 807
Credit: 652,541,596
RAC: 276,109
Message 29971 - Posted: 19 Apr 2017, 20:16:22 UTC

I assume they are on holidays, I don't have time to nanny the tasks so I have 53% error rate, my RAC has took a hit :(
ID: 29971 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,569,815
RAC: 9,173
Message 29972 - Posted: 19 Apr 2017, 20:20:09 UTC - in response to Message 29970.  

These faulty tasks still keep coming, and I keep aborting them at the moment they are being downloaded (anything else would be pure waste).

What I am surprised about is that nobody at LHC/ATLAS has noticed yet this problem which exists for almost one week.

When this happens I switch my clients to other projects (I have already switched them to the primegrid-race)


Supporting BOINC, a great concept !
ID: 29972 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 29974 - Posted: 20 Apr 2017, 11:12:29 UTC

I run a mix of projects on LHC, being SixTrack, Theory and Atlas. I am getting about 10 of the bad Atlas a day, and they fail after 5 minutes. So that is 50 minutes a day on one out of seven cores of an i7-4790 that I have on LHC. That is about 0.5% of my CPU time, and is not worth doing anything.

Or to put it another way, if I thought that another project is greater than 0.5% more valuable than that mix, or if I valued my time less than that amount, then I would do something.
ID: 29974 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,920,092
RAC: 121,920
Message 29975 - Posted: 20 Apr 2017, 11:56:34 UTC - in response to Message 29974.  

... and is not worth doing anything...

On one hand, you are right, seen from the point of view for ONE single user.
On the other hand, this unlucky situation has now been going on for about 1 week - and in total for all crunchers a lot of waste of CPU time plus bandwidth.
ID: 29975 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 29976 - Posted: 20 Apr 2017, 12:00:38 UTC - in response to Message 29975.  

On the other hand, this unlucky situation has now been going on for about 1 week - and in total for all crunchers a lot of waste of CPU time plus bandwidth.

Yes, they should be looking into it, at least for their own sake if not ours.
ID: 29976 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 242
Credit: 5,800,306
RAC: 0
Message 29977 - Posted: 20 Apr 2017, 12:43:51 UTC - in response to Message 29976.  

Thanks for your reports and help!

From the BOINC point of view, we have not been able to detect any error. As far as I can see, the VM starts, but then the Panda job from ATLAS fails.

I got one of those failed ones myself the other day, e.g. https://lhcathome.cern.ch/lhcathome/result.php?resultid=136096220

Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this.

Meanwhile, please just ignore these tasks, there should be plenty of working ones in between.
ID: 29977 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29981 - Posted: 20 Apr 2017, 15:59:30 UTC

Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this.

Thanks Nils for the clarification.

That is about 0.5% of my CPU time, and is not worth doing anything.

Yes, but I had to increase the frequency of synch with the servers in order to limit to 0.5%. With the default settings I sometime have only 30% of the cores working till the next synch.
We are the product of random evolution.
ID: 29981 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,920,092
RAC: 121,920
Message 29996 - Posted: 22 Apr 2017, 4:38:16 UTC - in response to Message 29977.  

Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this.

Nils - could you not find someone yet to fix this Problem?
ID: 29996 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,658,342
RAC: 15,889
Message 30002 - Posted: 22 Apr 2017, 12:45:51 UTC

I have one of these again queued up on my task list. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=65885688 It has already failed on three other hosts. I am running on one CPU core and I have set the memory in app_config.xml to 3400 MB and that is what I see as base memory in VirtualBox manager but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory.

Have you seen similar memory requirements?
ID: 30002 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,920,092
RAC: 121,920
Message 30003 - Posted: 22 Apr 2017, 14:07:45 UTC - in response to Message 30002.  

... but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory.

Have you seen similar memory requirements?

how did you find out that Boinc needs 9000 MB ? For how many cores?
ID: 30003 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 30004 - Posted: 22 Apr 2017, 14:26:37 UTC - in response to Message 30002.  
Last modified: 22 Apr 2017, 14:28:29 UTC

but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory.

Hi Harri. What setting do you have in "LHC@home preferences", under "Max # CPUs"? If you have "No limit" or "8", then BOINC will consider that the task needs around 9000 MB of memory to run (the value set by the server as required for 8 cores), even if you have set 3400 MB and 1 core in app_config.xml.
If this is your situation, change the setting "Max # CPUs" to "1".
We are the product of random evolution.
ID: 30004 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,658,342
RAC: 15,889
Message 30005 - Posted: 22 Apr 2017, 15:20:10 UTC - in response to Message 30004.  

Sorry, I have screwed up. I modified the settings on LHC@home preferences. I ment to set Max # CPUs to 1 and Max # Jobs to 10 and got them vice versa. I have only 8 CPUs so I propably have to abort all tasks I downloaded since the change.

Sorry about that... Move along. Nothing to see here...
ID: 30005 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,658,342
RAC: 15,889
Message 30006 - Posted: 22 Apr 2017, 15:47:10 UTC - in response to Message 30003.  

... but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory.

Have you seen similar memory requirements?

how did you find out that Boinc needs 9000 MB ? For how many cores?


You can see the memory requirement when right-clicking a task that has been started, select properties and view line Working set size. I use BoincTasks as a Boinc Manager replacement, there you can set to view memory and virtual memory usage also directly on the Tasks tab.

The number of CPUs is where I went wrong (see my previous post).
ID: 30006 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,920,092
RAC: 121,920
Message 30009 - Posted: 22 Apr 2017, 18:03:42 UTC - in response to Message 30006.  

You can see the memory requirement when right-clicking a task that has been started, select properties and view line Working set size. I use BoincTasks as a Boinc Manager replacement, there you can set to view memory and virtual memory usage also directly on the Tasks tab.

thanks for the Information.
I have 4 tasks 2-cores ea. running simultaneously. I now looked up the "Working set size" for each, it's shown as 4.10 GB. However, and now comes the interesting thing, in reality each task takes close to 5 GB of the memory. Which seems have to do with the fact that in the app_config.xml I set 5000MB.
Since I still have plenty of memory (total RAM is 32 GB), just out of curiosity I'll set this value to 6.500MB and see what happens.
ID: 30009 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 807
Credit: 652,541,596
RAC: 276,109
Message 30010 - Posted: 22 Apr 2017, 22:53:59 UTC

The Working set size is calaculated from your web prefernces using the 2.6 + (0.6 x Max # CPUs).

The actual ram usage is set in app_config.xml

I have Max # CPUs = 3 and app_config.xml set to 4900 for dual core. With these settings things match up fairly well.
ID: 30010 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : ATLAS application : Error -161


©2024 CERN