Message boards :
ATLAS application :
Error -161
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0 |
Whilst erroring out another dozen or so of these WUs this morning (allowing them to error out rather than aborting them gets them out of the system that bit quicker due to the "max # errors" setting), I noticed that all of these WUs have a common fator ... namely they all appear to contain the text string ..qnDDn7oo6G73TpABFKDmABFKDmPaIKDm.. within the WU name (at least, all of the ones I've encountered have done so). I don't know whether there are any other WUs with different name strings which are exhibiting the same problem nor do I know whether anyone has been able to crunch any of these succsessfully (and, if so, under what circumstances) but that does seem to suggest a common fault with a specific WU batch. Moreover, I notice, to my dismay, that these WUs are still being produced (two of my most recent failures were created only this morning), so the problem isn't going to go away until someone terminates this batch production and/or identifies the problem and rectifies it. Hopefully one of the project team will be able to get on the case some time this week. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,920,092 RAC: 121,920 |
Hopefully one of the project team will be able to get on the case some time this week. +1 |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
So what I am doing now is: once such a "Long runner" is downloaded, I abort it immediately. +1 We are the product of random evolution. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,920,092 RAC: 121,920 |
Moreover, I notice, to my dismay, that these WUs are still being produced (two of my most recent failures were created only this morning), so the problem isn't going to go away until someone terminates this batch production and/or identifies the problem and rectifies it. That's what I hope too. The way it works right now is somewhat tedious. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,920,092 RAC: 121,920 |
These faulty tasks still keep coming, and I keep aborting them at the moment they are being downloaded (anything else would be pure waste). What I am surprised about is that nobody at LHC/ATLAS has noticed yet this problem which exists for almost one week. |
Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,541,596 RAC: 276,109 |
I assume they are on holidays, I don't have time to nanny the tasks so I have 53% error rate, my RAC has took a hit :( |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 9,173 |
These faulty tasks still keep coming, and I keep aborting them at the moment they are being downloaded (anything else would be pure waste). When this happens I switch my clients to other projects (I have already switched them to the primegrid-race) Supporting BOINC, a great concept ! |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I run a mix of projects on LHC, being SixTrack, Theory and Atlas. I am getting about 10 of the bad Atlas a day, and they fail after 5 minutes. So that is 50 minutes a day on one out of seven cores of an i7-4790 that I have on LHC. That is about 0.5% of my CPU time, and is not worth doing anything. Or to put it another way, if I thought that another project is greater than 0.5% more valuable than that mix, or if I valued my time less than that amount, then I would do something. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,920,092 RAC: 121,920 |
... and is not worth doing anything... On one hand, you are right, seen from the point of view for ONE single user. On the other hand, this unlucky situation has now been going on for about 1 week - and in total for all crunchers a lot of waste of CPU time plus bandwidth. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
On the other hand, this unlucky situation has now been going on for about 1 week - and in total for all crunchers a lot of waste of CPU time plus bandwidth. Yes, they should be looking into it, at least for their own sake if not ours. |
Send message Joined: 15 Jul 05 Posts: 242 Credit: 5,800,306 RAC: 0 |
Thanks for your reports and help! From the BOINC point of view, we have not been able to detect any error. As far as I can see, the VM starts, but then the Panda job from ATLAS fails. I got one of those failed ones myself the other day, e.g. https://lhcathome.cern.ch/lhcathome/result.php?resultid=136096220 Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this. Meanwhile, please just ignore these tasks, there should be plenty of working ones in between. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this. Thanks Nils for the clarification. That is about 0.5% of my CPU time, and is not worth doing anything. Yes, but I had to increase the frequency of synch with the servers in order to limit to 0.5%. With the default settings I sometime have only 30% of the cores working till the next synch. We are the product of random evolution. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,920,092 RAC: 121,920 |
Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this. Nils - could you not find someone yet to fix this Problem? |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,658,342 RAC: 15,889 |
I have one of these again queued up on my task list. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=65885688 It has already failed on three other hosts. I am running on one CPU core and I have set the memory in app_config.xml to 3400 MB and that is what I see as base memory in VirtualBox manager but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory. Have you seen similar memory requirements? |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,920,092 RAC: 121,920 |
... but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory. how did you find out that Boinc needs 9000 MB ? For how many cores? |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory. Hi Harri. What setting do you have in "LHC@home preferences", under "Max # CPUs"? If you have "No limit" or "8", then BOINC will consider that the task needs around 9000 MB of memory to run (the value set by the server as required for 8 cores), even if you have set 3400 MB and 1 core in app_config.xml. If this is your situation, change the setting "Max # CPUs" to "1". We are the product of random evolution. |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,658,342 RAC: 15,889 |
Sorry, I have screwed up. I modified the settings on LHC@home preferences. I ment to set Max # CPUs to 1 and Max # Jobs to 10 and got them vice versa. I have only 8 CPUs so I propably have to abort all tasks I downloaded since the change. Sorry about that... Move along. Nothing to see here... |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,658,342 RAC: 15,889 |
... but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory. You can see the memory requirement when right-clicking a task that has been started, select properties and view line Working set size. I use BoincTasks as a Boinc Manager replacement, there you can set to view memory and virtual memory usage also directly on the Tasks tab. The number of CPUs is where I went wrong (see my previous post). |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,920,092 RAC: 121,920 |
You can see the memory requirement when right-clicking a task that has been started, select properties and view line Working set size. I use BoincTasks as a Boinc Manager replacement, there you can set to view memory and virtual memory usage also directly on the Tasks tab. thanks for the Information. I have 4 tasks 2-cores ea. running simultaneously. I now looked up the "Working set size" for each, it's shown as 4.10 GB. However, and now comes the interesting thing, in reality each task takes close to 5 GB of the memory. Which seems have to do with the fact that in the app_config.xml I set 5000MB. Since I still have plenty of memory (total RAM is 32 GB), just out of curiosity I'll set this value to 6.500MB and see what happens. |
Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,541,596 RAC: 276,109 |
The Working set size is calaculated from your web prefernces using the 2.6 + (0.6 x Max # CPUs). The actual ram usage is set in app_config.xml I have Max # CPUs = 3 and app_config.xml set to 4900 for dual core. With these settings things match up fairly well. |
©2024 CERN