Thread 'Error -161'

Author	Message
Dave Peachey Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0	Message 29948 - Posted: 17 Apr 2017, 12:06:49 UTC Whilst erroring out another dozen or so of these WUs this morning (allowing them to error out rather than aborting them gets them out of the system that bit quicker due to the "max # errors" setting), I noticed that all of these WUs have a common fator ... namely they all appear to contain the text string ..qnDDn7oo6G73TpABFKDmABFKDmPaIKDm.. within the WU name (at least, all of the ones I've encountered have done so). I don't know whether there are any other WUs with different name strings which are exhibiting the same problem nor do I know whether anyone has been able to crunch any of these succsessfully (and, if so, under what circumstances) but that does seem to suggest a common fault with a specific WU batch. Moreover, I notice, to my dismay, that these WUs are still being produced (two of my most recent failures were created only this morning), so the problem isn't going to go away until someone terminates this batch production and/or identifies the problem and rectifies it. Hopefully one of the project team will be able to get on the case some time this week. ID: 29948 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,769,729 RAC: 55,171	Message 29951 - Posted: 17 Apr 2017, 17:35:53 UTC - in response to Message 29948. Hopefully one of the project team will be able to get on the case some time this week. +1 ID: 29951 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29952 - Posted: 17 Apr 2017, 17:51:21 UTC So what I am doing now is: once such a "Long runner" is downloaded, I abort it immediately. +1 We are the product of random evolution. ID: 29952 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,769,729 RAC: 55,171	Message 29961 - Posted: 18 Apr 2017, 15:55:13 UTC - in response to Message 29948. Moreover, I notice, to my dismay, that these WUs are still being produced (two of my most recent failures were created only this morning), so the problem isn't going to go away until someone terminates this batch production and/or identifies the problem and rectifies it. Hopefully one of the project team will be able to get on the case some time this week. That's what I hope too. The way it works right now is somewhat tedious. ID: 29961 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,769,729 RAC: 55,171	Message 29970 - Posted: 19 Apr 2017, 19:01:01 UTC - in response to Message 29961. These faulty tasks still keep coming, and I keep aborting them at the moment they are being downloaded (anything else would be pure waste). What I am surprised about is that nobody at LHC/ATLAS has noticed yet this problem which exists for almost one week. ID: 29970 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 910 Credit: 777,253,259 RAC: 181,617	Message 29971 - Posted: 19 Apr 2017, 20:16:22 UTC I assume they are on holidays, I don't have time to nanny the tasks so I have 53% error rate, my RAC has took a hit :( ID: 29971 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,508,842 RAC: 124,398	Message 29972 - Posted: 19 Apr 2017, 20:20:09 UTC - in response to Message 29970. These faulty tasks still keep coming, and I keep aborting them at the moment they are being downloaded (anything else would be pure waste). What I am surprised about is that nobody at LHC/ATLAS has noticed yet this problem which exists for almost one week. When this happens I switch my clients to other projects (I have already switched them to the primegrid-race) Supporting BOINC, a great concept ! ID: 29972 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 29974 - Posted: 20 Apr 2017, 11:12:29 UTC I run a mix of projects on LHC, being SixTrack, Theory and Atlas. I am getting about 10 of the bad Atlas a day, and they fail after 5 minutes. So that is 50 minutes a day on one out of seven cores of an i7-4790 that I have on LHC. That is about 0.5% of my CPU time, and is not worth doing anything. Or to put it another way, if I thought that another project is greater than 0.5% more valuable than that mix, or if I valued my time less than that amount, then I would do something. ID: 29974 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,769,729 RAC: 55,171	Message 29975 - Posted: 20 Apr 2017, 11:56:34 UTC - in response to Message 29974. ... and is not worth doing anything... On one hand, you are right, seen from the point of view for ONE single user. On the other hand, this unlucky situation has now been going on for about 1 week - and in total for all crunchers a lot of waste of CPU time plus bandwidth. ID: 29975 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 29976 - Posted: 20 Apr 2017, 12:00:38 UTC - in response to Message 29975. On the other hand, this unlucky situation has now been going on for about 1 week - and in total for all crunchers a lot of waste of CPU time plus bandwidth. Yes, they should be looking into it, at least for their own sake if not ours. ID: 29976 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 251 Credit: 6,001,083 RAC: 0	Message 29977 - Posted: 20 Apr 2017, 12:43:51 UTC - in response to Message 29976. Thanks for your reports and help! From the BOINC point of view, we have not been able to detect any error. As far as I can see, the VM starts, but then the Panda job from ATLAS fails. I got one of those failed ones myself the other day, e.g. https://lhcathome.cern.ch/lhcathome/result.php?resultid=136096220 Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this. Meanwhile, please just ignore these tasks, there should be plenty of working ones in between. ID: 29977 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29981 - Posted: 20 Apr 2017, 15:59:30 UTC Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this. Thanks Nils for the clarification. That is about 0.5% of my CPU time, and is not worth doing anything. Yes, but I had to increase the frequency of synch with the servers in order to limit to 0.5%. With the default settings I sometime have only 30% of the cores working till the next synch. We are the product of random evolution. ID: 29981 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,769,729 RAC: 55,171	Message 29996 - Posted: 22 Apr 2017, 4:38:16 UTC - in response to Message 29977. Our ATLAS expert is on holidays this week, but we'll see if we can find someone to look into this. Nils - could you not find someone yet to fix this Problem? ID: 29996 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,502,668 RAC: 31,631	Message 30002 - Posted: 22 Apr 2017, 12:45:51 UTC I have one of these again queued up on my task list. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=65885688 It has already failed on three other hosts. I am running on one CPU core and I have set the memory in app_config.xml to 3400 MB and that is what I see as base memory in VirtualBox manager but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory. Have you seen similar memory requirements? ID: 30002 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,769,729 RAC: 55,171	Message 30003 - Posted: 22 Apr 2017, 14:07:45 UTC - in response to Message 30002. ... but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory. Have you seen similar memory requirements? how did you find out that Boinc needs 9000 MB ? For how many cores? ID: 30003 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 30004 - Posted: 22 Apr 2017, 14:26:37 UTC - in response to Message 30002. Last modified: 22 Apr 2017, 14:28:29 UTC but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory. Hi Harri. What setting do you have in "LHC@home preferences", under "Max # CPUs"? If you have "No limit" or "8", then BOINC will consider that the task needs around 9000 MB of memory to run (the value set by the server as required for 8 cores), even if you have set 3400 MB and 1 core in app_config.xml. If this is your situation, change the setting "Max # CPUs" to "1". We are the product of random evolution. ID: 30004 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,502,668 RAC: 31,631	Message 30005 - Posted: 22 Apr 2017, 15:20:10 UTC - in response to Message 30004. Sorry, I have screwed up. I modified the settings on LHC@home preferences. I ment to set Max # CPUs to 1 and Max # Jobs to 10 and got them vice versa. I have only 8 CPUs so I propably have to abort all tasks I downloaded since the change. Sorry about that... Move along. Nothing to see here... ID: 30005 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,502,668 RAC: 31,631	Message 30006 - Posted: 22 Apr 2017, 15:47:10 UTC - in response to Message 30003. ... but Boinc is seeing that it needs 9000 MB and the task has been now suspended waiting for memory. Have you seen similar memory requirements? how did you find out that Boinc needs 9000 MB ? For how many cores? You can see the memory requirement when right-clicking a task that has been started, select properties and view line Working set size. I use BoincTasks as a Boinc Manager replacement, there you can set to view memory and virtual memory usage also directly on the Tasks tab. The number of CPUs is where I went wrong (see my previous post). ID: 30006 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,769,729 RAC: 55,171	Message 30009 - Posted: 22 Apr 2017, 18:03:42 UTC - in response to Message 30006. You can see the memory requirement when right-clicking a task that has been started, select properties and view line Working set size. I use BoincTasks as a Boinc Manager replacement, there you can set to view memory and virtual memory usage also directly on the Tasks tab. thanks for the Information. I have 4 tasks 2-cores ea. running simultaneously. I now looked up the "Working set size" for each, it's shown as 4.10 GB. However, and now comes the interesting thing, in reality each task takes close to 5 GB of the memory. Which seems have to do with the fact that in the app_config.xml I set 5000MB. Since I still have plenty of memory (total RAM is 32 GB), just out of curiosity I'll set this value to 6.500MB and see what happens. ID: 30009 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 910 Credit: 777,253,259 RAC: 181,617	Message 30010 - Posted: 22 Apr 2017, 22:53:59 UTC The Working set size is calaculated from your web prefernces using the 2.6 + (0.6 x Max # CPUs). The actual ram usage is set in app_config.xml I have Max # CPUs = 3 and app_config.xml set to 4900 for dual core. With these settings things match up fairly well. ID: 30010 · Reply Quote