Message boards :
ATLAS application :
100% errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 Oct 13 Posts: 6 Credit: 1,559,929 RAC: 0 |
Why all tasks finish with errors? I checked everything, I even increased memory for VM for 2CPUs up to 4800Mb, but still cannot get any successfull task. https://lhcathome.cern.ch/lhcathome/results.php?userid=269482 EVERY task is either computation error or validation error, not only for me but for wingmen too. Is this subproject really working? But wait, miracle happened, I got 1 (one) successfull task. Only one task between 20. Is this rate ok for this project? This is my app_config.xml <app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>2.000000</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 4800</cmdline> </app_version> </app_config> |
Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0 |
All of your tasks showing Error while computing seem to be victims of the problem being discussed, at length, in the thread on the "Error -161" message (per https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4224) which is a recent and recurrent problem. Whilst this seems to affect people to a greater or lesser extent, there are a number of people who are experiencing this problem with a significant proportion of WUs they download (as would seem to be the case for you). In the absence of (apparent) investigations or any reports by the project team (I presume they are short-staffed due to the holidays hence the lack of news), the jury is out on where the problem lies. I can't speak for the ones showing Validate error but, certainly with your attempts to process them, the result includes a failure to generate the 50MB+ HITS output file ... as would seem to be the case for the wingmen who are also failing with those same WUs. So there's something different that's wrong here although, as those WUs mostly seem to validate eventually, that would seem to imply a different fault. However, this seems to be the first time (which I can recall) since the recent consolidation of ATLAS@home into the wider LHC@home that problems of this order of magnitude have been encountered so I would hope this is a glitch and not a foreshadowing of future user experiences for ATLAS@home. And, no, a 5% success rate on successful WU processing is very much not OK for this or any other project (a 5% failure rate is generally deemed to be just about acceptable) and is indicative of problems at either the project end, at your end ... or both! BTW, your app_config file looks OK; mine is simlar at <app> <name>ATLAS</name> <fraction_done_exact/> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>4.000000</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 5700</cmdline> </app_version> but this is for a sixteen core machine using only four cores for ATLAS and with 32GB RAM. Given your machine only seems to have two cores and 6GB of RAM (if I've read the specs page for that computer correctly), maybe it's underpowered for running ATLAS on two cores and you should consider cutting back to using only one core with a commensurate drop in the RAM setting. |
Send message Joined: 19 Oct 13 Posts: 6 Credit: 1,559,929 RAC: 0 |
Given your machine only seems to have two cores and 6GB of RAM (if I've read the specs page for that computer correctly), maybe it's underpowered for running ATLAS on two cores and you should consider cutting back to using only one core with a commensurate drop in the RAM setting. This is the main problem, I think. 2 cores VB machine has 4200 Mb memory limit by default, my computer has 6 Gb. What wrong with this? And what is the real memory requirement for such tasks? If 4.2Gb is not enough for 2 cores, why virtual machines still misconfigured? If 6Gb of total memory is not enough for 2 cores VB machine, why this tasks is still being sent? It's an old issue, since atlas@home were a separate project. But nobody's care. I changed app_config.xml to 1 avg_ncpus, so let's see how it works. |
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,066,508 RAC: 127,019 |
Whilst this seems to affect people to a greater or lesser extent, there are a number of people who are experiencing this problem with a significant proportion of WUs they download (as would seem to be the case for you). Same is true for me. Meanwhile, I have noticed that all "Long runners" (which, when being downloaded, show a remaining time of a fews days) error out shortly after start. The other tasks (showing a remaining time of a few hours) are all going well. So what I am doing now is: once such a "Long runner" is downloaded, I abort it immediately. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
So what I am doing now is: once such a "Long runner" is downloaded, I abort it immediately. Same here, and all short runners are ok. We are the product of random evolution. |
Send message Joined: 19 Oct 13 Posts: 6 Credit: 1,559,929 RAC: 0 |
Ok, with 1 cpu it works fine. "Bad tasks" fail by itself after some minutes and "good tasks" successfully finish and get validated. 4200-4800Mb of memory is not enough for 2 cores (or 6Gb is not enough for virtualbox to run such virtual machines, I'm not sure) 4200Mb is enough for single-core core VMs :) |
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,066,508 RAC: 127,019 |
Ok, with 1 cpu it works fine. You think the problem of the failing long-runners has to do with the memory size? Well, I have 5000MB in the app_config for 2 cores. And still they fail. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
You think the problem of the failing long-runners has to do with the memory size? I have tested 2-cores with 4600MB and 1 with 5000MB in app_config. The long-runners fail in both cases and the short-runners occasionally fail with 4600MB, but very few. We are the product of random evolution. |
Send message Joined: 19 Oct 13 Posts: 6 Credit: 1,559,929 RAC: 0 |
Nope, I didn't say that. |
©2024 CERN