Message boards :
ATLAS application :
Task will not end despite being 100%
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Dec 08 Posts: 318 Credit: 4,235,262 RAC: 3,879 |
I have a task that has been running 2 days and 9 hrs now, It is 100% complete but will not end. I hate to abort it, but it looks like its time. Any suggestions? |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,471,735 RAC: 131,946 |
I have a task that has been running 2 days and 9 hrs now, It is 100% complete but will not end. I hate to abort it, but it looks like its time. It's possible to shut down a VM gracefully. This will at least save the stderr.txt for further analysis. 1. locate the task's "...\slots\x\shared" folder 2. create a completion trigger file there This file doesn't need any content. Just create it. In case of ATLAS it's name must be "atlas_done", in case of Theory it's name must be "shutdown". |
Send message Joined: 6 Oct 17 Posts: 2 Credit: 387,695 RAC: 0 |
I created this file after the ATLAS task was stuck on 100% and 0 seconds for 30+ hours. It almost immediately ended with a computation error. Of course, I got no credit for it. 3 days, 17 hours of wasted 8 cores and electricity. This is in addition to all the other ATLAS that take forever and Theory that randomly fail. I'll just use my resources more efficiently at WCG. |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,471,735 RAC: 131,946 |
At least 1 of your computers has lots of CPUs but not enough RAM, especially to run ATLAS in the way you configured it. You may visit this page https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project and change the values "max #cpus" and "max #tasks" to "1" for the venue your computers are attached to. Then request fresh work and check if the tasks finish successfully within 12-18h (Theory only, ATLAS will be faster). Then slightly (!) raise the numbers above starting with "max #tasks" until your computers become unstable again. Use the last stable values. |
Send message Joined: 1 Feb 06 Posts: 66 Credit: 9,723 RAC: 0 |
Definetely, is not a very good solution... I have plenty of tasks failing due to similar "reasons" 8 GB ram and can't produce a single task if I use the laptop for internet surfing (VM gets stuck, "JOB postponed"... |
Send message Joined: 6 Oct 17 Posts: 2 Credit: 387,695 RAC: 0 |
I have 16 GB and 24 GB of RAM so memory shouldn't be an issue. Theory consistently works on one computer and fails on the other. ATLAS is hit or miss. I will try your suggestions. Thanks! |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,261,384 RAC: 4,382 |
On your tasks, set the the number of cpus to the number of true cores you have or less. Intel computer, try six or less. I am not sure the Virtual Box machines can start up with more than the number of true cores you have, not counting hyper threading or SMT (AMD). You may be able to run a 12 cpu workunit since it looks like you have dual processors on the Intel machine. Try setting the AMD machine to 3 cpus. You can make these adjustment using an 'app_config.xml' You can also check out the checklist in the number crunching forum. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161 |
Send message Joined: 28 Dec 08 Posts: 318 Credit: 4,235,262 RAC: 3,879 |
If your talking to the other posters, then you need to mention them by name. My system uses 4 physical cores and 3 virtual threads to run BOINC. Part of thread 8 or whatever is used to manage the GPU and all system tasks. Memory is 16 gigs. Even with Atlas I use at most 90% of my memory. But it is as someone else pointed out a problem in the workunits themselves. Also ATLAS crew sent out a command to cancel all work units I had still on my system. Now my credit is next to 0 or 0. One of these days they got to get things straightened out. This is nuts! |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,261,384 RAC: 4,382 |
greg_be, as long as you reply to the correct post in a chain, it shows in the header of the message. Mine just showed "Message 38006 - Posted: 14 Feb 2019, 5:29:27 UTC - in response to Message 38005. " |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
But it is as someone else pointed out a problem in the workunits themselves.Well, you might have received 2 bad workunits but without more evidence and history in the form of logs from your system it's difficult to know for sure what caused those 2 to fail to stop. Also ATLAS crew sent out a command to cancel all work units I had still on my system.It's perfectly normal and sensible for a project to cancel expired work units. If you don't know that by now then you have a lot of catching up to do. One of these days they got to get things straightened out. This is nuts! Wrong. You got to stop ignoring all the advice and face the fact that you're not doing what you need to do to run ATLAS. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Tested 4 task to win host all 4 got suspended for a few hours and resumed late on. All task end after 3.5 hours and got valid (vbox 5.2.26). Ignore estimated time but if task do not end after 2 days it would be time to abort it. |
Send message Joined: 28 Dec 08 Posts: 318 Credit: 4,235,262 RAC: 3,879 |
greg_be, as long as you reply to the correct post in a chain, it shows in the header of the message. Mine just showed Yeah I realized to late what happened. That was just reply, not quote. Anyway....I'm not watching this post any more. I've got another one going that's generating more detailed and specific answers. |
Send message Joined: 30 Jun 15 Posts: 1 Credit: 7,673,067 RAC: 0 |
I'm having massive issues trying to get an ATLAS task to complete without giving a COMPUTATION ERROR. The tasks are all 6 CPU tasks (I'd previously tried 8, but they never finish or error out) The 6CPU tasks typically run for almost 3 days on my machine. The computer is dual E5-2683 v3 giving 28 actual cores and RAM is 128GB. Windows 10 Pro, Hyper-V is disabled. BOINC Manager 7.12.2 (x64), Virtual Box 6.0.4 r128413 with extension pack It crunches though Theory Simulation (6 CPU) tasks with no issue. Please if someone could help me out, I really do want to contribute. Thank you, Ewin |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 9,173 |
I'm having massive issues trying to get an ATLAS task to complete without giving a COMPUTATION ERROR. I invite you to take a walk through my checklist Supporting BOINC, a great concept ! |
©2024 CERN