Message boards :
ATLAS application :
What caused this to run 24hrs and not even complete before aborting?
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Dec 08 Posts: 341 Credit: 5,093,469 RAC: 2,359 ![]() ![]() ![]() |
Heres the stderr file before aborting 2021-05-02 14:27:47 (8720): Detected: vboxwrapper 26197 2021-05-02 14:27:47 (8720): Detected: BOINC client v7.7 2021-05-02 14:27:48 (8720): Detected: VirtualBox VboxManage Interface (Version: 6.1.22) 2021-05-02 14:27:49 (8720): Successfully copied 'init_data.xml' to the shared directory. 2021-05-02 14:27:51 (8720): Create VM. (boinc_9ebda94bf5b1e593, slot#4) 2021-05-02 14:27:51 (8720): Setting Memory Size for VM. (14000MB) 2021-05-02 14:27:52 (8720): Setting CPU Count for VM. (8) 2021-05-02 14:27:54 (8720): Setting Chipset Options for VM. 2021-05-02 14:27:55 (8720): Setting Boot Options for VM. 2021-05-02 14:27:55 (8720): Setting Network Configuration for NAT. 2021-05-02 14:27:56 (8720): Enabling VM Network Access. 2021-05-02 14:27:56 (8720): Disabling USB Support for VM. 2021-05-02 14:27:58 (8720): Disabling COM Port Support for VM. 2021-05-02 14:27:58 (8720): Disabling LPT Port Support for VM. 2021-05-02 14:27:59 (8720): Disabling Audio Support for VM. 2021-05-02 14:27:59 (8720): Disabling Clipboard Support for VM. 2021-05-02 14:27:59 (8720): Disabling Drag and Drop Support for VM. 2021-05-02 14:28:00 (8720): Adding storage controller(s) to VM. 2021-05-02 14:28:00 (8720): Adding virtual disk drive to VM. (vm_image.vdi) 2021-05-02 14:28:01 (8720): Adding VirtualBox Guest Additions to VM. 2021-05-02 14:28:01 (8720): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) 2021-05-02 14:28:01 (8720): forwarding host port 53518 to guest port 80 2021-05-02 14:28:02 (8720): Enabling remote desktop for VM. 2021-05-02 14:28:02 (8720): Enabling shared directory for VM. 2021-05-02 14:28:02 (8720): Starting VM using VBoxManage interface. (boinc_9ebda94bf5b1e593, slot#4) 2021-05-02 14:28:10 (8720): Successfully started VM. (PID = '30108') 2021-05-02 14:28:10 (8720): Reporting VM Process ID to BOINC. 2021-05-02 14:28:10 (8720): Guest Log: BIOS: VirtualBox 6.1.22 2021-05-02 14:28:10 (8720): Guest Log: CPUID EDX: 0x178bfbff 2021-05-02 14:28:10 (8720): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2021-05-02 14:28:10 (8720): VM state change detected. (old = 'PoweredOff', new = 'Running') 2021-05-02 14:28:10 (8720): Detected: Web Application Enabled (http://localhost:53518) 2021-05-02 14:28:10 (8720): Detected: Remote Desktop Enabled (localhost:53519) 2021-05-02 14:28:11 (8720): Preference change detected 2021-05-02 14:28:11 (8720): Setting CPU throttle for VM. (100%) 2021-05-02 14:28:11 (8720): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 180 seconds) or (Vbox_job.xml: 900 seconds)) 2021-05-02 14:28:13 (8720): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2021-05-02 14:28:13 (8720): Guest Log: BIOS: Booting from Hard Disk... 2021-05-02 14:28:16 (8720): Guest Log: BIOS: KBD: unsupported int 16h function 03 2021-05-02 14:28:16 (8720): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f 2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f 2021-05-02 14:28:31 (8720): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds 2021-05-02 14:28:31 (8720): Guest Log: vboxguest: misc device minor 58, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000) 2021-05-02 14:28:41 (8720): Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log 2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000327 main Log opened 2021-05-02T14:28:38.873579000Z 2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000421 main OS Product: Linux 2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000454 main OS Release: 3.10.0-957.27.2.el7.x86_64 2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000483 main OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019 2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000511 main Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService 2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000512 main Process ID: 1218 2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000512 main Package type: LINUX_64BITS_GENERIC 2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.002895 main 5.2.32 r132073 started. Verbose level = 0 2021-05-02 14:28:51 (8720): Guest Log: 00:00:10.009297 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 188 250 511 000ns (GuestNow=1 619 958 530 628 485 000 ns GuestLast=1 619 965 718 878 996 000 ns fSetTimeLastLoop=true ) 2021-05-02 14:29:03 (8720): Guest Log: Checking CVMFS... 2021-05-02 14:29:05 (8720): Guest Log: CVMFS is ok 2021-05-02 16:21:59 (8720): Status Report: Elapsed Time: '6000.480204' 2021-05-02 16:21:59 (8720): Status Report: CPU Time: '334.906250' 2021-05-02 18:17:07 (8720): Status Report: Elapsed Time: '12000.833475' 2021-05-02 18:17:07 (8720): Status Report: CPU Time: '547.734375' 2021-05-02 20:06:58 (8720): Status Report: Elapsed Time: '18000.931150' 2021-05-02 20:06:58 (8720): Status Report: CPU Time: '723.515625' 2021-05-02 21:03:05 (8720): Stopping VM. 2021-05-02 21:12:59 (18076): Detected: vboxwrapper 26197 2021-05-02 21:12:59 (18076): Detected: BOINC client v7.7 2021-05-02 21:13:00 (18076): Detected: VirtualBox VboxManage Interface (Version: 6.1.22) 2021-05-02 21:13:01 (18076): Starting VM using VBoxManage interface. (boinc_9ebda94bf5b1e593, slot#4) 2021-05-02 21:13:17 (18076): Successfully started VM. (PID = '31384') 2021-05-02 21:13:17 (18076): Reporting VM Process ID to BOINC. 2021-05-02 21:13:17 (18076): VM state change detected. (old = 'PoweredOff', new = 'Running') 2021-05-02 21:13:17 (18076): Detected: Web Application Enabled (http://localhost:53518) 2021-05-02 21:13:17 (18076): Detected: Remote Desktop Enabled (localhost:53519) 2021-05-02 21:13:17 (18076): Status Report: Elapsed Time: '20932.559669' 2021-05-02 21:13:17 (18076): Status Report: CPU Time: '856.109375' 2021-05-02 21:13:17 (18076): Preference change detected 2021-05-02 21:13:17 (18076): Setting CPU throttle for VM. (100%) 2021-05-02 21:13:18 (18076): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 180 seconds) or (Vbox_job.xml: 900 seconds)) 2021-05-02 22:42:16 (18076): VM state change detected. (old = 'Running', new = 'Paused') 2021-05-03 08:50:48 (18076): VM state change detected. (old = 'Paused', new = 'Running') 2021-05-03 08:50:57 (18076): Guest Log: 08:03:31.316461 timesync vgsvcTimeSyncWorker: Radical host time change: 36 521 751 000 000ns (HostNow=1 620 024 656 725 000 000 ns HostLast=1 619 988 134 974 000 000 ns) 2021-05-03 08:51:07 (18076): Guest Log: 08:03:41.320009 timesync vgsvcTimeSyncWorker: Radical guest time change: 36 521 751 916 000ns (GuestNow=1 620 024 666 728 738 000 ns GuestLast=1 619 988 144 976 822 000 ns fSetTimeLastLoop=true ) 2021-05-03 09:21:21 (18076): Status Report: Elapsed Time: '26932.800935' 2021-05-03 09:21:21 (18076): Status Report: CPU Time: '900.750000' 2021-05-03 11:08:47 (18076): Status Report: Elapsed Time: '32933.526563' 2021-05-03 11:08:47 (18076): Status Report: CPU Time: '941.171875' 2021-05-03 12:54:57 (18076): Status Report: Elapsed Time: '38934.069574' 2021-05-03 12:54:57 (18076): Status Report: CPU Time: '979.281250' 2021-05-03 14:42:17 (18076): Status Report: Elapsed Time: '44934.354932' 2021-05-03 14:42:17 (18076): Status Report: CPU Time: '1040.359375' 2021-05-03 16:26:58 (18076): Status Report: Elapsed Time: '50935.264158' 2021-05-03 16:26:58 (18076): Status Report: CPU Time: '1110.390625' 2021-05-03 18:10:14 (18076): Status Report: Elapsed Time: '56935.745889' 2021-05-03 18:10:14 (18076): Status Report: CPU Time: '1180.625000' 2021-05-03 19:52:36 (18076): Status Report: Elapsed Time: '62936.101398' 2021-05-03 19:52:36 (18076): Status Report: CPU Time: '1236.484375' It still had a so called 7 mins and some odd seconds to completion, but the completion rate was .001 per 2 seconds which is a dead task. |
Send message Joined: 18 Sep 05 Posts: 2 Credit: 764,198 RAC: 0 ![]() ![]() |
I have had similar ATLAS 8 CPU work units slow to a crawl to less than 0.001% per several seconds or minutes, so I aborted them. The last one: 316368943 - 164232871 - 10647415 11 May 2021, 21:24:30 UTC - 19 May 2021, 21:24:30 UTC - Timed out - no response -.00, 0.00 --- ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) - windows_x86_64 I let the work unit run until it hit 99.999% completion over a 30 hour run time until it timed out. It was still running on the VBox machine until the very end. . What is with these long running work units. It is very expensive to let my 5.00 GHz machine run this long with no results, even though I use Solar Electric power from my roof top system for the house power. I reduced the 8 CPU to 6 CPU work units to see if this will make a difference. This will allow me to also run some other projects concurrently with the ATLAS tasks. Hope they work! Fred |
Send message Joined: 28 Dec 08 Posts: 341 Credit: 5,093,469 RAC: 2,359 ![]() ![]() ![]() |
From what I was told in another thread. 4 cores and 6600 MB of memory per task. But now I have other problems. |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,183,298 RAC: 128,762 ![]() ![]() |
... to let my 5.00 GHz machine ... The first measure should be not to run the CPU far above the manufacturer specs. According to your computer details page it's an i7-8700K: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10647415 Intel specifies a Processor Base Frequency of 3.70 GHz and allows a Max Turbo Frequency of 4.70 GHz (single core frequency!): https://ark.intel.com/content/www/us/en/ark/products/126684/intel-core-i7-8700k-processor-12m-cache-up-to-4-70-ghz.html |
Send message Joined: 18 Sep 05 Posts: 2 Credit: 764,198 RAC: 0 ![]() ![]() |
It has been running at 5GHz for most of its 4 year lifetime. I build a new computer about every 4-5 years to keep in practice. It means little if one fails now and then. And I have never had an Intel CPU fail or slow down. Why do so many computer owners worry about heat death of silicon devices. Software is the major killer of computer data and machines. Now my wife's 2 year old computer runs at 5.3 GHz just fine but lacks a descrete GPU due to the sorry miners and lack of Chip founderies in the USA. Now to the answer for my question? Where is it? f |
Send message Joined: 23 Sep 21 Posts: 3 Credit: 10,915 RAC: 0 ![]() |
A similar issue here. A 4 core ATLAS WU had run for 21 hours the other day, had 42 seconds left to crunch, and the deadline was four hours away, but it wasn't running. "Waiting for memory." So I suspended all other projects, boosted memory and CPU usage to 100%, ended all non-critical processes, and crossed my fingers. It ran for another 24 hours with CPU in single digits, got to zero seconds left to compute and 99.999% completion, and then ran another three hours with no change in stats. Of course by that time it was way past the deadline. I killed it and went back to crunching elsewhere. Seems a shame to waste all that computing time with no results for the project. At this point my concern is how to proceed with other four core WUs. There currently are five more of them in the queue, projected at about 5.5 hours each. What's the chance that they will screw everything up just like the last one? Should I abort them and opt out of ATLAS? Your insight, please, Ardis |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 326 ![]() ![]() |
Atlas finishing in around half a day, normally. (one CPU) The most faults are in definitions to run it. You can check with Yeti's checklist in the Atlas folder first. |
Send message Joined: 14 Jan 10 Posts: 1446 Credit: 9,708,961 RAC: 766 ![]() ![]() |
@Ardis: Hard to say, cause your computers are hidden. In general: ATLAS and CMS need an almost continuous connection to LHC@home infrastructure. When a task is waiting or suspended for longer than ~15 minutes it's deemed to fail. |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,183,298 RAC: 128,762 ![]() ![]() |
Your computer(s) is/are hidden. Since it may be helpful to check the logs your computer(s) sent back to the project you may check "Should LHC@home show your computers on its web site?" at the prefs page: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project <edit> Looks like I was too slow between reading and posting. ;-) </edit> |
Send message Joined: 23 Sep 21 Posts: 3 Credit: 10,915 RAC: 0 ![]() |
Thanks, y'all, for taking a look at this and pointing me in some good directions. Changes since my previous post:
Read Yeti's List 18 Sep 2018 Marked the PC running Atlas as visible Leo Moon CPU-V shows two green check marks verified p_vm_extensions_disabled = 0
|
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,183,298 RAC: 128,762 ![]() ![]() |
Some comments Updated Virtual Box and Extension Pack to 6.1.28 The computer details page currently states Virtualbox (6.1.12). This indicates - the computer did not yet contact the server after the upgrade or - something went wrong (did you reboot the computer?) ... the running time was 5:29:00. Elapsed time just ... or 19 seconds per minute. Looking at those values (including those not quoted) is useless, especially for new computers. Why? This has been explained many times throughout the forum. 32 GB "Virtual Boost" What does "Virtual Boost" mean in this context? If you mean swap/page memory then it should not be taken into account. It would be far too slow and might sooner or later (more sooner!) cause the VM processes to run into timeouts. The only thing that helps is real RAM. As long as ATLAS runs using a 4-core setup it will allocate 6600 MB RAM per task for the VM. This RAM can't be used for any other process and will leave only 1.4 GB for all other things. What happens can be seen in this ATLAS logfile: https://lhcathome.cern.ch/lhcathome/result.php?resultid=330371441 There are very many restarts and finally this: 2021-10-22 16:06:54 (11328): BOINC will be notified that it needs to clean up the environment. This is a temporary problem and so this job will be rescheduled for another time. What to do? 1. Set BOINC to "no new work" 2. Stop all tasks that are not yet started 3. Finish and report all work currently in progress. 4. Shutdown BOINC (then wait at least 1 minute) 5. Delete all folders below \slots\ (that's what the message from the log means) 6. Restart BOINC and resume the tasks; allow getting new work again ATLAS should not run on 4 cores on that computer. It might run fine on 2 cores. |
![]() Send message Joined: 24 Jun 10 Posts: 43 Credit: 6,187,934 RAC: 1,164 ![]() ![]() ![]() |
Greetings Might be an time to invest in another stick of ram for that machine, Regards |
Send message Joined: 23 Sep 21 Posts: 3 Credit: 10,915 RAC: 0 ![]() |
The maximum number of cores to use has been set to one. We'll try that for a while, then may consider going to two. |
©2025 CERN