Message boards : ATLAS application : What caused this to run 24hrs and not even complete before aborting?
Message board moderation

To post messages, you must log in.

AuthorMessage
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 44868 - Posted: 3 May 2021, 19:31:31 UTC

Heres the stderr file before aborting

2021-05-02 14:27:47 (8720): Detected: vboxwrapper 26197
2021-05-02 14:27:47 (8720): Detected: BOINC client v7.7
2021-05-02 14:27:48 (8720): Detected: VirtualBox VboxManage Interface (Version: 6.1.22)
2021-05-02 14:27:49 (8720): Successfully copied 'init_data.xml' to the shared directory.
2021-05-02 14:27:51 (8720): Create VM. (boinc_9ebda94bf5b1e593, slot#4)
2021-05-02 14:27:51 (8720): Setting Memory Size for VM. (14000MB)
2021-05-02 14:27:52 (8720): Setting CPU Count for VM. (8)
2021-05-02 14:27:54 (8720): Setting Chipset Options for VM.
2021-05-02 14:27:55 (8720): Setting Boot Options for VM.
2021-05-02 14:27:55 (8720): Setting Network Configuration for NAT.
2021-05-02 14:27:56 (8720): Enabling VM Network Access.
2021-05-02 14:27:56 (8720): Disabling USB Support for VM.
2021-05-02 14:27:58 (8720): Disabling COM Port Support for VM.
2021-05-02 14:27:58 (8720): Disabling LPT Port Support for VM.
2021-05-02 14:27:59 (8720): Disabling Audio Support for VM.
2021-05-02 14:27:59 (8720): Disabling Clipboard Support for VM.
2021-05-02 14:27:59 (8720): Disabling Drag and Drop Support for VM.
2021-05-02 14:28:00 (8720): Adding storage controller(s) to VM.
2021-05-02 14:28:00 (8720): Adding virtual disk drive to VM. (vm_image.vdi)
2021-05-02 14:28:01 (8720): Adding VirtualBox Guest Additions to VM.
2021-05-02 14:28:01 (8720): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2021-05-02 14:28:01 (8720): forwarding host port 53518 to guest port 80
2021-05-02 14:28:02 (8720): Enabling remote desktop for VM.
2021-05-02 14:28:02 (8720): Enabling shared directory for VM.
2021-05-02 14:28:02 (8720): Starting VM using VBoxManage interface. (boinc_9ebda94bf5b1e593, slot#4)
2021-05-02 14:28:10 (8720): Successfully started VM. (PID = '30108')
2021-05-02 14:28:10 (8720): Reporting VM Process ID to BOINC.
2021-05-02 14:28:10 (8720): Guest Log: BIOS: VirtualBox 6.1.22

2021-05-02 14:28:10 (8720): Guest Log: CPUID EDX: 0x178bfbff

2021-05-02 14:28:10 (8720): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63

2021-05-02 14:28:10 (8720): VM state change detected. (old = 'PoweredOff', new = 'Running')
2021-05-02 14:28:10 (8720): Detected: Web Application Enabled (http://localhost:53518)
2021-05-02 14:28:10 (8720): Detected: Remote Desktop Enabled (localhost:53519)
2021-05-02 14:28:11 (8720): Preference change detected
2021-05-02 14:28:11 (8720): Setting CPU throttle for VM. (100%)
2021-05-02 14:28:11 (8720): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 180 seconds) or (Vbox_job.xml: 900 seconds))
2021-05-02 14:28:13 (8720): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032

2021-05-02 14:28:13 (8720): Guest Log: BIOS: Booting from Hard Disk...

2021-05-02 14:28:16 (8720): Guest Log: BIOS: KBD: unsupported int 16h function 03

2021-05-02 14:28:16 (8720): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f

2021-05-02 14:28:16 (8720): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f

2021-05-02 14:28:31 (8720): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds

2021-05-02 14:28:31 (8720): Guest Log: vboxguest: misc device minor 58, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)

2021-05-02 14:28:41 (8720): Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log

2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000327 main Log opened 2021-05-02T14:28:38.873579000Z

2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000421 main OS Product: Linux

2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000454 main OS Release: 3.10.0-957.27.2.el7.x86_64

2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000483 main OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019

2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000511 main Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService

2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000512 main Process ID: 1218

2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.000512 main Package type: LINUX_64BITS_GENERIC

2021-05-02 14:28:41 (8720): Guest Log: 00:00:00.002895 main 5.2.32 r132073 started. Verbose level = 0

2021-05-02 14:28:51 (8720): Guest Log: 00:00:10.009297 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 188 250 511 000ns (GuestNow=1 619 958 530 628 485 000 ns GuestLast=1 619 965 718 878 996 000 ns fSetTimeLastLoop=true )

2021-05-02 14:29:03 (8720): Guest Log: Checking CVMFS...

2021-05-02 14:29:05 (8720): Guest Log: CVMFS is ok

2021-05-02 16:21:59 (8720): Status Report: Elapsed Time: '6000.480204'
2021-05-02 16:21:59 (8720): Status Report: CPU Time: '334.906250'
2021-05-02 18:17:07 (8720): Status Report: Elapsed Time: '12000.833475'
2021-05-02 18:17:07 (8720): Status Report: CPU Time: '547.734375'
2021-05-02 20:06:58 (8720): Status Report: Elapsed Time: '18000.931150'
2021-05-02 20:06:58 (8720): Status Report: CPU Time: '723.515625'
2021-05-02 21:03:05 (8720): Stopping VM.
2021-05-02 21:12:59 (18076): Detected: vboxwrapper 26197
2021-05-02 21:12:59 (18076): Detected: BOINC client v7.7
2021-05-02 21:13:00 (18076): Detected: VirtualBox VboxManage Interface (Version: 6.1.22)
2021-05-02 21:13:01 (18076): Starting VM using VBoxManage interface. (boinc_9ebda94bf5b1e593, slot#4)
2021-05-02 21:13:17 (18076): Successfully started VM. (PID = '31384')
2021-05-02 21:13:17 (18076): Reporting VM Process ID to BOINC.
2021-05-02 21:13:17 (18076): VM state change detected. (old = 'PoweredOff', new = 'Running')
2021-05-02 21:13:17 (18076): Detected: Web Application Enabled (http://localhost:53518)
2021-05-02 21:13:17 (18076): Detected: Remote Desktop Enabled (localhost:53519)
2021-05-02 21:13:17 (18076): Status Report: Elapsed Time: '20932.559669'
2021-05-02 21:13:17 (18076): Status Report: CPU Time: '856.109375'
2021-05-02 21:13:17 (18076): Preference change detected
2021-05-02 21:13:17 (18076): Setting CPU throttle for VM. (100%)
2021-05-02 21:13:18 (18076): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 180 seconds) or (Vbox_job.xml: 900 seconds))
2021-05-02 22:42:16 (18076): VM state change detected. (old = 'Running', new = 'Paused')
2021-05-03 08:50:48 (18076): VM state change detected. (old = 'Paused', new = 'Running')
2021-05-03 08:50:57 (18076): Guest Log: 08:03:31.316461 timesync vgsvcTimeSyncWorker: Radical host time change: 36 521 751 000 000ns (HostNow=1 620 024 656 725 000 000 ns HostLast=1 619 988 134 974 000 000 ns)

2021-05-03 08:51:07 (18076): Guest Log: 08:03:41.320009 timesync vgsvcTimeSyncWorker: Radical guest time change: 36 521 751 916 000ns (GuestNow=1 620 024 666 728 738 000 ns GuestLast=1 619 988 144 976 822 000 ns fSetTimeLastLoop=true )

2021-05-03 09:21:21 (18076): Status Report: Elapsed Time: '26932.800935'
2021-05-03 09:21:21 (18076): Status Report: CPU Time: '900.750000'
2021-05-03 11:08:47 (18076): Status Report: Elapsed Time: '32933.526563'
2021-05-03 11:08:47 (18076): Status Report: CPU Time: '941.171875'
2021-05-03 12:54:57 (18076): Status Report: Elapsed Time: '38934.069574'
2021-05-03 12:54:57 (18076): Status Report: CPU Time: '979.281250'
2021-05-03 14:42:17 (18076): Status Report: Elapsed Time: '44934.354932'
2021-05-03 14:42:17 (18076): Status Report: CPU Time: '1040.359375'
2021-05-03 16:26:58 (18076): Status Report: Elapsed Time: '50935.264158'
2021-05-03 16:26:58 (18076): Status Report: CPU Time: '1110.390625'
2021-05-03 18:10:14 (18076): Status Report: Elapsed Time: '56935.745889'
2021-05-03 18:10:14 (18076): Status Report: CPU Time: '1180.625000'
2021-05-03 19:52:36 (18076): Status Report: Elapsed Time: '62936.101398'
2021-05-03 19:52:36 (18076): Status Report: CPU Time: '1236.484375'

It still had a so called 7 mins and some odd seconds to completion, but the completion rate was .001 per 2 seconds which is a dead task.
ID: 44868 · Report as offensive     Reply Quote
FredK

Send message
Joined: 18 Sep 05
Posts: 2
Credit: 764,198
RAC: 0
Message 44986 - Posted: 21 May 2021, 3:20:53 UTC

I have had similar ATLAS 8 CPU work units slow to a crawl to less than 0.001% per several seconds or minutes, so I aborted them. The last one:
316368943 - 164232871 - 10647415 11 May 2021, 21:24:30 UTC - 19 May 2021, 21:24:30 UTC - Timed out - no response -.00, 0.00 --- ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) - windows_x86_64
I let the work unit run until it hit 99.999% completion over a 30 hour run time until it timed out. It was still running on the VBox machine until the very end.
.
What is with these long running work units. It is very expensive to let my 5.00 GHz machine run this long with no results, even though I use Solar Electric power from my roof top system for the house power.

I reduced the 8 CPU to 6 CPU work units to see if this will make a difference. This will allow me to also run some other projects concurrently with the ATLAS tasks. Hope they work!

Fred
ID: 44986 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 44988 - Posted: 21 May 2021, 5:58:31 UTC - in response to Message 44986.  

From what I was told in another thread. 4 cores and 6600 MB of memory per task.
But now I have other problems.
ID: 44988 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,906,433
RAC: 138,030
Message 44991 - Posted: 21 May 2021, 6:55:50 UTC - in response to Message 44986.  

... to let my 5.00 GHz machine ...

The first measure should be not to run the CPU far above the manufacturer specs.
According to your computer details page it's an i7-8700K:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10647415

Intel specifies a Processor Base Frequency of 3.70 GHz and allows a Max Turbo Frequency of 4.70 GHz (single core frequency!):
https://ark.intel.com/content/www/us/en/ark/products/126684/intel-core-i7-8700k-processor-12m-cache-up-to-4-70-ghz.html
ID: 44991 · Report as offensive     Reply Quote
FredK

Send message
Joined: 18 Sep 05
Posts: 2
Credit: 764,198
RAC: 0
Message 44996 - Posted: 22 May 2021, 3:56:39 UTC - in response to Message 44991.  

It has been running at 5GHz for most of its 4 year lifetime. I build a new computer about every 4-5 years to keep in practice. It means little if one fails now and then. And I have never had an Intel CPU fail or slow down.
Why do so many computer owners worry about heat death of silicon devices. Software is the major killer of computer data and machines.
Now my wife's 2 year old computer runs at 5.3 GHz just fine but lacks a descrete GPU due to the sorry miners and lack of Chip founderies in the USA.
Now to the answer for my question? Where is it?
f
ID: 44996 · Report as offensive     Reply Quote
Ardis

Send message
Joined: 23 Sep 21
Posts: 3
Credit: 10,915
RAC: 0
Message 45524 - Posted: 23 Oct 2021, 22:22:52 UTC

A similar issue here. A 4 core ATLAS WU had run for 21 hours the other day, had 42 seconds left to crunch, and the deadline was four hours away, but it wasn't running. "Waiting for memory." So I suspended all other projects, boosted memory and CPU usage to 100%, ended all non-critical processes, and crossed my fingers. It ran for another 24 hours with CPU in single digits, got to zero seconds left to compute and 99.999% completion, and then ran another three hours with no change in stats. Of course by that time it was way past the deadline. I killed it and went back to crunching elsewhere. Seems a shame to waste all that computing time with no results for the project.

At this point my concern is how to proceed with other four core WUs. There currently are five more of them in the queue, projected at about 5.5 hours each. What's the chance that they will screw everything up just like the last one? Should I abort them and opt out of ATLAS?

Your insight, please,

Ardis
ID: 45524 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,087,008
RAC: 104,209
Message 45525 - Posted: 23 Oct 2021, 22:44:16 UTC - in response to Message 45524.  

Atlas finishing in around half a day, normally. (one CPU)
The most faults are in definitions to run it.
You can check with Yeti's checklist in the Atlas folder first.
ID: 45525 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 45526 - Posted: 24 Oct 2021, 6:10:07 UTC

@Ardis: Hard to say, cause your computers are hidden.
In general: ATLAS and CMS need an almost continuous connection to LHC@home infrastructure.
When a task is waiting or suspended for longer than ~15 minutes it's deemed to fail.
ID: 45526 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,906,433
RAC: 138,030
Message 45527 - Posted: 24 Oct 2021, 6:25:46 UTC - in response to Message 45524.  
Last modified: 24 Oct 2021, 6:28:18 UTC

Your computer(s) is/are hidden.
Since it may be helpful to check the logs your computer(s) sent back to the project you may check "Should LHC@home show your computers on its web site?" at the prefs page:
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project

<edit>
Looks like I was too slow between reading and posting.
;-)
</edit>
ID: 45527 · Report as offensive     Reply Quote
Ardis

Send message
Joined: 23 Sep 21
Posts: 3
Credit: 10,915
RAC: 0
Message 45529 - Posted: 25 Oct 2021, 0:46:00 UTC - in response to Message 45527.  
Last modified: 25 Oct 2021, 0:51:13 UTC

Thanks, y'all, for taking a look at this and pointing me in some good directions. Changes since my previous post:


    Updated Virtual Box and Extension Pack to 6.1.28
    Read Yeti's List 18 Sep 2018
    Marked the PC running Atlas as visible
    Leo Moon CPU-V shows two green check marks
    verified p_vm_extensions_disabled = 0



Currently one Atlas 4 CPU WU is running, all other tasks have been suspended. Before it started, the running time was 5:29:00. Elapsed time just went over six hours, with remaining time 3:00 and a few seconds. The remaining time decrements one second for about every 3-4 seconds of elapsed time, or 19 seconds per minute.

A concern seems to be memory. This "mature" box has four cores and 8 GB physical memory + 32 GB "Virtual Boost". Is that going to be enough to float a 4 CPU WU?

Ardis

ID: 45529 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,906,433
RAC: 138,030
Message 45530 - Posted: 25 Oct 2021, 5:58:12 UTC - in response to Message 45529.  

Some comments

Updated Virtual Box and Extension Pack to 6.1.28

The computer details page currently states Virtualbox (6.1.12).
This indicates
- the computer did not yet contact the server after the upgrade or
- something went wrong (did you reboot the computer?)

... the running time was 5:29:00. Elapsed time just ... or 19 seconds per minute.

Looking at those values (including those not quoted) is useless, especially for new computers.
Why? This has been explained many times throughout the forum.


32 GB "Virtual Boost"

What does "Virtual Boost" mean in this context?
If you mean swap/page memory then it should not be taken into account.
It would be far too slow and might sooner or later (more sooner!) cause the VM processes to run into timeouts.
The only thing that helps is real RAM.

As long as ATLAS runs using a 4-core setup it will allocate 6600 MB RAM per task for the VM.
This RAM can't be used for any other process and will leave only 1.4 GB for all other things.

What happens can be seen in this ATLAS logfile:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=330371441

There are very many restarts and finally this:
2021-10-22 16:06:54 (11328): 
   BOINC will be notified that it needs to clean up the environment.
 		    This is a temporary problem and so this job will be rescheduled for another time.

What to do?
1. Set BOINC to "no new work"
2. Stop all tasks that are not yet started
3. Finish and report all work currently in progress.
4. Shutdown BOINC (then wait at least 1 minute)
5. Delete all folders below \slots\ (that's what the message from the log means)
6. Restart BOINC and resume the tasks; allow getting new work again

ATLAS should not run on 4 cores on that computer.
It might run fine on 2 cores.
ID: 45530 · Report as offensive     Reply Quote
Profile tazzduke

Send message
Joined: 24 Jun 10
Posts: 39
Credit: 4,971,347
RAC: 4,992
Message 45533 - Posted: 25 Oct 2021, 8:20:25 UTC - in response to Message 45530.  

Greetings

Might be an time to invest in another stick of ram for that machine,

Regards
ID: 45533 · Report as offensive     Reply Quote
Ardis

Send message
Joined: 23 Sep 21
Posts: 3
Credit: 10,915
RAC: 0
Message 45558 - Posted: 26 Oct 2021, 14:53:57 UTC

The maximum number of cores to use has been set to one. We'll try that for a while, then may consider going to two.
ID: 45558 · Report as offensive     Reply Quote

Message boards : ATLAS application : What caused this to run 24hrs and not even complete before aborting?


©2024 CERN