Message boards : ATLAS application : Another crappy task
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41133 - Posted: 1 Jan 2020, 12:35:14 UTC
Last modified: 1 Jan 2020, 13:01:06 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=256978465

This is crazy!
It burns through a ton of cpu time at the start of the task and at the very end of the task it craps out and stalls.
In 13 hrs it has used only 222 units of CPU time.
in the last hour on checkpoints it used 49 units of CPU time!
What is going on? There is no more FAH CPU to interfere, all other projects for CPU are BOINC based.
A script restricts ATLAS to 4 cores which a person said would make it more stable, but here I am again with a bogged down unit at 99.999% done. With a theoretical 1 second left. I went to bed and it had 4 seconds to go and I got up 8 hrs later and its stuck on 1 second and in the last 3 hours it has not moved at all.
CPU time before it crashed while I was trying to force it to complete by allocating all resources to it was only 2%

What is going on with these ATLAS tasks? I complete a whole bunch of them ok and then I get one or two that crap out. This is not normal.


2019-12-30 16:36:46 (12132): Detected: vboxwrapper 26197
2019-12-30 16:36:46 (12132): Detected: BOINC client v7.7
2019-12-30 16:36:47 (12132): Detected: VirtualBox VboxManage Interface (Version: 6.1.0)
2019-12-30 16:36:47 (12132): Successfully copied 'init_data.xml' to the shared directory.
2019-12-30 16:36:49 (12132): Create VM. (boinc_956c7a0da2235ae1, slot#15)
2019-12-30 16:36:50 (12132): Setting Memory Size for VM. (2241MB)
2019-12-30 16:36:50 (12132): Setting CPU Count for VM. (4)
2019-12-30 16:36:50 (12132): Setting Chipset Options for VM.
2019-12-30 16:36:51 (12132): Setting Boot Options for VM.
2019-12-30 16:36:51 (12132): Setting Network Configuration for NAT.
2019-12-30 16:36:51 (12132): Enabling VM Network Access.
2019-12-30 16:36:52 (12132): Disabling USB Support for VM.
2019-12-30 16:36:52 (12132): Disabling COM Port Support for VM.
2019-12-30 16:36:52 (12132): Disabling LPT Port Support for VM.
2019-12-30 16:36:53 (12132): Disabling Audio Support for VM.
2019-12-30 16:36:53 (12132): Disabling Clipboard Support for VM.
2019-12-30 16:36:54 (12132): Disabling Drag and Drop Support for VM.
2019-12-30 16:36:54 (12132): Adding storage controller(s) to VM.
2019-12-30 16:36:54 (12132): Adding virtual disk drive to VM. (vm_image.vdi)
2019-12-30 16:36:55 (12132): Adding VirtualBox Guest Additions to VM.
2019-12-30 16:36:55 (12132): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2019-12-30 16:36:55 (12132): forwarding host port 49906 to guest port 80
2019-12-30 16:36:56 (12132): Enabling remote desktop for VM.
2019-12-30 16:36:56 (12132): Enabling shared directory for VM.
2019-12-30 16:36:57 (12132): Starting VM using VBoxManage interface. (boinc_956c7a0da2235ae1, slot#15)
2019-12-30 16:37:02 (12132): Successfully started VM. (PID = '11112')
2019-12-30 16:37:02 (12132): Reporting VM Process ID to BOINC.
2019-12-30 16:37:02 (12132): Guest Log: BIOS: VirtualBox 6.1.0

2019-12-30 16:37:02 (12132): Guest Log: CPUID EDX: 0x178bfbff

2019-12-30 16:37:02 (12132): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63

2019-12-30 16:37:02 (12132): VM state change detected. (old = 'PoweredOff', new = 'Running')
2019-12-30 16:37:02 (12132): Detected: Web Application Enabled (http://localhost:49906)
2019-12-30 16:37:02 (12132): Detected: Remote Desktop Enabled (localhost:49907)
2019-12-30 16:37:02 (12132): Preference change detected
2019-12-30 16:37:02 (12132): Setting CPU throttle for VM. (100%)
2019-12-30 16:37:02 (12132): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 180 seconds) or (Vbox_job.xml: 900 seconds))
2019-12-30 16:37:04 (12132): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032

2019-12-30 16:37:04 (12132): Guest Log: BIOS: Booting from Hard Disk...

2019-12-30 16:37:07 (12132): Guest Log: BIOS: KBD: unsupported int 16h function 03

2019-12-30 16:37:07 (12132): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f

2019-12-30 16:37:07 (12132): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f

2019-12-30 16:37:15 (12132): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds

2019-12-30 16:37:15 (12132): Guest Log: vboxguest: misc device minor 58, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)

2019-12-30 16:37:19 (12132): Guest Log: Checking CVMFS...

2019-12-30 16:37:27 (12132): Guest Log: CVMFS is ok

2019-12-30 16:37:27 (12132): Guest Log: Mounting shared directory

2019-12-30 16:37:27 (12132): Guest Log: Copying input files

2019-12-30 16:37:30 (12132): Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log

2019-12-30 16:37:30 (12132): Guest Log: 00:00:00.001115 main Log opened 2019-12-30T16:37:28.657561000Z

2019-12-30 16:37:30 (12132): Guest Log: 00:00:00.001292 main OS Product: Linux

2019-12-30 16:37:30 (12132): Guest Log: 00:00:00.001332 main OS Release: 3.10.0-957.27.2.el7.x86_64

2019-12-30 16:37:30 (12132): Guest Log: 00:00:00.001368 main OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019

2019-12-30 16:37:30 (12132): Guest Log: 00:00:00.001407 main Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService

2019-12-30 16:37:30 (12132): Guest Log: 00:00:00.001409 main Process ID: 1730

2019-12-30 16:37:30 (12132): Guest Log: 00:00:00.001409 main Package type: LINUX_64BITS_GENERIC

2019-12-30 16:37:30 (12132): Guest Log: 00:00:00.002121 main 5.2.32 r132073 started. Verbose level = 0

2019-12-30 16:37:32 (12132): Guest Log: Copied input files into RunAtlas.

2019-12-30 16:37:33 (12132): Guest Log: copied the webapp to /var/www

2019-12-30 16:37:34 (12132): Guest Log: This vm does not need to setup an http proxy

2019-12-30 16:37:34 (12132): Guest Log: ATHENA_PROC_NUMBER=4

2019-12-30 16:37:34 (12132): Guest Log: *** Starting ATLAS job. (PandaID=4592658421 taskID=20172820) ***

2019-12-30 16:37:40 (12132): Guest Log: 00:00:10.004789 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 680 980 000ns (GuestNow=1 577 720 259 979 102 000 ns GuestLast=1 577 723 848 660 082 000 ns fSetTimeLastLoop=true )

2019-12-30 18:17:51 (12132): Status Report: Elapsed Time: '6000.000000'
2019-12-30 18:17:51 (12132): Status Report: CPU Time: '11427.375000'
2019-12-30 19:58:48 (12132): Status Report: Elapsed Time: '12000.000000'
2019-12-30 19:58:48 (12132): Status Report: CPU Time: '23327.625000'
2019-12-30 21:39:03 (12132): Status Report: Elapsed Time: '18000.000000'
2019-12-30 21:39:03 (12132): Status Report: CPU Time: '26047.875000'
2019-12-30 23:19:11 (12132): Status Report: Elapsed Time: '24000.000000'
2019-12-30 23:19:11 (12132): Status Report: CPU Time: '26094.828125'
2019-12-31 00:59:16 (12132): Status Report: Elapsed Time: '30000.000000'
2019-12-31 00:59:16 (12132): Status Report: CPU Time: '26141.718750'
2019-12-31 02:39:23 (12132): Status Report: Elapsed Time: '36000.000000'
2019-12-31 02:39:23 (12132): Status Report: CPU Time: '26186.859375'
2019-12-31 04:19:32 (12132): Status Report: Elapsed Time: '42000.000000'
2019-12-31 04:19:32 (12132): Status Report: CPU Time: '26202.296875'
2019-12-31 05:59:42 (12132): Status Report: Elapsed Time: '48000.000000'
2019-12-31 05:59:42 (12132): Status Report: CPU Time: '26217.953125'
2019-12-31 07:39:51 (12132): Status Report: Elapsed Time: '54000.000000'
2019-12-31 07:39:51 (12132): Status Report: CPU Time: '26232.343750'
2019-12-31 09:20:00 (12132): Status Report: Elapsed Time: '60000.000000'
2019-12-31 09:20:00 (12132): Status Report: CPU Time: '26246.859375'
2019-12-31 11:00:10 (12132): Status Report: Elapsed Time: '66000.000000'
2019-12-31 11:00:10 (12132): Status Report: CPU Time: '26261.671875'
2019-12-31 12:40:18 (12132): Status Report: Elapsed Time: '72000.000000'
2019-12-31 12:40:18 (12132): Status Report: CPU Time: '26295.250000'
2019-12-31 14:20:24 (12132): Status Report: Elapsed Time: '78000.000000'
2019-12-31 14:20:24 (12132): Status Report: CPU Time: '26340.484375'
2019-12-31 16:00:30 (12132): Status Report: Elapsed Time: '84000.508299'
2019-12-31 16:00:30 (12132): Status Report: CPU Time: '26386.843750'
2019-12-31 17:40:40 (12132): Status Report: Elapsed Time: '90000.508299'
2019-12-31 17:40:40 (12132): Status Report: CPU Time: '26435.671875'
2019-12-31 19:20:47 (12132): Status Report: Elapsed Time: '96000.508299'
2019-12-31 19:20:47 (12132): Status Report: CPU Time: '26483.359375'
2019-12-31 21:00:54 (12132): Status Report: Elapsed Time: '102000.508299'
2019-12-31 21:00:54 (12132): Status Report: CPU Time: '26522.484375'
2019-12-31 22:41:06 (12132): Status Report: Elapsed Time: '108000.508299'
2019-12-31 22:41:06 (12132): Status Report: CPU Time: '26538.937500'
2020-01-01 00:21:17 (12132): Status Report: Elapsed Time: '114000.508299'
2020-01-01 00:21:17 (12132): Status Report: CPU Time: '26554.515625'
2020-01-01 02:01:28 (12132): Status Report: Elapsed Time: '120000.508299'
2020-01-01 02:01:28 (12132): Status Report: CPU Time: '26571.078125'
2020-01-01 03:41:39 (12132): Status Report: Elapsed Time: '126000.508299'
2020-01-01 03:41:39 (12132): Status Report: CPU Time: '26587.171875'
2020-01-01 05:21:48 (12132): Status Report: Elapsed Time: '132000.508299'
2020-01-01 05:21:48 (12132): Status Report: CPU Time: '26601.343750'
2020-01-01 07:01:54 (12132): Status Report: Elapsed Time: '138000.508299'
2020-01-01 07:01:54 (12132): Status Report: CPU Time: '26638.953125'
2020-01-01 08:41:58 (12132): Status Report: Elapsed Time: '144000.508299'
2020-01-01 08:41:58 (12132): Status Report: CPU Time: '26681.843750'
2020-01-01 10:22:03 (12132): Status Report: Elapsed Time: '150000.508299'
2020-01-01 10:22:03 (12132): Status Report: CPU Time: '26727.750000'
2020-01-01 12:02:11 (12132): Status Report: Elapsed Time: '156000.508299'
2020-01-01 12:02:11 (12132): Status Report: CPU Time: '26776.796875'[/url]
ID: 41133 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 41149 - Posted: 3 Jan 2020, 23:19:45 UTC - in response to Message 41133.  
Last modified: 3 Jan 2020, 23:48:16 UTC

Your task have most likely stalled out. Avoid other processes like FAH, the load you would see is not correct measurement when several other process is running. Cpu could easy suffer if doesn't get core/treads that is reserved.
This include I/O on disk and ram. To startup a vm machine it would need higher then set to boinc as a process on start/stop/save would not be counted and be system load.

On specific task boinc only get 2241MB far to low as old atlas require 2600MB for 1 core and new application recommend 3000MB for one core. Task in this case suffer on start to boot and get script running. you would not see any proccess of ATLAS.py running as can't get any memory to even start. Those few sec are probably the attempt on start atlas and stalled out.

2019-12-30 16:36:50 (12132): Setting Memory Size for VM. (2241MB)
2019-12-30 16:36:50 (12132): Setting CPU Count for VM. (4)

Each core added would need somewhere around 800MB-1000MB each.

If you would like to use app_config i suggest to not include ram setting and let application pick what it would need. Changes to ram would only work to new downloaded task and if you update boinc manager it would only change corecount.

Virtualbox have it's own issues and error and mix with LHC it could be hard to catch what problem could be but LHC have put great log and extension could pull out a lot of good info. The suggestion that 4 core app_config is good and i got better experience running on 4 then default 12 on virtualbox. I got less (error 195) using app_config.xml. But running on virtualbox i had to use 8000MB as minimum for 4 core task to new application to have it somewhat stable running. So running default on 12 core with 11000MB or what it would require default is probably better for most users as ramusage would be lower and less load on disk and cpu and ram.

Task process in boinc-manager is just wrapper fetch info on vm machine any estimated time is only based on device flops what your cpu could/should or be able to do in time. If flops calculation is of target what cpu is estimated could be days off.
Never ever trust estimated on first batch of task your boinc manager download or when you make changes to app_config.xml.

If like any info estimated time it would be one in console of each vm machine task. it would provide a much better but not perfect time errors and load.
ID: 41149 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41151 - Posted: 4 Jan 2020, 1:25:24 UTC - in response to Message 41149.  
Last modified: 4 Jan 2020, 1:48:19 UTC

Your task have most likely stalled out. Avoid other processes like FAH, the load you would see is not correct measurement when several other process is running. Cpu could easy suffer if doesn't get core/treads that is reserved.
This include I/O on disk and ram. To startup a vm machine it would need higher then set to boinc as a process on start/stop/save would not be counted and be system load.

On specific task boinc only get 2241MB far to low as old atlas require 2600MB for 1 core and new application recommend 3000MB for one core. Task in this case suffer on start to boot and get script running. you would not see any proccess of ATLAS.py running as can't get any memory to even start. Those few sec are probably the attempt on start atlas and stalled out.

2019-12-30 16:36:50 (12132): Setting Memory Size for VM. (2241MB)
2019-12-30 16:36:50 (12132): Setting CPU Count for VM. (4)

Each core added would need somewhere around 800MB-1000MB each.

If you would like to use app_config i suggest to not include ram setting and let application pick what it would need. Changes to ram would only work to new downloaded task and if you update boinc manager it would only change corecount.

Virtualbox have it's own issues and error and mix with LHC it could be hard to catch what problem could be but LHC have put great log and extension could pull out a lot of good info. The suggestion that 4 core app_config is good and i got better experience running on 4 then default 12 on virtualbox. I got less (error 195) using app_config.xml. But running on virtualbox i had to use 8000MB as minimum for 4 core task to new application to have it somewhat stable running. So running default on 12 core with 11000MB or what it would require default is probably better for most users as ramusage would be lower and less load on disk and cpu and ram.

Task process in boinc-manager is just wrapper fetch info on vm machine any estimated time is only based on device flops what your cpu could/should or be able to do in time. If flops calculation is of target what cpu is estimated could be days off.
Never ever trust estimated on first batch of task your boinc manager download or when you make changes to app_config.xml.

If like any info estimated time it would be one in console of each vm machine task. it would provide a much better but not perfect time errors and load.



I think I found the problem. The amount of memory allocated in the VBOX is less than what it needs. Even VBOX points this out. Which is weird as everything is automatic. But I'll boost it higher and see if that helps.

I don't knwo how much OC plays in these problems. I can run BOINC at 40.75 without crashing anything. My temps stay within in their max limits. I lowered it down to 40.50 now to see if that helps anything, the next time ATLAS comes to my system and starts running.

GRRR - Something weird happened. System went down with a IRQL Not Less or Equal error and I lost one task that was slowing down again in the 95% range and a task that was in waiting. I'll have to check drivers on the system. It is always something with ATLAS. I can rarely complete a task on this project despite having a fully capable system. IT is becoming annoying as hell!
ID: 41151 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 41152 - Posted: 4 Jan 2020, 1:50:41 UTC - in response to Message 41151.  

If that task had old setting of low ram it would be doomed to fail from start. Get settings right and app_config and after that allow new task. Only new task would be able to be valid.

If like to reach any stable environment i suggest to run on linux and able to do native instead.
ID: 41152 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41158 - Posted: 4 Jan 2020, 8:55:54 UTC - in response to Message 41152.  

If that task had old setting of low ram it would be doomed to fail from start. Get settings right and app_config and after that allow new task. Only new task would be able to be valid.

If like to reach any stable environment i suggest to run on linux and able to do native instead.



Remember though, I said VBOX memory allocation was low.
Windows memory is more than enough, could be tweaked to be a little bit higher, but VBOX assigned the Windows memory allocation.

When I poked around in VBOX and found the specific settings for the task, there was a message on the screen saying the memory was to low. Again VBOX not Windows.

So how do you push VBOX to assign more memory automatically?
ID: 41158 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41159 - Posted: 4 Jan 2020, 11:07:03 UTC

For now, I will try this

<app_config>
<project_max_concurrent>2</project_max_concurrent>
<app_version>
<app_name>ATLAS</app_name>
<version_num>100</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>4.000000</avg_ncpus>
<max_ncpus>4.000000</max_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<api_version>7.7.0</api_version>
<cmdline>--memory_size_mb 7500</cmdline>
<dont_throttle/>
<is_wrapper/>
<needs_network/>
</app_version>
</app_config>

According to the person posting this, it has been tried and tested and is ok.
ID: 41159 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,969,164
RAC: 136,720
Message 41162 - Posted: 4 Jan 2020, 17:11:17 UTC - in response to Message 41159.  

I don't see any post from an experienced user within the last year that suggested an app_config.xml like this:
<app_config>
<project_max_concurrent>2</project_max_concurrent>
<app_version>
<app_name>ATLAS</app_name>
<version_num>100</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>4.000000</avg_ncpus>
<max_ncpus>4.000000</max_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<api_version>7.7.0</api_version>
<cmdline>--memory_size_mb 7500</cmdline>
<dont_throttle/>
<is_wrapper/>
<needs_network/>
</app_version>
</app_config>

This file includes lots of tags that will simply be ignored by your BOINC client, e.g. max_ncpus, is_wrapper,... .
I'm curious where you got this from.
Could you post a link to the source?



Beside that lots of your other posts mention suggestions from "this guy" or "that guy" that are all "tested" but none of them can be checked as links to the sources are always missing.
This makes it nearly impossible to get an impression where you got lost.


And, yes, you obviously got lost as some of the recent logs do show:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=257084193
2020-01-01 16:11:48 (16604): Setting Memory Size for VM. (2241MB)
2020-01-01 16:11:48 (16604): Setting CPU Count for VM. (4)

ATLAS vbox requires at least 3900 MB RAM for a 1-core setup.
An n-core setup requires RAM according to this formula:
3000 + 900 * n_cores => a 4-core setup requires 6600 MB.

VBoxManage.exe: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED)

AMD-V (would be VT-x on intel) must be enabled in your BIOS as mentioned a couple of times.


https://lhcathome.cern.ch/lhcathome/result.php?resultid=257045774
Another VirtualBox management application has locked the session for
this VM. BOINC cannot properly monitor this VM
and so this job will be aborted.

Looks like HyperV is running concurrently with VirtualBox.
HyperV must be disabled if you plan to use VirtualBox.
   NOTE: VM session lock error encountered.
 		    BOINC will be notified that it needs to clean up the environment.
 		    This might be a temporary problem and so this job will be rescheduled for another time.

Looks like some crashes left garbage on the disk.
Restart your computer without BOINC and use your VirtualBox GUI to remove unaccessible VMs before you restart BOINC.



greg_be wrote:
I don't knwo how much OC plays in these problems. I can run BOINC at 40.75 without crashing anything. My temps stay within in their max limits. I lowered it down to 40.50 now to see if that helps anything, the next time ATLAS comes to my system and starts running.

Try to get a stable system at stock settings.
ID: 41162 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41163 - Posted: 4 Jan 2020, 19:26:44 UTC - in response to Message 41162.  
Last modified: 4 Jan 2020, 19:28:43 UTC

VBOX takes 10,200 MB for 4 cores at stock settings plus another 5,5xx in Virtual memory (sorry don't recall precise figures for this)
AMD-V IS enabled. Not sure how it got disabled. Lots of weird stuff happening lately.
Checked Hyper V settings, not enabled. Again, don't know how that happened. It's not something I mess with.

The app config was a 2017 post here on LHC:https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4137#29016

I found this via a google search.
ID: 41163 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41164 - Posted: 4 Jan 2020, 19:42:08 UTC
Last modified: 4 Jan 2020, 19:43:10 UTC

I had this given to me by user Saigon (Has been with LHC since 2012) in one of my other threads about slow tasks.
So what would I use from the script below to boost memory?



<app_config>
<app>
<name>ATLAS</name>
<max_concurrent>3</max_concurrent> <---- I set this to 1 usually or no more than 2 so I don't have them hanging around in the queue. I just want to run and finish 1 or 2 at a time.
</app>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>4.0</avg_ncpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
</app_config>
ID: 41164 · Report as offensive     Reply Quote
Combat Marmot

Send message
Joined: 8 Apr 12
Posts: 1
Credit: 312,983
RAC: 480
Message 41166 - Posted: 5 Jan 2020, 11:04:20 UTC

I had the same problem with task:
ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
J0wMDmFPy7vn9Rq4apoT9bVoABFKDmABFKDmvNPXDmABFKDml6WuHm

It was a 8 CPU job and ran for 14 hours and seemed to be asymptotically approaching 100%. I aborted it in the end. It also appeared to be under-utilising the processors that it had "reserved", which is unfortunate for all the other tasks.

For the time being I've changed my LHC preferences to Max # CPUs = 4. In future I'll keep a closer eye on jobs and abort them if they run 25-50% over their initial estimation.
ID: 41166 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41167 - Posted: 5 Jan 2020, 12:57:31 UTC - in response to Message 41166.  

I had the same problem with task:
ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
J0wMDmFPy7vn9Rq4apoT9bVoABFKDmABFKDmvNPXDmABFKDml6WuHm

It was a 8 CPU job and ran for 14 hours and seemed to be asymptotically approaching 100%. I aborted it in the end. It also appeared to be under-utilising the processors that it had "reserved", which is unfortunate for all the other tasks.

For the time being I've changed my LHC preferences to Max # CPUs = 4. In future I'll keep a closer eye on jobs and abort them if they run 25-50% over their initial estimation.


For me they run fine (but show only 4-6 seconds actual CPU time) until they reach the 90% range, then they slow down. At around 97% they slow down to .00010% every 2-4 seconds. Sometimes a little higher. I give it a day to day and a half to overcome that issue, but usually its useless. So around 98 or 99.xxx% I have to abort the task because it will not finish.

If I can get an answer back on where to put that memory boost into the script, or I will try it myself I can only hope that solves the issue. Seems like VBOX is under allocating memory for the tasks. I don't touch any of the settings and you see what I have reported and how that works out.
ID: 41167 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,158,128
RAC: 105,494
Message 41168 - Posted: 5 Jan 2020, 13:16:53 UTC - in response to Message 41167.  

When you see for the Task no cpu-time growing after about 10 min.(Initialisation phase),
you can delete the task.
Had over the last few days two or three tasks with no CPU-time after 10 min. Duration-Time.
But, it can be that you have a other problem too.
ID: 41168 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,969,164
RAC: 136,720
Message 41169 - Posted: 5 Jan 2020, 13:56:53 UTC - in response to Message 41151.  

greg_be wrote:
I don't knwo how much OC plays in these problems. I can run BOINC at 40.75 without crashing anything. My temps stay within in their max limits. I lowered it down to 40.50 now to see if that helps anything, the next time ATLAS comes to my system and starts running.

AMD lists the Ryzen 2700 with a base clock of 3.2 GHz and a max boost clock of 4.1 GHz (single core, bursty single-threaded workload!).
https://www.amd.com/en/products/cpu/amd-ryzen-7-2700

A clock setting above 4 GHZ seems to be far too high.
You may try to get stable results at AMD's base clock (3.2 GHz) before you try OC again.
That's what I meant with stock settings.
ID: 41169 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41171 - Posted: 5 Jan 2020, 17:37:21 UTC - in response to Message 41169.  

greg_be wrote:
I don't knwo how much OC plays in these problems. I can run BOINC at 40.75 without crashing anything. My temps stay within in their max limits. I lowered it down to 40.50 now to see if that helps anything, the next time ATLAS comes to my system and starts running.

AMD lists the Ryzen 2700 with a base clock of 3.2 GHz and a max boost clock of 4.1 GHz (single core, bursty single-threaded workload!).
https://www.amd.com/en/products/cpu/amd-ryzen-7-2700

A clock setting above 4 GHZ seems to be far too high.
You may try to get stable results at AMD's base clock (3.2 GHz) before you try OC again.
That's what I meant with stock settings.


Ive done a little digging and it can handle long term higher end OC. Everyone talks gaming of course, so its hard to compare that to here. But I will trim the frequency down to 4.0. It's interesting that some pages talk max 4.1 (but not for me, I freeze up at 4.1) and some others have ran as high as 4.2.
I know BOINC tasks can be touchy about what frequency you use. So 4.0 for now and see how that does.
ID: 41171 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41172 - Posted: 5 Jan 2020, 17:43:13 UTC - in response to Message 41168.  

When you see for the Task no cpu-time growing after about 10 min.(Initialisation phase),
you can delete the task.
Had over the last few days two or three tasks with no CPU-time after 10 min. Duration-Time.
But, it can be that you have a other problem too.


With me..in the first 5 mins it shows 4-6 seconds of CPU time and that's it for the entire task.
I think one time I got about 11 seconds for the whole 16+ hrs it ran. But the completion rate keeps ticking up nicely until 70% and then slows down a bit. At 90% it bogs down really good. 95 and up its almost dead. 99.xxx its dead or crawling at .00010 percent every 4 seconds. So where should I abort it at? after the CPU seconds stop? What is it doing when its not reporting CPU cycles but still shows an increase in percent done?
ID: 41172 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41173 - Posted: 5 Jan 2020, 17:49:04 UTC
Last modified: 5 Jan 2020, 17:51:24 UTC

Here is my idea of how to boost memory and constrain the task to 4 cpu's.
This is combining the memory boost section of a 2007 post with a message to me from a user that has been on here since 2012.

<app_config>
<app>
<name>ATLAS</name>
<max_concurrent></max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>4.0</avg_ncpus>
<cmdline>--nthreads 4</cmdline>
<cmdline>--memory_size_mb 7500</cmdline> (though I have seen via BOINC Tasks that it wants 10,200MB of memory)
</app_version>
</app_config>

What corrections are needed or what are your thoughts on this app_config script?

Remember though from what I saw in VBOX, VBOX is automatically only taking 2240 when the machine is engaged. So do I have to goto VBOX and on each ATLAS machine, boost the memory manually or what? What is the interaction of VBOX memory allocation and BOINC allocation?
ID: 41173 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,158,128
RAC: 105,494
Message 41174 - Posted: 5 Jan 2020, 17:49:13 UTC - in response to Message 41171.  

My Board is a ASUS X-370 with a Ryzen 2700. More than 3.4 GHz is not useful.
ASUS-AI Suite 3 does the Clock-setting itsself.
Ryzen-Master is also a feature to see what is possible.
But Higher than 3.4 GHz my Ryzen is not able to run well. More than 1 year experience with it.
ID: 41174 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 41175 - Posted: 5 Jan 2020, 18:24:26 UTC - in response to Message 41173.  
Last modified: 5 Jan 2020, 18:39:36 UTC

<app_config>
<app>
<name>ATLAS</name>
<max_concurrent></max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>4.0</avg_ncpus>
<cmdline>--nthreads 4</cmdline>
<cmdline>--memory_size_mb 7500</cmdline> (though I have seen via BOINC Tasks that it wants 10,200MB of memory)
</app_version>
</app_config>

What corrections are needed or what are your thoughts on this app_config script?
Your max_concurrent is empty.
<app_config>
 <project_max_concurrent>8</project_max_concurrent>
<app>
  <name>ATLAS</name>
  <max_concurrent>1</max_concurrent>
 </app>
 <app_version>
  <app_name>ATLAS</app_name>
  <plan_class>vbox64_mt_mcore_atlas</plan_class>
  <avg_ncpus>4.000000</avg_ncpus>
  <cmdline>--memory_size_mb 6600</cmdline>
 </app_version>
</app_config>

After a change of the app_config.xml you have to read the config files - BoincTasks - Menu Extra - Read config files. The change only effects new loaded tasks.
Remember though from what I saw in VBOX, VBOX is automatically only taking 2240 when the machine is engaged. So do I have to goto VBOX and on each ATLAS machine, boost the memory manually or what? What is the interaction of VBOX memory allocation and BOINC allocation?
BOINC's memory allocation is coming from your preferences When you saw 10200, that's because you have set in your preferences Max # CPUs to 8. Set that to 4 too.
ID: 41175 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41176 - Posted: 5 Jan 2020, 21:34:24 UTC - in response to Message 41175.  

<app_config>
<app>
<name>ATLAS</name>
<max_concurrent></max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>4.0</avg_ncpus>
<cmdline>--nthreads 4</cmdline>
<cmdline>--memory_size_mb 7500</cmdline> (though I have seen via BOINC Tasks that it wants 10,200MB of memory)
</app_version>
</app_config>

What corrections are needed or what are your thoughts on this app_config script?
Your max_concurrent is empty.
<app_config>
 <project_max_concurrent>8</project_max_concurrent>
<app>
  <name>ATLAS</name>
  <max_concurrent>1</max_concurrent>
 </app>
 <app_version>
  <app_name>ATLAS</app_name>
  <plan_class>vbox64_mt_mcore_atlas</plan_class>
  <avg_ncpus>4.000000</avg_ncpus>
  <cmdline>--memory_size_mb 6600</cmdline>
 </app_version>
</app_config>

After a change of the app_config.xml you have to read the config files - BoincTasks - Menu Extra - Read config files. The change only effects new loaded tasks.
Remember though from what I saw in VBOX, VBOX is automatically only taking 2240 when the machine is engaged. So do I have to goto VBOX and on each ATLAS machine, boost the memory manually or what? What is the interaction of VBOX memory allocation and BOINC allocation?
BOINC's memory allocation is coming from your preferences When you saw 10200, that's because you have set in your preferences Max # CPUs to 8. Set that to 4 too.



Thanks for the corrections. I did miss refilling the max concurrent to 1.
I was trying to force just ATLAS into 4 CPU's but leave it wide open to Theory and all the rest, so I figured set the restriction to ATLAS in app_config and leave the memory alone.
But...I still cant figure out why VBOX is coming in so low on memory...2240 total if I remember correctly.
ID: 41176 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41190 - Posted: 7 Jan 2020, 9:08:08 UTC

Crystal Pellet and the rest,

Thank you for your help. It looks like with your assistance I finally can run ATLAS with no problems.
Overnight three tasks ran and all completed ok.
ID: 41190 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : Another crappy task


©2024 CERN