Author | Message |
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
Hello, I would like to know the size of 1 job. When I run many VMs (Theory and CMS), I often experience a long-lasting idle. I guess that many concurrent downloads get stuck or maybe job size is too large for my ADSL (~600kb/s).
|
|
Laurence Project administrator Project developer
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0
|
Hello, I would like to know the size of 1 job. When I run many VMs (Theory and CMS), I often experience a long-lasting idle. I guess that many concurrent downloads get stuck or maybe job size is too large for my ADSL (~600kb/s).
The specification for the apps can be found in the FAQ. The input/output for the Theory app is less than 1MB per job. CMS, LHC and ATLAS vary between 20MB to 100MB
|
|
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
1MB per job doesn't seem too much.
So I don't understand why I have 1 VM running e 7 VMs idling today, 0 running yesterday and 8 running two days ago.
stderr.txt idling today
2017-01-13 12:44:05 (3239): vboxwrapper (7.7.26196): starting
2017-01-13 12:44:05 (3239): Feature: Checkpoint interval offset (474 seconds)
2017-01-13 12:44:05 (3239): Detected: VirtualBox VboxManage Interface (Version: 5.0.26)
2017-01-13 12:44:05 (3239): Detected: Minimum checkpoint interval (600.000000 seconds)
2017-01-13 12:44:05 (3239): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2017-01-13 12:44:05 (3239): Starting VM. (boinc_33a2224c153eb7ca, slot#6)
2017-01-13 12:44:15 (3239): Successfully started VM. (PID = '3970')
2017-01-13 12:44:15 (3239): Reporting VM Process ID to BOINC.
2017-01-13 12:44:15 (3239): VM state change detected. (old = 'poweroff', new = 'running')
2017-01-13 12:44:15 (3239): Detected: Web Application Enabled (http://localhost:56077)
2017-01-13 12:44:15 (3239): Detected: Remote Desktop Enabled (localhost:37732)
2017-01-13 12:44:15 (3239): Status Report: Job Duration: '64800.000000'
2017-01-13 12:44:15 (3239): Status Report: Elapsed Time: '26987.109348'
2017-01-13 12:44:15 (3239): Status Report: CPU Time: '13027.730000'
2017-01-13 12:44:15 (3239): Preference change detected
2017-01-13 12:44:15 (3239): Setting CPU throttle for VM. (100%)
2017-01-13 12:44:15 (3239): Setting network throttle for VM. (80KB)
2017-01-13 12:44:15 (3239): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 600 seconds))
stderr.txt running today
2017-01-13 12:44:05 (3240): vboxwrapper (7.7.26196): starting
2017-01-13 12:44:05 (3240): Feature: Checkpoint interval offset (327 seconds)
2017-01-13 12:44:05 (3240): Detected: VirtualBox VboxManage Interface (Version: 5.0.26)
2017-01-13 12:44:05 (3240): Detected: Minimum checkpoint interval (600.000000 seconds)
2017-01-13 12:44:05 (3240): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2017-01-13 12:44:05 (3240): Starting VM. (boinc_248c1324b9ac7c9c, slot#5)
2017-01-13 12:44:07 (3240): Successfully started VM. (PID = '3995')
2017-01-13 12:44:07 (3240): Reporting VM Process ID to BOINC.
2017-01-13 12:44:07 (3240): Guest Log: BIOS: VirtualBox 5.0.26
2017-01-13 12:44:07 (3240): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63
2017-01-13 12:44:07 (3240): VM state change detected. (old = 'poweroff', new = 'running')
2017-01-13 12:44:07 (3240): Detected: Web Application Enabled (http://localhost:33403)
2017-01-13 12:44:07 (3240): Detected: Remote Desktop Enabled (localhost:58296)
2017-01-13 12:44:07 (3240): Status Report: Job Duration: '64800.000000'
2017-01-13 12:44:07 (3240): Status Report: Elapsed Time: '26715.438963'
2017-01-13 12:44:07 (3240): Status Report: CPU Time: '14028.040000'
2017-01-13 12:44:07 (3240): Preference change detected
2017-01-13 12:44:07 (3240): Setting CPU throttle for VM. (100%)
2017-01-13 12:44:07 (3240): Setting network throttle for VM. (80KB)
2017-01-13 12:44:07 (3240): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 600 seconds))
2017-01-13 12:44:09 (3240): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032
2017-01-13 12:44:09 (3240): Guest Log: BIOS: Booting from Hard Disk...
2017-01-13 12:44:11 (3240): Guest Log: BIOS: KBD: unsupported int 16h function 03
2017-01-13 12:44:11 (3240): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000
2017-01-13 12:44:22 (3240): Guest Log: vboxguest: misc device minor 56, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)
2017-01-13 12:44:39 (3240): Guest Log: VBoxService 4.3.28 r100309 (verbosity: 0) linux.amd64 (May 13 2015 17:11:31) release log
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000041 main Log opened 2017-01-13T11:44:36.426830000Z
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000239 main OS Product: Linux
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000272 main OS Release: 4.1.34-22.cernvm.x86_64
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000292 main OS Version: #1 SMP Mon Oct 24 14:29:58 CEST 2016
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000311 main OS Service Pack: #1 SMP Mon Oct 24 14:29:58 CEST 2016
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000329 main Executable: /usr/sbin/VBoxService
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000330 main Process ID: 2646
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000330 main Package type: LINUX_64BITS_GENERIC
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000852 main 4.3.28 r100309 started. Verbose level = 0
2017-01-13 12:47:58 (3240): Guest Log: [INFO] Mounting the shared directory
2017-01-13 12:47:58 (3240): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2017-01-13 12:47:58 (3240): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2017-01-13 12:47:58 (3240): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!
2017-01-13 12:47:58 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:47:58 (3240): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125
2017-01-13 12:48:03 (3240): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded!
2017-01-13 12:48:03 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:48:03 (3240): Guest Log: [DEBUG] Testing VCCS connection to vccs1.cern.ch on port 443
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] Connection to vccs1.cern.ch 443 port [tcp/https] succeeded!
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] Connection to vccondor01.cern.ch 9618 port [tcp/condor] succeeded!
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] Probing CVMFS ...
2017-01-13 12:48:05 (3240): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2017-01-13 12:49:26 (3240): Guest Log: Probing /cvmfs/sft.cern.ch... OK
2017-01-13 12:49:26 (3240): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2017-01-13 12:49:26 (3240): Guest Log: 2.2.0.0 3335 4 21800 3799 13 1 589102 10240001 2 65024 0 20 95 15203 0 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch http://128.142.168.203:3125 1
2017-01-13 12:49:39 (3240): Guest Log: [INFO] Reading volunteer information
2017-01-13 12:49:39 (3240): Guest Log: [INFO] Volunteer: Luigi R. (282378) Host: 10408772
2017-01-13 12:49:39 (3240): Guest Log: [INFO] VMID: 3fc883b4-41f5-4a16-aa19-a080e214e4c6
2017-01-13 12:49:39 (3240): Guest Log: [INFO] Requesting an X509 credential from vLHC@home
2017-01-13 12:49:40 (3240): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2017-01-13 12:49:41 (3240): Guest Log: [INFO] Theory application starting. Check log files.
2017-01-13 12:49:41 (3240): Guest Log: [DEBUG] HTCondor ping
2017-01-13 12:49:45 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:50:30 (3240): Guest Log: [INFO] New Job Starting in slot1
2017-01-13 12:50:30 (3240): Guest Log: [INFO] Condor JobID: 1042966.0 in slot1
2017-01-13 12:50:35 (3240): Guest Log: [INFO] MCPlots JobID: 34797771 in slot1
2017-01-13 12:55:54 (3240): Guest Log: [INFO] Job finished in slot1 with 0.
2017-01-13 12:56:00 (3240): Guest Log: [INFO] New Job Starting in slot1
2017-01-13 12:56:00 (3240): Guest Log: [INFO] Condor JobID: 1043020.0 in slot1
2017-01-13 12:56:06 (3240): Guest Log: [INFO] MCPlots JobID: 34799529 in slot1
2017-01-13 13:08:52 (3240): Guest Log: [INFO] Job finished in slot1 with 0.
2017-01-13 13:08:56 (3240): Guest Log: [INFO] New Job Starting in slot1
2017-01-13 13:08:56 (3240): Guest Log: [INFO] Condor JobID: 1043156.0 in slot1
2017-01-13 13:09:02 (3240): Guest Log: [INFO] MCPlots JobID: 34799673 in slot1
2017-01-13 13:44:05 (3240): Guest Log: [INFO] Job finished in slot1 with 0.
|
|
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
running VM
idling VM
|
|
Laurence Project administrator Project developer
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0
|
Thanks for the screen shots, these are very helpful. The first thing I notice is that an old version of the application is running (262.50), whereas the latest version is 262.60. The input/output error is typical when CVMFS can not access files over the network. I would first ensure that you are running the latest version by doing a project reset. The new version will be more resilient to network issues and provides improved error messages.
|
|
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
Done! It's better. I have 6 VMs running and 2 VMs (1 CMS and 1 Theory) idling.
Processes list
CMS VM idling (process 5770) (elapsed time: 45 minutes)
Theory VM idling (process 23721) (elapsed time: 49 minutes)
|
|
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
Now I have 5 VMs running and 3 idling (2 CMS and 1 Theory).
Edit: After 20 minutes 2-3 VMs running.
Maybe should I try to limit VMs number to see if I can get 1-2-3-etc... VMs running all the time?
Edit2: After another 5 minutes 4 VMs running.
|
|
Laurence Project administrator Project developer
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0
|
Theory VM idling (process 23721) (elapsed time: 49 minutes)
This looks like the VM has not network connectivity.
|
|
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
|
|
Laurence Project administrator Project developer
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0
|
As far as I can tell your machine is 8 cores with 8GB of RAM. From the memory perspective 2 Theory tasks are equivalent to 1 CMS task. When starting to run multiple VM tasks on a machine, start small and experiment by slowly increasing what you are running. Always start with a Theory task. If that works then it suggests there are no fundamental issues. Then try 1 CMS before trying 1 Theory and 1 CMS together. It has been mentioned by others that VM starts should be staged.
|
|
Laurence Project administrator Project developer
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0
|
You may be interested in the multicore VMs. ATLAS will be the first on here but once everyone is comfortable, we can also add this for CMS and Theory.
|
|
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
I have 24GB of RAM though.
|
|
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
As far as I can tell your machine is 8 cores with 8GB of RAM. From the memory perspective 2 Theory tasks are equivalent to 1 CMS task. When starting to run multiple VM tasks on a machine, start small and experiment by slowly increasing what you are running. Always start with a Theory task. If that works then it suggests there are no fundamental issues. Then try 1 CMS before trying 1 Theory and 1 CMS together. It has been mentioned by others that VM starts should be staged.
I think there are no issues on this machine. There are moments while 8 VMs are correctly running. My 24GB of ram are enough for 8 CMS tasks as well.
Today I'm experiencing many errors: 206 (0x000000CE) EXIT_INIT_FAILURE.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=112071733
[ERROR] Condor exited after 686s without running a job.
Sorry if I sound repetitive, but I see a bandwidth problem.
My host downloaded >2GB in 1.5 hours.
I will try to disable CMS tasks and run only 4 Theory tasks to see if things improve.
Multicore VMs would be good.
|
|
Luigi R.
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0
|
|
|