Message boards : Theory Application : Job size - download
Message board moderation

To post messages, you must log in.

AuthorMessage
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28465 - Posted: 13 Jan 2017, 12:12:41 UTC

Hello, I would like to know the size of 1 job. When I run many VMs (Theory and CMS), I often experience a long-lasting idle. I guess that many concurrent downloads get stuck or maybe job size is too large for my ADSL (~600kb/s).
ID: 28465 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28466 - Posted: 13 Jan 2017, 12:39:04 UTC - in response to Message 28465.  

Hello, I would like to know the size of 1 job. When I run many VMs (Theory and CMS), I often experience a long-lasting idle. I guess that many concurrent downloads get stuck or maybe job size is too large for my ADSL (~600kb/s).


The specification for the apps can be found in the FAQ. The input/output for the Theory app is less than 1MB per job. CMS, LHC and ATLAS vary between 20MB to 100MB
ID: 28466 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28467 - Posted: 13 Jan 2017, 12:51:53 UTC

1MB per job doesn't seem too much.

So I don't understand why I have 1 VM running e 7 VMs idling today, 0 running yesterday and 8 running two days ago.

stderr.txt idling today

2017-01-13 12:44:05 (3239): vboxwrapper (7.7.26196): starting
2017-01-13 12:44:05 (3239): Feature: Checkpoint interval offset (474 seconds)
2017-01-13 12:44:05 (3239): Detected: VirtualBox VboxManage Interface (Version: 5.0.26)
2017-01-13 12:44:05 (3239): Detected: Minimum checkpoint interval (600.000000 seconds)
2017-01-13 12:44:05 (3239): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2017-01-13 12:44:05 (3239): Starting VM. (boinc_33a2224c153eb7ca, slot#6)
2017-01-13 12:44:15 (3239): Successfully started VM. (PID = '3970')
2017-01-13 12:44:15 (3239): Reporting VM Process ID to BOINC.
2017-01-13 12:44:15 (3239): VM state change detected. (old = 'poweroff', new = 'running')
2017-01-13 12:44:15 (3239): Detected: Web Application Enabled (http://localhost:56077)
2017-01-13 12:44:15 (3239): Detected: Remote Desktop Enabled (localhost:37732)
2017-01-13 12:44:15 (3239): Status Report: Job Duration: '64800.000000'
2017-01-13 12:44:15 (3239): Status Report: Elapsed Time: '26987.109348'
2017-01-13 12:44:15 (3239): Status Report: CPU Time: '13027.730000'
2017-01-13 12:44:15 (3239): Preference change detected
2017-01-13 12:44:15 (3239): Setting CPU throttle for VM. (100%)
2017-01-13 12:44:15 (3239): Setting network throttle for VM. (80KB)
2017-01-13 12:44:15 (3239): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 600 seconds))


stderr.txt running today

2017-01-13 12:44:05 (3240): vboxwrapper (7.7.26196): starting
2017-01-13 12:44:05 (3240): Feature: Checkpoint interval offset (327 seconds)
2017-01-13 12:44:05 (3240): Detected: VirtualBox VboxManage Interface (Version: 5.0.26)
2017-01-13 12:44:05 (3240): Detected: Minimum checkpoint interval (600.000000 seconds)
2017-01-13 12:44:05 (3240): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2017-01-13 12:44:05 (3240): Starting VM. (boinc_248c1324b9ac7c9c, slot#5)
2017-01-13 12:44:07 (3240): Successfully started VM. (PID = '3995')
2017-01-13 12:44:07 (3240): Reporting VM Process ID to BOINC.
2017-01-13 12:44:07 (3240): Guest Log: BIOS: VirtualBox 5.0.26
2017-01-13 12:44:07 (3240): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63
2017-01-13 12:44:07 (3240): VM state change detected. (old = 'poweroff', new = 'running')
2017-01-13 12:44:07 (3240): Detected: Web Application Enabled (http://localhost:33403)
2017-01-13 12:44:07 (3240): Detected: Remote Desktop Enabled (localhost:58296)
2017-01-13 12:44:07 (3240): Status Report: Job Duration: '64800.000000'
2017-01-13 12:44:07 (3240): Status Report: Elapsed Time: '26715.438963'
2017-01-13 12:44:07 (3240): Status Report: CPU Time: '14028.040000'
2017-01-13 12:44:07 (3240): Preference change detected
2017-01-13 12:44:07 (3240): Setting CPU throttle for VM. (100%)
2017-01-13 12:44:07 (3240): Setting network throttle for VM. (80KB)
2017-01-13 12:44:07 (3240): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 600 seconds))
2017-01-13 12:44:09 (3240): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032
2017-01-13 12:44:09 (3240): Guest Log: BIOS: Booting from Hard Disk...
2017-01-13 12:44:11 (3240): Guest Log: BIOS: KBD: unsupported int 16h function 03
2017-01-13 12:44:11 (3240): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000
2017-01-13 12:44:22 (3240): Guest Log: vboxguest: misc device minor 56, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)
2017-01-13 12:44:39 (3240): Guest Log: VBoxService 4.3.28 r100309 (verbosity: 0) linux.amd64 (May 13 2015 17:11:31) release log
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000041 main Log opened 2017-01-13T11:44:36.426830000Z
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000239 main OS Product: Linux
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000272 main OS Release: 4.1.34-22.cernvm.x86_64
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000292 main OS Version: #1 SMP Mon Oct 24 14:29:58 CEST 2016
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000311 main OS Service Pack: #1 SMP Mon Oct 24 14:29:58 CEST 2016
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000329 main Executable: /usr/sbin/VBoxService
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000330 main Process ID: 2646
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000330 main Package type: LINUX_64BITS_GENERIC
2017-01-13 12:44:39 (3240): Guest Log: 00:00:00.000852 main 4.3.28 r100309 started. Verbose level = 0
2017-01-13 12:47:58 (3240): Guest Log: [INFO] Mounting the shared directory
2017-01-13 12:47:58 (3240): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2017-01-13 12:47:58 (3240): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2017-01-13 12:47:58 (3240): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!
2017-01-13 12:47:58 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:47:58 (3240): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125
2017-01-13 12:48:03 (3240): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded!
2017-01-13 12:48:03 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:48:03 (3240): Guest Log: [DEBUG] Testing VCCS connection to vccs1.cern.ch on port 443
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] Connection to vccs1.cern.ch 443 port [tcp/https] succeeded!
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] Connection to vccondor01.cern.ch 9618 port [tcp/condor] succeeded!
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:48:04 (3240): Guest Log: [DEBUG] Probing CVMFS ...
2017-01-13 12:48:05 (3240): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2017-01-13 12:49:26 (3240): Guest Log: Probing /cvmfs/sft.cern.ch... OK
2017-01-13 12:49:26 (3240): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2017-01-13 12:49:26 (3240): Guest Log: 2.2.0.0 3335 4 21800 3799 13 1 589102 10240001 2 65024 0 20 95 15203 0 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch http://128.142.168.203:3125 1
2017-01-13 12:49:39 (3240): Guest Log: [INFO] Reading volunteer information
2017-01-13 12:49:39 (3240): Guest Log: [INFO] Volunteer: Luigi R. (282378) Host: 10408772
2017-01-13 12:49:39 (3240): Guest Log: [INFO] VMID: 3fc883b4-41f5-4a16-aa19-a080e214e4c6
2017-01-13 12:49:39 (3240): Guest Log: [INFO] Requesting an X509 credential from vLHC@home
2017-01-13 12:49:40 (3240): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2017-01-13 12:49:41 (3240): Guest Log: [INFO] Theory application starting. Check log files.
2017-01-13 12:49:41 (3240): Guest Log: [DEBUG] HTCondor ping
2017-01-13 12:49:45 (3240): Guest Log: [DEBUG] 0
2017-01-13 12:50:30 (3240): Guest Log: [INFO] New Job Starting in slot1
2017-01-13 12:50:30 (3240): Guest Log: [INFO] Condor JobID: 1042966.0 in slot1
2017-01-13 12:50:35 (3240): Guest Log: [INFO] MCPlots JobID: 34797771 in slot1
2017-01-13 12:55:54 (3240): Guest Log: [INFO] Job finished in slot1 with 0.
2017-01-13 12:56:00 (3240): Guest Log: [INFO] New Job Starting in slot1
2017-01-13 12:56:00 (3240): Guest Log: [INFO] Condor JobID: 1043020.0 in slot1
2017-01-13 12:56:06 (3240): Guest Log: [INFO] MCPlots JobID: 34799529 in slot1
2017-01-13 13:08:52 (3240): Guest Log: [INFO] Job finished in slot1 with 0.
2017-01-13 13:08:56 (3240): Guest Log: [INFO] New Job Starting in slot1
2017-01-13 13:08:56 (3240): Guest Log: [INFO] Condor JobID: 1043156.0 in slot1
2017-01-13 13:09:02 (3240): Guest Log: [INFO] MCPlots JobID: 34799673 in slot1
2017-01-13 13:44:05 (3240): Guest Log: [INFO] Job finished in slot1 with 0.
ID: 28467 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28468 - Posted: 13 Jan 2017, 12:57:01 UTC

running VM


idling VM
ID: 28468 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28469 - Posted: 13 Jan 2017, 13:18:06 UTC - in response to Message 28468.  

Thanks for the screen shots, these are very helpful. The first thing I notice is that an old version of the application is running (262.50), whereas the latest version is 262.60. The input/output error is typical when CVMFS can not access files over the network. I would first ensure that you are running the latest version by doing a project reset. The new version will be more resilient to network issues and provides improved error messages.
ID: 28469 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28471 - Posted: 13 Jan 2017, 15:41:19 UTC

Done! It's better. I have 6 VMs running and 2 VMs (1 CMS and 1 Theory) idling.

Processes list


CMS VM idling (process 5770) (elapsed time: 45 minutes)


Theory VM idling (process 23721) (elapsed time: 49 minutes)
ID: 28471 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28472 - Posted: 13 Jan 2017, 15:46:30 UTC
Last modified: 13 Jan 2017, 16:12:28 UTC

Now I have 5 VMs running and 3 idling (2 CMS and 1 Theory).


Edit: After 20 minutes 2-3 VMs running.
Maybe should I try to limit VMs number to see if I can get 1-2-3-etc... VMs running all the time?

Edit2: After another 5 minutes 4 VMs running.
ID: 28472 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28481 - Posted: 13 Jan 2017, 21:38:30 UTC - in response to Message 28471.  


Theory VM idling (process 23721) (elapsed time: 49 minutes)


This looks like the VM has not network connectivity.
ID: 28481 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28482 - Posted: 13 Jan 2017, 21:49:40 UTC - in response to Message 28481.  
Last modified: 13 Jan 2017, 21:51:58 UTC

I tried to suspend (without leaving in memory) and resume it, but the same error occurred after the VM completed startup. Then I aborted it. The other tasks are 'gracefully' running. Now I have 8/8 VMs running.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=111877625
ID: 28482 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28483 - Posted: 13 Jan 2017, 21:50:49 UTC - in response to Message 28472.  

As far as I can tell your machine is 8 cores with 8GB of RAM. From the memory perspective 2 Theory tasks are equivalent to 1 CMS task. When starting to run multiple VM tasks on a machine, start small and experiment by slowly increasing what you are running. Always start with a Theory task. If that works then it suggests there are no fundamental issues. Then try 1 CMS before trying 1 Theory and 1 CMS together. It has been mentioned by others that VM starts should be staged.
ID: 28483 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28484 - Posted: 13 Jan 2017, 21:54:05 UTC - in response to Message 28482.  

You may be interested in the multicore VMs. ATLAS will be the first on here but once everyone is comfortable, we can also add this for CMS and Theory.
ID: 28484 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28485 - Posted: 13 Jan 2017, 21:54:49 UTC

I have 24GB of RAM though.
ID: 28485 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28492 - Posted: 14 Jan 2017, 16:50:47 UTC - in response to Message 28483.  
Last modified: 14 Jan 2017, 16:52:19 UTC

As far as I can tell your machine is 8 cores with 8GB of RAM. From the memory perspective 2 Theory tasks are equivalent to 1 CMS task. When starting to run multiple VM tasks on a machine, start small and experiment by slowly increasing what you are running. Always start with a Theory task. If that works then it suggests there are no fundamental issues. Then try 1 CMS before trying 1 Theory and 1 CMS together. It has been mentioned by others that VM starts should be staged.

I think there are no issues on this machine. There are moments while 8 VMs are correctly running. My 24GB of ram are enough for 8 CMS tasks as well.


Today I'm experiencing many errors: 206 (0x000000CE) EXIT_INIT_FAILURE.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=112071733
[ERROR] Condor exited after 686s without running a job.


Sorry if I sound repetitive, but I see a bandwidth problem.
My host downloaded >2GB in 1.5 hours.

I will try to disable CMS tasks and run only 4 Theory tasks to see if things improve.
Multicore VMs would be good.
ID: 28492 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 28494 - Posted: 14 Jan 2017, 18:51:05 UTC

Hello Laurence, I report you another two tasks stucked at 'bootlogd: no process killed'.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=112071880
https://lhcathome.cern.ch/lhcathome/result.php?resultid=112070676

I gracefully ended them setting 64790 as elapsed time via vbox_checkpoint.xml. They idled for almost 3 hours.
ID: 28494 · Report as offensive     Reply Quote

Message boards : Theory Application : Job size - download


©2024 CERN