Message boards :
Theory Application :
New Version 263.70
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
This version has an improved CVMFS configuration and provides new vboxwrappers for Mac and Linux to address the issues reported by computezrmle. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4746&postid=35682#35682 |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 244,979,460 RAC: 170,842 |
Without iptables the VM's CVMFS still ignores the local squid (keyword: DIRECT) Guest Log: [DEBUG] Detected squid proxy http://<hostname_censored_by_volunteer/>:3128 Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE Guest Log: 2.4.4.0 3538 1 27796 6583 3 1 183730 10240000 2 65024 0 15 100 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 1 |
Send message Joined: 6 Sep 08 Posts: 117 Credit: 12,271,320 RAC: 18,234 |
I'm seeing an increase in heartbeat failures, possibly because the failure of the VM to use the local squid is slowing the startup process considerably. Can't be more precise since the history has been deleted. Could the timeouts be increased to allow for slow(er) hosts, internet connections, etc and all the stuff (AV updates and whatnot) that go on at startup? |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 244,979,460 RAC: 170,842 |
The new app is a multicore app and it reserves more RAM/VM than the old one. This may result in higher swapping activity and at the end in timing problems. You may review the total load on your hosts and eventually adapt the settings. |
Send message Joined: 24 Oct 04 Posts: 1153 Credit: 50,841,180 RAC: 19,766 |
So I guess this is still using the same vdi since the number is the same.......I hope. Mine are all Windows of course but I have 97 Valids and 0 Errors.........and now don't even have to *remove* the old vdi's in the VB manager any more. |
Send message Joined: 18 Dec 15 Posts: 1735 Credit: 112,366,399 RAC: 51,051 |
I'm seeing an increase in heartbeat failures...I, too, plea for an timeout increase - today, for the first time, I had a heartbeat failure on a host where so for I never had it before. On 2 of my other PCs, there are hearbeat failures once in a while, no idea why. My opinion is that the heartbeat policy applied by LHC is much too rigid. It's simply annoying that when a task is nearly finished (after having run for some 46.000 seconds), all of a sudden it dies because of the missing heartbeat :-((( |
Send message Joined: 6 Sep 08 Posts: 117 Credit: 12,271,320 RAC: 18,234 |
I'm seeing an increase in heartbeat failures, possibly because the failure of the VM to use the local squid is slowing the startup process considerably. Can't be more precise since the history has been deleted. Most of the failures are on startup. Since tasks stop and restart at least once here, a lot of time can be wasted when tasks fail on subsequent starts. Once upon a time... a config change was made to CMS which, as I remember, largely fixed the problem. Then it was taken away... and never returned; no explanation. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
The thread was labeled incorrectly, it does have a new version number. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
The heartbeat interval is 20mins and it should beat every minute. So the VM is killed if it takes longer than 20mins to boot or has frozen for 20 minutes. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
This change is in this version so it should boot faster. |
Send message Joined: 6 Sep 08 Posts: 117 Credit: 12,271,320 RAC: 18,234 |
The heartbeat interval is 20mins and it should beat every minute. So the VM is killed if it takes longer than 20mins to boot or has frozen for 20 minutes. Are the times in the tasks below right? Looks like the timeout is still 10mins and the heartbeat interval is 20 mins, surely I'm misreading this? The actual failure is probably OK - it tried to use 2 CPU when it shouldn't. Theory Simulation v263.70 (vbox64_mt_mcore) x86_64-pc-linux-gnu 2018-07-07 01:00:20 (7559): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) ..... 2018-07-07 01:00:28 (7559): Successfully started VM. (PID = '8126') ..... 2018-07-07 01:10:23 (7559): VM Heartbeat file specified, but missing. 2018-07-07 01:10:23 (7559): VM Heartbeat file specified, but missing file system status. (errno = '2') Another host... 2018-07-07 06:26:32 (2567): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) ..... 2018-07-07 06:26:38 (2567): Successfully started VM. (PID = '3049') ..... 2018-07-07 06:36:33 (2567): VM Heartbeat file specified, but missing. 2018-07-07 06:36:33 (2567): VM Heartbeat file specified, but missing file system status. (errno = '2') |
Send message Joined: 3 Nov 12 Posts: 54 Credit: 133,242,422 RAC: 181,706 |
For me too. Since the mc apps 263.60 and 263.70 I have the "VM Heartbeat file specified, but missing heartbeat." trouble. On three older machines, they all are dual core. Intel and AMD. Most of the tasks failed but not all of them. The computers are ID: 10392891 ID: 10318807 ID: 10395493 Here one of the results https://lhcathome.cern.ch/lhcathome/result.php?resultid=199736060 Before the mc apps these three worked like charm. Please help. |
Send message Joined: 24 Oct 04 Posts: 1153 Credit: 50,841,180 RAC: 19,766 |
Mine are still working perfect (197 Valids and 0 Errors) but I still have to finish all the 263.60's before I run 70's on all of them. |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 244,979,460 RAC: 170,842 |
The old app was a singlecore app. The new app is a multicore app that uses a 2-core setup on your hosts. This is not recommended for the hosts you mentioned as they have only 2 cores. You may navigate to the project's preferences page and set "max # of CPUs" to 1. If this doesn't help, further adjustments may be necessary. |
Send message Joined: 18 Dec 15 Posts: 1735 Credit: 112,366,399 RAC: 51,051 |
... I have the "VM Heartbeat file specified, but missing heartbeat." trouble.Reading this, I remember that I, too, had these "hearbeat" problems on two of my PCs - both AMD (I never had the hearbeat problem with Intel). After I had posted these problems here, a very valuable advice I got from Crystal Pellet was that the problem may have to do with the low process priority of the vboxwrapper.exe. So I installed the tool "Prio" with which I can permanently set the priority to "normal". And since that time, I have no longer experienced the "hearbeat" trouble. No idea whether this tool might be of help also in your case, too. But perhaps it's worth a try. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
The old app was a singlecore app. I have 2 hosts that have 2 core CPUs. One has 4GB RAM and runs Linux, the other has 8GB and runs Windows 8.1. For the 4GB host, if I set "max # of CPUs " to 1 and "max tasks" to unlimited, I get 2 single core tasks running simultaneously and they run to completion error free. If I set "max # of CPUs" to 2 then I get 1 task running and it inevitably crashes. My other 2 core host (the one with 8GB RAM) is happy with "max # of CPUs " set to 2. So it seems it's not just the number of cores, the RAM size seems to affect it too? |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 244,979,460 RAC: 170,842 |
... the RAM size seems to affect it too? If you reduce "max # of CPUs" the RAM requirements will automatically be reduced. The corresponding value is included in the scheduler_reply if you download a fresh task. Tuning this value would be the least effort. |
Send message Joined: 24 Oct 04 Posts: 1153 Credit: 50,841,180 RAC: 19,766 |
263.60 and 263.70 are both ulti-core and I am still finishing up the .60 version tasks since they all have worked with no problems all Valids ( I did get one Valid that was only 25mins) but the rest are normal. As far as the VB version on the three hosts that are almost the same version of 8-cores and OS and Ram I still have VB version 5.2.2 and 5.2.6 and the newest Boinc onone and the previous on the other two hosts and they had no problems running the 263.60 version and the 263.70 version tasks are still running so no finished ones yet. The 8-core I am on right now shows that the 4 X 2-core tasks are running using only 5GB Ram and 72 - 85% of CPU (3.67Ghz) and the other two 8-core hosts show the same. (all are Windows 10 OS) but I do have a Win 7 still testing the -dev version of these multi-cores and they all are Valids and this is on my older quad-core hosts with no more than 12GB ram (over 4000 Valids) and the LHCb's are working there right now too. (also multi-core) I have only updated the VB version to the newer one on my -dev hosts so far but will do the same on these when I get the time. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
... the RAM size seems to affect it too? It seems odd that it will run two 1-core tasks simultaneously but not one 2-core task. I was thinking that either scenario would require approximately the same amount of RAM in total and that if there were to be a difference then one 2-core task would have some inherent efficiency and would require less total RAM. Instead it seems somehow less efficient. Not complaining, just observing. Hmmm. Maybe boot to runlevel 3 instead of 5 which would not start the X subsystem and free up some RAM. There must be a few other unnecessary luxuries that could be eliminated as well. |
Send message Joined: 18 Dec 15 Posts: 1735 Credit: 112,366,399 RAC: 51,051 |
... and the LHCb's are working there right now too. (also multi-core)LHCb can be run multicore? |
©2024 CERN