Message boards : Theory Application : New Version 263.70
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 35773 - Posted: 5 Jul 2018, 9:21:37 UTC
Last modified: 6 Jul 2018, 10:59:48 UTC

This version has an improved CVMFS configuration and provides new vboxwrappers for Mac and Linux to address the issues reported by computezrmle.

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4746&postid=35682#35682
ID: 35773 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1459
Credit: 77,495,507
RAC: 93,755
Message 35776 - Posted: 5 Jul 2018, 11:19:57 UTC - in response to Message 35773.  

Without iptables the VM's CVMFS still ignores the local squid (keyword: DIRECT)

Guest Log: [DEBUG] Detected squid proxy http://<hostname_censored_by_volunteer/>:3128
Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
Guest Log: 2.4.4.0 3538 1 27796 6583 3 1 183730 10240000 2 65024 0 15 100 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 1
ID: 35776 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 110
Credit: 6,718,445
RAC: 771
Message 35781 - Posted: 5 Jul 2018, 12:45:35 UTC

I'm seeing an increase in heartbeat failures, possibly because the failure of the VM to use the local squid is slowing the startup process considerably. Can't be more precise since the history has been deleted.

Could the timeouts be increased to allow for slow(er) hosts, internet connections, etc and all the stuff (AV updates and whatnot) that go on at startup?
ID: 35781 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1459
Credit: 77,495,507
RAC: 93,755
Message 35782 - Posted: 5 Jul 2018, 12:59:03 UTC - in response to Message 35781.  

The new app is a multicore app and it reserves more RAM/VM than the old one.
This may result in higher swapping activity and at the end in timing problems.

You may review the total load on your hosts and eventually adapt the settings.
ID: 35782 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 947
Credit: 40,288,123
RAC: 8,159
Message 35783 - Posted: 5 Jul 2018, 18:41:52 UTC - in response to Message 35773.  

So I guess this is still using the same vdi since the number is the same.......I hope.

Mine are all Windows of course but I have 97 Valids and 0 Errors.........and now don't even have to *remove* the old vdi's in the VB manager any more.
ID: 35783 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,104,328
RAC: 2,664
Message 35784 - Posted: 5 Jul 2018, 19:32:33 UTC - in response to Message 35781.  

I'm seeing an increase in heartbeat failures...
Could the timeouts be increased to allow for slow(er) hosts, internet connections, etc and all the stuff (AV updates and whatnot) that go on at startup?
I, too, plea for an timeout increase - today, for the first time, I had a heartbeat failure on a host where so for I never had it before. On 2 of my other PCs, there are hearbeat failures once in a while, no idea why.
My opinion is that the heartbeat policy applied by LHC is much too rigid.
It's simply annoying that when a task is nearly finished (after having run for some 46.000 seconds), all of a sudden it dies because of the missing heartbeat :-(((
ID: 35784 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 110
Credit: 6,718,445
RAC: 771
Message 35785 - Posted: 5 Jul 2018, 22:01:46 UTC - in response to Message 35781.  

I'm seeing an increase in heartbeat failures, possibly because the failure of the VM to use the local squid is slowing the startup process considerably. Can't be more precise since the history has been deleted.

Could the timeouts be increased to allow for slow(er) hosts, internet connections, etc and all the stuff (AV updates and whatnot) that go on at startup?

Most of the failures are on startup. Since tasks stop and restart at least once here, a lot of time can be wasted when tasks fail on subsequent starts.

Once upon a time... a config change was made to CMS which, as I remember, largely fixed the problem.
Then it was taken away... and never returned; no explanation.
ID: 35785 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 35788 - Posted: 6 Jul 2018, 13:27:39 UTC - in response to Message 35783.  

The thread was labeled incorrectly, it does have a new version number.
ID: 35788 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 35790 - Posted: 6 Jul 2018, 13:33:13 UTC - in response to Message 35784.  

The heartbeat interval is 20mins and it should beat every minute. So the VM is killed if it takes longer than 20mins to boot or has frozen for 20 minutes.
ID: 35790 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 35791 - Posted: 6 Jul 2018, 13:34:04 UTC - in response to Message 35785.  

This change is in this version so it should boot faster.
ID: 35791 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 110
Credit: 6,718,445
RAC: 771
Message 35807 - Posted: 7 Jul 2018, 9:58:40 UTC - in response to Message 35790.  

The heartbeat interval is 20mins and it should beat every minute. So the VM is killed if it takes longer than 20mins to boot or has frozen for 20 minutes.


Are the times in the tasks below right? Looks like the timeout is still 10mins and the heartbeat interval is 20 mins, surely I'm misreading this?
The actual failure is probably OK - it tried to use 2 CPU when it shouldn't.

Theory Simulation v263.70 (vbox64_mt_mcore)
x86_64-pc-linux-gnu

2018-07-07 01:00:20 (7559): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
.....
2018-07-07 01:00:28 (7559): Successfully started VM. (PID = '8126')
.....
2018-07-07 01:10:23 (7559): VM Heartbeat file specified, but missing.
2018-07-07 01:10:23 (7559): VM Heartbeat file specified, but missing file system status. (errno = '2')

Another host...


2018-07-07 06:26:32 (2567): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
.....
2018-07-07 06:26:38 (2567): Successfully started VM. (PID = '3049')
.....
2018-07-07 06:36:33 (2567): VM Heartbeat file specified, but missing.
2018-07-07 06:36:33 (2567): VM Heartbeat file specified, but missing file system status. (errno = '2')
ID: 35807 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 3
Credit: 21,968,691
RAC: 25,989
Message 35812 - Posted: 7 Jul 2018, 18:19:06 UTC

For me too.
Since the mc apps 263.60 and 263.70 I have the "VM Heartbeat file specified, but missing heartbeat." trouble.
On three older machines, they all are dual core. Intel and AMD.
Most of the tasks failed but not all of them.
The computers are
ID: 10392891
ID: 10318807
ID: 10395493
Here one of the results

https://lhcathome.cern.ch/lhcathome/result.php?resultid=199736060

Before the mc apps these three worked like charm.

Please help.
ID: 35812 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 947
Credit: 40,288,123
RAC: 8,159
Message 35813 - Posted: 7 Jul 2018, 19:08:49 UTC

Mine are still working perfect (197 Valids and 0 Errors) but I still have to finish all the 263.60's before I run 70's on all of them.
ID: 35813 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1459
Credit: 77,495,507
RAC: 93,755
Message 35814 - Posted: 7 Jul 2018, 19:09:57 UTC - in response to Message 35812.  

The old app was a singlecore app.
The new app is a multicore app that uses a 2-core setup on your hosts.
This is not recommended for the hosts you mentioned as they have only 2 cores.

You may navigate to the project's preferences page and set "max # of CPUs" to 1.
If this doesn't help, further adjustments may be necessary.
ID: 35814 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,104,328
RAC: 2,664
Message 35816 - Posted: 7 Jul 2018, 19:37:31 UTC - in response to Message 35812.  

... I have the "VM Heartbeat file specified, but missing heartbeat." trouble.
On three older machines, they all are dual core. Intel and AMD.
Reading this, I remember that I, too, had these "hearbeat" problems on two of my PCs - both AMD (I never had the hearbeat problem with Intel).

After I had posted these problems here, a very valuable advice I got from Crystal Pellet was that the problem may have to do with the low process priority of the vboxwrapper.exe. So I installed the tool "Prio" with which I can permanently set the priority to "normal".
And since that time, I have no longer experienced the "hearbeat" trouble.

No idea whether this tool might be of help also in your case, too. But perhaps it's worth a try.
ID: 35816 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35817 - Posted: 7 Jul 2018, 19:43:56 UTC - in response to Message 35814.  

The old app was a singlecore app.
The new app is a multicore app that uses a 2-core setup on your hosts.
This is not recommended for the hosts you mentioned as they have only 2 cores.

You may navigate to the project's preferences page and set "max # of CPUs" to 1.
If this doesn't help, further adjustments may be necessary.


I have 2 hosts that have 2 core CPUs. One has 4GB RAM and runs Linux, the other has 8GB and runs Windows 8.1.

For the 4GB host, if I set "max # of CPUs " to 1 and "max tasks" to unlimited, I get 2 single core tasks running simultaneously and they run to completion error free. If I set "max # of CPUs" to 2 then I get 1 task running and it inevitably crashes.

My other 2 core host (the one with 8GB RAM) is happy with "max # of CPUs " set to 2. So it seems it's not just the number of cores, the RAM size seems to affect it too?
ID: 35817 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1459
Credit: 77,495,507
RAC: 93,755
Message 35818 - Posted: 7 Jul 2018, 20:07:44 UTC - in response to Message 35817.  

... the RAM size seems to affect it too?

If you reduce "max # of CPUs" the RAM requirements will automatically be reduced.
The corresponding value is included in the scheduler_reply if you download a fresh task.

Tuning this value would be the least effort.
ID: 35818 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 947
Credit: 40,288,123
RAC: 8,159
Message 35820 - Posted: 7 Jul 2018, 21:48:30 UTC

263.60 and 263.70 are both ulti-core and I am still finishing up the .60 version tasks since they all have worked with no problems all Valids ( I did get one Valid that was only 25mins) but the rest are normal.

As far as the VB version on the three hosts that are almost the same version of 8-cores and OS and Ram I still have VB version 5.2.2 and 5.2.6 and the newest Boinc onone and the previous on the other two hosts and they had no problems running the 263.60 version and the 263.70 version tasks are still running so no finished ones yet.

The 8-core I am on right now shows that the 4 X 2-core tasks are running using only 5GB Ram and 72 - 85% of CPU (3.67Ghz) and the other two 8-core hosts show the same. (all are Windows 10 OS) but I do have a Win 7 still testing the -dev version of these multi-cores and they all are Valids and this is on my older quad-core hosts with no more than 12GB ram (over 4000 Valids) and the LHCb's are working there right now too. (also multi-core)

I have only updated the VB version to the newer one on my -dev hosts so far but will do the same on these when I get the time.
ID: 35820 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35821 - Posted: 7 Jul 2018, 23:11:55 UTC - in response to Message 35818.  

... the RAM size seems to affect it too?

If you reduce "max # of CPUs" the RAM requirements will automatically be reduced.
The corresponding value is included in the scheduler_reply if you download a fresh task.

Tuning this value would be the least effort.


It seems odd that it will run two 1-core tasks simultaneously but not one 2-core task. I was thinking that either scenario would require approximately the same amount of RAM in total and that if there were to be a difference then one 2-core task would have some inherent efficiency and would require less total RAM. Instead it seems somehow less efficient. Not complaining, just observing.

Hmmm. Maybe boot to runlevel 3 instead of 5 which would not start the X subsystem and free up some RAM. There must be a few other unnecessary luxuries that could be eliminated as well.
ID: 35821 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,104,328
RAC: 2,664
Message 35823 - Posted: 8 Jul 2018, 4:23:02 UTC - in response to Message 35820.  

... and the LHCb's are working there right now too. (also multi-core)
LHCb can be run multicore?
ID: 35823 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Theory Application : New Version 263.70


©2020 CERN