Message boards : Theory Application : New Version 263.70
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Saturn911

Send message
Joined: 3 Nov 12
Posts: 36
Credit: 114,201,417
RAC: 92,090
Message 35824 - Posted: 8 Jul 2018, 5:13:42 UTC - in response to Message 35814.  

The new app is a multicore app that uses a 2-core setup on your hosts.
This is not recommended for the hosts you mentioned as they have only 2 cores.

You may navigate to the project's preferences page and set "max # of CPUs" to 1.

First I will give this a try.

About memory
Two of the failing computers are equipped with 6GB of ram.
I think this should be enough for a 2core Theory task.
One of them has 4GB only. But the result is the same.

Is it possible, that some other tasks of the OS (Manjaro Linux here)
blocks one of the CPUs and impedes VB to work correctly?
ID: 35824 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,310
RAC: 3,828
Message 35833 - Posted: 8 Jul 2018, 17:37:57 UTC - in response to Message 35823.  

... and the LHCb's are working there right now too. (also multi-core)
LHCb can be run multicore?


Not here......over at LHC-dev test site.
ID: 35833 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35834 - Posted: 8 Jul 2018, 19:13:48 UTC - in response to Message 35824.  

Is it possible, that some other tasks of the OS (Manjaro Linux here)
blocks one of the CPUs and impedes VB to work correctly?

Possibly your desktop environment. Full featured desktops like Gnome or KDE use considerable RAM and CPU time. Lightweight desktops like LXDE and XFCE use considerably less RAM and CPU. Which desktop do you use?
ID: 35834 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 116
Credit: 10,927,002
RAC: 2,464
Message 35875 - Posted: 12 Jul 2018, 13:11:02 UTC
Last modified: 12 Jul 2018, 13:16:59 UTC

VMs are using the local squid again

(3909): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2018-07-11 02:53:10 (3909): Guest Log: 2.4.4.0 3540 1 25728 6631 3 1 183741 10240000 2 65024 0 15 100 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch http://192.168.100.137:3128 1
2018-07-11 02:53:11 (3909): Guest Log: [INFO] Reading volunteer information

Working OK so far.
Thanks Laurence.
ID: 35875 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 35883 - Posted: 13 Jul 2018, 9:41:56 UTC - in response to Message 35875.  
Last modified: 13 Jul 2018, 9:49:48 UTC

Same here, openhtc.io and local squid are used now.
Thanks Laurence!

Some questions and a suggestion:
- Is it correct that if the VM is configured to use 4 cores, it will run concurrently 4 separate Condor Jobs?
- What is displayed on the ATL+F2 screen? Is it a randomly chosen output from one of the currently running Condor Jobs?
- Would it be possible to associate one ATL+Fx screen to one particular core (or slot directory) to display the output of the Condor Job that is currently running on that core (i.e. e.g. screen ALT+F2 displays the output of the Condor Job running on Core 1, ALT+F3 displays the output of the Condor Job running on Core 2, and so on)?
ID: 35883 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,003,488
RAC: 136,073
Message 35885 - Posted: 13 Jul 2018, 17:18:28 UTC - in response to Message 35883.  

... Is it correct that if the VM is configured to use 4 cores, it will run concurrently 4 separate Condor Jobs?

Yes.
As many concurrently running subtasks (or Condor Jobs) as cores are configured.
ID: 35885 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 35886 - Posted: 13 Jul 2018, 19:35:31 UTC - in response to Message 35883.  
Last modified: 13 Jul 2018, 19:38:21 UTC

- What is displayed on the ATL+F2 screen? Is it a randomly chosen output from one of the currently running Condor Jobs?
It will show the job running on the first of the cores. You can see the output from all the allocated cores under Show Graphics then Logs then down to running-slotX.log, although that will only update if you refresh.

- Would it be possible to associate one ATL+Fx screen to one particular core (or slot directory) to display the output of the Condor Job that is currently running on that core (i.e. e.g. screen ALT+F2 displays the output of the Condor Job running on Core 1, ALT+F3 displays the output of the Condor Job running on Core 2, and so on)?
Would be easier for us users than using the method above but I don't know how easy it is to impliment or even if it is possible as ALT+F(1 - 6) are already in use for various outputs.
ID: 35886 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 36
Credit: 114,201,417
RAC: 92,090
Message 35890 - Posted: 14 Jul 2018, 6:45:02 UTC - in response to Message 35834.  

Possibly your desktop environment. Full featured desktops like Gnome or KDE use considerable RAM and CPU time. Lightweight desktops like LXDE and XFCE use considerably less RAM and CPU. Which desktop do you use?

Boinc is installed as a system service here.
So I don't need to log in to desktop for running LHC@home.
Btw, it's XFCE on the 6G machines while the 4G host runs without desktop environment.

You may navigate to the project's preferences page and set "max # of CPUs" to 1.

Since I did this I had no more "VM Heartbeat file specified, but missing heartbeat." errors.
For me it looks like we need at least a processor with more than two treads to run the 2-processor-mt-tasks.

Now I run two single core mt-task at once and its o.k.

Thank you a lot for your help!
ID: 35890 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 116
Credit: 10,927,002
RAC: 2,464
Message 35935 - Posted: 16 Jul 2018, 9:39:03 UTC - in response to Message 35875.  

VMs are using the local squid again
Working OK so far.
Thanks Laurence.

Maybe I wrote too soon, the VM still sometimes fails to use the local squid.
2018-07-14 03:46:43 (2620): Guest Log: [DEBUG] Detected squid proxy http://192.168.100.137:3128

2018-07-14 03:47:59 (2620): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE

2018-07-14 03:47:59 (2620): Guest Log: 2.4.4.0 3533 1 25768 6661 3 1 183731 10240000 2 65024 0 15 93.3333 13 21 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 0
ID: 35935 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 35964 - Posted: 20 Jul 2018, 11:49:42 UTC - in response to Message 35935.  

I had the same problem with the previous version (before mulit core). I dont know exact numbers, but maybe 1 out of 6 or so did not correctly setup the proxy.

Do you guys know why my Theroy VMs do NOT shut down correctly? This happens to every single task, e.g. this one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200422861
The log says:
2018-07-20 12:51:44 (4364): Guest Log: [INFO] Job finished in slot1 with 0.
2018-07-20 13:02:07 (4364): Guest Log: [INFO] Condor exited with return value N/A.
2018-07-20 13:02:07 (4364): Guest Log: [INFO] Shutting Down.
2018-07-20 13:02:07 (4364): VM Completion File Detected.
2018-07-20 13:02:07 (4364): VM Completion Message: Condor exited with return value N/A..
2018-07-20 13:02:07 (4364): Powering off VM.
2018-07-20 13:07:11 (4364): VM did not power off when requested.
2018-07-20 13:07:11 (4364): VM was successfully terminated.
2018-07-20 13:07:11 (4364): Deregistering VM. (boinc_95d70c61e78dce9e, slot#0)
2018-07-20 13:07:11 (4364): Removing network bandwidth throttle group from VM.
2018-07-20 13:07:11 (4364): Removing VM from VirtualBox.
13:07:17 (4364): called boinc_finish(0)
ID: 35964 · Report as offensive     Reply Quote
Profile Ben Segal
Volunteer moderator
Project administrator

Send message
Joined: 1 Sep 04
Posts: 139
Credit: 2,579
RAC: 0
Message 35965 - Posted: 20 Jul 2018, 12:10:52 UTC - in response to Message 35964.  

I had the same problem with the previous version (before mulit core). I dont know exact numbers, but maybe 1 out of 6 or so did not correctly setup the proxy.

Do you guys know why my Theroy VMs do NOT shut down correctly? This happens to every single task, e.g. this one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200422861
The log says:
2018-07-20 12:51:44 (4364): Guest Log: [INFO] Job finished in slot1 with 0.
2018-07-20 13:02:07 (4364): Guest Log: [INFO] Condor exited with return value N/A.
2018-07-20 13:02:07 (4364): Guest Log: [INFO] Shutting Down.
2018-07-20 13:02:07 (4364): VM Completion File Detected.
2018-07-20 13:02:07 (4364): VM Completion Message: Condor exited with return value N/A..
2018-07-20 13:02:07 (4364): Powering off VM.
2018-07-20 13:07:11 (4364): VM did not power off when requested.
2018-07-20 13:07:11 (4364): VM was successfully terminated.
2018-07-20 13:07:11 (4364): Deregistering VM. (boinc_95d70c61e78dce9e, slot#0)
2018-07-20 13:07:11 (4364): Removing network bandwidth throttle group from VM.
2018-07-20 13:07:11 (4364): Removing VM from VirtualBox.
13:07:17 (4364): called boinc_finish(0)


It doesn't seem very serious as the VM was terminated even though the power-off failed.
Perhaps Laurence can comment but for now don't worry...

Ben
ID: 35965 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,003,488
RAC: 136,073
Message 35966 - Posted: 20 Jul 2018, 12:41:04 UTC - in response to Message 35964.  

"VM did not power off when requested." indicates that the shutdown needs more time than expected and hits a watchdog timeout.
Don't worry, if it happens only occasionally.

If it occurs very often and together with other indicators like
- lots of blank lines in the logs
- lines with lots of garbage
- the "postponed..." error mentioned in other threads

the computer or one of it's resources is most likely permanently too busy.
That situation should be investigated.
ID: 35966 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,003,488
RAC: 136,073
Message 35967 - Posted: 20 Jul 2018, 12:49:29 UTC - in response to Message 35935.  

Maybe I wrote too soon, the VM still sometimes fails to use the local squid.

Sorry for the late response.

I checked a couple of your logs and all of them had a correct proxy configuration.
Do you still have some VMs with errors?
Do you notice a relevant number of requests in your proxy log?
ID: 35967 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,449,365
RAC: 103,210
Message 35972 - Posted: 20 Jul 2018, 16:51:48 UTC - in response to Message 35966.  

"VM did not power off when requested."
I see this with ALL tasks on the machine which uses Theory Simulation v263.70 (vbox64_mt_mcore) windows_x86_64
but NOT on the two other PCs which use application Theory Simulation v263.50 (vbox32) windows_intelx86

whatever this observation is worth ...
ID: 35972 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35974 - Posted: 20 Jul 2018, 17:38:00 UTC - in response to Message 35972.  

Checked the last 19 Theory tasks I returned and none show "Failed to shutdown the VM".
Currently running Theory on only 1 host and it's using Theory Simulation v263.70 (vbox64_mt_mcore)x86_64-pc-linux-gnu.
ID: 35974 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 116
Credit: 10,927,002
RAC: 2,464
Message 35978 - Posted: 20 Jul 2018, 21:53:15 UTC - in response to Message 35967.  

the VM still sometimes fails to use the local squid.

Do you still have some VMs with errors?

Yes, in total I have details of three tasks:-
In all cases the proxy was reported as detected, but the VM was not reported as set up to use it.
Entries in the access log are taken to show that the proxy was (or wasn't) actually used.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200124140 ( Proxy used).
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200145344 ( Proxy not used)
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200221912 ( Proxy used)

The last two are the same host.

The inconsistency is puzzling. I'll have to wait for some more. If anyone else sees these failures it would be interesting to see their results. Hopefully I haven't misread something somewhere...
ID: 35978 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,003,488
RAC: 136,073
Message 35980 - Posted: 21 Jul 2018, 8:15:09 UTC - in response to Message 35978.  

The CVMFS inside the VM configures a proxy via this directive:
CVMFS_HTTP_PROXY='http://<proxy_name_or IP>:3128;DIRECT'
If the proxy does not respond, DIRECT is used as fallback.

Beside that the configuration directive CVMFS_PROXY_RESET_AFTER=300 tries to switch back to the first proxy after 300 s.

Thus the LHC VMs behave as expected and it should be investigated why your local proxy doesn't respond occasionally.
ID: 35980 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,449,365
RAC: 103,210
Message 36093 - Posted: 27 Jul 2018, 19:03:33 UTC

I now looked up the application details for one of my PCs on which, among others, I have been crunching Theory tasks for long time.
Interesting results:

Theory Simulation 263.20 windows_x86_64 (vbox64): Average processing rate 24.38 GFLOPS

Theory Simulation 263.60 windows_x86_64 (vbox64_mt_mcore): Average processing rate 30.63 GFLOPS

Theory Simulation 263.70 windows_x86_64 (vbox64_mt_mcore): Average processing rate 21.53 GFLOPS

So, version 263.70 is even slower, at least on my PC, than 263.20 was. And also slower than 263.60 - how come?
ID: 36093 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,003,488
RAC: 136,073
Message 36094 - Posted: 27 Jul 2018, 19:16:09 UTC - in response to Message 36093.  

Your VMs run a mix of different scientific apps like pythia, sherpa, agile-runmc ...
Some of them, e.g. sherpa, occasionally cause longer idle periods that influence the average efficiency index.
ID: 36094 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,449,365
RAC: 103,210
Message 36657 - Posted: 7 Sep 2018, 10:29:24 UTC

like some time ago when v263.70 came up, I now made another comparison between 3 of my PCs.

And the result was, that with 2 old machines with 32-bit Windows and about 10 years old processors (AMD Turion(tm) Neo X2 Dual Core Processor L625, and AMD Turion Dual-Core ZM-80), using Theory Simulation v263.50 (vbox32) windows_intelx86, I get about 3 times as many credit points than with a newer PC, 64-bit Windows, Intel(R) Core(TM) i5 CPU M 480 @ 2.67GHz, using Theory Simulation v263.70 (vbox64_mt_mcore) windows_x86_64.

Can anyone explain this?
ID: 36657 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Theory Application : New Version 263.70


©2024 CERN