Thread 'New Version 263.70'

Author	Message
Saturn911 Send message Joined: 3 Nov 12 Posts: 87 Credit: 184,743,181 RAC: 84,503	Message 35824 - Posted: 8 Jul 2018, 5:13:42 UTC - in response to Message 35814. The new app is a multicore app that uses a 2-core setup on your hosts. This is not recommended for the hosts you mentioned as they have only 2 cores. You may navigate to the project's preferences page and set "max # of CPUs" to 1. First I will give this a try. About memory Two of the failing computers are equipped with 6GB of ram. I think this should be enough for a 2core Theory task. One of them has 4GB only. But the result is the same. Is it possible, that some other tasks of the OS (Manjaro Linux here) blocks one of the CPUs and impedes VB to work correctly? ID: 35824 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1261 Credit: 92,330,502 RAC: 108,383	Message 35833 - Posted: 8 Jul 2018, 17:37:57 UTC - in response to Message 35823. ... and the LHCb's are working there right now too. (also multi-core) LHCb can be run multicore? Not here......over at LHC-dev test site. ID: 35833 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 35834 - Posted: 8 Jul 2018, 19:13:48 UTC - in response to Message 35824. Is it possible, that some other tasks of the OS (Manjaro Linux here) blocks one of the CPUs and impedes VB to work correctly? Possibly your desktop environment. Full featured desktops like Gnome or KDE use considerable RAM and CPU time. Lightweight desktops like LXDE and XFCE use considerably less RAM and CPU. Which desktop do you use? ID: 35834 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,761,224 RAC: 1,992	Message 35875 - Posted: 12 Jul 2018, 13:11:02 UTC Last modified: 12 Jul 2018, 13:16:59 UTC e using the local squid again [pre](3909): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2018-07-11 02:53:10 (3909): Guest Log: 2.4.4.0 3540 1 25728 6631 3 1 183741 10240000 2 65024 0 15 100 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch http://192.168.100.137:3128 1 2018-07-11 02:53:11 (3909): Guest Log: [INFO] Reading volunteer information[/pre] Working OK so far. Thanks Laurence. ID: 35875 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,659,192 RAC: 202	Message 35883 - Posted: 13 Jul 2018, 9:41:56 UTC - in response to Message 35875. Last modified: 13 Jul 2018, 9:49:48 UTC Same here, openhtc.io and local squid are used now. Thanks Laurence! Some questions and a suggestion: - Is it correct that if the VM is configured to use 4 cores, it will run concurrently 4 separate Condor Jobs? - What is displayed on the ATL+F2 screen? Is it a randomly chosen output from one of the currently running Condor Jobs? - Would it be possible to associate one ATL+Fx screen to one particular core (or slot directory) to display the output of the Condor Job that is currently running on that core (i.e. e.g. screen ALT+F2 displays the output of the Condor Job running on Core 1, ALT+F3 displays the output of the Condor Job running on Core 2, and so on)? ID: 35883 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 58,257	Message 35885 - Posted: 13 Jul 2018, 17:18:28 UTC - in response to Message 35883. ... Is it correct that if the VM is configured to use 4 cores, it will run concurrently 4 separate Condor Jobs? Yes. As many concurrently running subtasks (or Condor Jobs) as cores are configured. ID: 35885 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 35886 - Posted: 13 Jul 2018, 19:35:31 UTC - in response to Message 35883. Last modified: 13 Jul 2018, 19:38:21 UTC - What is displayed on the ATL+F2 screen? Is it a randomly chosen output from one of the currently running Condor Jobs? It will show the job running on the first of the cores. You can see the output from all the allocated cores under Show Graphics then Logs then down to running-slotX.log, although that will only update if you refresh. - Would it be possible to associate one ATL+Fx screen to one particular core (or slot directory) to display the output of the Condor Job that is currently running on that core (i.e. e.g. screen ALT+F2 displays the output of the Condor Job running on Core 1, ALT+F3 displays the output of the Condor Job running on Core 2, and so on)? Would be easier for us users than using the method above but I don't know how easy it is to impliment or even if it is possible as ALT+F(1 - 6) are already in use for various outputs. ID: 35886 · Reply Quote

Saturn911 Send message Joined: 3 Nov 12 Posts: 87 Credit: 184,743,181 RAC: 84,503	Message 35890 - Posted: 14 Jul 2018, 6:45:02 UTC - in response to Message 35834. Possibly your desktop environment. Full featured desktops like Gnome or KDE use considerable RAM and CPU time. Lightweight desktops like LXDE and XFCE use considerably less RAM and CPU. Which desktop do you use? Boinc is installed as a system service here. So I don't need to log in to desktop for running LHC@home. Btw, it's XFCE on the 6G machines while the 4G host runs without desktop environment. You may navigate to the project's preferences page and set "max # of CPUs" to 1. Since I did this I had no more "VM Heartbeat file specified, but missing heartbeat." errors. For me it looks like we need at least a processor with more than two treads to run the 2-processor-mt-tasks. Now I run two single core mt-task at once and its o.k. Thank you a lot for your help! ID: 35890 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,761,224 RAC: 1,992	Message 35935 - Posted: 16 Jul 2018, 9:39:03 UTC - in response to Message 35875. ]VMs are using the local squid again Working OK so far. Thanks Laurence.[/quote] Maybe I wrote too soon, the VM still sometimes fails to use the local squid. [pre]2018-07-14 03:46:43 (2620): Guest Log: [DEBUG] Detected squid proxy http://192.168.100.137:3128 2018-07-14 03:47:59 (2620): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2018-07-14 03:47:59 (2620): Guest Log: 2.4.4.0 3533 1 25768 6661 3 1 183731 10240000 2 65024 0 15 93.3333 13 21 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 0 [/pre] ID: 35935 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,659,192 RAC: 202	Message 35964 - Posted: 20 Jul 2018, 11:49:42 UTC - in response to Message 35935. I had the same problem with the previous version (before mulit core). I dont know exact numbers, but maybe 1 out of 6 or so did not correctly setup the proxy. Do you guys know why my Theroy VMs do NOT shut down correctly? This happens to every single task, e.g. this one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200422861 The log says: 2018-07-20 12:51:44 (4364): Guest Log: [INFO] Job finished in slot1 with 0. 2018-07-20 13:02:07 (4364): Guest Log: [INFO] Condor exited with return value N/A. 2018-07-20 13:02:07 (4364): Guest Log: [INFO] Shutting Down. 2018-07-20 13:02:07 (4364): VM Completion File Detected. 2018-07-20 13:02:07 (4364): VM Completion Message: Condor exited with return value N/A.. 2018-07-20 13:02:07 (4364): Powering off VM. 2018-07-20 13:07:11 (4364): VM did not power off when requested. 2018-07-20 13:07:11 (4364): VM was successfully terminated. 2018-07-20 13:07:11 (4364): Deregistering VM. (boinc_95d70c61e78dce9e, slot#0) 2018-07-20 13:07:11 (4364): Removing network bandwidth throttle group from VM. 2018-07-20 13:07:11 (4364): Removing VM from VirtualBox. 13:07:17 (4364): called boinc_finish(0) ID: 35964 · Reply Quote

Ben Segal Volunteer moderator Project administrator Send message Joined: 1 Sep 04 Posts: 143 Credit: 2,579 RAC: 0	Message 35965 - Posted: 20 Jul 2018, 12:10:52 UTC - in response to Message 35964. I had the same problem with the previous version (before mulit core). I dont know exact numbers, but maybe 1 out of 6 or so did not correctly setup the proxy. Do you guys know why my Theroy VMs do NOT shut down correctly? This happens to every single task, e.g. this one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200422861 The log says: 2018-07-20 12:51:44 (4364): Guest Log: [INFO] Job finished in slot1 with 0. 2018-07-20 13:02:07 (4364): Guest Log: [INFO] Condor exited with return value N/A. 2018-07-20 13:02:07 (4364): Guest Log: [INFO] Shutting Down. 2018-07-20 13:02:07 (4364): VM Completion File Detected. 2018-07-20 13:02:07 (4364): VM Completion Message: Condor exited with return value N/A.. 2018-07-20 13:02:07 (4364): Powering off VM. 2018-07-20 13:07:11 (4364): VM did not power off when requested. 2018-07-20 13:07:11 (4364): VM was successfully terminated. 2018-07-20 13:07:11 (4364): Deregistering VM. (boinc_95d70c61e78dce9e, slot#0) 2018-07-20 13:07:11 (4364): Removing network bandwidth throttle group from VM. 2018-07-20 13:07:11 (4364): Removing VM from VirtualBox. 13:07:17 (4364): called boinc_finish(0) It doesn't seem very serious as the VM was terminated even though the power-off failed. Perhaps Laurence can comment but for now don't worry... Ben ID: 35965 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 58,257	Message 35966 - Posted: 20 Jul 2018, 12:41:04 UTC - in response to Message 35964. "VM did not power off when requested." indicates that the shutdown needs more time than expected and hits a watchdog timeout. Don't worry, if it happens only occasionally. If it occurs very often and together with other indicators like - lots of blank lines in the logs - lines with lots of garbage - the "postponed..." error mentioned in other threads the computer or one of it's resources is most likely permanently too busy. That situation should be investigated. ID: 35966 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 58,257	Message 35967 - Posted: 20 Jul 2018, 12:49:29 UTC - in response to Message 35935. Maybe I wrote too soon, the VM still sometimes fails to use the local squid. Sorry for the late response. I checked a couple of your logs and all of them had a correct proxy configuration. Do you still have some VMs with errors? Do you notice a relevant number of requests in your proxy log? ID: 35967 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,237,934 RAC: 106,139	Message 35972 - Posted: 20 Jul 2018, 16:51:48 UTC - in response to Message 35966. "VM did not power off when requested." I see this with ALL tasks on the machine which uses Theory Simulation v263.70 (vbox64_mt_mcore) windows_x86_64 but NOT on the two other PCs which use application Theory Simulation v263.50 (vbox32) windows_intelx86 whatever this observation is worth ... ID: 35972 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 35974 - Posted: 20 Jul 2018, 17:38:00 UTC - in response to Message 35972. Checked the last 19 Theory tasks I returned and none show "Failed to shutdown the VM". Currently running Theory on only 1 host and it's using Theory Simulation v263.70 (vbox64_mt_mcore)x86_64-pc-linux-gnu. ID: 35974 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,761,224 RAC: 1,992	Message 35978 - Posted: 20 Jul 2018, 21:53:15 UTC - in response to Message 35967. the VM still sometimes fails to use the local squid. Do you still have some VMs with errors? Yes, in total I have details of three tasks:- In all cases the proxy was reported as detected, but the VM was not reported as set up to use it. Entries in the access log are taken to show that the proxy was (or wasn't) actually used. https://lhcathome.cern.ch/lhcathome/result.php?resultid=200124140 ( Proxy used). https://lhcathome.cern.ch/lhcathome/result.php?resultid=200145344 ( Proxy not used) https://lhcathome.cern.ch/lhcathome/result.php?resultid=200221912 ( Proxy used) The last two are the same host. The inconsistency is puzzling. I'll have to wait for some more. If anyone else sees these failures it would be interesting to see their results. Hopefully I haven't misread something somewhere... ID: 35978 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 58,257	Message 35980 - Posted: 21 Jul 2018, 8:15:09 UTC - in response to Message 35978. The CVMFS inside the VM configures a proxy via this directive: CVMFS_HTTP_PROXY='http://<proxy_name_or IP>:3128;DIRECT' If the proxy does not respond, DIRECT is used as fallback. Beside that the configuration directive CVMFS_PROXY_RESET_AFTER=300 tries to switch back to the first proxy after 300 s. Thus the LHC VMs behave as expected and it should be investigated why your local proxy doesn't respond occasionally. ID: 35980 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,237,934 RAC: 106,139	Message 36093 - Posted: 27 Jul 2018, 19:03:33 UTC I now looked up the application details for one of my PCs on which, among others, I have been crunching Theory tasks for long time. Interesting results: Theory Simulation 263.20 windows_x86_64 (vbox64): Average processing rate 24.38 GFLOPS Theory Simulation 263.60 windows_x86_64 (vbox64_mt_mcore): Average processing rate 30.63 GFLOPS Theory Simulation 263.70 windows_x86_64 (vbox64_mt_mcore): Average processing rate 21.53 GFLOPS So, version 263.70 is even slower, at least on my PC, than 263.20 was. And also slower than 263.60 - how come? ID: 36093 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 58,257	Message 36094 - Posted: 27 Jul 2018, 19:16:09 UTC - in response to Message 36093. Your VMs run a mix of different scientific apps like pythia, sherpa, agile-runmc ... Some of them, e.g. sherpa, occasionally cause longer idle periods that influence the average efficiency index. ID: 36094 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,237,934 RAC: 106,139	Message 36657 - Posted: 7 Sep 2018, 10:29:24 UTC like some time ago when v263.70 came up, I now made another comparison between 3 of my PCs. And the result was, that with 2 old machines with 32-bit Windows and about 10 years old processors (AMD Turion(tm) Neo X2 Dual Core Processor L625, and AMD Turion Dual-Core ZM-80), using Theory Simulation v263.50 (vbox32) windows_intelx86, I get about 3 times as many credit points than with a newer PC, 64-bit Windows, Intel(R) Core(TM) i5 CPU M 480 @ 2.67GHz, using Theory Simulation v263.70 (vbox64_mt_mcore) windows_x86_64. Can anyone explain this? ID: 36657 · Reply Quote