Thread 'New Version v60.00'

Author	Message
Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,202 RAC: 72	Message 45292 - Posted: 7 Sep 2021, 8:44:02 UTC A new version of the CMS application v60.00 is now available. This version provides an updated VM image. In addition the Windows version also has an updated version of the vboxwrapper provided from the BOINC website rather than our own custom built version. ID: 45292 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2290 Credit: 178,865,859 RAC: 2,558	Message 45293 - Posted: 7 Sep 2021, 9:15:17 UTC Last modified: 7 Sep 2021, 9:16:16 UTC Windows10pro-Project reset, Version 50.0 is downloading. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=170128411 Applications show Version.60 https://lhcathome.cern.ch/lhcathome/apps.php ID: 45293 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1543 Credit: 10,050,357 RAC: 1,267	Message 45294 - Posted: 7 Sep 2021, 9:39:02 UTC - in response to Message 45292. Last modified: 7 Sep 2021, 9:45:34 UTC 1915 LHC@home 07 Sep 11:36:39 Started download of vboxwrapper_26203_windows_x86_64.exe 1916 LHC@home 07 Sep 11:36:39 Started download of CMS_2021_07_07.vdi 1917 LHC@home 07 Sep 11:36:41 Finished download of vboxwrapper_26203_windows_x86_64.exe 1918 LHC@home 07 Sep 11:36:41 Giving up on download of CMS_2021_07_07.vdi: permanent HTTP error <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>CMS_2021_07_07.vdi</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> </message> I'm able to download that 1.6 GB gz-file with my browser !? ID: 45294 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,202 RAC: 72	Message 45295 - Posted: 7 Sep 2021, 9:45:49 UTC - in response to Message 45294. The permissions error has been fixed. ID: 45295 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2290 Credit: 178,865,859 RAC: 2,558	Message 45296 - Posted: 7 Sep 2021, 11:10:31 UTC - in response to Message 45295. Last modified: 7 Sep 2021, 11:28:01 UTC CMS with proxy: last two lines atm: CMS application starting. Check log files. X509_USER_PROXY = /tmp/x509up_u1000 Since 8 Min. no more action in Boinc_VM. Edit: two lines before reading Volunteer information, but no info about the volunteer is sending back. https://lhcathome.cern.ch/lhcathome/result.php?resultid=324877146 Have aborted this task! ID: 45296 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 363	Message 45297 - Posted: 7 Sep 2021, 11:11:55 UTC I was able to download the .vdi file and wrapper to my Linux box and now have a task running a job. All indications seem nominal; my task reports "Application version CMS Simulation v60.00 (vbox64) x86_64-pc-linux-gnu". The production failure rate has dropped significantly since early this morning, I wonder if that's due to the proxy changes yesterday? From an experimental monitoring dashboard, most of the failures were coming from one machine, and seem to be network errors. ID: 45297 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 363	Message 45298 - Posted: 7 Sep 2021, 11:31:18 UTC - in response to Message 45297. The production failure rate has dropped significantly since early this morning, I wonder if that's due to the proxy changes yesterday? From an experimental monitoring dashboard, most of the failures were coming from one machine, and seem to be network errors. Ah, he has been having connectivity problems, failure to resolve cern.ch, so he hasn't actually received any condor jobs today. ID: 45298 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45299 - Posted: 7 Sep 2021, 11:40:46 UTC The new 60.00 are running on all 24 cores of a Ryzen 3900X (Ubuntu 20.04.3 and VBox 6.1.26), with 64 GB of memory. They look good to me, but they are only 25 minutes into the run. ID: 45299 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45303 - Posted: 8 Sep 2021, 1:43:14 UTC - in response to Message 45299. They look good to me, but they are only 25 minutes into the run. 22 of them completed successfully, running about 10 1/2 to 12 1/2 hours, a little faster than before. Then they started to error; the next 12 show: Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT https://lhcathome.cern.ch/lhcathome/results.php?hostid=10694610&offset=0&show_names=0&state=6&appid= They run for about 2 to 3 hours before they error. ID: 45303 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,316,654 RAC: 47,434	Message 45304 - Posted: 8 Sep 2021, 8:02:05 UTC - in response to Message 45303. like the VMs were all running concurrently and they were killed all of a sudden within the very same second. In most cases this points out a VirtualBox problem: [pre]2021-09-07 20:49:45 (836238): VM is no longer is a running state. It is in 'poweroff'.[/pre] Nonetheless your logs show some entries that are not very common: [pre]2021-09-07 18:07:23 (719461): Command: VBoxManage -q list extpacks Exit Code: 0 Output: Extension Packs: 1 Pack no. 0: VNC Version: 6.1.26 Revision: 145957 Edition: Description: VNC plugin module VRDE Module: VBoxVNC Usable: true Why unusable: . . . 2021-09-07 18:07:25 (719461): Command: VBoxManage -q bandwidthctl "boinc_3fc71e1847ee7287" set "boinc_3fc71e1847ee7287_net" --limit 1953K[/pre] Did you use VNC to check a VM when they crashed? Did you limit the network bandwidth yourself? Did you notice if result uploads were in progress (the big ones from inside the VMs)? Did you check the load average values? Did your BOINC client hit the RAM/disk limit? ID: 45304 · Reply Quote

[VENETO] boboviz Send message Joined: 7 May 08 Posts: 273 Credit: 2,131,245 RAC: 139	Message 45305 - Posted: 8 Sep 2021, 10:22:24 UTC 199 wus in queue and i continue to have "Got 0 new tasks" ID: 45305 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45306 - Posted: 8 Sep 2021, 11:28:43 UTC - in response to Message 45304. Last modified: 8 Sep 2021, 12:08:15 UTC Did you use VNC to check a VM when they crashed? Did you limit the network bandwidth yourself? Did you notice if result uploads were in progress (the big ones from inside the VMs)? Did you check the load average values? Did your BOINC client hit the RAM/disk limit? No. But when the successful ones were running for about 4 to 5 hours, I checked the memory usage and got: $ free total used free shared buff/cache available Mem: 65792776 49589536 473652 10276 15729588 15467528 Swap: 1951740 51968 1899772 So that seemed to be enough. However, I also run a large (12 GB) write cache using the Linux commands: sudo sysctl vm.dirty_background_bytes=12000000000 sudo sysctl vm.dirty_bytes=12500000000 sudo sysctl vm.dirty_writeback_centisecs=500 sudo sysctl vm.dirty_expire_centisecs=1440000 (page flush 4 hours) That might have gotten in the way. I will reduce that to 8 GB. If that doesn't work, I will limit them to 20 running CMS using an app_config. PS - Thanks for looking into it. That level of analysis is beyond me. I am not intending to run this machine 100% on CMS anyway. This was just a stress test. It looks like it found a limit. ID: 45306 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,316,654 RAC: 47,434	Message 45307 - Posted: 8 Sep 2021, 12:26:57 UTC - in response to Message 45306. Just in case you plan to repeat the test: You may also check the load average values. In some cases vboxwrapper causes a huge "idle" load and you get a "traffic jam" in the CPU queues. This may cause other processes to run into timeouts. Critical values I noticed on my systems: #CPUs * 3.5 ID: 45307 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45308 - Posted: 8 Sep 2021, 14:28:30 UTC - in response to Message 45307. Last modified: 8 Sep 2021, 14:30:40 UTC By the way, BOINC has gone berserk again and downloaded six days of CMS, when I have my buffer set to the default of 0.1 + 0.5 days, so I have to set NNW. It has done that randomly on various projects ever since they did an update to the BOINC scheduler two or three years ago (maybe four; time flies during a pandemic). I noticed it first on WCG/MCM. I expect that it is a bug in how the BOINC client and server communicate with each other. No one has rushed in to take responsibility for it, and a lot of people just deny that it exists. It is sort of like Global Warming. ID: 45308 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1543 Credit: 10,050,357 RAC: 1,267	Message 45318 - Posted: 9 Sep 2021, 6:17:52 UTC Every 12th minute a problem is reported in StartdLog: 09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad. 09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = ,x509userproxysubject,x509UserProxyFQAN,x509UserProxyVOName,x509UserProxyEmail,x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad. 09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad. 09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad. 09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = ,x509userproxysubject,x509UserProxyFQAN,x509UserProxyVOName,x509UserProxyEmail,x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad. 09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad. ID: 45318 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 799 Credit: 65,043,267 RAC: 27,885	Message 45319 - Posted: 9 Sep 2021, 6:53:50 UTC - in response to Message 45308. By the way, BOINC has gone berserk again and downloaded six days of CMS, when I have my buffer set to the default of 0.1 + 0.5 days, so I have to set NNW. It has done that randomly on various projects ever since they did an update to the BOINC scheduler two or three years ago (maybe four; time flies during a pandemic). I noticed it first on WCG/MCM. I expect that it is a bug in how the BOINC client and server communicate with each other. No one has rushed in to take responsibility for it, and a lot of people just deny that it exists. It is sort of like Global Warming. This can happen if you have <max_concurrent> line in your app_config.xml. The client just keeps asking more work until a server side limit comes to play (this is a bug in Boinc client). For CMS the limit is much higher than for Atlas and Theory. I think Atlas limit is 16 and Theory limit is 8 but for CMS it is a few hundred. I just enable CMS tasks on web preferences when I need more tasks and disable them when I have enough to keep crunching a few days. ID: 45319 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1543 Credit: 10,050,357 RAC: 1,267	Message 45320 - Posted: 9 Sep 2021, 7:26:47 UTC - in response to Message 45319. .... This can happen if you have <max_concurrent> line in your app_config.xml. .... max_concurrent on app-level not on project-level. ID: 45320 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45321 - Posted: 9 Sep 2021, 11:39:22 UTC - in response to Message 45319. This can happen if you have <max_concurrent> line in your app_config.xml. The client just keeps asking more work until a server side limit comes to play (this is a bug in Boinc client). Thanks, I do have such a line and will take it out. I was beginning to suspect that the app_config had something to do with it, since it started after I put it in, and I know that BOINC does not handle it well. That leaves me with the question of how to limit CMS though. I may have to run two BOINC instances. That is no big deal though. ID: 45321 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45322 - Posted: 9 Sep 2021, 11:58:53 UTC - in response to Message 45320. max_concurrent on app-level not on project-level. Yes, I actually had both. So maybe <project_max_concurrent>X</project_max_concurrent> would still work? That is all I really need here. ID: 45322 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,316,654 RAC: 47,434	Message 45323 - Posted: 9 Sep 2021, 12:08:13 UTC - in response to Message 45322. According to the BOINC issue tracker "project_max_concurrent" is also affected. "... It happens if max_concurrent is used on app OR project level." https://github.com/BOINC/boinc/issues/4322 This requires a BOINC client update once they publish a version without that bug. ID: 45323 · Reply Quote