Message boards :
CMS Application :
New Version v60.00
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 ![]() ![]() |
A new version of the CMS application v60.00 is now available. This version provides an updated VM image. In addition the Windows version also has an updated version of the vboxwrapper provided from the BOINC website rather than our own custom built version. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 12,751 ![]() ![]() ![]() |
Windows10pro-Project reset, Version 50.0 is downloading. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=170128411 Applications show Version.60 https://lhcathome.cern.ch/lhcathome/apps.php |
Send message Joined: 14 Jan 10 Posts: 1440 Credit: 9,657,806 RAC: 1,029 ![]() ![]() |
1915 LHC@home 07 Sep 11:36:39 Started download of vboxwrapper_26203_windows_x86_64.exe 1916 LHC@home 07 Sep 11:36:39 Started download of CMS_2021_07_07.vdi 1917 LHC@home 07 Sep 11:36:41 Finished download of vboxwrapper_26203_windows_x86_64.exe 1918 LHC@home 07 Sep 11:36:41 Giving up on download of CMS_2021_07_07.vdi: permanent HTTP error <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>CMS_2021_07_07.vdi</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> </message> I'm able to download that 1.6 GB gz-file with my browser !? |
![]() Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 ![]() ![]() |
The permissions error has been fixed. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 12,751 ![]() ![]() ![]() |
CMS with proxy: last two lines atm: CMS application starting. Check log files. X509_USER_PROXY = /tmp/x509up_u1000 Since 8 Min. no more action in Boinc_VM. Edit: two lines before reading Volunteer information, but no info about the volunteer is sending back. https://lhcathome.cern.ch/lhcathome/result.php?resultid=324877146 Have aborted this task! |
![]() Send message Joined: 29 Aug 05 Posts: 1072 Credit: 8,414,368 RAC: 6,198 ![]() |
I was able to download the .vdi file and wrapper to my Linux box and now have a task running a job. All indications seem nominal; my task reports "Application version CMS Simulation v60.00 (vbox64) x86_64-pc-linux-gnu". The production failure rate has dropped significantly since early this morning, I wonder if that's due to the proxy changes yesterday? From an experimental monitoring dashboard, most of the failures were coming from one machine, and seem to be network errors. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1072 Credit: 8,414,368 RAC: 6,198 ![]() |
Ah, he has been having connectivity problems, failure to resolve cern.ch, so he hasn't actually received any condor jobs today. ![]() |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
The new 60.00 are running on all 24 cores of a Ryzen 3900X (Ubuntu 20.04.3 and VBox 6.1.26), with 64 GB of memory. They look good to me, but they are only 25 minutes into the run. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
They look good to me, but they are only 25 minutes into the run. 22 of them completed successfully, running about 10 1/2 to 12 1/2 hours, a little faster than before. Then they started to error; the next 12 show: Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT https://lhcathome.cern.ch/lhcathome/results.php?hostid=10694610&offset=0&show_names=0&state=6&appid= They run for about 2 to 3 hours before they error. |
![]() Send message Joined: 15 Jun 08 Posts: 2606 Credit: 262,261,312 RAC: 134,499 ![]() ![]() |
Looks like the VMs were all running concurrently and they were killed all of a sudden within the very same second. In most cases this points out a VirtualBox problem: 2021-09-07 20:49:45 (836238): VM is no longer is a running state. It is in 'poweroff'. Nonetheless your logs show some entries that are not very common: 2021-09-07 18:07:23 (719461): Command: VBoxManage -q list extpacks Exit Code: 0 Output: Extension Packs: 1 Pack no. 0: VNC Version: 6.1.26 Revision: 145957 Edition: Description: VNC plugin module VRDE Module: VBoxVNC Usable: true Why unusable: . . . 2021-09-07 18:07:25 (719461): Command: VBoxManage -q bandwidthctl "boinc_3fc71e1847ee7287" set "boinc_3fc71e1847ee7287_net" --limit 1953K Did you use VNC to check a VM when they crashed? Did you limit the network bandwidth yourself? Did you notice if result uploads were in progress (the big ones from inside the VMs)? Did you check the load average values? Did your BOINC client hit the RAM/disk limit? |
![]() Send message Joined: 7 May 08 Posts: 233 Credit: 1,575,053 RAC: 0 ![]() ![]() |
199 wus in queue and i continue to have "Got 0 new tasks" |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
Did you use VNC to check a VM when they crashed?No. But when the successful ones were running for about 4 to 5 hours, I checked the memory usage and got: $ free So that seemed to be enough. However, I also run a large (12 GB) write cache using the Linux commands: sudo sysctl vm.dirty_background_bytes=12000000000 sudo sysctl vm.dirty_bytes=12500000000 sudo sysctl vm.dirty_writeback_centisecs=500 sudo sysctl vm.dirty_expire_centisecs=1440000 (page flush 4 hours) That might have gotten in the way. I will reduce that to 8 GB. If that doesn't work, I will limit them to 20 running CMS using an app_config. PS - Thanks for looking into it. That level of analysis is beyond me. I am not intending to run this machine 100% on CMS anyway. This was just a stress test. It looks like it found a limit. |
![]() Send message Joined: 15 Jun 08 Posts: 2606 Credit: 262,261,312 RAC: 134,499 ![]() ![]() |
Just in case you plan to repeat the test: You may also check the load average values. In some cases vboxwrapper causes a huge "idle" load and you get a "traffic jam" in the CPU queues. This may cause other processes to run into timeouts. Critical values I noticed on my systems: #CPUs * 3.5 |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
By the way, BOINC has gone berserk again and downloaded six days of CMS, when I have my buffer set to the default of 0.1 + 0.5 days, so I have to set NNW. It has done that randomly on various projects ever since they did an update to the BOINC scheduler two or three years ago (maybe four; time flies during a pandemic). I noticed it first on WCG/MCM. I expect that it is a bug in how the BOINC client and server communicate with each other. No one has rushed in to take responsibility for it, and a lot of people just deny that it exists. It is sort of like Global Warming. |
Send message Joined: 14 Jan 10 Posts: 1440 Credit: 9,657,806 RAC: 1,029 ![]() ![]() |
Every 12th minute a problem is reported in StartdLog: 09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad. 09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = ,x509userproxysubject,x509UserProxyFQAN,x509UserProxyVOName,x509UserProxyEmail,x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad. 09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad. 09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad. 09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = ,x509userproxysubject,x509UserProxyFQAN,x509UserProxyVOName,x509UserProxyEmail,x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad. 09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad. |
![]() Send message Joined: 28 Sep 04 Posts: 746 Credit: 51,977,635 RAC: 30,379 ![]() ![]() ![]() |
By the way, BOINC has gone berserk again and downloaded six days of CMS, when I have my buffer set to the default of 0.1 + 0.5 days, so I have to set NNW. This can happen if you have <max_concurrent> line in your app_config.xml. The client just keeps asking more work until a server side limit comes to play (this is a bug in Boinc client). For CMS the limit is much higher than for Atlas and Theory. I think Atlas limit is 16 and Theory limit is 8 but for CMS it is a few hundred. I just enable CMS tasks on web preferences when I need more tasks and disable them when I have enough to keep crunching a few days. ![]() |
Send message Joined: 14 Jan 10 Posts: 1440 Credit: 9,657,806 RAC: 1,029 ![]() ![]() |
.... This can happen if you have <max_concurrent> line in your app_config.xml. ....max_concurrent on app-level not on project-level. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
This can happen if you have <max_concurrent> line in your app_config.xml. The client just keeps asking more work until a server side limit comes to play (this is a bug in Boinc client). Thanks, I do have such a line and will take it out. I was beginning to suspect that the app_config had something to do with it, since it started after I put it in, and I know that BOINC does not handle it well. That leaves me with the question of how to limit CMS though. I may have to run two BOINC instances. That is no big deal though. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
max_concurrent on app-level not on project-level. Yes, I actually had both. So maybe <project_max_concurrent>X</project_max_concurrent> would still work? That is all I really need here. |
![]() Send message Joined: 15 Jun 08 Posts: 2606 Credit: 262,261,312 RAC: 134,499 ![]() ![]() |
According to the BOINC issue tracker "project_max_concurrent" is also affected. "... It happens if max_concurrent is used on app OR project level." https://github.com/BOINC/boinc/issues/4322 This requires a BOINC client update once they publish a version without that bug. |
©2025 CERN