Message boards : CMS Application : New Version v60.00
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 348
Credit: 237,918
RAC: 0
Message 45292 - Posted: 7 Sep 2021, 8:44:02 UTC

A new version of the CMS application v60.00 is now available. This version provides an updated VM image. In addition the Windows version also has an updated version of the vboxwrapper provided from the BOINC website rather than our own custom built version.
ID: 45292 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1311
Credit: 39,796,455
RAC: 18,095
Message 45293 - Posted: 7 Sep 2021, 9:15:17 UTC
Last modified: 7 Sep 2021, 9:16:16 UTC

Windows10pro-Project reset, Version 50.0 is downloading.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=170128411
Applications show Version.60
https://lhcathome.cern.ch/lhcathome/apps.php
ID: 45293 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1046
Credit: 6,603,873
RAC: 275
Message 45294 - Posted: 7 Sep 2021, 9:39:02 UTC - in response to Message 45292.  
Last modified: 7 Sep 2021, 9:45:34 UTC

1915 LHC@home 07 Sep 11:36:39 Started download of vboxwrapper_26203_windows_x86_64.exe
1916 LHC@home 07 Sep 11:36:39 Started download of CMS_2021_07_07.vdi
1917 LHC@home 07 Sep 11:36:41 Finished download of vboxwrapper_26203_windows_x86_64.exe
1918 LHC@home 07 Sep 11:36:41 Giving up on download of CMS_2021_07_07.vdi: permanent HTTP error

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
app_version download error: couldn't get input files:
<file_xfer_error>
  <file_name>CMS_2021_07_07.vdi</file_name>
  <error_code>-224 (permanent HTTP error)</error_code>
  <error_message>permanent HTTP error</error_message>
</file_xfer_error>
</message>


I'm able to download that 1.6 GB gz-file with my browser !?
ID: 45294 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 348
Credit: 237,918
RAC: 0
Message 45295 - Posted: 7 Sep 2021, 9:45:49 UTC - in response to Message 45294.  

The permissions error has been fixed.
ID: 45295 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1311
Credit: 39,796,455
RAC: 18,095
Message 45296 - Posted: 7 Sep 2021, 11:10:31 UTC - in response to Message 45295.  
Last modified: 7 Sep 2021, 11:28:01 UTC

CMS with proxy:
last two lines atm:
CMS application starting. Check log files.
X509_USER_PROXY = /tmp/x509up_u1000
Since 8 Min. no more action in Boinc_VM.

Edit: two lines before reading Volunteer information, but
no info about the volunteer is sending back.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=324877146
Have aborted this task!
ID: 45296 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 802
Credit: 5,696,085
RAC: 16
Message 45297 - Posted: 7 Sep 2021, 11:11:55 UTC

I was able to download the .vdi file and wrapper to my Linux box and now have a task running a job. All indications seem nominal; my task reports "Application version CMS Simulation v60.00 (vbox64) x86_64-pc-linux-gnu".
The production failure rate has dropped significantly since early this morning, I wonder if that's due to the proxy changes yesterday? From an experimental monitoring dashboard, most of the failures were coming from one machine, and seem to be network errors.
ID: 45297 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 802
Credit: 5,696,085
RAC: 16
Message 45298 - Posted: 7 Sep 2021, 11:31:18 UTC - in response to Message 45297.  


The production failure rate has dropped significantly since early this morning, I wonder if that's due to the proxy changes yesterday? From an experimental monitoring dashboard, most of the failures were coming from one machine, and seem to be network errors.

Ah, he has been having connectivity problems, failure to resolve cern.ch, so he hasn't actually received any condor jobs today.
ID: 45298 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 574
Credit: 18,017,331
RAC: 22,812
Message 45299 - Posted: 7 Sep 2021, 11:40:46 UTC

The new 60.00 are running on all 24 cores of a Ryzen 3900X (Ubuntu 20.04.3 and VBox 6.1.26), with 64 GB of memory.
They look good to me, but they are only 25 minutes into the run.
ID: 45299 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 574
Credit: 18,017,331
RAC: 22,812
Message 45303 - Posted: 8 Sep 2021, 1:43:14 UTC - in response to Message 45299.  

They look good to me, but they are only 25 minutes into the run.

22 of them completed successfully, running about 10 1/2 to 12 1/2 hours, a little faster than before.

Then they started to error; the next 12 show: Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10694610&offset=0&show_names=0&state=6&appid=
They run for about 2 to 3 hours before they error.
ID: 45303 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1824
Credit: 123,758,469
RAC: 86,034
Message 45304 - Posted: 8 Sep 2021, 8:02:05 UTC - in response to Message 45303.  

Looks like the VMs were all running concurrently and they were killed all of a sudden within the very same second.
In most cases this points out a VirtualBox problem:
2021-09-07 20:49:45 (836238): VM is no longer is a running state. It is in 'poweroff'.



Nonetheless your logs show some entries that are not very common:
2021-09-07 18:07:23 (719461): 
Command: VBoxManage -q list extpacks
Exit Code: 0
Output:
Extension Packs: 1
Pack no. 0:   VNC
Version:      6.1.26
Revision:     145957
Edition:      
Description:  VNC plugin module
VRDE Module:  VBoxVNC
Usable:       true 
Why unusable: 

.
.
.

2021-09-07 18:07:25 (719461): 
Command: VBoxManage -q bandwidthctl "boinc_3fc71e1847ee7287" set "boinc_3fc71e1847ee7287_net" --limit 1953K

Did you use VNC to check a VM when they crashed?
Did you limit the network bandwidth yourself?
Did you notice if result uploads were in progress (the big ones from inside the VMs)?
Did you check the load average values?
Did your BOINC client hit the RAM/disk limit?
ID: 45304 · Report as offensive     Reply Quote
[VENETO] boboviz
Avatar

Send message
Joined: 7 May 08
Posts: 100
Credit: 1,311,832
RAC: 358
Message 45305 - Posted: 8 Sep 2021, 10:22:24 UTC

199 wus in queue and i continue to have "Got 0 new tasks"
ID: 45305 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 574
Credit: 18,017,331
RAC: 22,812
Message 45306 - Posted: 8 Sep 2021, 11:28:43 UTC - in response to Message 45304.  
Last modified: 8 Sep 2021, 12:08:15 UTC

Did you use VNC to check a VM when they crashed?
Did you limit the network bandwidth yourself?
Did you notice if result uploads were in progress (the big ones from inside the VMs)?
Did you check the load average values?
Did your BOINC client hit the RAM/disk limit?
No. But when the successful ones were running for about 4 to 5 hours, I checked the memory usage and got:
$ free
total used free shared buff/cache available
Mem: 65792776 49589536 473652 10276 15729588 15467528
Swap: 1951740 51968 1899772


So that seemed to be enough. However, I also run a large (12 GB) write cache using the Linux commands:
sudo sysctl vm.dirty_background_bytes=12000000000
sudo sysctl vm.dirty_bytes=12500000000
sudo sysctl vm.dirty_writeback_centisecs=500
sudo sysctl vm.dirty_expire_centisecs=1440000 (page flush 4 hours)


That might have gotten in the way. I will reduce that to 8 GB.
If that doesn't work, I will limit them to 20 running CMS using an app_config.

PS - Thanks for looking into it. That level of analysis is beyond me. I am not intending to run this machine 100% on CMS anyway. This was just a stress test.
It looks like it found a limit.
ID: 45306 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1824
Credit: 123,758,469
RAC: 86,034
Message 45307 - Posted: 8 Sep 2021, 12:26:57 UTC - in response to Message 45306.  

Just in case you plan to repeat the test:
You may also check the load average values.
In some cases vboxwrapper causes a huge "idle" load and you get a "traffic jam" in the CPU queues.
This may cause other processes to run into timeouts.

Critical values I noticed on my systems:
#CPUs * 3.5
ID: 45307 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 574
Credit: 18,017,331
RAC: 22,812
Message 45308 - Posted: 8 Sep 2021, 14:28:30 UTC - in response to Message 45307.  
Last modified: 8 Sep 2021, 14:30:40 UTC

By the way, BOINC has gone berserk again and downloaded six days of CMS, when I have my buffer set to the default of 0.1 + 0.5 days, so I have to set NNW.
It has done that randomly on various projects ever since they did an update to the BOINC scheduler two or three years ago (maybe four; time flies during a pandemic).

I noticed it first on WCG/MCM. I expect that it is a bug in how the BOINC client and server communicate with each other.
No one has rushed in to take responsibility for it, and a lot of people just deny that it exists. It is sort of like Global Warming.
ID: 45308 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1046
Credit: 6,603,873
RAC: 275
Message 45318 - Posted: 9 Sep 2021, 6:17:52 UTC

Every 12th minute a problem is reported in StartdLog:
09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad.
09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = ,x509userproxysubject,x509UserProxyFQAN,x509UserProxyVOName,x509UserProxyEmail,x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad.
09/09/21 07:45:52 (pid:16121) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad.
09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad.
09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = ,x509userproxysubject,x509UserProxyFQAN,x509UserProxyVOName,x509UserProxyEmail,x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad.
09/09/21 07:45:52 (pid:16121) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad.
ID: 45318 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 550
Credit: 30,012,059
RAC: 13,029
Message 45319 - Posted: 9 Sep 2021, 6:53:50 UTC - in response to Message 45308.  

By the way, BOINC has gone berserk again and downloaded six days of CMS, when I have my buffer set to the default of 0.1 + 0.5 days, so I have to set NNW.
It has done that randomly on various projects ever since they did an update to the BOINC scheduler two or three years ago (maybe four; time flies during a pandemic).

I noticed it first on WCG/MCM. I expect that it is a bug in how the BOINC client and server communicate with each other.
No one has rushed in to take responsibility for it, and a lot of people just deny that it exists. It is sort of like Global Warming.

This can happen if you have <max_concurrent> line in your app_config.xml. The client just keeps asking more work until a server side limit comes to play (this is a bug in Boinc client). For CMS the limit is much higher than for Atlas and Theory. I think Atlas limit is 16 and Theory limit is 8 but for CMS it is a few hundred. I just enable CMS tasks on web preferences when I need more tasks and disable them when I have enough to keep crunching a few days.
ID: 45319 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1046
Credit: 6,603,873
RAC: 275
Message 45320 - Posted: 9 Sep 2021, 7:26:47 UTC - in response to Message 45319.  

.... This can happen if you have <max_concurrent> line in your app_config.xml. ....
max_concurrent on app-level not on project-level.
ID: 45320 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 574
Credit: 18,017,331
RAC: 22,812
Message 45321 - Posted: 9 Sep 2021, 11:39:22 UTC - in response to Message 45319.  

This can happen if you have <max_concurrent> line in your app_config.xml. The client just keeps asking more work until a server side limit comes to play (this is a bug in Boinc client).

Thanks, I do have such a line and will take it out.
I was beginning to suspect that the app_config had something to do with it, since it started after I put it in, and I know that BOINC does not handle it well.

That leaves me with the question of how to limit CMS though. I may have to run two BOINC instances. That is no big deal though.
ID: 45321 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 574
Credit: 18,017,331
RAC: 22,812
Message 45322 - Posted: 9 Sep 2021, 11:58:53 UTC - in response to Message 45320.  

max_concurrent on app-level not on project-level.

Yes, I actually had both.
So maybe <project_max_concurrent>X</project_max_concurrent> would still work?
That is all I really need here.
ID: 45322 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1824
Credit: 123,758,469
RAC: 86,034
Message 45323 - Posted: 9 Sep 2021, 12:08:13 UTC - in response to Message 45322.  

According to the BOINC issue tracker "project_max_concurrent" is also affected.
"... It happens if max_concurrent is used on app OR project level."
https://github.com/BOINC/boinc/issues/4322

This requires a BOINC client update once they publish a version without that bug.
ID: 45323 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : New Version v60.00


©2021 CERN