Questions and Answers :
Windows :
Windows vbox64 CMS Simulation tasks failing - VM unable to validate X509 credential from LHC@home
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Apr 21 Posts: 23 Credit: 45,869,850 RAC: 4,682 |
I have a Win10 machine with BOINC client 7.16.11 and Vbox 6.1.22 installed. All the CMS Simulation tasks on my host are failing when the VM attemtps to validate the x509 certificate with LHC@home. I installed the CERN Root and Grid CA certificates, https://cafiles.cern.ch/cafiles/, on my local host, seeing if that corrected the issue of validation. It did not. Failed jobs examples: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316190883 https://lhcathome.cern.ch/lhcathome/result.php?resultid=316187982 https://lhcathome.cern.ch/lhcathome/result.php?resultid=316180651 I've verified the local windows FW as well as my pfSense FW, including Snort, is passing traffic as it should. I ran a packet capture while the VM was attempting to reach out for the validation and see that the VM is communicating with LHC servers (vccs.cern.ch @ 137.138.120.99). The VM does not recognize the CERN server side CA. The stream exits with a TLSv1.2 Fatal error: Unknown CA The relevant packet is #10 No. Time Source Destination Protocol Length Info 1 0.000000 192.168.150.30 137.138.120.99 TCP 66 55514 → 443 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1 Frame 1: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) Ethernet II, Src: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09), Dst: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10) Internet Protocol Version 4, Src: 192.168.150.30, Dst: 137.138.120.99 Transmission Control Protocol, Src Port: 55514, Dst Port: 443, Seq: 0, Len: 0 No. Time Source Destination Protocol Length Info 2 0.108285 137.138.120.99 192.168.150.30 TCP 66 443 → 55514 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=128 Frame 2: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) Ethernet II, Src: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10), Dst: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09) Internet Protocol Version 4, Src: 137.138.120.99, Dst: 192.168.150.30 Transmission Control Protocol, Src Port: 443, Dst Port: 55514, Seq: 0, Ack: 1, Len: 0 No. Time Source Destination Protocol Length Info 3 0.108513 192.168.150.30 137.138.120.99 TCP 60 55514 → 443 [ACK] Seq=1 Ack=1 Win=262656 Len=0 Frame 3: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) Ethernet II, Src: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09), Dst: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10) Internet Protocol Version 4, Src: 192.168.150.30, Dst: 137.138.120.99 Transmission Control Protocol, Src Port: 55514, Dst Port: 443, Seq: 1, Ack: 1, Len: 0 No. Time Source Destination Protocol Length Info 4 0.186955 192.168.150.30 137.138.120.99 TLSv1.2 224 Client Hello Frame 4: 224 bytes on wire (1792 bits), 224 bytes captured (1792 bits) Ethernet II, Src: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09), Dst: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10) Internet Protocol Version 4, Src: 192.168.150.30, Dst: 137.138.120.99 Transmission Control Protocol, Src Port: 55514, Dst Port: 443, Seq: 1, Ack: 1, Len: 170 Secure Sockets Layer No. Time Source Destination Protocol Length Info 5 0.297779 137.138.120.99 192.168.150.30 TCP 54 443 → 55514 [ACK] Seq=1 Ack=171 Win=30336 Len=0 Frame 5: 54 bytes on wire (432 bits), 54 bytes captured (432 bits) Ethernet II, Src: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10), Dst: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09) Internet Protocol Version 4, Src: 137.138.120.99, Dst: 192.168.150.30 Transmission Control Protocol, Src Port: 443, Dst Port: 55514, Seq: 1, Ack: 171, Len: 0 No. Time Source Destination Protocol Length Info 6 0.306888 137.138.120.99 192.168.150.30 TLSv1.2 1514 Server Hello Frame 6: 1514 bytes on wire (12112 bits), 1514 bytes captured (12112 bits) Ethernet II, Src: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10), Dst: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09) Internet Protocol Version 4, Src: 137.138.120.99, Dst: 192.168.150.30 Transmission Control Protocol, Src Port: 443, Dst Port: 55514, Seq: 1, Ack: 171, Len: 1460 Secure Sockets Layer No. Time Source Destination Protocol Length Info 7 0.306897 137.138.120.99 192.168.150.30 TLSv1.2 1514 Certificate [TCP segment of a reassembled PDU] Frame 7: 1514 bytes on wire (12112 bits), 1514 bytes captured (12112 bits) Ethernet II, Src: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10), Dst: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09) Internet Protocol Version 4, Src: 137.138.120.99, Dst: 192.168.150.30 Transmission Control Protocol, Src Port: 443, Dst Port: 55514, Seq: 1461, Ack: 171, Len: 1460 [2 Reassembled TCP Segments (2315 bytes): #6(1366), #7(949)] Secure Sockets Layer No. Time Source Destination Protocol Length Info 8 0.306905 137.138.120.99 192.168.150.30 TLSv1.2 146 Server Key Exchange, Server Hello Done Frame 8: 146 bytes on wire (1168 bits), 146 bytes captured (1168 bits) Ethernet II, Src: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10), Dst: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09) Internet Protocol Version 4, Src: 137.138.120.99, Dst: 192.168.150.30 Transmission Control Protocol, Src Port: 443, Dst Port: 55514, Seq: 2921, Ack: 171, Len: 92 [2 Reassembled TCP Segments (594 bytes): #7(511), #8(83)] Secure Sockets Layer Secure Sockets Layer No. Time Source Destination Protocol Length Info 9 0.307078 192.168.150.30 137.138.120.99 TCP 60 55514 → 443 [ACK] Seq=171 Ack=3013 Win=262656 Len=0 Frame 9: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) Ethernet II, Src: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09), Dst: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10) Internet Protocol Version 4, Src: 192.168.150.30, Dst: 137.138.120.99 Transmission Control Protocol, Src Port: 55514, Dst Port: 443, Seq: 171, Ack: 3013, Len: 0 No. Time Source Destination Protocol Length Info 10 0.308588 192.168.150.30 137.138.120.99 TLSv1.2 61 Alert (Level: Fatal, Description: Unknown CA) Frame 10: 61 bytes on wire (488 bits), 61 bytes captured (488 bits) Ethernet II, Src: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09), Dst: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10) Internet Protocol Version 4, Src: 192.168.150.30, Dst: 137.138.120.99 Transmission Control Protocol, Src Port: 55514, Dst Port: 443, Seq: 171, Ack: 3013, Len: 7 Secure Sockets Layer No. Time Source Destination Protocol Length Info 11 0.308688 192.168.150.30 137.138.120.99 TCP 60 55514 → 443 [FIN, ACK] Seq=178 Ack=3013 Win=262656 Len=0 Frame 11: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) Ethernet II, Src: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09), Dst: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10) Internet Protocol Version 4, Src: 192.168.150.30, Dst: 137.138.120.99 Transmission Control Protocol, Src Port: 55514, Dst Port: 443, Seq: 178, Ack: 3013, Len: 0 No. Time Source Destination Protocol Length Info 12 0.418915 137.138.120.99 192.168.150.30 TCP 54 443 → 55514 [FIN, ACK] Seq=3013 Ack=179 Win=30336 Len=0 Frame 12: 54 bytes on wire (432 bits), 54 bytes captured (432 bits) Ethernet II, Src: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10), Dst: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09) Internet Protocol Version 4, Src: 137.138.120.99, Dst: 192.168.150.30 Transmission Control Protocol, Src Port: 443, Dst Port: 55514, Seq: 3013, Ack: 179, Len: 0 No. Time Source Destination Protocol Length Info 13 0.419178 192.168.150.30 137.138.120.99 TCP 60 55514 → 443 [ACK] Seq=179 Ack=3014 Win=262656 Len=0 Frame 13: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) Ethernet II, Src: AsustekC_ee:47:09 (3c:7c:3f:ee:47:09), Dst: IntelCor_6b:d4:10 (00:1b:21:6b:d4:10) Internet Protocol Version 4, Src: 192.168.150.30, Dst: 137.138.120.99 Transmission Control Protocol, Src Port: 55514, Dst Port: 443, Seq: 179, Ack: 3014, Len: 0 I believe this is an issue with the VM itself not having the correct host certificate. Can an admin check into this? R/S Scott |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,476,539 RAC: 68,041 |
I installed the CERN Root ... on my local host, seeing if that corrected the issue of validation. It did not. Sure. This will not work as the certs need to be installed inside the VM. To may test Theory vbox on that computer to see whether it behaves different. <edit> This is a snippet from one of your CMS logs: 2021-05-07 19:49:02 (5344): Guest Log: [DEBUG] Probing CVMFS ... 2021-05-07 19:49:02 (5344): Guest Log: Probing /cvmfs/grid.cern.ch... OK 2021-05-07 19:49:07 (5344): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2021-05-07 19:49:07 (5344): Guest Log: 2.4.4.0 3755 5 25524 11625 2 1 1234377 4096000 2 65024 0 3 100 0 0 http://s1asgc-cvmfs.openhtc.io:8080/cvmfs/grid.cern.ch http://131.225.188.246:3126 0 2021-05-07 19:53:47 (5344): Guest Log: [INFO] Reading volunteer information 2021-05-07 19:53:47 (5344): Guest Log: [INFO] Volunteer: scotth (787857) It looks as if you either use a local proxy that is not correctly configured. => CVMFS configures a fallback proxy. Or the CVMFS inside the VM can only partly access CERN's CVMFS. Since some CA certs are taken from there your cert issues are follow up issues. The latter mostly point out an incomplete local firewall setup. I guess it's on the affected computer since your native CVMFS on others are working fine. Looks like you are familiar with network diagnostic tools. If so you may check for filtered TCP packets to ports 80, 8000, 8080, 443, 4080, 9618. </edit> |
Send message Joined: 8 Apr 21 Posts: 23 Credit: 45,869,850 RAC: 4,682 |
I didn't think it really would... but gave it a shot anyway just to be sure for myself.
I don't have a local Squid proxy configured on, or for, my hosts. All my other hosts (except one, which will be getting an OS rebuild soon) running native work units are reaching out for their images. My ISP connection handles the traffic easily. I'm just working on getting all my hosts running correctly, then will be configuring Squid proxy on my firewall and then making config changes on each host. Then working out any issues on that...
Below is one of the many links I found when I was initially setting up LHC@Home and getting native work units to run correctly. I've configured a port alias in pfSense to handle it all, with the exception of my existing rules for port 80 and 443. The FW rule allowing all the traffic is configured for TCP only vice TCP/UDP. https://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use Here is my port list: 3125 Common - CVMFS 8000 ATLAS - HTTP 8080 ATLAS - HTTP 23128 ATLAS - HTTP 3127:3128 ATLAS - HTTP Proxy 5222 ATLAS - XMPP 9094 ATLAS - TCP 9618 Theory, CMS, LHCb - Condor 4080 CMS - WMAgent 8080 CMS - Frontier 8443 LHCb - DIRAC 9133:9149 LHCb - DIRAC 9166 LHCb - DIRAC 9196:9199 LHCb - DIRAC I've also been chewing through my Snort logs the past several weeks, identifying and suppressing signature alerts for LHC@Home traffic. I've got a nice list of IP addresses LHC@home communicates with. A few of what I believe are the more critical CVMFS IP addresses I've added to an "External Server" alias list and configured that on the Snort Pass list to prevent any alerting on those. Here are the CVMFS entries I have in the alias: 104.21.88.130 LHC@Home - s1f'nal/bnl/unl/cern/ral'-cvmfs.openhtc.io 172.67.179.99 LHC@Home - s1f'nal/bnl/unl/cern/ral'-cvmfs.openhtc.io 158.39.48.38 LHC@Home - atlas-db-squid1.grid.uiocloud.no I'm still stuck on the response I saw in the packet capture from the LHC@Home CMS Simulation VM. It actively rejected the server side Certificate Authority as invalid. I still believe this is a LHC server side issue unless someone can validate that I'm the only one with this issue. R/S Scott |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,476,539 RAC: 68,041 |
The project's firewall list might need to be updated. It shows ports/projects that are not in use any more and others are missing: Not in use: Port 3127 Port 3125 (replaced by port 3126 and used by fallback proxies) Port 5222 (XMPP) Port 9094 Port 1094 LHCb (all DIRAC ports) When a fresh VM starts it's internal CVMFS is not yet completely configured. Instead it checks some hard wired servers to get updated basic setup scripts from. As a result the packets to the required destination ports should not be restricted to a few IPs, they should be allowed for all destination IPs. Same for openhtc.io. Those servers are run by Cloudflare and usually don't change the IP very often, but sometimes they do it a couple of times within just a day. In addition the list does not include cms-frontier.openhtc.io which is required for CMS tasks (wouldn't be necessary if all destinations are allowed). ... My ISP connection handles the traffic easily ... An argument that is used very often by many volunteers. The point is "My ISP...". The CVMFS manual clearly asks for a local proxy to keep the load on the project servers as low as possible. ... then will be configuring Squid proxy ... Good idea. The sooner the better. |
Send message Joined: 8 Apr 21 Posts: 23 Credit: 45,869,850 RAC: 4,682 |
Great news on identifying the firewall port page needs updating. I now have three completed CMS Simulation tasks for my Win10 host! https://lhcathome.cern.ch/lhcathome/result.php?resultid=316425963 https://lhcathome.cern.ch/lhcathome/result.php?resultid=316423089 https://lhcathome.cern.ch/lhcathome/result.php?resultid=316428800 The only modification I've made since your previous post was in adding port 3126 to my rule allowing it out. I saw that in the error log of one of my failed work units when you quoted it back in my post. I have not made any additional changes on that. Was the CMS Simulation VM updated? Additionally, from your info on the port usage
I'll remove those from my allowed outbound ports for the LHC@Home traffic. Speaking to your comment here:
I've always had my FW rule configured to allow the identified ports out to any IP. I specifically added the CVMFS IPs I found to my Snort PASS list to ensure any of them did not get blocked by a signature hit. Now that the host is working, yes, I will be looking up the Squid configuration and setting it up in pfSense to get my clients from reaching all the way out. Thank you! R/S Scott |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,476,539 RAC: 68,041 |
I usually do not control other volunteers. In this case I stumbled over an error that needs to be corrected manually. If not it would treat you every now and then. Here are 2 failed CMS tasks that show the same error message: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316484150 https://lhcathome.cern.ch/lhcathome/result.php?resultid=316539335 VBoxManage.exe: error: Medium 'C:\ProgramData\BOINC\slots\18\vm_image.vdi' is not accessible. UUID {9f5af9d2-a067-43af-9905-e40303214595} of the medium 'C:\ProgramData\BOINC\slots\18\vm_image.vdi' does not match the value {82750195-a5d4-4cc4-8519-84a53d0783a4} stored in the media registry ('C:\Users\Scott\.VirtualBox\VirtualBox.xml') . . . 2021-05-14 02:20:05 (2292): NOTE: VM session lock error encountered. BOINC will be notified that it needs to clean up the environment. This might be a temporary problem and so this job will be rescheduled for another time. Both point out an error in slots\18. This slot needs to be cleaned. You may - shut down the Boinc client and wait until everything has calmed down - remove everything below slots\18 - restart the BOINC client |
©2025 CERN