Message boards : CMS Application : Problems with Theory and CMS on MacOS
Message board moderation

To post messages, you must log in.

AuthorMessage
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51550 - Posted: 16 Feb 2025, 18:57:19 UTC

Hello,

Since I am trying to find the issue with the Theory and CMS applications and I do not get any feedback from the LHC team I write another post.

Apperantly LHC does not want to fix the issues. Can you disable the Theory and CMS application for MacOS? Otherwise I let my computer run the LHC tasks with failing CMS and Theory while waiting for the working Atlas tasks.

Please let me know something!!
ID: 51550 · Report as offensive     Reply Quote
Glohr

Send message
Joined: 13 Jan 24
Posts: 11
Credit: 3,554,464
RAC: 17,000
Message 51554 - Posted: 17 Feb 2025, 5:07:02 UTC - in response to Message 51550.  

Until your problem is resolved you could go to Project> Preferences> then select 'Edit preferences' at the bottom and change the 'Run only the selected applications' setting.

You might first try resetting the project on that computer.
ID: 51554 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2634
Credit: 272,025,947
RAC: 79,910
Message 51555 - Posted: 17 Feb 2025, 9:03:25 UTC - in response to Message 51550.  

The missing heartbeat is not the root cause of your problems.
Instead, it's a result of earlier errors that need to be identified.

It could be helpful to understand how the heartbeat method works.
It is identical on Linux/MacOS/Windows and has not been changed for years.

1. VirtualBox must provide a shared folder on the host
2. The VM must call a command to mount the shared folder
3. The VM must download certain scripts from CVMFS (bootstrap)
4. The bootstrap script adds a cron job inside the VM which once a minute touches the heartbeat file from within the VM
5. Vboxwrapper periodically checks the status (st_mtime) of the heartbeat file on the host side and reacts if st_mtime doesn't update


Check the logs to see if these steps succeeded

As for (1.)
This looks good
2025-02-11 20:32:01 (72295): Enabling shared directory for VM.
.
.
.
2025-02-11 20:32:01 (72295): 
Command: VBoxManage -q sharedfolder add "boinc_42fdb658fcffee8c" --name "shared" --hostpath "/Library/Application Support/BOINC Data/slots/0/shared"
Exit Code: 0



As for (2.) and (3.)
Taken from another user's log.
This is missing in your logs pointing out the VM starts but then hangs.
2025-02-15 17:18:43 (6244): Guest Log: [INFO] Mounting the shared directory
2025-02-15 17:18:43 (6244): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2025-02-15 17:18:43 (6244): Guest Log: [INFO] Sourcing essential functions from /cvmfs/grid.cern.ch

The overall picture suggests your VM doesn't contact CVMFS to download bootstrap.
Since the cron job is not active vboxwrapper finally shuts down the VM as intended.

As it works fine on Linux/Windows it appears to be a local issue
- may be caused by the sandbox feature used on MacOS
- may be caused by a firewall that doesn't allow connections to CVMFS
- may be caused by other reasons
ID: 51555 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2634
Credit: 272,025,947
RAC: 79,910
Message 51556 - Posted: 17 Feb 2025, 14:15:07 UTC

This PR on github describes issues on MacOS:
https://github.com/BOINC/boinc/pull/6088

It is primarily related to GPU usage but network connections are also tied to a user.

Since CVMFS and even the local shared folder mounts use network functionality it might be worth to investigate if there's a relationship.
ID: 51556 · Report as offensive     Reply Quote
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51561 - Posted: 17 Feb 2025, 21:28:34 UTC

I found the issue!
I copied the VM to another folder so it won't get deleted by BOINC. I changed the network configuration from NAT to bridged. Connection for time with NTP works, sourcing functions works, heartbeat file is created and job is running!

With switching back to the wrapper + boinc, the network connections are made. The task is running but the heartbeat file is not updated anymore.
ID: 51561 · Report as offensive     Reply Quote
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51562 - Posted: 17 Feb 2025, 22:15:11 UTC - in response to Message 51561.  
Last modified: 17 Feb 2025, 22:32:56 UTC

I added the following line to CMS_2025_01_16_prod.xml:
<network_bridged_mode/>


Now the heartbeat file is created, updated and the job is running. (Remote Desktop working but the show graphics is not working :( )

Here are all the files of my first test which succeeded to run after the NIC change from NAT to Bridged.
https://drive.google.com/drive/folders/1Ogw7XQcV-cgEePuEDK0GJsxmgQ53qMSn?usp=share_link

Link to result:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419673918
ID: 51562 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2634
Credit: 272,025,947
RAC: 79,910
Message 51563 - Posted: 18 Feb 2025, 8:46:01 UTC - in response to Message 51562.  

This is taken from your log:
Command: VBoxManage -q showvminfo "boinc_7e2d147d962cddcd" --machinereadable
bridgeadapter1="en0: Wi-Fi"
nic1="bridged"

Looks like your host is connected via wi-fi.
This is known to be problematic, especially on MacOS if the guest is set to bridged mode (see the VirtualBox forum).
Better to connect the host via cable.

Then leave the VM network at NAT and find out why that is not working.
This worked for years and according to your earlier posts also on your host (ATLAS).

Some settings you should check:
Is IPv4 enabled on your host/LAN?
If yes, which address range does it use?
Ensure it does not conflict with 10.0.2.0/24 which is used as default by VirtualBox.

The log from the example you mentioned shows it finally failed.

Please verify:
Since you copied the VM a couple of times to switch the network settings it is not clear under wich user account if finally ran.
Could have been "nentech" since tis name is mentioned in the Hypervisor System Log but it should have been "boinc_project".
Check/ensure the user account running vboxheadless has write permission to ".../slots/n/shared/".
ID: 51563 · Report as offensive     Reply Quote
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51566 - Posted: 18 Feb 2025, 15:16:33 UTC - in response to Message 51563.  

I will check that later. Wired network is hard for me to test on my Mac.

Hereby the last result without copying bur changing the task XML to bridged:
[url] https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=230425509[/url]
ID: 51566 · Report as offensive     Reply Quote
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51570 - Posted: 18 Feb 2025, 17:42:40 UTC - in response to Message 51566.  

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=230425509
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419674064

I did put all files from a running task in my Google Drive folder:
https://drive.google.com/drive/folders/1Ogw7XQcV-cgEePuEDK0GJsxmgQ53qMSn?usp=sharing
The files in Working Task are the files of the task which is running now.
ID: 51570 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2634
Credit: 272,025,947
RAC: 79,910
Message 51572 - Posted: 18 Feb 2025, 18:31:11 UTC - in response to Message 51570.  

As already mentioned it would be better to solve the issues with NAT instead of using bridged mode.
Bridged mode will not become the default for good reasons and using it during tests is not helpful to find the root cause.

I suggest to open an issue at BOINC github to make the MacOS experts there aware and maybe also an issue at the VirtualBox forum or their issue tracker.
The latter already reports NAT issues for Windows which appear every now and then.
ID: 51572 · Report as offensive     Reply Quote
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51578 - Posted: 20 Feb 2025, 0:08:04 UTC - in response to Message 51572.  

I did some other testing via the debug mode. I found out the following:
The networking on my Mac is made via IPv6. The DNS is not working inside the VM. Ping via IP address works.
When I change IPv6 on my Mac to manual mode, the DNS is working and the machine is starting.
I just started my job via the BOINCmanager with IPv6 on manual and it is running now. Remote Desktop and the web application are working now :)
ID: 51578 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2634
Credit: 272,025,947
RAC: 79,910
Message 51581 - Posted: 20 Feb 2025, 8:21:42 UTC - in response to Message 51578.  

You have been asked not to switch to bridged mode.
If you continue doing so further tests are pretty much useless.


As for DNS:
VirtualBox forwards the host's DNS servers to the VM.
This is typically shown in your logs:
00:00:00.081170 dns-monitor HostDnsMonitor: new information
00:00:00.081183 dns-monitor server 1: 2001:b88:1002::10
00:00:00.081197 dns-monitor server 2: 2001:b88:1202::10
00:00:00.081211 dns-monitor server 3: 2001:730:3e42:1000::53
00:00:00.081223 dns-monitor server 4: 89.101.251.228
00:00:00.081234 dns-monitor server 5: 89.101.251.229
These seem to be public DNS servers which can't resolve IPs inside your LAN.


The networking on my Mac is made via IPv6.

Did you disable IPv4?
If so, you should enable it since IPv6 doesn't provide NAT.
ID: 51581 · Report as offensive     Reply Quote
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51582 - Posted: 20 Feb 2025, 19:52:41 UTC - in response to Message 51581.  

I had my tasks still running. I tried last night with the NAT.
This 2 tasks did run with NAT enabled and IPv6 put in manual mode without any address set up:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419720786
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419719592

IPv4 was always enabled.

I did remove the DNS servers and added 4 DNS servers. 2 IPv4 and 2 IPv6. Now the CMS task is working normally.
2606:4700:4700::64
2606:4700:4700::6400
1.1.1.1
8.8.8.8
ID: 51582 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2634
Credit: 272,025,947
RAC: 79,910
Message 51583 - Posted: 21 Feb 2025, 8:06:33 UTC - in response to Message 51582.  

@Mastha-Hacker

Looks good now.
Especially these logs:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419737228
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419742735
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419745434


Just to verify the original issues are solved...

You ran the task in a sandbox environment under username=boinc_project?
This is a must on MacOS.
From the logfile:
2025-02-21 04:34:30 (17875): Detected: Sandbox Configuration Enabled

Please confirm: [Yes|No]


You used vboxwrapper's default network configuration, i.e. NAT?
From the logfile:
2025-02-21 04:34:31 (17875): Setting Network Configuration for NAT.

Please confirm: [Yes|No]


You left the heartbeat check activated as intended by the project and the task regularly updates the heartbeat file in '.../slots/n/shared/'?
From the logfile:
2025-02-21 04:34:30 (17875): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)

Please confirm: [Yes|No]
ID: 51583 · Report as offensive     Reply Quote
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51591 - Posted: 21 Feb 2025, 20:36:20 UTC - in response to Message 51583.  

@Mastha-Hacker

Looks good now.
Especially these logs:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419737228
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419742735
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419745434


Just to verify the original issues are solved...

You ran the task in a sandbox environment under username=boinc_project?
This is a must on MacOS.
From the logfile:
2025-02-21 04:34:30 (17875): Detected: Sandbox Configuration Enabled

Please confirm: [Yes|No]

Yes



You used vboxwrapper's default network configuration, i.e. NAT?
From the logfile:
2025-02-21 04:34:31 (17875): Setting Network Configuration for NAT.

Please confirm: [Yes|No]

Yes. I used the latest VBoxwrapper. (26209)



You left the heartbeat check activated as intended by the project and the task regularly updates the heartbeat file in '.../slots/n/shared/'?
From the logfile:
2025-02-21 04:34:30 (17875): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)

Please confirm: [Yes|No]

Yes

So the cause of the last error was a (for VirtualBox) faulty DNS configuration on MacOS.[/quote]
ID: 51591 · Report as offensive     Reply Quote
[DPC] Mastha-Hacker

Send message
Joined: 11 Apr 11
Posts: 23
Credit: 194,876
RAC: 57
Message 51597 - Posted: 23 Feb 2025, 18:27:03 UTC

The ATLAS and Theory tasks are also working! :)
ID: 51597 · Report as offensive     Reply Quote

Message boards : CMS Application : Problems with Theory and CMS on MacOS


©2025 CERN