Questions and Answers : Unix/Linux : computing errors
Message board moderation

To post messages, you must log in.

AuthorMessage
alphaaurigae

Send message
Joined: 26 Nov 11
Posts: 3
Credit: 104,527
RAC: 0
Message 48139 - Posted: 29 May 2023, 3:32:12 UTC

Intended to start crunching lhc@home but i get computation errors on every task - last tested cms
im running:
- ubuntu 22.04 5.15.0-72-lowlatency,
- boinc 7.23.0 build off git master branch (works fine everywhere else=
- virtualbox 7.0.8 r156879 (Qt5.15.3), guest additions updated. (may missed a reboot after guest additions updated but this should be an issue?)
.. tested with firewall off behind default nat.
Machine is amd ryzen 1920x on Asus ROG Strix X399

lscpu prints Virtualization: AMD-V
Ram doesn't seem to be an issue, added 100gb swap, neither free space 400gb + free
Last test cms tasks fail at ~1% with computation error
What could be the issue here?
ID: 48139 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2114
Credit: 159,914,613
RAC: 83,929
Message 48140 - Posted: 29 May 2023, 5:09:32 UTC - in response to Message 48139.  

2023-05-28 02:14:49 (3326477): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory
Boinc have a conflict with CA Certificate. See Boinc-Webpage.
You can upgrade Boinc to 7.20.2.
ID: 48140 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2421
Credit: 227,211,346
RAC: 131,365
Message 48141 - Posted: 29 May 2023, 5:25:25 UTC - in response to Message 48139.  

Looks like most of your network packets get lost.
Do some basic tests.

Example:
nc -zvvw10 cern.ch 80

Result must be:
Connection to cern.ch 80 port [tcp/http] succeeded!


Other tests that must succeed:
nc -zvvw10 vccs.cern.ch 443
nc -zvvw10 vocms0840.cern.ch 9618
nc -zvvw10 vocms0267.cern.ch 4080
nc -zvvw10 eoscms-ns-ip563.cern.ch 1094
nc -zvvw10 vocms0205.cern.ch 80
nc -zvvw10 cmsfrontier.cern.ch 8000
nc -zvvw10 cms-frontier.openhtc.io 8080


If only one of them fails you need to investigate whether it's caused by the firewall on the computer or on the router.
ID: 48141 · Report as offensive     Reply Quote
alphaaurigae

Send message
Joined: 26 Nov 11
Posts: 3
Credit: 104,527
RAC: 0
Message 48145 - Posted: 29 May 2023, 19:09:46 UTC
Last modified: 29 May 2023, 19:11:46 UTC

THX for the replies!
Missed out to open port 8000 -
CMS test task is at 7% now, way further than the previous ones which had the computation error at ~1%
Curious it didn't work when i tested with disabled ufw - maybe it wasn't reloaded properly?
here are the updated ufw rules, ive added now beside the default 80, 443 and 53
	sudo ufw allow out 9618/tcp # boinc lhc Theory and CMS and LHCb:
	sudo ufw allow out 9094/tcp # boinc lhc ATLAS
	sudo ufw allow out 5222/tcp # boinc lhc xmpp ATLAS
	sudo ufw allow out 3125/tcp # boinc lhc CVMFS
	sudo ufw allow out 4080/tcp # boinc lhc WMAgent
	sudo ufw allow out 8000/tcp # boinc lhc HTTP
	sudo ufw allow out 8080/tcp # boinc lhc HTTP
	sudo ufw allow out 8443/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9133/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9135/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9148/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9149/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9166/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9196/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9199/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 1094/tcp # boinc lhc CMS EOS
	sudo ufw reload


$ nc -zvvw10 cern.ch 80
nc -zvvw10 vccs.cern.ch 443
nc -zvvw10 vocms0840.cern.ch 9618
nc -zvvw10 vocms0267.cern.ch 4080
nc -zvvw10 eoscms-ns-ip563.cern.ch 1094
nc -zvvw10 vocms0205.cern.ch 80
nc -zvvw10 cmsfrontier.cern.ch 8000
nc -zvvw10 cms-frontier.openhtc.io 8080
Connection to cern.ch (188.184.37.219) 80 port [tcp/http] succeeded!
Connection to vccs.cern.ch (137.138.120.99) 443 port [tcp/https] succeeded!
Connection to vocms0840.cern.ch (137.138.156.85) 9618 port [tcp/*] succeeded!
Connection to vocms0267.cern.ch (137.138.52.94) 4080 port [tcp/*] succeeded!
Connection to eoscms-ns-ip563.cern.ch (128.142.160.140) 1094 port [tcp/rootd] succeeded!
Connection to vocms0205.cern.ch (137.138.55.253) 80 port [tcp/http] succeeded!
Connection to cmsfrontier.cern.ch (188.184.100.32) 8000 port [tcp/*] succeeded!
Connection to cms-frontier.openhtc.io (188.114.97.3) 8080 port [tcp/http-alt] succeeded!


fingers crossed .. ill update later if the CMS task completed successfully.

upgraded boinc too but haven't restarted yet, wasnt aware of the certificate bug .. primegrid had no obv issues.
ID: 48145 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2421
Credit: 227,211,346
RAC: 131,365
Message 48147 - Posted: 29 May 2023, 20:21:15 UTC - in response to Message 48145.  

You never had a BOINC certificate bug.
The issue was inside your VM which could not contact CERN to update it's CA certs.

BOINC's progress bar/percentage doesn't tell you what you need to know since it can't look into the VM.
To get an early impression whether the task processes fine you may check the messages in .../slots/n/stderr.txt with n being the slot number.

You may also use the "show" function of the VirtualBox GUI to get to the VM's consoles.
Leave that view using "Machine -> disconnect from GUI". Other methods interrupt the VM and cause unnecessary load.
ID: 48147 · Report as offensive     Reply Quote
alphaaurigae

Send message
Joined: 26 Nov 11
Posts: 3
Credit: 104,527
RAC: 0
Message 48158 - Posted: 30 May 2023, 21:49:10 UTC - in response to Message 48147.  
Last modified: 30 May 2023, 21:52:34 UTC

You never had a BOINC certificate bug.

- Yeah guessed that about the boinc "upgrade" , which would have been a downgrade as im on 7.23.0 and suggested was 7.20.2 in this post. i compile the master branch off github and never run into issues doing so ... up to date with master as of now.

... Had 11 valid cms tasks meanwhile.
When i tried theory simulation i had some computation errors - have to check that later .. for now running cms.

My ufw config rn running primegrid and lhc:
	sudo ufw allow out to any port 80
	sudo ufw allow out to any port 443
	sudo ufw allow out to any port 53
	sudo ufw allow out 31416/tcp # boinc 
	sudo ufw allow out 38406/tcp # boinc 
	sudo ufw allow out 46082/tcp # boinc 
	sudo ufw allow out 9618/tcp # boinc lhc Theory and CMS and LHCb:
	sudo ufw allow out 9094/tcp # boinc lhc ATLAS
	sudo ufw allow out 5222/tcp # boinc lhc xmpp ATLAS
	sudo ufw allow out 3125/tcp # boinc lhc CVMFS
	sudo ufw allow out 4080/tcp # boinc lhc WMAgent
	sudo ufw allow out 8000/tcp # boinc lhc HTTP
	sudo ufw allow out 8080/tcp # boinc lhc HTTP
	sudo ufw allow out 8443/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9133/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9135/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9148/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9149/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9166/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9196/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 9199/tcp # boinc lhc LHCb DIRAC
	sudo ufw allow out 1094/tcp # boinc lhc CMS EOS


To get an early impression whether the task processes fine you may check the messages in .../slots/n/stderr.txt with n being the slot number.

- maybe useful later on, thx

You may also use the "show" function of the VirtualBox GUI to get to the VM's consoles.

going to try this out, thx again.[/quote]
ID: 48158 · Report as offensive     Reply Quote
Charles R Ward

Send message
Joined: 14 Jul 05
Posts: 1
Credit: 476,702
RAC: 914
Message 48190 - Posted: 4 Jun 2023, 15:06:01 UTC - in response to Message 48141.  

I am new to using Linux.
LHC fails after 10 minutes. The latest was : 394537242 211814107 10830515 4 Jun 2023, 12:29:35 UTC 4 Jun 2023, 13:58:59 UTC Error while computing 602.86 0.00 --- ATLAS Simulation v3.01 (native_mt)
x86_64-pc-linux-gnu

Using Debian 11, AND Ryzen 5600g, 16 G ram.

The tests you suggest fail.
charles@Guardian:~$ nc -zvvw10 cern.ch 80
DNS fwd/rev mismatch: cern.ch != drupal8lb01.cern.ch
cern.ch [188.184.37.219] 80 (http) open
sent 0, rcvd 0
charles@Guardian:~$ nc -zvvw10 vccs.cern.ch 443
vccs.cern.ch [137.138.120.99] 443 (https) open
sent 0, rcvd 0
ID: 48190 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : computing errors


©2024 CERN