Message boards : ATLAS application : Bad WUs?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1252
Credit: 8,305,230
RAC: 20,228
Message 47177 - Posted: 26 Aug 2022, 12:46:23 UTC - in response to Message 47176.  

for the last few hours, all my ATLAS tasks on all of my computers are failing. They keep running, but no CPU usage, and VM console_2 shows N/A for each core.
I then aborted the task manually.

Examples:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10688539

What's going wrong?

Guest Log: Checking CVMFS... and no response
maeax had the same problem some time ago and his problem was proxy settings, if I remember correctly.
ID: 47177 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47178 - Posted: 26 Aug 2022, 12:53:19 UTC - in response to Message 47177.  

Guest Log: Checking CVMFS... and no response
maeax had the same problem some time ago and his problem was proxy settings, if I remember correctly.
hm, I am not using a proxy.
Furthermore, I am just realizing that on other computers on which I run Theory tasks, the ones which were downloaded within the past few hours, don't run either - also there:
"Guest Log: Probing /cvmfs/sft.cern.ch... Failed!"
https://lhcathome.cern.ch/lhcathome/result.php?resultid=364413464

I had not made any changes on any of my computers. So I guess the problem must be with CERN ???
ID: 47178 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47179 - Posted: 26 Aug 2022, 12:58:10 UTC - in response to Message 47178.  

here is the stderr from another machine:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=364441059

basically same error, same problem.

And it looks exactly same on all other computers :-(

As said before, I had not made any changes at all.
ID: 47179 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2364
Credit: 218,393,532
RAC: 120,829
Message 47180 - Posted: 26 Aug 2022, 13:14:07 UTC - in response to Message 47179.  

The task logs don't show a specific error condition, except this:
06:36:48.307434          ERROR [COM]: aRC=E_FAIL (0x80004005) aIID={85632c68-b5bb-4316-a900-5eb28d3413df} aComponent={MachineWrap} aText={Runtime error opening 'Z:\BOINC\slots\2\boinc_5f3d7e10453d75c7\boinc_5f3d7e10453d75c7.vbox' for reading: -103 (Path not found.).

This tells you that vbox expects to read the VM definition file but either the file or a path component doesn't exist.
Shouldn't be responsible for errors on other computers.


Just bypassed my local proxy and tested grid.cern.ch and sft.cern.ch via s1cern-cvmfs.openhtc.io.
Both immediately return "HTTP/1.1 200 OK".

If you see network connection problems to CVMFS you may check your router/LAN.
Did your ISP recently reset the connection?


BTW:
Your host reports vbox 6.1.18.
You may update it to the most recent version, then reboot and check/clean the vbox environment (including the disk entries) before you start BOINC.
ID: 47180 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47181 - Posted: 26 Aug 2022, 13:58:57 UTC - in response to Message 47180.  

If you see network connection problems to CVMFS you may check your router/LAN.
Did your ISP recently reset the connection?
No idea whether my ISP recently reset the connection. I did not notice something like this.
Fact is that until tomorrow morning all LHC VM tasks (ATLAS and Theory) were running okay.
I now rebooted the ISP modem/router which did, unfortunately, NOT solve the problem.

I looked at the VM console during the start of a Theory task, and the process came to a halt at:
(1 of 4) A start job is runnig for /cvmfs/cernvm-prod.cern.ch
this ran in a loop for a few minutes, then it jumped to where it says "Probing /cvmfs/sft.cern.ch... Failed!"

Since all of my computers connect without any problem to anywhere, and also, for example, the QuChemPad VM tasks run properly, I am questioning what is that special with CVMFS ? What is the modem/router NOT doing well with a CVMFS connection, whereas everything else works?

I will update to the latest version the VM on one of my machines and then try, but my instinct tells me that this will not solve the problem :-(
It this is so, what else could I do in order to crunch LHC ?
ID: 47181 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2050
Credit: 151,468,134
RAC: 39,162
Message 47182 - Posted: 26 Aug 2022, 16:37:39 UTC - in response to Message 47177.  

for the last few hours, all my ATLAS tasks on all of my computers are failing. They keep running, but no CPU usage, and VM console_2 shows N/A for each core.
I then aborted the task manually.

Examples:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10688539

What's going wrong?

Guest Log: Checking CVMFS... and no response
maeax had the same problem some time ago and his problem was proxy settings, if I remember correctly.

Yes, CentOS8-VM squid not running well with Threadripper 3995.
Today made a new test with squid. Two Atlas running 5 hours and one running 11 hours for nothing.
So. stopped squid. Have no idea why?https://lhcathome.cern.ch/lhcathome/results.php?userid=75468&offset=0&show_names=0&state=6&appid=14
ID: 47182 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2364
Credit: 218,393,532
RAC: 120,829
Message 47183 - Posted: 26 Aug 2022, 17:13:44 UTC

The problems from a while ago had nothing to do with squid.
The were cause by an undefined service order during ATLAS startup.
ID: 47183 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2050
Credit: 151,468,134
RAC: 39,162
Message 47184 - Posted: 27 Aug 2022, 1:30:24 UTC
Last modified: 27 Aug 2022, 1:43:37 UTC

Hypervisor System Log:

468:39:42.807656 WARNING [COM]: aRC=E_FAIL (0x80004005) aIID={16ced992-5fdc-4aba-aff5-6a39bbd7c38b} aComponent={HostWrap} aText={Could not load the Host USB Proxy Service (VERR_FILE_NOT_FOUND). The service might not be installed on the host computer}, preserve=true aResultDetail=0
468:39:43.017821 Saving settings file "C:\ProgramData\BOINC\slots\3\boinc_10feba5151116813\boinc_10feba5151116813.vbox" with version "1.16-windows"

??
ID: 47184 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47185 - Posted: 27 Aug 2022, 6:33:14 UTC - in response to Message 47181.  
Last modified: 27 Aug 2022, 6:41:23 UTC

If you see network connection problems to CVMFS you may check your router/LAN.
Did your ISP recently reset the connection?
No idea whether my ISP recently reset the connection. I did not notice something like this.
Fact is that until tomorrow morning all LHC VM tasks (ATLAS and Theory) were running okay.
I now rebooted the ISP modem/router which did, unfortunately, NOT solve the problem.

I looked at the VM console during the start of a Theory task, and the process came to a halt at:
(1 of 4) A start job is runnig for /cvmfs/cernvm-prod.cern.ch
this ran in a loop for a few minutes, then it jumped to where it says "Probing /cvmfs/sft.cern.ch... Failed!"

Since all of my computers connect without any problem to anywhere, and also, for example, the QuChemPad VM tasks run properly, I am questioning what is that special with CVMFS ? What is the modem/router NOT doing well with a CVMFS connection, whereas everything else works?

I will update to the latest version the VM on one of my machines and then try, but my instinct tells me that this will not solve the problem :-(
It this is so, what else could I do in order to crunch LHC ?
I now installed the latest VB version 6.1.36 - and, as expected, my problem of not connecting to CVMFS still exists.
Honestly, I now have no idea what else I can do
Does this now mean that I will no longer be able to crunch LHC VB tasks? :-)
ID: 47185 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2050
Credit: 151,468,134
RAC: 39,162
Message 47186 - Posted: 27 Aug 2022, 7:03:00 UTC - in response to Message 47185.  

Are there some events in Windows-Protocol for your PC's?
ID: 47186 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,019,766
RAC: 30
Message 47188 - Posted: 27 Aug 2022, 7:31:03 UTC - in response to Message 47185.  

May or may not apply to you, but I had similar problems a while ago

At the suggestion of someone (don't remember if it was here or my ISP Help Desk), i connected my cruncher machine via copper rather than WIFI, even though I was using 802.11AC

All of my LLHC/Atlas connection problems went away
ID: 47188 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47189 - Posted: 27 Aug 2022, 7:40:26 UTC - in response to Message 47188.  

May or may not apply to you, but I had similar problems a while ago

At the suggestion of someone (don't remember if it was here or my ISP Help Desk), i connected my cruncher machine via copper rather than WIFI, even though I was using 802.11AC

All of my LLHC/Atlas connection problems went away
some of my machines are connected via copper, some others via WLAN. My problem exists with all of these machines :-(
ID: 47189 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47190 - Posted: 27 Aug 2022, 7:59:30 UTC - in response to Message 47186.  

Are there some events in Windows-Protocol for your PC's?
nothing special
ID: 47190 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47197 - Posted: 29 Aug 2022, 13:22:00 UTC - in response to Message 47190.  

good news: all my computers are successfully crunching LHC CM tasks again.
I just gave it a try this late morning, and everything worked fine. Access to cvmfs was not a problem.
No idea what had happened last weekend.
ID: 47197 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47199 - Posted: 29 Aug 2022, 15:49:22 UTC - in response to Message 47197.  

good news: all my computers are successfully crunching LHC CM tasks again.
sorry for the typo. Should have read VM tasks, of course (I noticed the typo only now and could not edit my posting any more)
ID: 47199 · Report as offensive     Reply Quote
captainjack

Send message
Joined: 21 Jun 10
Posts: 39
Credit: 10,032,726
RAC: 10,258
Message 47213 - Posted: 31 Aug 2022, 3:07:59 UTC

For the past two days, I haven't been able to get any ATLAS tasks to run on Windows boxes. I've tried two different Windows 11 computers and one Windows 10 computer. All had previously successfully completed multiple ATLAS tasks. On prior successful tasks, there is a message in the stderr.txt that says:

2022-08-27 17:09:53 (16204): Guest Log: 00:00:00.000603 main 5.2.32 r132073 started. Verbose level = 0
2022-08-27 17:09:59 (16204): Guest Log: CVMFS is ok

Now I get a message that says:

2022-08-30 21:52:02 (16276): Guest Log: 00:00:00.002213 main 5.2.32 r132073 started. Verbose level = 0
2022-08-30 21:52:12 (16276): Guest Log: 00:00:10.017953 timesync vgsvcTimeSyncWorker: Radical guest time change: 18 010 622 070 000ns (GuestNow=1 661 914 332 064 570 000 ns GuestLast=1 661 896 321 442 500 000 ns fSetTimeLastLoop=true )

And then it just sits there and does nothing.

Any suggestions?
ID: 47213 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2050
Credit: 151,468,134
RAC: 39,162
Message 47214 - Posted: 31 Aug 2022, 4:54:01 UTC - in response to Message 47213.  
Last modified: 31 Aug 2022, 4:54:43 UTC

This night have also three Atlas running for 6 hours under Win11pro and doing nothing.
This is, when CVMFS is not connected correct, whatever the reason is.
It is not at our side.
ID: 47214 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47215 - Posted: 31 Aug 2022, 4:59:56 UTC - in response to Message 47214.  

This is, when CVMFS is not connected correct, whatever the reason is.
It is not at our side.
this seems to happen once in a while - see my postings from a few days ago :-(
CVMFS seems to be extremely touchy.
ID: 47215 · Report as offensive     Reply Quote
captainjack

Send message
Joined: 21 Jun 10
Posts: 39
Credit: 10,032,726
RAC: 10,258
Message 47218 - Posted: 2 Sep 2022, 3:32:42 UTC

After reading through several other posts about this kind of problem, I thought I might have network problems. I shut down all computers, the switch and finally the router. Then restarted the router, then the switch, then computers. Atlas tasks work again. What had me confused was that all other internet connections worked just fine so I didn't think it was a network issue. But a reboot got ATLAS tasks going again. Hope this helps someone else having similar issues.
ID: 47218 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1672
Credit: 96,815,600
RAC: 126,695
Message 47219 - Posted: 2 Sep 2022, 6:45:32 UTC - in response to Message 47218.  

After reading through several other posts about this kind of problem, I thought I might have network problems. I shut down all computers, the switch and finally the router. Then restarted the router, then the switch, then computers. Atlas tasks work again. What had me confused was that all other internet connections worked just fine so I didn't think it was a network issue. But a reboot got ATLAS tasks going again. Hope this helps someone else having similar issues.
well, the situation was similar here, about a week ago. Everything else worked fine, just die LHC VM tasks which need access to CVMFS did not work. However, a reboot of the modem/router did NOT help. Only 2 days later everything worked fine, without any further intervention from my side.
ID: 47219 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : ATLAS application : Bad WUs?


©2024 CERN