Message boards : ATLAS application : "No starage device attached ..."
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1545
Credit: 55,559,293
RAC: 74,483
Message 47340 - Posted: 4 Oct 2022, 7:35:23 UTC

Yesterday and today, on only one of my hosts I had cases where the task started, but CPU activity ended after less than 2 minutes, and the task continued running and running ... until I happened to find out eventually.

Here are the tasks:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=366570838
https://lhcathome.cern.ch/lhcathome/result.php?resultid=366599960

the following entry in the stderr of the failed tasks caught my eye:

aText={No storage device attached to device slot 0 on port 0 of controller 'Hard Disk Controller'}, preserve=false aResultDetail=0

anyone any idea what's going wrong?

BTW, this is the only one of my machines with the latest versions of both BOINC and Oracle:

BOINC client v7.20.2
VirtualBox 6.1.38

Maybe 6.1.38 has some problems?
ID: 47340 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47341 - Posted: 4 Oct 2022, 7:56:21 UTC - in response to Message 47340.  

This is when CVMFS connect is not successful.
Theory had this night also faulty tasks.
Sometime you have to stop this running Atlas-task yourself, otherwhise, there is no stop from the Task.
ID: 47341 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2121
Credit: 169,263,210
RAC: 113,897
Message 47342 - Posted: 4 Oct 2022, 8:19:17 UTC - in response to Message 47340.  

Looks like CVMFS works fine but there were Frontier problems last night that couldn't be solved by switching to fail-over servers.
Since an hour ago my logs don't show Frontier fail-over request any more and things seem to recover (slowly).

This affects ATLAS and CMS.

The scripts running on the client side usually can't recover from that.
Best would be to cancel weird looking tasks and start fresh ones.
ID: 47342 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1545
Credit: 55,559,293
RAC: 74,483
Message 47343 - Posted: 4 Oct 2022, 8:59:42 UTC - in response to Message 47342.  

thanks, computezrmle, for your analysis and quick reply.
I now checked the other hosts - as it seems, no others were affected :-)
ID: 47343 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47345 - Posted: 4 Oct 2022, 17:37:13 UTC - in response to Message 47342.  

Looks like CVMFS works fine but there were Frontier problems last night that couldn't be solved by switching to fail-over servers.
Since an hour ago my logs don't show Frontier fail-over request any more and things seem to recover (slowly).

This affects ATLAS and CMS.

The scripts running on the client side usually can't recover from that.
Best would be to cancel weird looking tasks and start fresh ones.

Believe, what you believe.
ID: 47345 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47346 - Posted: 5 Oct 2022, 6:16:34 UTC - in response to Message 47345.  
Last modified: 5 Oct 2022, 6:18:18 UTC

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10593998
Clientstatus Abbruch durch Benutzer
Endstatus 203 (0x000000CB) EXIT_ABORTED_VIA_GUI
Computer ID 10593998
Laufzeit 14 Stunden 24 min. 57 sek.
CPU Zeit 2 min. 9 sek.

Prüfungsstatus Ungültig
Punkte 0.00
max. FLOPS des Gerätes 41.86 GFLOPS
Anwendungsversion ATLAS Simulation v2.02 (vbox64_mt_mcore_atlas)
windows_x86_64
This example for Win11pro shows, why your answer for Windows is not correct!!
ID: 47346 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1545
Credit: 55,559,293
RAC: 74,483
Message 47352 - Posted: 10 Oct 2022, 5:24:29 UTC

again there was a task with CPU time of 1:34 minutes only, but the task continued running ...

https://lhcathome.cern.ch/lhcathome/result.php?resultid=366757195

the stderr shows the following interesting entry:

VBoxManage.exe: error: Could not find a registered machine named 'boinc_9af796eea95542d4'
VBoxManage.exe: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown
VBoxManage.exe: error: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2773 of file VBoxManageInfo.cpp

Command: VBoxManage -q showhdinfo "C:\ProgramData\BOINC\slots\0/vm_image.vdi"
Exit Code: -2135228412
Output:
VBoxManage.exe: error: Could not find file for the medium 'C:\ProgramData\BOINC\slots\0\vm_image.vdi' (VERR_FILE_NOT_FOUND)
VBoxManage.exe: error: Details: code VBOX_E_FILE_ERROR (0x80bb0004), component MediumWrap, interface IMedium, callee IUnknown
VBoxManage.exe: error: Context: "OpenMedium(Bstr(pszFilenameOrUuid).raw(), enmDevType, enmAccessMode, fForceNewUuidOnOpen, pMedium.asOutParam())" at line 191 of file VBoxManageDisk.cpp


what's going wrong?
ID: 47352 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2121
Credit: 169,263,210
RAC: 113,897
Message 47353 - Posted: 10 Oct 2022, 6:11:57 UTC - in response to Message 47352.  

The app_version currently in use doesn't attach a vm_image.vdi.
The related "error" can be ignored.

Instead it uses ATLAS_vbox_2.02_image.vdi which is correctly attached here:
2022-10-09 21:16:36 (14788): 
Command: VBoxManage -q storageattach "boinc_9af796eea95542d4" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "C:\ProgramData\BOINC/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_2.02_image.vdi" 
Exit Code: 0
ID: 47353 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1545
Credit: 55,559,293
RAC: 74,483
Message 47354 - Posted: 10 Oct 2022, 6:23:05 UTC - in response to Message 47353.  

thanks for your quick reply.

So what else could be the cause for the problem?

Is it coincidence that the problems which I described here lately are happening only with the host which recently got a new SSD (after the former one became defective) ?
On the other hand, many other ATLAS tasks have been working well on this host after the exchange of the SSD.
ID: 47354 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47355 - Posted: 10 Oct 2022, 6:44:53 UTC - in response to Message 47354.  

Computer ID 10795955
Laufzeit 1 Stunden 21 min. 18 sek.
CPU Zeit 7 sek.
ID: 47355 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2121
Credit: 169,263,210
RAC: 113,897
Message 47356 - Posted: 10 Oct 2022, 6:46:39 UTC - in response to Message 47354.  

My guess (without evidence!) would be that there is a timing issue while the VM boots.
In that case it can't be solved client side.
ID: 47356 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47357 - Posted: 10 Oct 2022, 6:50:11 UTC - in response to Message 47356.  

My guess (without evidence!) would be that there is a timing issue while the VM boots.
In that case it can't be solved client side.

This task was yesterday with Squid-Test AND new Windows (three days ago)!
ID: 47357 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47358 - Posted: 10 Oct 2022, 11:04:12 UTC - in response to Message 47357.  

This seem a timestamp problem for connect CVMFS and the other Server.
Is it possible to see the protocol of this Servers from the Atlas-Team?
Had 20 - 30 Atlas Tasks before running well with Squid.
Otherwhise, when connect is wrong, we need a exit.
ID: 47358 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2121
Credit: 169,263,210
RAC: 113,897
Message 47359 - Posted: 10 Oct 2022, 11:13:45 UTC - in response to Message 47358.  

... a timestamp problem ...

A red herring.
It's neither a timestamp problem nor a squid problem.
ID: 47359 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47360 - Posted: 10 Oct 2022, 11:44:48 UTC - in response to Message 47359.  

Take a Windows11pro or Workstation PC and see what happens!
ID: 47360 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2121
Credit: 169,263,210
RAC: 113,897
Message 47361 - Posted: 10 Oct 2022, 12:38:14 UTC - in response to Message 47360.  

No, thanks.
I'm able to differentiate between the host OS and the guest OS.
ID: 47361 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1545
Credit: 55,559,293
RAC: 74,483
Message 47365 - Posted: 11 Oct 2022, 19:26:20 UTC

I just noticed that all 5 ATLAS tasks on my main cruncher were failing :-(

VM console cannot be accessed, and under "properties" I saw that although the tasks had runtimes between 7 and 8 hours (which they normally don't, I just found out too late) the CPU time was only between 40 and 60 minutes.

Here an example, the others look same way, so I don't to cite them all:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=366808818

excerpt of the stderr:

Command:
VBoxManage -q controlvm "boinc_4156a33e0a714d73" pause
Output:
VBoxManage.exe: error: Machine 'boinc_4156a33e0a714d73' is not currently running

2022-10-11 13:48:26 (12484): Error in pause VM for VM: -108
Command:
VBoxManage -q controlvm "boinc_4156a33e0a714d73" pause
Output:
VBoxManage.exe: error: Machine 'boinc_4156a33e0a714d73' is not currently running


anyone any idea what happened?
ID: 47365 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47367 - Posted: 12 Oct 2022, 2:01:44 UTC - in response to Message 47365.  
Last modified: 12 Oct 2022, 2:05:58 UTC

Yesterday have testing IP-Connecting to ISP.
When just in this moment of disconnect a new Atlas Task is starting (saw three faulty),
they are running only with a few seconds of CPU, but a lot of duration-time (hours...).
Testing it again today and in the next days.
Only seeing it on this fast two Threadripper.
https://lhcathome.cern.ch/lhcathome/results.php?userid=75468&offset=0&show_names=0&state=6&appid=
ID: 47367 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1692
Credit: 113,013,570
RAC: 308,060
Message 47372 - Posted: 13 Oct 2022, 23:09:30 UTC - in response to Message 47367.  
Last modified: 13 Oct 2022, 23:18:37 UTC

Name b78KDmkO821nsSi4apGgGQJmABFKDmABFKDmuOfWDmWBTKDmReOcZm_0
Arbeitspaket 196042022
Erstellt 13 Oct 2022, 8:05:35 UTC
Gesendet 13 Oct 2022, 11:49:16 UTC
Ablaufdatum 21 Oct 2022, 11:49:16 UTC
Empfangen 13 Oct 2022, 22:52:48 UTC
Computer ID 10795955
Laufzeit 7 Stunden 7 min.
CPU Zeit 10 sek.
Prüfungsstatus Ungültig
ID: 47372 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1149
Credit: 7,035,906
RAC: 669
Message 47374 - Posted: 14 Oct 2022, 10:51:46 UTC - in response to Message 47372.  

The cause of that problem is already discussed in the long thread ATLAS vbox v2.02 - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5885

If you don't want to abort such a task: A possible manner to retry such a task is mentioned here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5885&postid=47086#47086
ID: 47374 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : "No starage device attached ..."


©2023 CERN