Thread '"No starage device attached ..."'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,947,916 RAC: 43,786	Message 47340 - Posted: 4 Oct 2022, 7:35:23 UTC Yesterday and today, on only one of my hosts I had cases where the task started, but CPU activity ended after less than 2 minutes, and the task continued running and running ... until I happened to find out eventually. Here are the tasks: https://lhcathome.cern.ch/lhcathome/result.php?resultid=366570838 https://lhcathome.cern.ch/lhcathome/result.php?resultid=366599960 the following entry in the stderr of the failed tasks caught my eye: aText={No storage device attached to device slot 0 on port 0 of controller 'Hard Disk Controller'}, preserve=false aResultDetail=0 anyone any idea what's going wrong? BTW, this is the only one of my machines with the latest versions of both BOINC and Oracle: BOINC client v7.20.2 VirtualBox 6.1.38 Maybe 6.1.38 has some problems? ID: 47340 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47341 - Posted: 4 Oct 2022, 7:56:21 UTC - in response to Message 47340. This is when CVMFS connect is not successful. Theory had this night also faulty tasks. Sometime you have to stop this running Atlas-task yourself, otherwhise, there is no stop from the Task. ID: 47341 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2749 Credit: 302,744,775 RAC: 76,143	Message 47342 - Posted: 4 Oct 2022, 8:19:17 UTC - in response to Message 47340. Looks like CVMFS works fine but there were Frontier problems last night that couldn't be solved by switching to fail-over servers. Since an hour ago my logs don't show Frontier fail-over request any more and things seem to recover (slowly). This affects ATLAS and CMS. The scripts running on the client side usually can't recover from that. Best would be to cancel weird looking tasks and start fresh ones. ID: 47342 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,947,916 RAC: 43,786	Message 47343 - Posted: 4 Oct 2022, 8:59:42 UTC - in response to Message 47342. thanks, computezrmle, for your analysis and quick reply. I now checked the other hosts - as it seems, no others were affected :-) ID: 47343 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47345 - Posted: 4 Oct 2022, 17:37:13 UTC - in response to Message 47342. Looks like CVMFS works fine but there were Frontier problems last night that couldn't be solved by switching to fail-over servers. Since an hour ago my logs don't show Frontier fail-over request any more and things seem to recover (slowly). This affects ATLAS and CMS. The scripts running on the client side usually can't recover from that. Best would be to cancel weird looking tasks and start fresh ones. Believe, what you believe. ID: 47345 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47346 - Posted: 5 Oct 2022, 6:16:34 UTC - in response to Message 47345. Last modified: 5 Oct 2022, 6:18:18 UTC https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10593998 Clientstatus Abbruch durch Benutzer Endstatus 203 (0x000000CB) EXIT_ABORTED_VIA_GUI Computer ID 10593998 Laufzeit 14 Stunden 24 min. 57 sek. CPU Zeit 2 min. 9 sek. PrÃ¼fungsstatus UngÃ¼ltig Punkte 0.00 max. FLOPS des GerÃ¤tes 41.86 GFLOPS Anwendungsversion ATLAS Simulation v2.02 (vbox64_mt_mcore_atlas) windows_x86_64 This example for Win11pro shows, why your answer for Windows is not correct!! ID: 47346 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,947,916 RAC: 43,786	Message 47352 - Posted: 10 Oct 2022, 5:24:29 UTC again there was a task with CPU time of 1:34 minutes only, but the task continued running ... https://lhcathome.cern.ch/lhcathome/result.php?resultid=366757195 the stderr shows the following interesting entry: VBoxManage.exe: error: Could not find a registered machine named 'boinc_9af796eea95542d4' VBoxManage.exe: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown VBoxManage.exe: error: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2773 of file VBoxManageInfo.cpp Command: VBoxManage -q showhdinfo "C:\ProgramData\BOINC\slots\0/vm_image.vdi" Exit Code: -2135228412 Output: VBoxManage.exe: error: Could not find file for the medium 'C:\ProgramData\BOINC\slots\0\vm_image.vdi' (VERR_FILE_NOT_FOUND) VBoxManage.exe: error: Details: code VBOX_E_FILE_ERROR (0x80bb0004), component MediumWrap, interface IMedium, callee IUnknown VBoxManage.exe: error: Context: "OpenMedium(Bstr(pszFilenameOrUuid).raw(), enmDevType, enmAccessMode, fForceNewUuidOnOpen, pMedium.asOutParam())" at line 191 of file VBoxManageDisk.cpp what's going wrong? ID: 47352 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2749 Credit: 302,744,775 RAC: 76,143	Message 47353 - Posted: 10 Oct 2022, 6:11:57 UTC - in response to Message 47352. p_version currently in use doesn't attach a vm_image.vdi. The related "error" can be ignored. Instead it uses ATLAS_vbox_2.02_image.vdi which is correctly attached here: [pre]2022-10-09 21:16:36 (14788): Command: VBoxManage -q storageattach "boinc_9af796eea95542d4" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "C:\ProgramData\BOINC/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_2.02_image.vdi" Exit Code: 0[/pre] ID: 47353 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,947,916 RAC: 43,786	Message 47354 - Posted: 10 Oct 2022, 6:23:05 UTC - in response to Message 47353. thanks for your quick reply. So what else could be the cause for the problem? Is it coincidence that the problems which I described here lately are happening only with the host which recently got a new SSD (after the former one became defective) ? On the other hand, many other ATLAS tasks have been working well on this host after the exchange of the SSD. ID: 47354 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47355 - Posted: 10 Oct 2022, 6:44:53 UTC - in response to Message 47354. Computer ID 10795955 Laufzeit 1 Stunden 21 min. 18 sek. CPU Zeit 7 sek. ID: 47355 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2749 Credit: 302,744,775 RAC: 76,143	Message 47356 - Posted: 10 Oct 2022, 6:46:39 UTC - in response to Message 47354. My guess (without evidence!) would be that there is a timing issue while the VM boots. In that case it can't be solved client side. ID: 47356 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47357 - Posted: 10 Oct 2022, 6:50:11 UTC - in response to Message 47356. My guess (without evidence!) would be that there is a timing issue while the VM boots. In that case it can't be solved client side. This task was yesterday with Squid-Test AND new Windows (three days ago)! ID: 47357 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47358 - Posted: 10 Oct 2022, 11:04:12 UTC - in response to Message 47357. This seem a timestamp problem for connect CVMFS and the other Server. Is it possible to see the protocol of this Servers from the Atlas-Team? Had 20 - 30 Atlas Tasks before running well with Squid. Otherwhise, when connect is wrong, we need a exit. ID: 47358 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2749 Credit: 302,744,775 RAC: 76,143	Message 47359 - Posted: 10 Oct 2022, 11:13:45 UTC - in response to Message 47358. ... a timestamp problem ... A red herring. It's neither a timestamp problem nor a squid problem. ID: 47359 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47360 - Posted: 10 Oct 2022, 11:44:48 UTC - in response to Message 47359. Take a Windows11pro or Workstation PC and see what happens! ID: 47360 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2749 Credit: 302,744,775 RAC: 76,143	Message 47361 - Posted: 10 Oct 2022, 12:38:14 UTC - in response to Message 47360. No, thanks. I'm able to differentiate between the host OS and the guest OS. ID: 47361 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,947,916 RAC: 43,786	Message 47365 - Posted: 11 Oct 2022, 19:26:20 UTC I just noticed that all 5 ATLAS tasks on my main cruncher were failing :-( VM console cannot be accessed, and under "properties" I saw that although the tasks had runtimes between 7 and 8 hours (which they normally don't, I just found out too late) the CPU time was only between 40 and 60 minutes. Here an example, the others look same way, so I don't to cite them all: https://lhcathome.cern.ch/lhcathome/result.php?resultid=366808818 excerpt of the stderr: Command: VBoxManage -q controlvm "boinc_4156a33e0a714d73" pause Output: VBoxManage.exe: error: Machine 'boinc_4156a33e0a714d73' is not currently running 2022-10-11 13:48:26 (12484): Error in pause VM for VM: -108 Command: VBoxManage -q controlvm "boinc_4156a33e0a714d73" pause Output: VBoxManage.exe: error: Machine 'boinc_4156a33e0a714d73' is not currently running anyone any idea what happened? ID: 47365 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47367 - Posted: 12 Oct 2022, 2:01:44 UTC - in response to Message 47365. Last modified: 12 Oct 2022, 2:05:58 UTC Yesterday have testing IP-Connecting to ISP. When just in this moment of disconnect a new Atlas Task is starting (saw three faulty), they are running only with a few seconds of CPU, but a lot of duration-time (hours...). Testing it again today and in the next days. Only seeing it on this fast two Threadripper. https://lhcathome.cern.ch/lhcathome/results.php?userid=75468&offset=0&show_names=0&state=6&appid= ID: 47367 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,589,842 RAC: 30,721	Message 47372 - Posted: 13 Oct 2022, 23:09:30 UTC - in response to Message 47367. Last modified: 13 Oct 2022, 23:18:37 UTC Name b78KDmkO821nsSi4apGgGQJmABFKDmABFKDmuOfWDmWBTKDmReOcZm_0 Arbeitspaket 196042022 Erstellt 13 Oct 2022, 8:05:35 UTC Gesendet 13 Oct 2022, 11:49:16 UTC Ablaufdatum 21 Oct 2022, 11:49:16 UTC Empfangen 13 Oct 2022, 22:52:48 UTC Computer ID 10795955 Laufzeit 7 Stunden 7 min. CPU Zeit 10 sek. PrÃ¼fungsstatus UngÃ¼ltig ID: 47372 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1553 Credit: 10,081,726 RAC: 1,368	Message 47374 - Posted: 14 Oct 2022, 10:51:46 UTC - in response to Message 47372. The cause of that problem is already discussed in the long thread ATLAS vbox v2.02 - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5885 If you don't want to abort such a task: A possible manner to retry such a task is mentioned here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5885&postid=47086#47086 ID: 47374 · Reply Quote