Message boards :
ATLAS application :
"No starage device attached ..."
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1823 Credit: 119,020,746 RAC: 16,628 |
Yesterday and today, on only one of my hosts I had cases where the task started, but CPU activity ended after less than 2 minutes, and the task continued running and running ... until I happened to find out eventually. Here are the tasks: https://lhcathome.cern.ch/lhcathome/result.php?resultid=366570838 https://lhcathome.cern.ch/lhcathome/result.php?resultid=366599960 the following entry in the stderr of the failed tasks caught my eye: aText={No storage device attached to device slot 0 on port 0 of controller 'Hard Disk Controller'}, preserve=false aResultDetail=0 anyone any idea what's going wrong? BTW, this is the only one of my machines with the latest versions of both BOINC and Oracle: BOINC client v7.20.2 VirtualBox 6.1.38 Maybe 6.1.38 has some problems? |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
This is when CVMFS connect is not successful. Theory had this night also faulty tasks. Sometime you have to stop this running Atlas-task yourself, otherwhise, there is no stop from the Task. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 15,673 |
Looks like CVMFS works fine but there were Frontier problems last night that couldn't be solved by switching to fail-over servers. Since an hour ago my logs don't show Frontier fail-over request any more and things seem to recover (slowly). This affects ATLAS and CMS. The scripts running on the client side usually can't recover from that. Best would be to cancel weird looking tasks and start fresh ones. |
Send message Joined: 18 Dec 15 Posts: 1823 Credit: 119,020,746 RAC: 16,628 |
thanks, computezrmle, for your analysis and quick reply. I now checked the other hosts - as it seems, no others were affected :-) |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
Looks like CVMFS works fine but there were Frontier problems last night that couldn't be solved by switching to fail-over servers. Believe, what you believe. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10593998 Clientstatus Abbruch durch Benutzer Endstatus 203 (0x000000CB) EXIT_ABORTED_VIA_GUI Computer ID 10593998 Laufzeit 14 Stunden 24 min. 57 sek. CPU Zeit 2 min. 9 sek. Prüfungsstatus Ungültig Punkte 0.00 max. FLOPS des Gerätes 41.86 GFLOPS Anwendungsversion ATLAS Simulation v2.02 (vbox64_mt_mcore_atlas) windows_x86_64 This example for Win11pro shows, why your answer for Windows is not correct!! |
Send message Joined: 18 Dec 15 Posts: 1823 Credit: 119,020,746 RAC: 16,628 |
again there was a task with CPU time of 1:34 minutes only, but the task continued running ... https://lhcathome.cern.ch/lhcathome/result.php?resultid=366757195 the stderr shows the following interesting entry: VBoxManage.exe: error: Could not find a registered machine named 'boinc_9af796eea95542d4' VBoxManage.exe: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown VBoxManage.exe: error: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2773 of file VBoxManageInfo.cpp Command: VBoxManage -q showhdinfo "C:\ProgramData\BOINC\slots\0/vm_image.vdi" Exit Code: -2135228412 Output: VBoxManage.exe: error: Could not find file for the medium 'C:\ProgramData\BOINC\slots\0\vm_image.vdi' (VERR_FILE_NOT_FOUND) VBoxManage.exe: error: Details: code VBOX_E_FILE_ERROR (0x80bb0004), component MediumWrap, interface IMedium, callee IUnknown VBoxManage.exe: error: Context: "OpenMedium(Bstr(pszFilenameOrUuid).raw(), enmDevType, enmAccessMode, fForceNewUuidOnOpen, pMedium.asOutParam())" at line 191 of file VBoxManageDisk.cpp what's going wrong? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 15,673 |
The app_version currently in use doesn't attach a vm_image.vdi. The related "error" can be ignored. Instead it uses ATLAS_vbox_2.02_image.vdi which is correctly attached here: 2022-10-09 21:16:36 (14788): Command: VBoxManage -q storageattach "boinc_9af796eea95542d4" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "C:\ProgramData\BOINC/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_2.02_image.vdi" Exit Code: 0 |
Send message Joined: 18 Dec 15 Posts: 1823 Credit: 119,020,746 RAC: 16,628 |
thanks for your quick reply. So what else could be the cause for the problem? Is it coincidence that the problems which I described here lately are happening only with the host which recently got a new SSD (after the former one became defective) ? On the other hand, many other ATLAS tasks have been working well on this host after the exchange of the SSD. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
Computer ID 10795955 Laufzeit 1 Stunden 21 min. 18 sek. CPU Zeit 7 sek. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 15,673 |
My guess (without evidence!) would be that there is a timing issue while the VM boots. In that case it can't be solved client side. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
My guess (without evidence!) would be that there is a timing issue while the VM boots. This task was yesterday with Squid-Test AND new Windows (three days ago)! |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
This seem a timestamp problem for connect CVMFS and the other Server. Is it possible to see the protocol of this Servers from the Atlas-Team? Had 20 - 30 Atlas Tasks before running well with Squid. Otherwhise, when connect is wrong, we need a exit. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 15,673 |
... a timestamp problem ... A red herring. It's neither a timestamp problem nor a squid problem. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
Take a Windows11pro or Workstation PC and see what happens! |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 15,673 |
No, thanks. I'm able to differentiate between the host OS and the guest OS. |
Send message Joined: 18 Dec 15 Posts: 1823 Credit: 119,020,746 RAC: 16,628 |
I just noticed that all 5 ATLAS tasks on my main cruncher were failing :-( VM console cannot be accessed, and under "properties" I saw that although the tasks had runtimes between 7 and 8 hours (which they normally don't, I just found out too late) the CPU time was only between 40 and 60 minutes. Here an example, the others look same way, so I don't to cite them all: https://lhcathome.cern.ch/lhcathome/result.php?resultid=366808818 excerpt of the stderr: Command: VBoxManage -q controlvm "boinc_4156a33e0a714d73" pause Output: VBoxManage.exe: error: Machine 'boinc_4156a33e0a714d73' is not currently running 2022-10-11 13:48:26 (12484): Error in pause VM for VM: -108 Command: VBoxManage -q controlvm "boinc_4156a33e0a714d73" pause Output: VBoxManage.exe: error: Machine 'boinc_4156a33e0a714d73' is not currently running anyone any idea what happened? |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
Yesterday have testing IP-Connecting to ISP. When just in this moment of disconnect a new Atlas Task is starting (saw three faulty), they are running only with a few seconds of CPU, but a lot of duration-time (hours...). Testing it again today and in the next days. Only seeing it on this fast two Threadripper. https://lhcathome.cern.ch/lhcathome/results.php?userid=75468&offset=0&show_names=0&state=6&appid= |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 206 |
Name b78KDmkO821nsSi4apGgGQJmABFKDmABFKDmuOfWDmWBTKDmReOcZm_0 Arbeitspaket 196042022 Erstellt 13 Oct 2022, 8:05:35 UTC Gesendet 13 Oct 2022, 11:49:16 UTC Ablaufdatum 21 Oct 2022, 11:49:16 UTC Empfangen 13 Oct 2022, 22:52:48 UTC Computer ID 10795955 Laufzeit 7 Stunden 7 min. CPU Zeit 10 sek. Prüfungsstatus Ungültig |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 573 |
The cause of that problem is already discussed in the long thread ATLAS vbox v2.02 - https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5885 If you don't want to abort such a task: A possible manner to retry such a task is mentioned here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5885&postid=47086#47086 |
©2025 CERN