Message boards : LHCb Application : LHCb/other tasks failing after putting computer into hibernation state?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile MechaToaster
Avatar

Send message
Joined: 17 Aug 17
Posts: 15
Credit: 179,253
RAC: 0
Message 32719 - Posted: 9 Oct 2017, 15:11:23 UTC
Last modified: 9 Oct 2017, 15:13:53 UTC

after a 22 day uptime and relatively no problem running LHCb tasks or others, with just a few errors here and there, today i found that upon waking from hibernation and starting 3 LHCb and 1 CMS task, they all promptly failed. i had one LHCb task in progress from the previous night, over 50% complete but that failed as well when i resumed it, along with the 3 others i had just begun.
the "Exit status" error varied for each task, but the log in all of them contained

"2017-10-09 10:36:00 (3196): Guest Log: 10/09/17 10:26:04 HibernationSupportedStates invalid '' in ad from hibernation plugin /usr/libexec/condor/condor_power_state"

i dont know much of what the log stuff means or if that is at all relevant. the only settings i changed from last night(besides putting my machine into hibernation for the night) until this morning was CPU time, which i raised from 50% to 60%. thanks for any help.

edit: it seems all of the tasks i attempt are failing almost immediately. i should probably stop trying to run these for now yeah?
ID: 32719 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1265
Credit: 23,005,935
RAC: 1,958
Message 32721 - Posted: 9 Oct 2017, 16:17:39 UTC - in response to Message 32719.  

... edit: it seems all of the tasks i attempt are failing almost immediately. i should probably stop trying to run these for now yeah?

well, right now no other tasks than ATLAS seem to be available anyway, at least from what can be seen from the Project Status Page:
https://lhcathome.cern.ch/lhcathome/server_status.php
ID: 32721 · Report as offensive     Reply Quote
Profile MechaToaster
Avatar

Send message
Joined: 17 Aug 17
Posts: 15
Credit: 179,253
RAC: 0
Message 32722 - Posted: 9 Oct 2017, 16:35:00 UTC - in response to Message 32721.  

... edit: it seems all of the tasks i attempt are failing almost immediately. i should probably stop trying to run these for now yeah?

well, right now no other tasks than ATLAS seem to be available anyway, at least from what can be seen from the Project Status Page:
https://lhcathome.cern.ch/lhcathome/server_status.php

oh weird, when i checked it was all up and running. would this explain my issue?
ID: 32722 · Report as offensive     Reply Quote
ChristianVirtual

Send message
Joined: 14 May 17
Posts: 3
Credit: 1,004,936
RAC: 0
Message 36437 - Posted: 15 Aug 2018, 12:03:16 UTC

Sorry for resurection; but I had the same issue

https://lhcathome.cern.ch/lhcathome/result.php?resultid=204084499

I paused a number of WU to finish some other tasks from different project and when returned the WU failed dumping work already done.
Is there a "correct" way to hold VM-based WUs ? If not then I can next time direct dump the WU if there is a change in processing sequence required on my client.
ID: 36437 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 930
Credit: 39,635,238
RAC: 17,817
Message 36447 - Posted: 15 Aug 2018, 19:15:38 UTC - in response to Message 36437.  

Sorry for resurection; but I had the same issue

https://lhcathome.cern.ch/lhcathome/result.php?resultid=204084499

I paused a number of WU to finish some other tasks from different project and when returned the WU failed dumping work already done.
Is there a "correct" way to hold VM-based WUs ? If not then I can next time direct dump the WU if there is a change in processing sequence required on my client.


You just got the typical error for the VB tasks that just happen once in a while (depending on how many you are running)

Guest Log: [ERROR] Condor exited after 111913s without running a job.

The main thing is you can check the VB Manager and make sure they are saved and suspended and then switch to other tasks (same if you have to reboot for any reason)

That one you lost was less than an hour running so no big deal and just get new tasks and start over again if you want to run the LHCb and soon I hope to have the multi-core version moved over here too.
ID: 36447 · Report as offensive     Reply Quote
ChristianVirtual

Send message
Joined: 14 May 17
Posts: 3
Credit: 1,004,936
RAC: 0
Message 36450 - Posted: 15 Aug 2018, 20:49:11 UTC - in response to Message 36447.  
Last modified: 15 Aug 2018, 20:49:45 UTC


You just got the typical error for the VB tasks that just happen once in a while (depending on how many you are running)

Guest Log: [ERROR] Condor exited after 111913s without running a job.

The main thing is you can check the VB Manager and make sure they are saved and suspended and then switch to other tasks (same if you have to reboot for any reason)

Thanks for the quick response; any suggestion on the sequence of steps ?
1) first suspend the VM, then suspend the WU in BOINC Manager
2) first suspend the WU, then suspend the VM
3) doesn’t matter, just suspend both WU and respective VM in short timeframe
TIA
ID: 36450 · Report as offensive     Reply Quote
Profile MechaToaster
Avatar

Send message
Joined: 17 Aug 17
Posts: 15
Credit: 179,253
RAC: 0
Message 36532 - Posted: 22 Aug 2018, 22:45:03 UTC - in response to Message 36450.  


You just got the typical error for the VB tasks that just happen once in a while (depending on how many you are running)

Guest Log: [ERROR] Condor exited after 111913s without running a job.

The main thing is you can check the VB Manager and make sure they are saved and suspended and then switch to other tasks (same if you have to reboot for any reason)

Thanks for the quick response; any suggestion on the sequence of steps ?
1) first suspend the VM, then suspend the WU in BOINC Manager
2) first suspend the WU, then suspend the VM
3) doesn’t matter, just suspend both WU and respective VM in short timeframe
TIA

perhaps a dumb questions, but how does one check if tasks are "saved"? how do you suspend the VM seperately from the WU?
ive been away from LHC@home or any other distributed computing projects for several months now; i forgot a lot of things about this stuff.
ID: 36532 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 930
Credit: 39,635,238
RAC: 17,817
Message 36534 - Posted: 23 Aug 2018, 6:53:50 UTC
Last modified: 23 Aug 2018, 6:56:20 UTC

To check to make sure the VB tasks are saved just bring up your VB Manager and you will see if they are saved or running or paused.



And
Thanks for the quick response; any suggestion on the sequence of steps ?


Just suspend the WU and soon after you will see that it was also suspended/saved in the VB Manager
ID: 36534 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1439
Credit: 74,004,253
RAC: 119,034
Message 36535 - Posted: 23 Aug 2018, 6:55:59 UTC - in response to Message 36450.  

There's a communication chain between different software layers:
BOINC client <--> vboxwrapper <--> VirtualBox Hypervisor <--> LHC VM

The only right way to suspend/resume a VM is via BOINC client.
You may use your VirtualBox Manager for status checks but never use it to suspend a VM.
ID: 36535 · Report as offensive     Reply Quote

Message boards : LHCb Application : LHCb/other tasks failing after putting computer into hibernation state?


©2020 CERN