Message boards : Theory Application : New version v300.20
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,955,351
RAC: 127,484
Message 50212 - Posted: 20 May 2024, 6:58:20 UTC - in response to Message 50211.  

Why not?
ID: 50212 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1409
Credit: 9,325,730
RAC: 9,392
Message 50213 - Posted: 20 May 2024, 7:05:55 UTC - in response to Message 50211.  

Post the output of
mount |grep shm ; ls -hal /dev/shm/

Then prepare for a reboot.
Then reboot.

Why?


Because the moderator wants to help you!

Cause of your problem is, that you load to many tasks, want to start them (virtual machines) all at once and even abort tasks that are just in their startup phase.
Try it again with more patience. Before the reboot maybe you could remove all LHC related VM's with VirtualBox Media Manager, but keep the vdi-disk files on disk.
ID: 50213 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 90
Credit: 15,105,237
RAC: 30,902
Message 50215 - Posted: 20 May 2024, 8:49:59 UTC - in response to Message 50213.  

Post the output of
mount |grep shm ; ls -hal /dev/shm/

Then prepare for a reboot.
Then reboot.

Why?


Because the moderator wants to help you!

Cause of your problem is, that you load to many tasks, want to start them (virtual machines) all at once and even abort tasks that are just in their startup phase.
Try it again with more patience. Before the reboot maybe you could remove all LHC related VM's with VirtualBox Media Manager, but keep the vdi-disk files on disk.

Chill, OK?
The "instructions" I was given are nearly useless, but as a Windows user, I wouldn't expect you to realize that.
I was given a set of things to do with no explanation whatsoever. "Show the contents of (some directory), then reboot" -- then why bother to ask to see the contents of that directory?
But (lucky me), he posted just enough to give me a clue what to go looking for -- and I'm probably not going to have to reboot my machine at all. However, right now, my job queue has been filled up with Einstein and Rosetta tasks, and until those are gone, it's fairly likely that Boinc won't fetch anything from LHC because I "don't need" it. So I will just have to wait until the job queue is nearly empty, then turn Theory back on, and force a scheduler request for LHC. I'll get back to you on this at the appropriate time.

As for your suggestions, your initial assumptions are simply not true. First, what do you mean by "load too many tasks"? All that is up to Boinc; I have no control over the number of tasks loaded, except via the number of threads I allot to Boinc use, as well as by controlling the number of running tasks in app_config.xml,
Secondly, I don't try to start too many tasks all at once; that again is strictly Boinc's decision to make, and I have no way to control this either.
Finally, I do not abort pending tasks until I have seen at least 3 or 4 consecutive tasks fail for whatever reason. It is up to Boinc to clean up after a task has been aborted, and if that happens properly, there should be nothing left of any task, even ones that have just started running.
However, thanks for your suggestions anyway.
ID: 50215 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 90
Credit: 15,105,237
RAC: 30,902
Message 50216 - Posted: 20 May 2024, 11:41:51 UTC

Problem solved. Fortunately, being an experienced Linux user, I was able to use that nearly-useless post to go digging.

There was a lockfile in /dev/shm that was created not very long before the problems started:

# ls -hal /dev/shm/
total 0
drwxrwxrwt  2 root  root    60 May 19 19:09 .
drwxr-xr-x 22 root  root  4.8K May  4 20:47 ..
-rw-------  1 boinc boinc    0 May 11 09:48 boinc_vboxwrapper_lock_e086e43dd21d28b7

UTC-0600 here, that is May 11 1548 UTC. The very first Theory task to fail after only 3 minutes was reported at 1627 UTC.

Now, that very same lockfile is mentioned in the stderr.txt of every single failed Theory task, twice in fact:
...
2024-05-19 18:50:33 (23676): Could not set race mitigation lock.
2024-05-19 18:50:33 (23676): Lockname: '/boinc_vboxwrapper_lock_e086e43dd21d28b7'
2024-05-19 18:50:33 (23676): Error: ERR_TIMEOUT
2024-05-19 18:50:33 (23676): Attempts: 48
2024-05-19 18:50:33 (23676): Could not set race mitigation lock in 'create_vm'.
2024-05-19 18:50:33 (23676): Could not create VM
2024-05-19 18:50:33 (23676): ERROR: VM failed to start
2024-05-19 18:50:33 (23676): Powering off VM.
2024-05-19 18:50:33 (23676): Deregistering VM. (boinc_372f2cc2be2d23c7, slot#9)
2024-05-19 18:50:33 (23676): Removing network bandwidth throttle group from VM.
2024-05-19 18:52:04 (23676): Could not set race mitigation lock.
2024-05-19 18:52:04 (23676): Lockname: '/boinc_vboxwrapper_lock_e086e43dd21d28b7'
2024-05-19 18:52:04 (23676): Error: ERR_TIMEOUT
2024-05-19 18:52:04 (23676): Attempts: 48
2024-05-19 18:52:04 (23676): Could not set race mitigation lock in 'deregister_vm'.
2024-05-19 18:52:04 (23676): Warning: Will continue without a lock.
2024-05-19 18:52:04 (23676): Removing VM from VirtualBox.

I believe that, when the task is first being set up, Boinc sets that lockfile for whatever reason the programmers find desirable/necessary, and then is supposed to delete it when it is no longer necessary. It looks like this one didn't get deleted for some reason -- a bug perhaps? One that is only rarely encountered?
This all screams, "Delete the file and try again," so that is what I did. No lockfile, no failed tasks -- and definitely no reboot needed; I now have one task waiting to be reported, and 4 more running quite happily.
ID: 50216 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1409
Credit: 9,325,730
RAC: 9,392
Message 50217 - Posted: 20 May 2024, 12:55:41 UTC - in response to Message 50216.  

I believe that, when the task is first being set up, Boinc sets that lockfile for whatever reason the programmers find desirable/necessary, and then is supposed to delete it when it is no longer necessary. It looks like this one didn't get deleted for some reason -- a bug perhaps? One that is only rarely encountered?
This all screams, "Delete the file and try again," so that is what I did. No lockfile, no failed tasks -- and definitely no reboot needed; I now have one task waiting to be reported, and 4 more running quite happily.
Very nice that you solved the problem with your linux experience.
For non experienced users a reboot should have unlocked that file too, I suppose.
Therefore just 1 question: Did you not reboot your system since .
May 11 09:48
?
ID: 50217 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 90
Credit: 15,105,237
RAC: 30,902
Message 50218 - Posted: 20 May 2024, 15:04:54 UTC - in response to Message 50217.  

/dev/shm is a virtual drive that exists only in memory, used for Linux programs to efficiently pass data to each other -- so a reboot will clear everything in it.
However, I am a firm believer in not rebooting a system unless it is absolutely necessary, and getting rid of a single miscreant lockfile is not IMO sufficient cause to reboot.
So no, I have not rebooted my system since the 11th. In fact, the last time I did reboot was after a system upgrade on May 4.
ID: 50218 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Theory Application : New version v300.20


©2024 CERN