New version v300.20

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,041,253 RAC: 139,400	Message 50212 - Posted: 20 May 2024, 6:58:20 UTC - in response to Message 50211. Why not? ID: 50212 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1450 Credit: 9,746,678 RAC: 953	Message 50213 - Posted: 20 May 2024, 7:05:55 UTC - in response to Message 50211. Post the output of mount \|grep shm ; ls -hal /dev/shm/ Then prepare for a reboot. Then reboot. Why? Because the moderator wants to help you! Cause of your problem is, that you load to many tasks, want to start them (virtual machines) all at once and even abort tasks that are just in their startup phase. Try it again with more patience. Before the reboot maybe you could remove all LHC related VM's with VirtualBox Media Manager, but keep the vdi-disk files on disk. ID: 50213 · Reply Quote

hadron Send message Joined: 4 Sep 22 Posts: 100 Credit: 17,073,659 RAC: 1,773	Message 50215 - Posted: 20 May 2024, 8:49:59 UTC - in response to Message 50213. Post the output of mount \|grep shm ; ls -hal /dev/shm/ Then prepare for a reboot. Then reboot. Why? Because the moderator wants to help you! Cause of your problem is, that you load to many tasks, want to start them (virtual machines) all at once and even abort tasks that are just in their startup phase. Try it again with more patience. Before the reboot maybe you could remove all LHC related VM's with VirtualBox Media Manager, but keep the vdi-disk files on disk. Chill, OK? The "instructions" I was given are nearly useless, but as a Windows user, I wouldn't expect you to realize that. I was given a set of things to do with no explanation whatsoever. "Show the contents of (some directory), then reboot" -- then why bother to ask to see the contents of that directory? But (lucky me), he posted just enough to give me a clue what to go looking for -- and I'm probably not going to have to reboot my machine at all. However, right now, my job queue has been filled up with Einstein and Rosetta tasks, and until those are gone, it's fairly likely that Boinc won't fetch anything from LHC because I "don't need" it. So I will just have to wait until the job queue is nearly empty, then turn Theory back on, and force a scheduler request for LHC. I'll get back to you on this at the appropriate time. As for your suggestions, your initial assumptions are simply not true. First, what do you mean by "load too many tasks"? All that is up to Boinc; I have no control over the number of tasks loaded, except via the number of threads I allot to Boinc use, as well as by controlling the number of running tasks in app_config.xml, Secondly, I don't try to start too many tasks all at once; that again is strictly Boinc's decision to make, and I have no way to control this either. Finally, I do not abort pending tasks until I have seen at least 3 or 4 consecutive tasks fail for whatever reason. It is up to Boinc to clean up after a task has been aborted, and if that happens properly, there should be nothing left of any task, even ones that have just started running. However, thanks for your suggestions anyway. ID: 50215 · Reply Quote

hadron Send message Joined: 4 Sep 22 Posts: 100 Credit: 17,073,659 RAC: 1,773	Message 50216 - Posted: 20 May 2024, 11:41:51 UTC Problem solved. Fortunately, being an experienced Linux user, I was able to use that nearly-useless post to go digging. There was a lockfile in /dev/shm that was created not very long before the problems started: # ls -hal /dev/shm/ total 0 drwxrwxrwt 2 root root 60 May 19 19:09 . drwxr-xr-x 22 root root 4.8K May 4 20:47 .. -rw------- 1 boinc boinc 0 May 11 09:48 boinc_vboxwrapper_lock_e086e43dd21d28b7 UTC-0600 here, that is May 11 1548 UTC. The very first Theory task to fail after only 3 minutes was reported at 1627 UTC. Now, that very same lockfile is mentioned in the stderr.txt of every single failed Theory task, twice in fact: ... 2024-05-19 18:50:33 (23676): Could not set race mitigation lock. 2024-05-19 18:50:33 (23676): Lockname: '/boinc_vboxwrapper_lock_e086e43dd21d28b7' 2024-05-19 18:50:33 (23676): Error: ERR_TIMEOUT 2024-05-19 18:50:33 (23676): Attempts: 48 2024-05-19 18:50:33 (23676): Could not set race mitigation lock in 'create_vm'. 2024-05-19 18:50:33 (23676): Could not create VM 2024-05-19 18:50:33 (23676): ERROR: VM failed to start 2024-05-19 18:50:33 (23676): Powering off VM. 2024-05-19 18:50:33 (23676): Deregistering VM. (boinc_372f2cc2be2d23c7, slot#9) 2024-05-19 18:50:33 (23676): Removing network bandwidth throttle group from VM. 2024-05-19 18:52:04 (23676): Could not set race mitigation lock. 2024-05-19 18:52:04 (23676): Lockname: '/boinc_vboxwrapper_lock_e086e43dd21d28b7' 2024-05-19 18:52:04 (23676): Error: ERR_TIMEOUT 2024-05-19 18:52:04 (23676): Attempts: 48 2024-05-19 18:52:04 (23676): Could not set race mitigation lock in 'deregister_vm'. 2024-05-19 18:52:04 (23676): Warning: Will continue without a lock. 2024-05-19 18:52:04 (23676): Removing VM from VirtualBox. I believe that, when the task is first being set up, Boinc sets that lockfile for whatever reason the programmers find desirable/necessary, and then is supposed to delete it when it is no longer necessary. It looks like this one didn't get deleted for some reason -- a bug perhaps? One that is only rarely encountered? This all screams, "Delete the file and try again," so that is what I did. No lockfile, no failed tasks -- and definitely no reboot needed; I now have one task waiting to be reported, and 4 more running quite happily. ID: 50216 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1450 Credit: 9,746,678 RAC: 953	Message 50217 - Posted: 20 May 2024, 12:55:41 UTC - in response to Message 50216. I believe that, when the task is first being set up, Boinc sets that lockfile for whatever reason the programmers find desirable/necessary, and then is supposed to delete it when it is no longer necessary. It looks like this one didn't get deleted for some reason -- a bug perhaps? One that is only rarely encountered? This all screams, "Delete the file and try again," so that is what I did. No lockfile, no failed tasks -- and definitely no reboot needed; I now have one task waiting to be reported, and 4 more running quite happily. Very nice that you solved the problem with your linux experience. For non experienced users a reboot should have unlocked that file too, I suppose. Therefore just 1 question: Did you not reboot your system since . May 11 09:48 ? ID: 50217 · Reply Quote

hadron Send message Joined: 4 Sep 22 Posts: 100 Credit: 17,073,659 RAC: 1,773	Message 50218 - Posted: 20 May 2024, 15:04:54 UTC - in response to Message 50217. /dev/shm is a virtual drive that exists only in memory, used for Linux programs to efficiently pass data to each other -- so a reboot will clear everything in it. However, I am a firm believer in not rebooting a system unless it is absolutely necessary, and getting rid of a single miscreant lockfile is not IMO sufficient cause to reboot. So no, I have not rebooted my system since the 11th. In fact, the last time I did reboot was after a system upgrade on May 4. ID: 50218 · Reply Quote

LHC@home