Message boards :
Theory Application :
Stuck WU: Waiting for the delivery of SIGUSR1
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Sep 08 Posts: 854 Credit: 697,798,821 RAC: 136,462 |
I get a few WU every now and then stuck on this error, any idea, since its inside the VM there isn't a lot I can do I think? |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,444,519 RAC: 9,470 |
I know this is Necroposting, but I work up this morning to 4 theory tasks that had been stuck for 8 hours with this error. Is there anything that can be done to prevent this? I manually aborted them and the next set of task started with no issues. I am running squid as well. But it still happens with or without the proxy. |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,444,519 RAC: 9,470 |
I woke up to 4 more stuck this morning. The only thing I can see in common with all of them is that it looks like they all started at roughly the same time. ~10-15 spread. Could this be a network issue because to many requests are going to this sigusr1 thing at a time? |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,444,519 RAC: 9,470 |
Just got some more this time 4 theorys started at the same time. Could it be squid caching the response for that endpoint and it causes the theorys to hang? |
Send message Joined: 15 Jun 08 Posts: 2567 Credit: 258,157,485 RAC: 118,865 |
CloverField wrote: Could it be squid caching the response for that endpoint and it causes the theorys to hang? No. See your own comment: CloverField wrote: But it still happens ... without the proxy. |
Send message Joined: 2 May 07 Posts: 2255 Credit: 174,204,943 RAC: 8,340 |
Have checked one Theory-Task from you. Seeing entries from Boinc-slot No. 29, 19 and 31 for the same Theory task. |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,444,519 RAC: 9,470 |
Have checked one Theory-Task from you. So somehow boinc is starting the same task 3 times? |
Send message Joined: 2 May 07 Posts: 2255 Credit: 174,204,943 RAC: 8,340 |
Where is your Squid? On the same Win10-PC? |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,444,519 RAC: 9,470 |
Yup on the same pc |
Send message Joined: 15 Jun 08 Posts: 2567 Credit: 258,157,485 RAC: 118,865 |
2023-02-07 15:05:56 (46556): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) . . . 2023-02-07 15:06:02 (46556): Setting network throttle for VM. (5120KB) It looks like you tweak your network bandwidth settings for the VMs (or BOINC as a whole). This makes no sense since it applies only to outgoing traffic (from the VM perspective), but it may affect the connection timing. You may leave those settings unlimited or at default values. |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,444,519 RAC: 9,470 |
2023-02-07 15:05:56 (46556): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) . . . 2023-02-07 15:06:02 (46556): Setting network throttle for VM. (5120KB) Ill try and change those but I ended up putting them on because I would get latency spikes in my network when tasks would upload. |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,444,519 RAC: 9,470 |
These continue to happen I get about ~5-10 a week. Is there anyway we could get some retry logic in the start up like at Altas and cms have so I don't have to make check for stuck tasks part of my morning routine? |
Send message Joined: 2 May 07 Posts: 2255 Credit: 174,204,943 RAC: 8,340 |
This happens only in Windows, seeing it also. Need some investigation... |
Send message Joined: 2 May 07 Posts: 2255 Credit: 174,204,943 RAC: 8,340 |
Win11pro Theory Waiting for the delivery of SIGUSR1: cernvm-prod.cern.ch and alice.cern.ch |
Send message Joined: 2 May 07 Posts: 2255 Credit: 174,204,943 RAC: 8,340 |
Yesterday lot of tasks with SIGUSR1- waiting and waiting and waiting.... from different Server (alice, sft,...). |
Send message Joined: 27 Sep 08 Posts: 854 Credit: 697,798,821 RAC: 136,462 |
I just abort them, as discussed before, if you reboot the VM it can come back but once you have the error its time out getting a WU from the backend. I does on LINUX also just not as much I would say. |
Send message Joined: 2 May 07 Posts: 2255 Credit: 174,204,943 RAC: 8,340 |
Yes Toby, Linux is Linux..., ;-), but have mostly Windows. Have checked MCPlot from yesterday: Host 10795955 (last month) total jobs: 4038 (84 failed, 2%) Host 10797673 (last month) total jobs: 352 (6 failed, 2%) Therefore it's a good performance. Before Atlas is back, it's a experience from a long time back now for Theory. |
Send message Joined: 27 Sep 08 Posts: 854 Credit: 697,798,821 RAC: 136,462 |
My total all time is 2% and last 30 d is also 2%. My observation is though these stuck ones don't get any work so overtime you just end up doing no work as they all gave up trying to get work. |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,444,519 RAC: 9,470 |
Got about 4 of these last night. |
©2025 CERN