Message boards : Theory Application : Stuck WU: Waiting for the delivery of SIGUSR1
Message board moderation

To post messages, you must log in.

AuthorMessage
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 817
Credit: 661,412,170
RAC: 172,049
Message 46052 - Posted: 12 Jan 2022, 17:52:13 UTC

I get a few WU every now and then stuck on this error, any idea, since its inside the VM there isn't a lot I can do I think?
ID: 46052 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 78
Credit: 53,964,709
RAC: 34,815
Message 47748 - Posted: 4 Feb 2023, 13:36:59 UTC

I know this is Necroposting, but I work up this morning to 4 theory tasks that had been stuck for 8 hours with this error. Is there anything that can be done to prevent this? I manually aborted them and the next set of task started with no issues. I am running squid as well. But it still happens with or without the proxy.
ID: 47748 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 78
Credit: 53,964,709
RAC: 34,815
Message 47749 - Posted: 5 Feb 2023, 14:29:21 UTC

I woke up to 4 more stuck this morning. The only thing I can see in common with all of them is that it looks like they all started at roughly the same time. ~10-15 spread. Could this be a network issue because to many requests are going to this sigusr1 thing at a time?
ID: 47749 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 78
Credit: 53,964,709
RAC: 34,815
Message 47752 - Posted: 7 Feb 2023, 20:32:09 UTC

Just got some more this time 4 theorys started at the same time. Could it be squid caching the response for that endpoint and it causes the theorys to hang?
ID: 47752 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2450
Credit: 232,576,887
RAC: 131,487
Message 47753 - Posted: 7 Feb 2023, 20:53:26 UTC - in response to Message 47752.  

CloverField wrote:
Could it be squid caching the response for that endpoint and it causes the theorys to hang?

No. See your own comment:
CloverField wrote:
But it still happens ... without the proxy.
ID: 47753 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2158
Credit: 162,597,990
RAC: 123,102
Message 47754 - Posted: 8 Feb 2023, 1:04:51 UTC - in response to Message 47752.  

Have checked one Theory-Task from you.
Seeing entries from Boinc-slot No. 29, 19 and 31 for the same Theory task.
ID: 47754 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 78
Credit: 53,964,709
RAC: 34,815
Message 47755 - Posted: 8 Feb 2023, 12:17:43 UTC - in response to Message 47754.  

Have checked one Theory-Task from you.
Seeing entries from Boinc-slot No. 29, 19 and 31 for the same Theory task.

So somehow boinc is starting the same task 3 times?
ID: 47755 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2158
Credit: 162,597,990
RAC: 123,102
Message 47756 - Posted: 8 Feb 2023, 12:27:48 UTC - in response to Message 47755.  

Where is your Squid?
On the same Win10-PC?
ID: 47756 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 78
Credit: 53,964,709
RAC: 34,815
Message 47757 - Posted: 8 Feb 2023, 13:37:24 UTC - in response to Message 47756.  

Yup on the same pc
ID: 47757 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2450
Credit: 232,576,887
RAC: 131,487
Message 47758 - Posted: 8 Feb 2023, 13:58:33 UTC - in response to Message 47749.  

2023-02-07 15:05:56 (46556): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
.
.
.
2023-02-07 15:06:02 (46556): Setting network throttle for VM. (5120KB)

It looks like you tweak your network bandwidth settings for the VMs (or BOINC as a whole).
This makes no sense since it applies only to outgoing traffic (from the VM perspective), but it may affect the connection timing.

You may leave those settings unlimited or at default values.
ID: 47758 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 78
Credit: 53,964,709
RAC: 34,815
Message 47759 - Posted: 8 Feb 2023, 18:05:12 UTC - in response to Message 47758.  

2023-02-07 15:05:56 (46556): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
.
.
.
2023-02-07 15:06:02 (46556): Setting network throttle for VM. (5120KB)

It looks like you tweak your network bandwidth settings for the VMs (or BOINC as a whole).
This makes no sense since it applies only to outgoing traffic (from the VM perspective), but it may affect the connection timing.

You may leave those settings unlimited or at default values.


Ill try and change those but I ended up putting them on because I would get latency spikes in my network when tasks would upload.
ID: 47759 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 78
Credit: 53,964,709
RAC: 34,815
Message 47787 - Posted: 25 Feb 2023, 12:29:27 UTC

These continue to happen I get about ~5-10 a week. Is there anyway we could get some retry logic in the start up like at Altas and cms have so I don't have to make check for stuck tasks part of my morning routine?
ID: 47787 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2158
Credit: 162,597,990
RAC: 123,102
Message 47788 - Posted: 25 Feb 2023, 14:16:53 UTC - in response to Message 47787.  

This happens only in Windows, seeing it also.
Need some investigation...
ID: 47788 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2158
Credit: 162,597,990
RAC: 123,102
Message 47850 - Posted: 13 Mar 2023, 19:19:40 UTC - in response to Message 47788.  

Win11pro Theory Waiting for the delivery of SIGUSR1:
cernvm-prod.cern.ch and
alice.cern.ch
ID: 47850 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2158
Credit: 162,597,990
RAC: 123,102
Message 47859 - Posted: 15 Mar 2023, 6:46:09 UTC - in response to Message 47850.  

Yesterday lot of tasks with SIGUSR1-
waiting and waiting and waiting.... from different Server (alice, sft,...).
ID: 47859 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 817
Credit: 661,412,170
RAC: 172,049
Message 47860 - Posted: 15 Mar 2023, 7:11:10 UTC

I just abort them, as discussed before, if you reboot the VM it can come back but once you have the error its time out getting a WU from the backend.

I does on LINUX also just not as much I would say.
ID: 47860 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2158
Credit: 162,597,990
RAC: 123,102
Message 47861 - Posted: 15 Mar 2023, 7:35:59 UTC - in response to Message 47860.  

Yes Toby,
Linux is Linux..., ;-), but have mostly Windows.
Have checked MCPlot from yesterday:
Host 10795955
(last month) total jobs: 4038 (84 failed, 2%)

Host 10797673
(last month) total jobs: 352 (6 failed, 2%)

Therefore it's a good performance.
Before Atlas is back, it's a experience from a long time back now for Theory.
ID: 47861 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 817
Credit: 661,412,170
RAC: 172,049
Message 47863 - Posted: 15 Mar 2023, 21:17:51 UTC

My total all time is 2% and last 30 d is also 2%.

My observation is though these stuck ones don't get any work so overtime you just end up doing no work as they all gave up trying to get work.
ID: 47863 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 78
Credit: 53,964,709
RAC: 34,815
Message 48118 - Posted: 19 May 2023, 14:01:48 UTC

Got about 4 of these last night.
ID: 48118 · Report as offensive     Reply Quote

Message boards : Theory Application : Stuck WU: Waiting for the delivery of SIGUSR1


©2024 CERN