Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
Message board moderation
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,512,737 RAC: 165,339 ![]() ![]() ![]() |
something seems to go wrong with CMS in the past few days. I observe 2 types of problems: 1) tasks error out after 8-9 minutes with -152 (0xFFFFFF68) ERR_NETOPEN stderr: 2021-05-03 15:13:53 (11808): Guest Log: [DEBUG] nc: connect to vocms0840.cern.ch port 9618 (tcp) timed out: Operation now in progress 2021-05-03 15:13:53 (11808): Guest Log: [DEBUG] 1 2021-05-03 15:13:53 (11808): Guest Log: [ERROR] Could not connect to Condor server on port 9618 example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=315968114 2) tasks are running for hours and hours, but the Windows task manager does not show any CPU usage. What's the problem? I am having these things on all of my machines, so the problem is definitely not with one of my PCs. |
![]() Send message Joined: 29 Aug 05 Posts: 941 Credit: 6,158,546 RAC: 1,103 ![]() |
Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days. vocms0840 is our condor server, as your error suggests. As far as I can see it is not having terminal failures; I'll ask Fede to take a look if she can. ![]() |
![]() Send message Joined: 28 Sep 04 Posts: 620 Credit: 38,042,340 RAC: 13,349 ![]() ![]() ![]() |
I have not observed any problems running CMS tasks here apart from yesterday when we ran out of jobs. ![]() |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,512,737 RAC: 165,339 ![]() ![]() ![]() |
Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days.thanks for the reply, Ivan. The situation is really strange. I have now tried to ping vocms0840 - no problem, on none of my PCs. However, after start of a CMS task, it errors out after 8 minutes with the message that it cannot connect to condor. This is really strange. I tried it on several PCs, always the same. Ping works, CMS tasks fail :-( ATLAS and Theory work well everywhere here. No idea what could be the problem with CMS. |
Send message Joined: 27 Sep 08 Posts: 750 Credit: 570,587,647 RAC: 91,349 ![]() ![]() ![]() |
Just that small outage for me also. @Ivan, I'm running 213 CMS at once, that is a few more than normal for me. |
Send message Joined: 18 Nov 17 Posts: 118 Credit: 46,396,751 RAC: 15,679 ![]() ![]() ![]() |
I have not observed any problems running CMS tasks here apart from yesterday when we ran out of jobs. I'm running fine too since that moment. |
![]() Send message Joined: 29 Aug 05 Posts: 941 Credit: 6,158,546 RAC: 1,103 ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,512,737 RAC: 165,339 ![]() ![]() ![]() |
I had stopped all CMS activities during last night, and when I restartet some of them this morning, everything seems to look good.Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days.thanks for the reply, Ivan. So hopefully the problem, whatever it was caused by, is solved now :-) |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,512,737 RAC: 165,339 ![]() ![]() ![]() |
a few minutes ago at one of my machines a task failed after 20 minutes with 207 (0x000000CF) EXIT_NO_SUB_TASKS and further down: 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: can't find resource with ClaimId (<10.0.2.15:46355>#1620135484#1#...) for 444 (ACTIVATE_CLAIM) 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: can't find resource with ClaimId (<10.0.2.15:46355>#1620135484#1#...) -- perhaps this claim was already removed? 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:48:40 No resources have been claimed for 600 seconds 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:48:40 Shutting down Condor on this machine. what kind of problem is this now? if interested, the complete date is: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316023150 |
Send message Joined: 27 Sep 08 Posts: 750 Credit: 570,587,647 RAC: 91,349 ![]() ![]() ![]() |
I think if I don't touch my computers they run somewhat smoothly, I just leave the allocation of tasks to BOINC, as I said before I think the flops esitmate of the tasks is too low so I get more than I think I should. I have 237 cores for BOINC so not quite all for CMS. |
Send message Joined: 27 Sep 08 Posts: 750 Credit: 570,587,647 RAC: 91,349 ![]() ![]() ![]() |
I see the same, I assume its something to do with us draining the queue? |
![]() Send message Joined: 28 Sep 04 Posts: 620 Credit: 38,042,340 RAC: 13,349 ![]() ![]() ![]() |
I see now a few errors about the same time Erich56 says he had errors today. Also I had some that didn't run the normal 12+ hours but they exited sooner. They were still redeemed as valid. So something definitely happened just before 14:00 UTC on the server land. ![]() |
Send message Joined: 27 Sep 08 Posts: 750 Credit: 570,587,647 RAC: 91,349 ![]() ![]() ![]() |
Back this morning. |
![]() Send message Joined: 28 Sep 04 Posts: 620 Credit: 38,042,340 RAC: 13,349 ![]() ![]() ![]() |
Here we go again. No sub tasks available. Luckily the ready to send queue has drained empty to reduce the number of failed tasks. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 941 Credit: 6,158,546 RAC: 1,103 ![]() |
I'm getting conflicting results from different monitors. WMStats says that there are jobs running but the job graphs say otherwise. On the other hand WMStats says there are lots of problems with our Agent, so its database may be stale. We were warned earlier today about upgrades to the Oracle database which weren't supposed to affect us -- perhaps reality is different. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 941 Credit: 6,158,546 RAC: 1,103 ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 941 Credit: 6,158,546 RAC: 1,103 ![]() |
I'm getting conflicting results from different monitors. WMStats says that there are jobs running but the job graphs say otherwise. On the other hand WMStats says there are lots of problems with our Agent, so its database may be stale. We were warned earlier today about upgrades to the Oracle database which weren't supposed to affect us -- perhaps reality is different. I've just heard from the WMCore team. The Oracle upgrade did not go smoothly, and our WMAgent is unable to connect to the database, hence the lack of jobs. It's being worked on; more information as it becomes available... ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 941 Credit: 6,158,546 RAC: 1,103 ![]() |
|
![]() Send message Joined: 15 Jun 08 Posts: 2177 Credit: 185,285,211 RAC: 187,255 ![]() ![]() ![]() |
Back in the game. Thanks for investigating. |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,512,737 RAC: 165,339 ![]() ![]() ![]() |
during last night, Theory ran out of jobs, and after some time - which was good - the download of new tasks was stopped automatically. Hope that Ivan can do something this morning :-) |
©2023 CERN