Thread 'EXIT_NO_SUB

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,357,685 RAC: 46,082	Message 44863 - Posted: 3 May 2021, 14:23:56 UTC something seems to go wrong with CMS in the past few days. I observe 2 types of problems: 1) tasks error out after 8-9 minutes with -152 (0xFFFFFF68) ERR_NETOPEN stderr: 2021-05-03 15:13:53 (11808): Guest Log: [DEBUG] nc: connect to vocms0840.cern.ch port 9618 (tcp) timed out: Operation now in progress 2021-05-03 15:13:53 (11808): Guest Log: [DEBUG] 1 2021-05-03 15:13:53 (11808): Guest Log: [ERROR] Could not connect to Condor server on port 9618 example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=315968114 2) tasks are running for hours and hours, but the Windows task manager does not show any CPU usage. What's the problem? I am having these things on all of my machines, so the problem is definitely not with one of my PCs. ID: 44863 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 44864 - Posted: 3 May 2021, 17:52:00 UTC - in response to Message 44863. Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days. vocms0840 is our condor server, as your error suggests. As far as I can see it is not having terminal failures; I'll ask Fede to take a look if she can. ID: 44864 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 799 Credit: 64,932,297 RAC: 30,783	Message 44865 - Posted: 3 May 2021, 17:52:11 UTC I have not observed any problems running CMS tasks here apart from yesterday when we ran out of jobs. ID: 44865 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,357,685 RAC: 46,082	Message 44866 - Posted: 3 May 2021, 18:58:40 UTC - in response to Message 44864. Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days. vocms0840 is our condor server, as your error suggests. As far as I can see it is not having terminal failures; I'll ask Fede to take a look if she can. thanks for the reply, Ivan. The situation is really strange. I have now tried to ping vocms0840 - no problem, on none of my PCs. However, after start of a CMS task, it errors out after 8 minutes with the message that it cannot connect to condor. This is really strange. I tried it on several PCs, always the same. Ping works, CMS tasks fail :-( ATLAS and Theory work well everywhere here. No idea what could be the problem with CMS. ID: 44866 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 918 Credit: 779,310,908 RAC: 147,863	Message 44867 - Posted: 3 May 2021, 18:59:30 UTC Just that small outage for me also. @Ivan, I'm running 213 CMS at once, that is a few more than normal for me. ID: 44867 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 134 Credit: 59,095,150 RAC: 4,532	Message 44870 - Posted: 3 May 2021, 20:47:22 UTC - in response to Message 44865. I have not observed any problems running CMS tasks here apart from yesterday when we ran out of jobs. I'm running fine too since that moment. ID: 44870 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 44872 - Posted: 4 May 2021, 3:08:08 UTC - in response to Message 44867. Just that small outage for me also. @Ivan, I'm running 213 CMS at once, that is a few more than normal for me. Yes, so I see. To be honest, given the grief that VirtualBox has given me in the past, I wouldn't recommend such a backlog. :-) ID: 44872 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,357,685 RAC: 46,082	Message 44873 - Posted: 4 May 2021, 5:29:28 UTC - in response to Message 44866. Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days. vocms0840 is our condor server, as your error suggests. As far as I can see it is not having terminal failures; I'll ask Fede to take a look if she can. thanks for the reply, Ivan. The situation is really strange. I have now tried to ping vocms0840 - no problem, on none of my PCs. However, after start of a CMS task, it errors out after 8 minutes with the message that it cannot connect to condor. This is really strange. I tried it on several PCs, always the same. Ping works, CMS tasks fail :-( ATLAS and Theory work well everywhere here. No idea what could be the problem with CMS. I had stopped all CMS activities during last night, and when I restartet some of them this morning, everything seems to look good. So hopefully the problem, whatever it was caused by, is solved now :-) ID: 44873 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,357,685 RAC: 46,082	Message 44874 - Posted: 4 May 2021, 14:10:18 UTC a few minutes ago at one of my machines a task failed after 20 minutes with 207 (0x000000CF) EXIT_NO_SUB_TASKS and further down: 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: can't find resource with ClaimId (<10.0.2.15:46355>#1620135484#1#...) for 444 (ACTIVATE_CLAIM) 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: can't find resource with ClaimId (<10.0.2.15:46355>#1620135484#1#...) -- perhaps this claim was already removed? 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:48:40 No resources have been claimed for 600 seconds 2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:48:40 Shutting down Condor on this machine. what kind of problem is this now? if interested, the complete date is: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316023150 ID: 44874 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 918 Credit: 779,310,908 RAC: 147,863	Message 44876 - Posted: 4 May 2021, 17:04:52 UTC - in response to Message 44872. I think if I don't touch my computers they run somewhat smoothly, I just leave the allocation of tasks to BOINC, as I said before I think the flops esitmate of the tasks is too low so I get more than I think I should. I have 237 cores for BOINC so not quite all for CMS. ID: 44876 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 918 Credit: 779,310,908 RAC: 147,863	Message 44877 - Posted: 4 May 2021, 17:05:21 UTC - in response to Message 44874. I see the same, I assume its something to do with us draining the queue? ID: 44877 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 799 Credit: 64,932,297 RAC: 30,783	Message 44884 - Posted: 4 May 2021, 20:38:50 UTC Last modified: 4 May 2021, 20:39:17 UTC I see now a few errors about the same time Erich56 says he had errors today. Also I had some that didn't run the normal 12+ hours but they exited sooner. They were still redeemed as valid. So something definitely happened just before 14:00 UTC on the server land. ID: 44884 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 918 Credit: 779,310,908 RAC: 147,863	Message 44902 - Posted: 7 May 2021, 6:58:22 UTC Back this morning. ID: 44902 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 799 Credit: 64,932,297 RAC: 30,783	Message 44955 - Posted: 17 May 2021, 17:57:03 UTC Here we go again. No sub tasks available. Luckily the ready to send queue has drained empty to reduce the number of failed tasks. ID: 44955 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 44956 - Posted: 17 May 2021, 18:03:49 UTC - in response to Message 44955. Last modified: 17 May 2021, 18:04:17 UTC I'm getting conflicting results from different monitors. WMStats says that there are jobs running but the job graphs say otherwise. On the other hand WMStats says there are lots of problems with our Agent, so its database may be stale. We were warned earlier today about upgrades to the Oracle database which weren't supposed to affect us -- perhaps reality is different. ID: 44956 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 44959 - Posted: 17 May 2021, 20:09:09 UTC Last modified: 17 May 2021, 20:09:31 UTC I have messaged the usual crew, but I don't expect a response until tomorrow -- I think everyone's in CET time-zone, so mostly relaxing at night. ID: 44959 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 44965 - Posted: 18 May 2021, 9:23:40 UTC - in response to Message 44956. Last modified: 18 May 2021, 9:24:00 UTC I'm getting conflicting results from different monitors. WMStats says that there are jobs running but the job graphs say otherwise. On the other hand WMStats says there are lots of problems with our Agent, so its database may be stale. We were warned earlier today about upgrades to the Oracle database which weren't supposed to affect us -- perhaps reality is different. I've just heard from the WMCore team. The Oracle upgrade did not go smoothly, and our WMAgent is unable to connect to the database, hence the lack of jobs. It's being worked on; more information as it becomes available... ID: 44965 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 44967 - Posted: 18 May 2021, 9:49:22 UTC - in response to Message 44965. The Agent is running again, now we just need to wait for the BOINC server to notice that jobs are available, and start sending out tasks. ID: 44967 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 300,116,263 RAC: 49,352	Message 44968 - Posted: 18 May 2021, 11:08:19 UTC - in response to Message 44967. Back in the game. Thanks for investigating. ID: 44968 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,357,685 RAC: 46,082	Message 44999 - Posted: 25 May 2021, 3:35:45 UTC during last night, Theory ran out of jobs, and after some time - which was good - the download of new tasks was stopped automatically. Hope that Ivan can do something this morning :-) ID: 44999 · Reply Quote