Message boards : CMS Application : EXIT_NO_SUB_TASKS
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1681
Credit: 99,382,019
RAC: 110,776
Message 44863 - Posted: 3 May 2021, 14:23:56 UTC

something seems to go wrong with CMS in the past few days.
I observe 2 types of problems:

1) tasks error out after 8-9 minutes with

-152 (0xFFFFFF68) ERR_NETOPEN

stderr:

2021-05-03 15:13:53 (11808): Guest Log: [DEBUG] nc: connect to vocms0840.cern.ch port 9618 (tcp) timed out: Operation now in progress

2021-05-03 15:13:53 (11808): Guest Log: [DEBUG] 1

2021-05-03 15:13:53 (11808): Guest Log: [ERROR] Could not connect to Condor server on port 9618

example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=315968114

2) tasks are running for hours and hours, but the Windows task manager does not show any CPU usage.

What's the problem?

I am having these things on all of my machines, so the problem is definitely not with one of my PCs.
ID: 44863 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 44864 - Posted: 3 May 2021, 17:52:00 UTC - in response to Message 44863.  

Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days.
vocms0840 is our condor server, as your error suggests. As far as I can see it is not having terminal failures; I'll ask Fede to take a look if she can.
ID: 44864 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,012,052
RAC: 17,254
Message 44865 - Posted: 3 May 2021, 17:52:11 UTC

I have not observed any problems running CMS tasks here apart from yesterday when we ran out of jobs.
ID: 44865 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1681
Credit: 99,382,019
RAC: 110,776
Message 44866 - Posted: 3 May 2021, 18:58:40 UTC - in response to Message 44864.  

Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days.
vocms0840 is our condor server, as your error suggests. As far as I can see it is not having terminal failures; I'll ask Fede to take a look if she can.
thanks for the reply, Ivan.
The situation is really strange.
I have now tried to ping vocms0840 - no problem, on none of my PCs. However, after start of a CMS task, it errors out after 8 minutes with the message that it cannot connect to condor. This is really strange. I tried it on several PCs, always the same. Ping works, CMS tasks fail :-(

ATLAS and Theory work well everywhere here.

No idea what could be the problem with CMS.
ID: 44866 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 642,826,630
RAC: 283,572
Message 44867 - Posted: 3 May 2021, 18:59:30 UTC

Just that small outage for me also.

@Ivan, I'm running 213 CMS at once, that is a few more than normal for me.
ID: 44867 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,135,827
RAC: 27,291
Message 44870 - Posted: 3 May 2021, 20:47:22 UTC - in response to Message 44865.  

I have not observed any problems running CMS tasks here apart from yesterday when we ran out of jobs.

I'm running fine too since that moment.
ID: 44870 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 44872 - Posted: 4 May 2021, 3:08:08 UTC - in response to Message 44867.  

Just that small outage for me also.

@Ivan, I'm running 213 CMS at once, that is a few more than normal for me.

Yes, so I see. To be honest, given the grief that VirtualBox has given me in the past, I wouldn't recommend such a backlog. :-)
ID: 44872 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1681
Credit: 99,382,019
RAC: 110,776
Message 44873 - Posted: 4 May 2021, 5:29:28 UTC - in response to Message 44866.  

Not sure, Erich. There is a marker out for a failed component in our WMAgent, but that's a database issue, I think, and on vocms0267 anyway. I'll message the WMCore team about that. We have an unusual level of jobs being run at the moment, I think because ATLAS jobs are currently limited -- my workflows of 10,000 jobs are being exhausted in little more than a day, compared to the normal 2-3 days.
vocms0840 is our condor server, as your error suggests. As far as I can see it is not having terminal failures; I'll ask Fede to take a look if she can.
thanks for the reply, Ivan.
The situation is really strange.
I have now tried to ping vocms0840 - no problem, on none of my PCs. However, after start of a CMS task, it errors out after 8 minutes with the message that it cannot connect to condor. This is really strange. I tried it on several PCs, always the same. Ping works, CMS tasks fail :-(

ATLAS and Theory work well everywhere here.

No idea what could be the problem with CMS.
I had stopped all CMS activities during last night, and when I restartet some of them this morning, everything seems to look good.
So hopefully the problem, whatever it was caused by, is solved now :-)
ID: 44873 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1681
Credit: 99,382,019
RAC: 110,776
Message 44874 - Posted: 4 May 2021, 14:10:18 UTC

a few minutes ago at one of my machines a task failed after 20 minutes with

207 (0x000000CF) EXIT_NO_SUB_TASKS

and further down:

2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: can't find resource with ClaimId (<10.0.2.15:46355>#1620135484#1#...) for 444 (ACTIVATE_CLAIM)
2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: can't find resource with ClaimId (<10.0.2.15:46355>#1620135484#1#...) -- perhaps this claim was already removed?
2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:38:32 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:48:40 No resources have been claimed for 600 seconds
2021-05-04 15:48:44 (3428): Guest Log: 05/04/21 15:48:40 Shutting down Condor on this machine.

what kind of problem is this now?

if interested, the complete date is: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316023150
ID: 44874 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 642,826,630
RAC: 283,572
Message 44876 - Posted: 4 May 2021, 17:04:52 UTC - in response to Message 44872.  

I think if I don't touch my computers they run somewhat smoothly, I just leave the allocation of tasks to BOINC, as I said before I think the flops esitmate of the tasks is too low so I get more than I think I should. I have 237 cores for BOINC so not quite all for CMS.
ID: 44876 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 642,826,630
RAC: 283,572
Message 44877 - Posted: 4 May 2021, 17:05:21 UTC - in response to Message 44874.  

I see the same, I assume its something to do with us draining the queue?
ID: 44877 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,012,052
RAC: 17,254
Message 44884 - Posted: 4 May 2021, 20:38:50 UTC
Last modified: 4 May 2021, 20:39:17 UTC

I see now a few errors about the same time Erich56 says he had errors today. Also I had some that didn't run the normal 12+ hours but they exited sooner. They were still redeemed as valid. So something definitely happened just before 14:00 UTC on the server land.
ID: 44884 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 642,826,630
RAC: 283,572
Message 44902 - Posted: 7 May 2021, 6:58:22 UTC

Back this morning.
ID: 44902 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,012,052
RAC: 17,254
Message 44955 - Posted: 17 May 2021, 17:57:03 UTC

Here we go again. No sub tasks available. Luckily the ready to send queue has drained empty to reduce the number of failed tasks.
ID: 44955 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 44956 - Posted: 17 May 2021, 18:03:49 UTC - in response to Message 44955.  
Last modified: 17 May 2021, 18:04:17 UTC

I'm getting conflicting results from different monitors. WMStats says that there are jobs running but the job graphs say otherwise. On the other hand WMStats says there are lots of problems with our Agent, so its database may be stale. We were warned earlier today about upgrades to the Oracle database which weren't supposed to affect us -- perhaps reality is different.
ID: 44956 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 44959 - Posted: 17 May 2021, 20:09:09 UTC
Last modified: 17 May 2021, 20:09:31 UTC

I have messaged the usual crew, but I don't expect a response until tomorrow -- I think everyone's in CET time-zone, so mostly relaxing at night.
ID: 44959 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 44965 - Posted: 18 May 2021, 9:23:40 UTC - in response to Message 44956.  
Last modified: 18 May 2021, 9:24:00 UTC

I'm getting conflicting results from different monitors. WMStats says that there are jobs running but the job graphs say otherwise. On the other hand WMStats says there are lots of problems with our Agent, so its database may be stale. We were warned earlier today about upgrades to the Oracle database which weren't supposed to affect us -- perhaps reality is different.

I've just heard from the WMCore team. The Oracle upgrade did not go smoothly, and our WMAgent is unable to connect to the database, hence the lack of jobs. It's being worked on; more information as it becomes available...
ID: 44965 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 44967 - Posted: 18 May 2021, 9:49:22 UTC - in response to Message 44965.  

The Agent is running again, now we just need to wait for the BOINC server to notice that jobs are available, and start sending out tasks.
ID: 44967 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2375
Credit: 221,691,661
RAC: 142,797
Message 44968 - Posted: 18 May 2021, 11:08:19 UTC - in response to Message 44967.  

Back in the game.
Thanks for investigating.
ID: 44968 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1681
Credit: 99,382,019
RAC: 110,776
Message 44999 - Posted: 25 May 2021, 3:35:45 UTC

during last night, Theory ran out of jobs, and after some time - which was good - the download of new tasks was stopped automatically.

Hope that Ivan can do something this morning :-)
ID: 44999 · Report as offensive     Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 · Next

Message boards : CMS Application : EXIT_NO_SUB_TASKS


©2024 CERN