Message boards : CMS Application : EXIT_NO_SUB_TASKS
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 15 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,738,331
RAC: 61,499
Message 40706 - Posted: 27 Nov 2019, 11:23:48 UTC

Thanks, Ivan, as always, for passing the (not too good) information on to us.
So we will wait and see what happens next week.

What should be done, though, I guess, is to stop tasks from being downloaded.
ID: 40706 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 42
Message 40723 - Posted: 29 Nov 2019, 11:57:46 UTC

OK, thanks to great efforts by the CMS & CERN IT teams, a workaround is in place and we are able to run jobs again! I've submitted a small batch and have jobs running on my boxen. I'll submit a larger batch later, and take the opportunity to increase the job size as the average run-time is less than I would prefer. This should increase our efficiency.
ID: 40723 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2028
Credit: 148,799,928
RAC: 118,273
Message 40725 - Posted: 29 Nov 2019, 14:06:12 UTC - in response to Message 40723.  

Thanks.

Got 1 task that started fine.
What factor do you expect regarding the runtime increase per job?
ID: 40725 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 42
Message 40728 - Posted: 29 Nov 2019, 15:21:35 UTC - in response to Message 40725.  

Thanks.

Got 1 task that started fine.
What factor do you expect regarding the runtime increase per job?

I've gone from 5,000 to 10,000 events per job. Given startup overhead, it should be less than a factor of two (the result file should be approx twice as big, too). Let me know if it causes any problems. It'll take a while for them to show up, there are 1,000 of the previous size to get through first.
ID: 40728 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 41,012,642
RAC: 7,120
Message 40743 - Posted: 1 Dec 2019, 19:40:27 UTC

Hello.
Now all my CMS tasks ends with error -203 (0xFFFFFF35) ERR_NO_NETWORK_CONNECTION.
Of course, internet connection is fine.
ID: 40743 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2028
Credit: 148,799,928
RAC: 118,273
Message 40744 - Posted: 1 Dec 2019, 20:58:46 UTC - in response to Message 40743.  

Checked a couple of your logfiles.
All of them show the same error:
2019-12-01 22:38:45 (16792): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2019-12-01 22:39:05 (16792): Guest Log: [DEBUG] nc: getaddrinfo: Temporary failure in name resolution
2019-12-01 22:39:05 (16792): Guest Log: [DEBUG] 1
2019-12-01 22:39:05 (16792): Guest Log: [ERROR] Could not connect to cern.ch on port 80

That's why the VMs shut down.
Since the DNS name resolution works for my internet connection you may check your nameservers or change to public ones like 1.1.1.1 (Cloudflare) or 8.8.8.8 (Google).
ID: 40744 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 41,012,642
RAC: 7,120
Message 40745 - Posted: 1 Dec 2019, 21:23:30 UTC - in response to Message 40744.  

Unfortunately, I do not know, how to do it. And there was no such problem before...
ID: 40745 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 41,012,642
RAC: 7,120
Message 40746 - Posted: 1 Dec 2019, 22:05:02 UTC - in response to Message 40745.  

And I have stop receiving ATLAS tasks at all several days ago... May be the reason is the same...
ID: 40746 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 41,012,642
RAC: 7,120
Message 40747 - Posted: 1 Dec 2019, 22:19:55 UTC - in response to Message 40746.  

It looks like only SixTrack is available for me now. But I did not change my preferences. No ATLAS tasks, no Theory tasks, CMS tasks crash.
ID: 40747 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2028
Credit: 148,799,928
RAC: 118,273
Message 40858 - Posted: 8 Dec 2019, 22:08:58 UTC

@Ivan
Just noticed at the Grafana pages that the number of running CMS jobs has doubled since Sunday afternoon.
Might be that we need a new batch earlier than expected.
ID: 40858 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 42
Message 40863 - Posted: 9 Dec 2019, 8:46:39 UTC - in response to Message 40858.  

@Ivan
Just noticed at the Grafana pages that the number of running CMS jobs has doubled since Sunday afternoon.
Might be that we need a new batch earlier than expected.

Yeah, I've seen that too. I have a batch in the pipeline that's not showing up in WMStats yet. Federica submitted two small tasks last week that appear to have run according to WMStats but I can't find any output in store -- ah, the unmerged result files are on DataBridge, I must be looking in the wrong place on EOS. I've just put in another batch that's not showing up yet either even though the submission is reported as successful. I'll have to double-check my input parameters.
ID: 40863 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 42
Message 40866 - Posted: 9 Dec 2019, 9:57:33 UTC - in response to Message 40863.  

Ah, I think I've found the reason. I'd been playing around with priorities to try to get around the problem we had with condor requests timing out, so all my recent jobs have been submitted with priority 1000. Federica's batches were submitted with the original template value of 600000(!). I submitted another batch at priority 100000 and it's appeared on WMStats, so it looks like the others I have sent are not being acted upon while the current batch is still running at the same priority.
ID: 40866 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2028
Credit: 148,799,928
RAC: 118,273
Message 40989 - Posted: 17 Dec 2019, 13:57:47 UTC

Just a reminder.

There are again no SixTrack WUs which results in a significantly higher number of CMS tasks being processed.
=> CMS may need fresh work earlier than expected.
ID: 40989 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2028
Credit: 148,799,928
RAC: 118,273
Message 41295 - Posted: 18 Jan 2020, 9:09:37 UTC

Looks like there are no subtasks in the queue any more due to lots of hosts that switched over from SixTrack.
Is anybody from the project team aware of this?
ID: 41295 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,738,331
RAC: 61,499
Message 41296 - Posted: 18 Jan 2020, 9:21:29 UTC - in response to Message 41295.  

Looks like there are no subtasks in the queue any more ...
once again this leads me to the question whether the formerly installed automatic stop of the tasks queue in case of lack of jobs is no longer working.
ID: 41296 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 41,012,642
RAC: 7,120
Message 41299 - Posted: 18 Jan 2020, 10:14:40 UTC - in response to Message 41296.  

Yes. This is very important question.
ID: 41299 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,738,331
RAC: 61,499
Message 41348 - Posted: 24 Jan 2020, 20:30:54 UTC

ID: 41348 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 592
Credit: 21,977,299
RAC: 199
Message 41350 - Posted: 24 Jan 2020, 22:30:58 UTC

I am picking up a whole string of them too. Since they are short, I wouldn't mind so much if there were a few good ones to work on.
But when they are all bad, maybe I should work on WCG.
ID: 41350 · Report as offensive     Reply Quote
rromanchuk

Send message
Joined: 11 Jan 20
Posts: 1
Credit: 279,839
RAC: 0
Message 41351 - Posted: 25 Jan 2020, 0:54:05 UTC - in response to Message 41350.  

100% failure here too

https://lhcathome.cern.ch/lhcathome/result.php?resultid=259953906
ID: 41351 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,738,331
RAC: 61,499
Message 41352 - Posted: 25 Jan 2020, 6:27:54 UTC

still, although mentioned here before, 2 question are unanswered:

1) why is this mechanism no longer working which should stop the tasks download queue as soon as there are no sub-tasks available?

2) is Ivan no longer on bord? Before, when problems like the current one came up, he always was very helpful in solving such and also other problems concerning CMS. Now, obviously, this is no longer the case :-(
ID: 41352 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 15 · Next

Message boards : CMS Application : EXIT_NO_SUB_TASKS


©2022 CERN