Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 16 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 102,944,310 RAC: 125,493 |
Thanks, Ivan, as always, for passing the (not too good) information on to us. So we will wait and see what happens next week. What should be done, though, I guess, is to stop tasks from being downloaded. |
Send message Joined: 29 Aug 05 Posts: 1004 Credit: 6,268,761 RAC: 316 |
OK, thanks to great efforts by the CMS & CERN IT teams, a workaround is in place and we are able to run jobs again! I've submitted a small batch and have jobs running on my boxen. I'll submit a larger batch later, and take the opportunity to increase the job size as the average run-time is less than I would prefer. This should increase our efficiency. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,356,455 RAC: 123,009 |
Thanks. Got 1 task that started fine. What factor do you expect regarding the runtime increase per job? |
Send message Joined: 29 Aug 05 Posts: 1004 Credit: 6,268,761 RAC: 316 |
Thanks. I've gone from 5,000 to 10,000 events per job. Given startup overhead, it should be less than a factor of two (the result file should be approx twice as big, too). Let me know if it causes any problems. It'll take a while for them to show up, there are 1,000 of the previous size to get through first. |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,810,545 RAC: 22,348 |
Hello. Now all my CMS tasks ends with error -203 (0xFFFFFF35) ERR_NO_NETWORK_CONNECTION. Of course, internet connection is fine. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,356,455 RAC: 123,009 |
Checked a couple of your logfiles. All of them show the same error: 2019-12-01 22:38:45 (16792): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80 2019-12-01 22:39:05 (16792): Guest Log: [DEBUG] nc: getaddrinfo: Temporary failure in name resolution 2019-12-01 22:39:05 (16792): Guest Log: [DEBUG] 1 2019-12-01 22:39:05 (16792): Guest Log: [ERROR] Could not connect to cern.ch on port 80 That's why the VMs shut down. Since the DNS name resolution works for my internet connection you may check your nameservers or change to public ones like 1.1.1.1 (Cloudflare) or 8.8.8.8 (Google). |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,810,545 RAC: 22,348 |
Unfortunately, I do not know, how to do it. And there was no such problem before... |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,810,545 RAC: 22,348 |
And I have stop receiving ATLAS tasks at all several days ago... May be the reason is the same... |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,810,545 RAC: 22,348 |
It looks like only SixTrack is available for me now. But I did not change my preferences. No ATLAS tasks, no Theory tasks, CMS tasks crash. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,356,455 RAC: 123,009 |
@Ivan Just noticed at the Grafana pages that the number of running CMS jobs has doubled since Sunday afternoon. Might be that we need a new batch earlier than expected. |
Send message Joined: 29 Aug 05 Posts: 1004 Credit: 6,268,761 RAC: 316 |
@Ivan Yeah, I've seen that too. I have a batch in the pipeline that's not showing up in WMStats yet. Federica submitted two small tasks last week that appear to have run according to WMStats but I can't find any output in store -- ah, the unmerged result files are on DataBridge, I must be looking in the wrong place on EOS. I've just put in another batch that's not showing up yet either even though the submission is reported as successful. I'll have to double-check my input parameters. |
Send message Joined: 29 Aug 05 Posts: 1004 Credit: 6,268,761 RAC: 316 |
Ah, I think I've found the reason. I'd been playing around with priorities to try to get around the problem we had with condor requests timing out, so all my recent jobs have been submitted with priority 1000. Federica's batches were submitted with the original template value of 600000(!). I submitted another batch at priority 100000 and it's appeared on WMStats, so it looks like the others I have sent are not being acted upon while the current batch is still running at the same priority. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,356,455 RAC: 123,009 |
Just a reminder. There are again no SixTrack WUs which results in a significantly higher number of CMS tasks being processed. => CMS may need fresh work earlier than expected. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,356,455 RAC: 123,009 |
Looks like there are no subtasks in the queue any more due to lots of hosts that switched over from SixTrack. Is anybody from the project team aware of this? |
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 102,944,310 RAC: 125,493 |
Looks like there are no subtasks in the queue any more ...once again this leads me to the question whether the formerly installed automatic stop of the tasks queue in case of lack of jobs is no longer working. |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,810,545 RAC: 22,348 |
Yes. This is very important question. |
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 102,944,310 RAC: 125,493 |
again, there are no subtasks available since a few hours ago: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259890562 https://lhcathome.cern.ch/lhcathome/result.php?resultid=259894933 https://lhcathome.cern.ch/lhcathome/result.php?resultid=259853027 |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I am picking up a whole string of them too. Since they are short, I wouldn't mind so much if there were a few good ones to work on. But when they are all bad, maybe I should work on WCG. |
Send message Joined: 11 Jan 20 Posts: 1 Credit: 279,839 RAC: 0 |
100% failure here too https://lhcathome.cern.ch/lhcathome/result.php?resultid=259953906 |
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 102,944,310 RAC: 125,493 |
still, although mentioned here before, 2 question are unanswered: 1) why is this mechanism no longer working which should stop the tasks download queue as soon as there are no sub-tasks available? 2) is Ivan no longer on bord? Before, when problems like the current one came up, he always was very helpful in solving such and also other problems concerning CMS. Now, obviously, this is no longer the case :-( |
©2024 CERN