CMS Tasks Failing

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1729 Credit: 110,081,963 RAC: 84,160	Message 32954 - Posted: 1 Nov 2017, 6:48:06 UTC Last modified: 1 Nov 2017, 6:48:19 UTC again, last night to cases with no connection to Condor; one of them is: 2017-10-31 21:46:06 (1372): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] 1 2017-10-31 21:46:36 (1372): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2017-10-31 21:46:36 (1372): Guest Log: [INFO] Shutting Down. What's wrong? ID: 32954 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 32955 - Posted: 1 Nov 2017, 8:56:50 UTC - in response to Message 32954. Well, the glib answer is that there's a connectivity problem somewhere. :-/ Exactly where is the question. Nils seems to have no problem, according to a recent post, but he is (I believe) quite local to CERN. I'm not having any problems that I'm aware of, from Heathrow North (but through an academic network). Perhaps if people could post times and locations where this happens, we could get a geographic sense of what the problem might be? ID: 32955 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1729 Credit: 110,081,963 RAC: 84,160	Message 32957 - Posted: 1 Nov 2017, 14:27:44 UTC - in response to Message 32955. Well, the glib answer is that there's a connectivity problem somewhere. :-/ Exactly where is the question. yes, that's the question. Perhaps if people could post times and locations where this happens, we could get a geographic sense of what the problem might be? in my case: Vienna, Austria. What I now did: I have tried to ping the Condor server (vccondor01.cern.ch) from all my PCs at various times, always with success. So, the problem, when it occurs, seems to be there only for very short time. And not too often. Will be very hard to find out what it really is. ID: 32957 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 32963 - Posted: 2 Nov 2017, 0:41:30 UTC - in response to Message 32955. Last modified: 2 Nov 2017, 0:43:50 UTC Well, the glib answer is that there's a connectivity problem somewhere. :-/ Exactly where is the question. Nils seems to have no problem, according to a recent post, but he is (I believe) quite local to CERN. I'm not having any problems that I'm aware of, from Heathrow North (but through an academic network). Perhaps if people could post times and locations where this happens, we could get a geographic sense of what the problem might be? My machines (in Missouri) were unable to process Theory for 3 hours from 14:33 UTC - 17:26 UTC Nov 1. Some of the WU's took over 2 hours before they failed. 60 WU's across 3 machines just stopped processing and all but 10 ended in the same variety of ping errors Erich56 listed above. I was going to say that maybe this is DB lock issue from the other thread https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4496&postid=32952#32952 is related but that would mean Erich56's WU's would need to have waited to fail from 14:30 UTC till 20 UTC (from Erich56's posted log). Is that a possibility? @Erich56, did you notice if the WU's that continued to be in RAM were actually using CPU cycles? ID: 32963 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1729 Credit: 110,081,963 RAC: 84,160	Message 32964 - Posted: 2 Nov 2017, 6:23:12 UTC - in response to Message 32963. @Erich56, did you notice if the WU's that continued to be in RAM were actually using CPU cycles? no, they did NOT. ID: 32964 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 691 Credit: 45,482,664 RAC: 33,578	Message 32974 - Posted: 3 Nov 2017, 15:49:57 UTC I haven't done any CMS for a long time but decided to give it a go while sixtrack tasks were not available. I use these single thread tasks to fill the CPU while running Atlas tasks on 3 CPU threads and some Seti and Einstein tasks on the GPU. The first CMS task went well but the second one run successfully about four hours before failing. The failed task is here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163466015 The task was paused 4 times while Boinc switched tasks (more CPU cores were required for running two Einstein tasks on a GPU at the same time). Three times the task was successfully continued but the 4th time failed and the task ended in error 206 (0x000000CE) EXIT_INIT_FAILURE. This is an odd error because it had already finished a few jobs. Needless to say 0 credit was given because of this error. It would be nice if this situation could be recognized and credit given based on the finished jobs. Even better if can get rid of these errors altogether. Has it ever been contemplated the idea to use Boinc as it is ment to, i.e. pack all the necessary files for a task in one or more zip-files and download them before starting to crunch? This would avoid the constantly required communication to Cern servers. After task is finished upload the results back to Cern in one go like most of other projects do. Just my two cents. ID: 32974 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1729 Credit: 110,081,963 RAC: 84,160	Message 32976 - Posted: 4 Nov 2017, 7:39:01 UTC - in response to Message 32954. again, last night to cases with no connection to Condor; one of them is: 2017-10-31 21:46:06 (1372): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] 1 2017-10-31 21:46:36 (1372): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2017-10-31 21:46:36 (1372): Guest Log: [INFO] Shutting Down. What's wrong? meanwhile, I am experiencing this problem on all 3 computers which I use for CMS crunching (at the beginning, it occurred only on one of them, so I was wondering whether it might have to do with this specific system). Something must be "shaky" with the Condor Server ... :-( ID: 32976 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 32977 - Posted: 4 Nov 2017, 10:26:51 UTC - in response to Message 32976. There was a spike in the failed jobs overnight, but I've not seen anything else amiss. We have made some changes lately which seem to have affected the merge jobs but should not be noticeable to volunteers. The current batch has a few hours to run and then some more changes/bug-fixes should come into operation. I can't say yet what effect these will have, so there may be some disruption later today. ID: 32977 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1729 Credit: 110,081,963 RAC: 84,160	Message 32979 - Posted: 4 Nov 2017, 13:30:13 UTC This late morning, some more tasks failed here, under the "title" 207 (0x000000CF) EXIT_NO_SUB_TASKS while connection to the Condor Server was successful, there were some very strange things contained in the stderr, like: 2017-11-04 12:22:14 (780): Guest Log: Did the tarball get created? 2017-11-04 12:22:14 (780): Guest Log: /tmp/CMS_25434_1509769558.937510_0.tgz 2017-11-04 12:22:14 (780): Guest Log: Here is the upload output 2017-11-04 12:22:14 (780): Guest Log: Here is the upload error 2017-11-04 12:22:14 (780): Guest Log: Here is the condor directory ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:11:55 Log last touched time unavailable (No such file or directory) ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:12:33 condor_write(): Socket closed when trying to write 2307 bytes to collector vccondor01.cern.ch, fd is 12 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:12:33 Buf::write(): condor_write() failed ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 AllReaper unexpectedly called on pid 4096, status 0. 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 The STARTD (pid 4096) exited with status 0 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 All daemons are gone. Exiting. 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 ** condor_master (condor_MASTER) pid 4086 EXITING WITH STATUS 0 ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Shutting down Condor on this machine. 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Got SIGTERM. Performing graceful shutdown. 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 shutdown graceful 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Cron: Killing all jobs ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 CronJob: 'multicore': Trying to kill illegal PID 0 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Cron: Killing all jobs 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Killing job multicore ... 2017-11-04 12:22:14 (780): Guest Log: [ERROR] No jobs were available to run. 2017-11-04 12:22:14 (780): Guest Log: [INFO] Shutting Down. really strange, have never seen this before. Anyone any idea what this all is about? ID: 32979 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 32980 - Posted: 4 Nov 2017, 14:18:44 UTC - in response to Message 32979. Something's not going right. We should have started picking up jobs from a new WMAgent by now, but we haven't. The queue for the old batch has drained and running jobs are dropping. I submitted a new batch to the old WMAgent but it will be tens of minutes before they start to arrive at the Condor server. I suspect the problem is that the queue may be saturated with merge jobs that haven't reached their third retry yet so production jobs can't run. Whichever, I suggest setting to No New Tasks for a while until things are clearer. I'll be back from shopping in an hour or so, I'll let you know if the situation is any better then. ID: 32980 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1729 Credit: 110,081,963 RAC: 84,160	Message 32981 - Posted: 4 Nov 2017, 14:30:44 UTC Thanks for the Information, Ivan. Some 20 minutes ago, I had another two failing CMS tasks within two minutes. ID: 32981 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 32983 - Posted: 4 Nov 2017, 15:25:53 UTC - in response to Message 32981. Last modified: 4 Nov 2017, 15:26:14 UTC Thanks for the Information, Ivan. Some 20 minutes ago, I had another two failing CMS tasks within two minutes. That's not good, but it's not an area I can help with AFAIK. On a happier note, jobs from the new batch are making it into the queue, so that panic's over for now. I'll let the CERN crew know so that they can have a look at the new WMAgent next week (we're gradually changing everything to CERN Centos7, as Scientific Linux CERN 6 is becoming obsolete). ID: 32983 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1729 Credit: 110,081,963 RAC: 84,160	Message 32986 - Posted: 4 Nov 2017, 21:17:41 UTC Still task are erroring out after 10 - 14 minutes: 207 (0x000000CF) EXIT_NO_SUB_TASKS one example see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163663158 ID: 32986 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 32988 - Posted: 4 Nov 2017, 22:24:02 UTC - in response to Message 32986. Those condor_write failures are worrying, and perhaps significant but alas I'm no HTCondor expert. ID: 32988 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1729 Credit: 110,081,963 RAC: 84,160	Message 33120 - Posted: 23 Nov 2017, 8:07:29 UTC within the last half hour, all CMS tasks are erroring out after 12-14 minutes - the old problem, so to speak. Which might indicate that there are no jobs available, or is there any other problem? ID: 33120 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 33121 - Posted: 23 Nov 2017, 8:42:36 UTC - in response to Message 33120. within the last half hour, all CMS tasks are erroring out after 12-14 minutes - the old problem, so to speak. Which might indicate that there are no jobs available, or is there any other problem? You are right, the WMAgent failed about two hours ago and the job queue has drained. I've alerted CERN, but set No New Tasks in the meantime. Hopefully it will be fixed quickly. ID: 33121 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 33122 - Posted: 23 Nov 2017, 8:50:14 UTC - in response to Message 33121. Last modified: 23 Nov 2017, 8:52:12 UTC WMAgent is showing green again (Thanks, Alan!); I'm just waiting for my monitors to show jobs in the queue again. [Edit] And there they are! We're up again. [/Edit] ID: 33122 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 33124 - Posted: 23 Nov 2017, 13:29:58 UTC There is a scheduled intervention taking place, so our queue has drained again. The downtime is expected to last another hour or two, so set No New Tasks for a while. ID: 33124 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 33126 - Posted: 23 Nov 2017, 17:21:55 UTC The intervention didn't go smoothly, so they have rolled back to the previous server. As of a few minutes ago the queue was full and jobs were being served, so it should be safe to start accepting new tasks again. ID: 33126 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1034 Credit: 6,744,424 RAC: 14,083	Message 33128 - Posted: 24 Nov 2017, 9:02:06 UTC - in response to Message 33126. At the moment, it doesn't look like any planned interventions will affect us. ID: 33128 · Reply Quote

LHC@home