Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,773 RAC: 18,214 |
again, last night to cases with no connection to Condor; one of them is: 2017-10-31 21:46:06 (1372): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] 1 2017-10-31 21:46:36 (1372): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2017-10-31 21:46:36 (1372): Guest Log: [INFO] Shutting Down. What's wrong? |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
Well, the glib answer is that there's a connectivity problem somewhere. :-/ Exactly where is the question. Nils seems to have no problem, according to a recent post, but he is (I believe) quite local to CERN. I'm not having any problems that I'm aware of, from Heathrow North (but through an academic network). Perhaps if people could post times and locations where this happens, we could get a geographic sense of what the problem might be? |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,773 RAC: 18,214 |
Well, the glib answer is that there's a connectivity problem somewhere. :-/yes, that's the question. Perhaps if people could post times and locations where this happens, we could get a geographic sense of what the problem might be?in my case: Vienna, Austria. What I now did: I have tried to ping the Condor server (vccondor01.cern.ch) from all my PCs at various times, always with success. So, the problem, when it occurs, seems to be there only for very short time. And not too often. Will be very hard to find out what it really is. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
Well, the glib answer is that there's a connectivity problem somewhere. :-/ My machines (in Missouri) were unable to process Theory for 3 hours from 14:33 UTC - 17:26 UTC Nov 1. Some of the WU's took over 2 hours before they failed. 60 WU's across 3 machines just stopped processing and all but 10 ended in the same variety of ping errors Erich56 listed above. I was going to say that maybe this is DB lock issue from the other thread https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4496&postid=32952#32952 is related but that would mean Erich56's WU's would need to have waited to fail from 14:30 UTC till 20 UTC (from Erich56's posted log). Is that a possibility? @Erich56, did you notice if the WU's that continued to be in RAM were actually using CPU cycles? |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,773 RAC: 18,214 |
@Erich56, did you notice if the WU's that continued to be in RAM were actually using CPU cycles? no, they did NOT. |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,373,095 RAC: 13,741 |
I haven't done any CMS for a long time but decided to give it a go while sixtrack tasks were not available. I use these single thread tasks to fill the CPU while running Atlas tasks on 3 CPU threads and some Seti and Einstein tasks on the GPU. The first CMS task went well but the second one run successfully about four hours before failing. The failed task is here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163466015 The task was paused 4 times while Boinc switched tasks (more CPU cores were required for running two Einstein tasks on a GPU at the same time). Three times the task was successfully continued but the 4th time failed and the task ended in error 206 (0x000000CE) EXIT_INIT_FAILURE. This is an odd error because it had already finished a few jobs. Needless to say 0 credit was given because of this error. It would be nice if this situation could be recognized and credit given based on the finished jobs. Even better if can get rid of these errors altogether. Has it ever been contemplated the idea to use Boinc as it is ment to, i.e. pack all the necessary files for a task in one or more zip-files and download them before starting to crunch? This would avoid the constantly required communication to Cern servers. After task is finished upload the results back to Cern in one go like most of other projects do. Just my two cents. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,773 RAC: 18,214 |
again, last night to cases with no connection to Condor; one of them is: meanwhile, I am experiencing this problem on all 3 computers which I use for CMS crunching (at the beginning, it occurred only on one of them, so I was wondering whether it might have to do with this specific system). Something must be "shaky" with the Condor Server ... :-( |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
There was a spike in the failed jobs overnight, but I've not seen anything else amiss. We have made some changes lately which seem to have affected the merge jobs but should not be noticeable to volunteers. The current batch has a few hours to run and then some more changes/bug-fixes should come into operation. I can't say yet what effect these will have, so there may be some disruption later today. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,773 RAC: 18,214 |
This late morning, some more tasks failed here, under the "title" 207 (0x000000CF) EXIT_NO_SUB_TASKS while connection to the Condor Server was successful, there were some very strange things contained in the stderr, like: 2017-11-04 12:22:14 (780): Guest Log: Did the tarball get created? 2017-11-04 12:22:14 (780): Guest Log: /tmp/CMS_25434_1509769558.937510_0.tgz 2017-11-04 12:22:14 (780): Guest Log: Here is the upload output 2017-11-04 12:22:14 (780): Guest Log: Here is the upload error 2017-11-04 12:22:14 (780): Guest Log: Here is the condor directory ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:11:55 ** Log last touched time unavailable (No such file or directory) ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:12:33 condor_write(): Socket closed when trying to write 2307 bytes to collector vccondor01.cern.ch, fd is 12 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:12:33 Buf::write(): condor_write() failed ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 AllReaper unexpectedly called on pid 4096, status 0. 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 The STARTD (pid 4096) exited with status 0 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 All daemons are gone. Exiting. 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 **** condor_master (condor_MASTER) pid 4086 EXITING WITH STATUS 0 ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Shutting down Condor on this machine. 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Got SIGTERM. Performing graceful shutdown. 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 shutdown graceful 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Cron: Killing all jobs ... 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 CronJob: 'multicore': Trying to kill illegal PID 0 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Cron: Killing all jobs 2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Killing job multicore ... 2017-11-04 12:22:14 (780): Guest Log: [ERROR] No jobs were available to run. 2017-11-04 12:22:14 (780): Guest Log: [INFO] Shutting Down. really strange, have never seen this before. Anyone any idea what this all is about? |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
Something's not going right. We should have started picking up jobs from a new WMAgent by now, but we haven't. The queue for the old batch has drained and running jobs are dropping. I submitted a new batch to the old WMAgent but it will be tens of minutes before they start to arrive at the Condor server. I suspect the problem is that the queue may be saturated with merge jobs that haven't reached their third retry yet so production jobs can't run. Whichever, I suggest setting to No New Tasks for a while until things are clearer. I'll be back from shopping in an hour or so, I'll let you know if the situation is any better then. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,773 RAC: 18,214 |
Thanks for the Information, Ivan. Some 20 minutes ago, I had another two failing CMS tasks within two minutes. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
Thanks for the Information, Ivan. That's not good, but it's not an area I can help with AFAIK. On a happier note, jobs from the new batch are making it into the queue, so that panic's over for now. I'll let the CERN crew know so that they can have a look at the new WMAgent next week (we're gradually changing everything to CERN Centos7, as Scientific Linux CERN 6 is becoming obsolete). |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,773 RAC: 18,214 |
Still task are erroring out after 10 - 14 minutes: 207 (0x000000CF) EXIT_NO_SUB_TASKS one example see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163663158 |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
|
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,773 RAC: 18,214 |
within the last half hour, all CMS tasks are erroring out after 12-14 minutes - the old problem, so to speak. Which might indicate that there are no jobs available, or is there any other problem? |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
within the last half hour, all CMS tasks are erroring out after 12-14 minutes - the old problem, so to speak. Which might indicate that there are no jobs available, or is there any other problem? You are right, the WMAgent failed about two hours ago and the job queue has drained. I've alerted CERN, but set No New Tasks in the meantime. Hopefully it will be fixed quickly. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 201 |
|
©2024 CERN