Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32954 - Posted: 1 Nov 2017, 6:48:06 UTC
Last modified: 1 Nov 2017, 6:48:19 UTC

again, last night to cases with no connection to Condor; one of them is:

2017-10-31 21:46:06 (1372): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] 1
2017-10-31 21:46:36 (1372): Guest Log: [ERROR] Could not connect to Condor server on port 9618
2017-10-31 21:46:36 (1372): Guest Log: [INFO] Shutting Down.

What's wrong?
ID: 32954 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 32955 - Posted: 1 Nov 2017, 8:56:50 UTC - in response to Message 32954.  

Well, the glib answer is that there's a connectivity problem somewhere. :-/
Exactly where is the question. Nils seems to have no problem, according to a recent post, but he is (I believe) quite local to CERN. I'm not having any problems that I'm aware of, from Heathrow North (but through an academic network). Perhaps if people could post times and locations where this happens, we could get a geographic sense of what the problem might be?
ID: 32955 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32957 - Posted: 1 Nov 2017, 14:27:44 UTC - in response to Message 32955.  

Well, the glib answer is that there's a connectivity problem somewhere. :-/
Exactly where is the question.
yes, that's the question.

Perhaps if people could post times and locations where this happens, we could get a geographic sense of what the problem might be?
in my case: Vienna, Austria.

What I now did: I have tried to ping the Condor server (vccondor01.cern.ch) from all my PCs at various times, always with success.

So, the problem, when it occurs, seems to be there only for very short time. And not too often.

Will be very hard to find out what it really is.
ID: 32957 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 144
Credit: 6,301,268
RAC: 0
Message 32963 - Posted: 2 Nov 2017, 0:41:30 UTC - in response to Message 32955.  
Last modified: 2 Nov 2017, 0:43:50 UTC

Well, the glib answer is that there's a connectivity problem somewhere. :-/
Exactly where is the question. Nils seems to have no problem, according to a recent post, but he is (I believe) quite local to CERN. I'm not having any problems that I'm aware of, from Heathrow North (but through an academic network). Perhaps if people could post times and locations where this happens, we could get a geographic sense of what the problem might be?



My machines (in Missouri) were unable to process Theory for 3 hours from 14:33 UTC - 17:26 UTC Nov 1. Some of the WU's took over 2 hours before they failed.
60 WU's across 3 machines just stopped processing and all but 10 ended in the same variety of ping errors Erich56 listed above.
I was going to say that maybe this is DB lock issue from the other thread https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4496&postid=32952#32952

is related but that would mean Erich56's WU's would need to have waited to fail from 14:30 UTC till 20 UTC (from Erich56's posted log).

Is that a possibility?
@Erich56, did you notice if the WU's that continued to be in RAM were actually using CPU cycles?
ID: 32963 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32964 - Posted: 2 Nov 2017, 6:23:12 UTC - in response to Message 32963.  

@Erich56, did you notice if the WU's that continued to be in RAM were actually using CPU cycles?

no, they did NOT.
ID: 32964 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,373,095
RAC: 13,741
Message 32974 - Posted: 3 Nov 2017, 15:49:57 UTC

I haven't done any CMS for a long time but decided to give it a go while sixtrack tasks were not available. I use these single thread tasks to fill the CPU while running Atlas tasks on 3 CPU threads and some Seti and Einstein tasks on the GPU. The first CMS task went well but the second one run successfully about four hours before failing. The failed task is here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163466015

The task was paused 4 times while Boinc switched tasks (more CPU cores were required for running two Einstein tasks on a GPU at the same time). Three times the task was successfully continued but the 4th time failed and the task ended in error 206 (0x000000CE) EXIT_INIT_FAILURE. This is an odd error because it had already finished a few jobs. Needless to say 0 credit was given because of this error. It would be nice if this situation could be recognized and credit given based on the finished jobs. Even better if can get rid of these errors altogether.

Has it ever been contemplated the idea to use Boinc as it is ment to, i.e. pack all the necessary files for a task in one or more zip-files and download them before starting to crunch? This would avoid the constantly required communication to Cern servers. After task is finished upload the results back to Cern in one go like most of other projects do. Just my two cents.
ID: 32974 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32976 - Posted: 4 Nov 2017, 7:39:01 UTC - in response to Message 32954.  

again, last night to cases with no connection to Condor; one of them is:

2017-10-31 21:46:06 (1372): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2017-10-31 21:46:36 (1372): Guest Log: [DEBUG] 1
2017-10-31 21:46:36 (1372): Guest Log: [ERROR] Could not connect to Condor server on port 9618
2017-10-31 21:46:36 (1372): Guest Log: [INFO] Shutting Down.

What's wrong?


meanwhile, I am experiencing this problem on all 3 computers which I use for CMS crunching (at the beginning, it occurred only on one of them, so I was wondering whether it might have to do with this specific system).

Something must be "shaky" with the Condor Server ... :-(
ID: 32976 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 32977 - Posted: 4 Nov 2017, 10:26:51 UTC - in response to Message 32976.  

There was a spike in the failed jobs overnight, but I've not seen anything else amiss. We have made some changes lately which seem to have affected the merge jobs but should not be noticeable to volunteers. The current batch has a few hours to run and then some more changes/bug-fixes should come into operation. I can't say yet what effect these will have, so there may be some disruption later today.
ID: 32977 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32979 - Posted: 4 Nov 2017, 13:30:13 UTC

This late morning, some more tasks failed here, under the "title"
207 (0x000000CF) EXIT_NO_SUB_TASKS

while connection to the Condor Server was successful, there were some very strange things contained in the stderr, like:

2017-11-04 12:22:14 (780): Guest Log: Did the tarball get created?
2017-11-04 12:22:14 (780): Guest Log: /tmp/CMS_25434_1509769558.937510_0.tgz
2017-11-04 12:22:14 (780): Guest Log: Here is the upload output
2017-11-04 12:22:14 (780): Guest Log: Here is the upload error
2017-11-04 12:22:14 (780): Guest Log: Here is the condor directory
...
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:11:55 ** Log last touched time unavailable (No such file or directory)
...
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:12:33 condor_write(): Socket closed when trying to write 2307 bytes to collector vccondor01.cern.ch, fd is 12
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:12:33 Buf::write(): condor_write() failed
...
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 AllReaper unexpectedly called on pid 4096, status 0.
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 The STARTD (pid 4096) exited with status 0
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 All daemons are gone. Exiting.
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 **** condor_master (condor_MASTER) pid 4086 EXITING WITH STATUS 0
...
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Shutting down Condor on this machine.
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Got SIGTERM. Performing graceful shutdown.
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 shutdown graceful
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Cron: Killing all jobs
...
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 CronJob: 'multicore': Trying to kill illegal PID 0
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Cron: Killing all jobs
2017-11-04 12:22:14 (780): Guest Log: 11/04/17 12:22:02 Killing job multicore
...
2017-11-04 12:22:14 (780): Guest Log: [ERROR] No jobs were available to run.
2017-11-04 12:22:14 (780): Guest Log: [INFO] Shutting Down.

really strange, have never seen this before. Anyone any idea what this all is about?
ID: 32979 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 32980 - Posted: 4 Nov 2017, 14:18:44 UTC - in response to Message 32979.  

Something's not going right. We should have started picking up jobs from a new WMAgent by now, but we haven't. The queue for the old batch has drained and running jobs are dropping. I submitted a new batch to the old WMAgent but it will be tens of minutes before they start to arrive at the Condor server.
I suspect the problem is that the queue may be saturated with merge jobs that haven't reached their third retry yet so production jobs can't run.
Whichever, I suggest setting to No New Tasks for a while until things are clearer. I'll be back from shopping in an hour or so, I'll let you know if the situation is any better then.
ID: 32980 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32981 - Posted: 4 Nov 2017, 14:30:44 UTC

Thanks for the Information, Ivan.

Some 20 minutes ago, I had another two failing CMS tasks within two minutes.
ID: 32981 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 32983 - Posted: 4 Nov 2017, 15:25:53 UTC - in response to Message 32981.  
Last modified: 4 Nov 2017, 15:26:14 UTC

Thanks for the Information, Ivan.

Some 20 minutes ago, I had another two failing CMS tasks within two minutes.

That's not good, but it's not an area I can help with AFAIK.

On a happier note, jobs from the new batch are making it into the queue, so that panic's over for now. I'll let the CERN crew know so that they can have a look at the new WMAgent next week (we're gradually changing everything to CERN Centos7, as Scientific Linux CERN 6 is becoming obsolete).
ID: 32983 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32986 - Posted: 4 Nov 2017, 21:17:41 UTC

Still task are erroring out after 10 - 14 minutes:

207 (0x000000CF) EXIT_NO_SUB_TASKS

one example see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163663158
ID: 32986 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 32988 - Posted: 4 Nov 2017, 22:24:02 UTC - in response to Message 32986.  

Those condor_write failures are worrying, and perhaps significant but alas I'm no HTCondor expert.
ID: 32988 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 33120 - Posted: 23 Nov 2017, 8:07:29 UTC

within the last half hour, all CMS tasks are erroring out after 12-14 minutes - the old problem, so to speak. Which might indicate that there are no jobs available, or is there any other problem?
ID: 33120 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 33121 - Posted: 23 Nov 2017, 8:42:36 UTC - in response to Message 33120.  

within the last half hour, all CMS tasks are erroring out after 12-14 minutes - the old problem, so to speak. Which might indicate that there are no jobs available, or is there any other problem?

You are right, the WMAgent failed about two hours ago and the job queue has drained. I've alerted CERN, but set No New Tasks in the meantime. Hopefully it will be fixed quickly.
ID: 33121 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 33122 - Posted: 23 Nov 2017, 8:50:14 UTC - in response to Message 33121.  
Last modified: 23 Nov 2017, 8:52:12 UTC

WMAgent is showing green again (Thanks, Alan!); I'm just waiting for my monitors to show jobs in the queue again.
[Edit] And there they are! We're up again. [/Edit]
ID: 33122 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 33124 - Posted: 23 Nov 2017, 13:29:58 UTC

There is a scheduled intervention taking place, so our queue has drained again. The downtime is expected to last another hour or two, so set No New Tasks for a while.
ID: 33124 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 33126 - Posted: 23 Nov 2017, 17:21:55 UTC

The intervention didn't go smoothly, so they have rolled back to the previous server. As of a few minutes ago the queue was full and jobs were being served, so it should be safe to start accepting new tasks again.
ID: 33126 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 33128 - Posted: 24 Nov 2017, 9:02:06 UTC - in response to Message 33126.  

At the moment, it doesn't look like any planned interventions will affect us.
ID: 33128 · Report as offensive     Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN