Thread 'CMS Tasks Failing'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31419 - Posted: 15 Jul 2017, 20:21:35 UTC - in response to Message 31417. I'm still looking for jobs (not necessarily tasks) that failed after about 1500 UTC today. okay, so we'll see what the logs show tomorrow. ID: 31419 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31420 - Posted: 15 Jul 2017, 20:25:41 UTC Another suspicious one got finished a few minutes ago: https://lhcathome.cern.ch/lhcathome/result.php?resultid=151051271 Runtime is 3 hours longer than CPU time. There is definitely something wrong. ID: 31420 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31423 - Posted: 15 Jul 2017, 22:20:22 UTC - in response to Message 31418. Last modified: 15 Jul 2017, 22:20:46 UTC It seems that there is an error in the stage-out phase. I saved stderr.log and stdout.log of my currently running WU. Let me know if they are of interest. Can't hurt to have them. Do you have my Brunel or CERN emails? Otherwise PM me. My WU decided to finish it's break while I was typing the message above. So, at the moment it is running normal. ID: 31423 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31424 - Posted: 15 Jul 2017, 22:24:41 UTC - in response to Message 31420. Last modified: 15 Jul 2017, 22:34:42 UTC Another suspicious one got finished a few minutes ago: https://lhcathome.cern.ch/lhcathome/result.php?resultid=151051271 Runtime is 3 hours longer than CPU time. There is definitely something wrong. OK, the job that started at 2017-07-15 15:14:14 seems to have taken an excessive amount of time, but didn't actually fail as far as I can see. I'll tickle Laurence, but don't expect any response this time on a Saturday night. [Edit] I have to go sleep soon, so don't expect anything more from me for 9-10 hours, unless inspiration strikes in my dreams! [/Edit] ID: 31424 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31425 - Posted: 16 Jul 2017, 6:09:04 UTC - in response to Message 31424. Last modified: 16 Jul 2017, 6:12:51 UTC good morning, Ivan. Here the next example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=151087801 Runtime 14 hours 6 minutes CPU-Time 9 hours 52 minutes This is totally abnormal and different to what it has been before: https://lhcathome.cern.ch/lhcathome/result.php?resultid=150602181 Runtime 12 hours 43 minutes CPU time 11 hours 55 minutes ID: 31425 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31426 - Posted: 16 Jul 2017, 7:52:20 UTC I found a job of mine that failed, tho' I'm not sure what the failure is yet: 2017-07-15 14:27:45 (33070): Guest Log: Probing /cvmfs/grid.cern.ch... OK 2017-07-15 14:27:55 (33070): Guest Log: Probing /cvmfs/cms.cern.ch... OK 2017-07-15 14:27:56 (33070): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2017-07-15 14:27:56 (33070): Guest Log: 2.2.0.0 3473 0 21796 4572 14 1 1688792 10240001 2 65024 0 20 95 20791 3 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch http://128.142.168.202:3125 1 2017-07-15 14:28:03 (33070): Guest Log: [INFO] Reading volunteer information 2017-07-15 14:28:03 (33070): Guest Log: [INFO] Volunteer: ivan (9) Host: 1054 2017-07-15 14:28:03 (33070): Guest Log: [INFO] VMID: e157435d-c4c6-41b0-bda1-b31c0f9afa17 2017-07-15 14:28:04 (33070): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2017-07-15 14:28:04 (33070): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev 2017-07-15 14:28:05 (33070): Guest Log: [INFO] CMS application starting. Check log files. 2017-07-15 14:28:06 (33070): Guest Log: [DEBUG] HTCondor ping 2017-07-15 14:28:46 (33070): Guest Log: [DEBUG] 1 2017-07-15 14:28:46 (33070): Guest Log: [DEBUG] DC_NOP failed! 2017-07-15 14:28:46 (33070): Guest Log: SECMAN:2006:Failed to establish a crypto key. 2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:06 recognized DC_NOP as command name, using command 60011. 2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:42 WARNING: globus returned with euid 0 2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:45 SECMAN: enable_mac has no key to use, failing... 2017-07-15 14:28:48 (33070): Guest Log: [ERROR] Could not ping HTCondor. 2017-07-15 14:28:48 (33070): Guest Log: [INFO] Shutting Down. 2017-07-15 14:28:48 (33070): VM Completion File Detected. 2017-07-15 14:28:48 (33070): VM Completion Message: Could not ping HTCondor. ID: 31426 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31427 - Posted: 16 Jul 2017, 9:09:09 UTC - in response to Message 31418. It seems that there is an error in the stage-out phase. I saved stderr.log and stdout.log of my currently running WU. Let me know if they are of interest. You are exactly right, it's a stage-out problem, connecting to Data Bridge. This was, of course, strongly suggested by the pie graphs in the Job Activity page. I'll pass the files on to Laurence. There's another recurring error, seemingly from a typo in a script, but I'm not sure if it's one of ours, nor if it has serious repercussions. ID: 31427 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31429 - Posted: 16 Jul 2017, 9:18:07 UTC - in response to Message 31427. OK, we've done our bit; now we wait. I'd suggest people set No New Tasks or temporarily transfer to other apps/projects to minimise wasted time. ID: 31429 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31430 - Posted: 16 Jul 2017, 9:44:36 UTC - in response to Message 31429. OK, we've done our bit; now we wait. I'd suggest people set No New Tasks or temporarily transfer to other apps/projects to minimise wasted time. Thank you, Ivan, for your efforts (I knew that something was going wrong). ID: 31430 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31431 - Posted: 16 Jul 2017, 12:14:03 UTC - in response to Message 31430. Last modified: 16 Jul 2017, 13:54:42 UTC Cheers, Erich; I do my best but I can't do everything. It's just gotten worse, which may or may not indicate that someone's trying to do something. Lunchtime at the cricket Test (England vs. South Africa) so I'm off downtown for my shopping, back in an hour or so. [Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit] ID: 31431 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31433 - Posted: 16 Jul 2017, 14:19:17 UTC - in response to Message 31431. [Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit] OMG - what's going on there? ID: 31433 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31435 - Posted: 16 Jul 2017, 15:37:19 UTC - in response to Message 31433. Last modified: 16 Jul 2017, 16:01:33 UTC [Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit] OMG - what's going on there? I'm afraid I have no idea, Erich, it's well beyond my control. Set No New Tasks, batten down the hatches, and check back tomorrow. :-( ID: 31435 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31436 - Posted: 16 Jul 2017, 16:13:46 UTC - in response to Message 31435. I'm afraid I have no idea, Erich, it's well beyond my control. Set No New Tasks, batten down the hatches, and check back tomorrow. :-( okay, thanks, I'll do this! ID: 31436 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31437 - Posted: 16 Jul 2017, 16:41:46 UTC Just now another CMS got finished and uploaded (one of the last remaining ones I had in the queue on one of my machines). The interesting thing is that they do not fail (which is good, on one hand). Also here, the discrepency between total running time and CPU time is nearly 4 1/2 hours. ID: 31437 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31440 - Posted: 16 Jul 2017, 17:01:13 UTC - in response to Message 31437. Just now another CMS got finished and uploaded (one of the last remaining ones I had in the queue on one of my machines). The interesting thing is that they do not fail (which is good, on one hand). Also here, the discrepency between total running time and CPU time is nearly 4 1/2 hours. I'm not seeing failures as my tasks run out, but the Dashboard graphs definitely are not showing any hint of green. But then again Dashboard is not always up to date. I'll keep monitoring but I'll probably go to bed relatively early, I've got a lot happening tomorrow. ID: 31440 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31442 - Posted: 16 Jul 2017, 18:44:38 UTC - in response to Message 31440. ... I've got a lot happening tomorrow. :-) :-) :-) ID: 31442 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31443 - Posted: 16 Jul 2017, 18:56:20 UTC - in response to Message 31442. ... I've got a lot happening tomorrow. :-) :-) :-) Tja. A Group meeting. Three reports to write (including CMS@Home). Five or six servers to do security updates on. Two 4 TB disks to RAID into one of the servers if I can get hold of a couple of CMOS batteries for the matched-pair that are having the Intel equivalent of Alzheimer's. And a doctor's appointment. As well as trying to get the current stage-out problem fixed. ID: 31443 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,837,390 RAC: 91,255	Message 31464 - Posted: 17 Jul 2017, 14:59:58 UTC - in response to Message 31443. Ivan, any news on the stage-out problem? ID: 31464 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31467 - Posted: 17 Jul 2017, 17:20:39 UTC - in response to Message 31464. Ivan, any news on the stage-out problem? Only that the LHC@Home crew is aware of it and working on it. I'll let you know as soon as I hear anything. ID: 31467 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 31483 - Posted: 18 Jul 2017, 14:39:07 UTC - in response to Message 31467. Ivan, any news on the stage-out problem? Only that the LHC@Home crew is aware of it and working on it. I'll let you know as soon as I hear anything. We are staging out jobs again, and the Data Bridge Web interface is working. Please proceed cautiously, but it looks like you can start running tasks again. ID: 31483 · Reply Quote