Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
I'm still looking for jobs (not necessarily tasks) that failed after about 1500 UTC today. okay, so we'll see what the logs show tomorrow. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
Another suspicious one got finished a few minutes ago: https://lhcathome.cern.ch/lhcathome/result.php?resultid=151051271 Runtime is 3 hours longer than CPU time. There is definitely something wrong. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
It seems that there is an error in the stage-out phase.Can't hurt to have them. Do you have my Brunel or CERN emails? Otherwise PM me.
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
Another suspicious one got finished a few minutes ago: OK, the job that started at 2017-07-15 15:14:14 seems to have taken an excessive amount of time, but didn't actually fail as far as I can see. I'll tickle Laurence, but don't expect any response this time on a Saturday night. [Edit] I have to go sleep soon, so don't expect anything more from me for 9-10 hours, unless inspiration strikes in my dreams! [/Edit] |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
good morning, Ivan. Here the next example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=151087801 Runtime 14 hours 6 minutes CPU-Time 9 hours 52 minutes This is totally abnormal and different to what it has been before: https://lhcathome.cern.ch/lhcathome/result.php?resultid=150602181 Runtime 12 hours 43 minutes CPU time 11 hours 55 minutes |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
I found a job of mine that failed, tho' I'm not sure what the failure is yet: 2017-07-15 14:27:45 (33070): Guest Log: Probing /cvmfs/grid.cern.ch... OK 2017-07-15 14:27:55 (33070): Guest Log: Probing /cvmfs/cms.cern.ch... OK 2017-07-15 14:27:56 (33070): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2017-07-15 14:27:56 (33070): Guest Log: 2.2.0.0 3473 0 21796 4572 14 1 1688792 10240001 2 65024 0 20 95 20791 3 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch http://128.142.168.202:3125 1 2017-07-15 14:28:03 (33070): Guest Log: [INFO] Reading volunteer information 2017-07-15 14:28:03 (33070): Guest Log: [INFO] Volunteer: ivan (9) Host: 1054 2017-07-15 14:28:03 (33070): Guest Log: [INFO] VMID: e157435d-c4c6-41b0-bda1-b31c0f9afa17 2017-07-15 14:28:04 (33070): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2017-07-15 14:28:04 (33070): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev 2017-07-15 14:28:05 (33070): Guest Log: [INFO] CMS application starting. Check log files. 2017-07-15 14:28:06 (33070): Guest Log: [DEBUG] HTCondor ping 2017-07-15 14:28:46 (33070): Guest Log: [DEBUG] 1 2017-07-15 14:28:46 (33070): Guest Log: [DEBUG] DC_NOP failed! 2017-07-15 14:28:46 (33070): Guest Log: SECMAN:2006:Failed to establish a crypto key. 2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:06 recognized DC_NOP as command name, using command 60011. 2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:42 WARNING: globus returned with euid 0 2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:45 SECMAN: enable_mac has no key to use, failing... 2017-07-15 14:28:48 (33070): Guest Log: [ERROR] Could not ping HTCondor. 2017-07-15 14:28:48 (33070): Guest Log: [INFO] Shutting Down. 2017-07-15 14:28:48 (33070): VM Completion File Detected. 2017-07-15 14:28:48 (33070): VM Completion Message: Could not ping HTCondor. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
It seems that there is an error in the stage-out phase. You are exactly right, it's a stage-out problem, connecting to Data Bridge. This was, of course, strongly suggested by the pie graphs in the Job Activity page. I'll pass the files on to Laurence. There's another recurring error, seemingly from a typo in a script, but I'm not sure if it's one of ours, nor if it has serious repercussions. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
|
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
OK, we've done our bit; now we wait. I'd suggest people set No New Tasks or temporarily transfer to other apps/projects to minimise wasted time. Thank you, Ivan, for your efforts (I knew that something was going wrong). |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
Cheers, Erich; I do my best but I can't do everything. It's just gotten worse, which may or may not indicate that someone's trying to do something. Lunchtime at the cricket Test (England vs. South Africa) so I'm off downtown for my shopping, back in an hour or so. [Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit] |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
[Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit] OMG - what's going on there? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
[Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit] I'm afraid I have no idea, Erich, it's well beyond my control. Set No New Tasks, batten down the hatches, and check back tomorrow. :-( |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
I'm afraid I have no idea, Erich, it's well beyond my control. Set No New Tasks, batten down the hatches, and check back tomorrow. :-( okay, thanks, I'll do this! |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
Just now another CMS got finished and uploaded (one of the last remaining ones I had in the queue on one of my machines). The interesting thing is that they do not fail (which is good, on one hand). Also here, the discrepency between total running time and CPU time is nearly 4 1/2 hours. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
Just now another CMS got finished and uploaded (one of the last remaining ones I had in the queue on one of my machines). The interesting thing is that they do not fail (which is good, on one hand). I'm not seeing failures as my tasks run out, but the Dashboard graphs definitely are not showing any hint of green. But then again Dashboard is not always up to date. I'll keep monitoring but I'll probably go to bed relatively early, I've got a lot happening tomorrow. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
... I've got a lot happening tomorrow. :-) :-) :-) |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
... I've got a lot happening tomorrow. Tja. A Group meeting. Three reports to write (including CMS@Home). Five or six servers to do security updates on. Two 4 TB disks to RAID into one of the servers if I can get hold of a couple of CMOS batteries for the matched-pair that are having the Intel equivalent of Alzheimer's. And a doctor's appointment. As well as trying to get the current stage-out problem fixed. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,422,261 RAC: 27,955 |
Ivan, any news on the stage-out problem? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,465 |
Ivan, any news on the stage-out problem? We are staging out jobs again, and the Data Bridge Web interface is working. Please proceed cautiously, but it looks like you can start running tasks again. |
©2024 CERN