Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 22 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31419 - Posted: 15 Jul 2017, 20:21:35 UTC - in response to Message 31417.  

I'm still looking for jobs (not necessarily tasks) that failed after about 1500 UTC today.

okay, so we'll see what the logs show tomorrow.
ID: 31419 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31420 - Posted: 15 Jul 2017, 20:25:41 UTC

Another suspicious one got finished a few minutes ago:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=151051271

Runtime is 3 hours longer than CPU time. There is definitely something wrong.
ID: 31420 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31423 - Posted: 15 Jul 2017, 22:20:22 UTC - in response to Message 31418.  
Last modified: 15 Jul 2017, 22:20:46 UTC

It seems that there is an error in the stage-out phase.
I saved stderr.log and stdout.log of my currently running WU.
Let me know if they are of interest.
Can't hurt to have them. Do you have my Brunel or CERN emails? Otherwise PM me.


My WU decided to finish it's break while I was typing the message above.
So, at the moment it is running normal.

ID: 31423 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31424 - Posted: 15 Jul 2017, 22:24:41 UTC - in response to Message 31420.  
Last modified: 15 Jul 2017, 22:34:42 UTC

Another suspicious one got finished a few minutes ago:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=151051271

Runtime is 3 hours longer than CPU time. There is definitely something wrong.

OK, the job that started at 2017-07-15 15:14:14 seems to have taken an excessive amount of time, but didn't actually fail as far as I can see. I'll tickle Laurence, but don't expect any response this time on a Saturday night.
[Edit] I have to go sleep soon, so don't expect anything more from me for 9-10 hours, unless inspiration strikes in my dreams! [/Edit]
ID: 31424 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31425 - Posted: 16 Jul 2017, 6:09:04 UTC - in response to Message 31424.  
Last modified: 16 Jul 2017, 6:12:51 UTC

good morning, Ivan.

Here the next example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=151087801

Runtime 14 hours 6 minutes
CPU-Time 9 hours 52 minutes

This is totally abnormal and different to what it has been before:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=150602181
Runtime 12 hours 43 minutes
CPU time 11 hours 55 minutes
ID: 31425 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31426 - Posted: 16 Jul 2017, 7:52:20 UTC

I found a job of mine that failed, tho' I'm not sure what the failure is yet:
2017-07-15 14:27:45 (33070): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2017-07-15 14:27:55 (33070): Guest Log: Probing /cvmfs/cms.cern.ch... OK
2017-07-15 14:27:56 (33070): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2017-07-15 14:27:56 (33070): Guest Log: 2.2.0.0 3473 0 21796 4572 14 1 1688792 10240001 2 65024 0 20 95 20791 3 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch http://128.142.168.202:3125 1
2017-07-15 14:28:03 (33070): Guest Log: [INFO] Reading volunteer information
2017-07-15 14:28:03 (33070): Guest Log: [INFO] Volunteer: ivan (9) Host: 1054
2017-07-15 14:28:03 (33070): Guest Log: [INFO] VMID: e157435d-c4c6-41b0-bda1-b31c0f9afa17
2017-07-15 14:28:04 (33070): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2017-07-15 14:28:04 (33070): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev
2017-07-15 14:28:05 (33070): Guest Log: [INFO] CMS application starting. Check log files.
2017-07-15 14:28:06 (33070): Guest Log: [DEBUG] HTCondor ping
2017-07-15 14:28:46 (33070): Guest Log: [DEBUG] 1
2017-07-15 14:28:46 (33070): Guest Log: [DEBUG] DC_NOP failed!
2017-07-15 14:28:46 (33070): Guest Log: SECMAN:2006:Failed to establish a crypto key.
2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:06 recognized DC_NOP as command name, using command 60011.
2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:42 WARNING: globus returned with euid 0
2017-07-15 14:28:46 (33070): Guest Log: 07/15/17 14:28:45 SECMAN: enable_mac has no key to use, failing...
2017-07-15 14:28:48 (33070): Guest Log: [ERROR] Could not ping HTCondor.
2017-07-15 14:28:48 (33070): Guest Log: [INFO] Shutting Down.
2017-07-15 14:28:48 (33070): VM Completion File Detected.
2017-07-15 14:28:48 (33070): VM Completion Message: Could not ping HTCondor.

ID: 31426 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31427 - Posted: 16 Jul 2017, 9:09:09 UTC - in response to Message 31418.  

It seems that there is an error in the stage-out phase.
I saved stderr.log and stdout.log of my currently running WU.
Let me know if they are of interest.

You are exactly right, it's a stage-out problem, connecting to Data Bridge. This was, of course, strongly suggested by the pie graphs in the Job Activity page. I'll pass the files on to Laurence. There's another recurring error, seemingly from a typo in a script, but I'm not sure if it's one of ours, nor if it has serious repercussions.
ID: 31427 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31429 - Posted: 16 Jul 2017, 9:18:07 UTC - in response to Message 31427.  

OK, we've done our bit; now we wait. I'd suggest people set No New Tasks or temporarily transfer to other apps/projects to minimise wasted time.
ID: 31429 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31430 - Posted: 16 Jul 2017, 9:44:36 UTC - in response to Message 31429.  

OK, we've done our bit; now we wait. I'd suggest people set No New Tasks or temporarily transfer to other apps/projects to minimise wasted time.

Thank you, Ivan, for your efforts (I knew that something was going wrong).
ID: 31430 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31431 - Posted: 16 Jul 2017, 12:14:03 UTC - in response to Message 31430.  
Last modified: 16 Jul 2017, 13:54:42 UTC

Cheers, Erich; I do my best but I can't do everything. It's just gotten worse, which may or may not indicate that someone's trying to do something. Lunchtime at the cricket Test (England vs. South Africa) so I'm off downtown for my shopping, back in an hour or so.
[Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit]
ID: 31431 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31433 - Posted: 16 Jul 2017, 14:19:17 UTC - in response to Message 31431.  

[Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit]

OMG - what's going on there?
ID: 31433 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31435 - Posted: 16 Jul 2017, 15:37:19 UTC - in response to Message 31433.  
Last modified: 16 Jul 2017, 16:01:33 UTC

[Edit] Well, it's definitely got worse now. "Watching and waiting," as the Moody Blues sang. [/Edit]

OMG - what's going on there?

I'm afraid I have no idea, Erich, it's well beyond my control. Set No New Tasks, batten down the hatches, and check back tomorrow. :-(
ID: 31435 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31436 - Posted: 16 Jul 2017, 16:13:46 UTC - in response to Message 31435.  

I'm afraid I have no idea, Erich, it's well beyond my control. Set No New Tasks, batten down the hatches, and check back tomorrow. :-(

okay, thanks, I'll do this!
ID: 31436 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31437 - Posted: 16 Jul 2017, 16:41:46 UTC

Just now another CMS got finished and uploaded (one of the last remaining ones I had in the queue on one of my machines). The interesting thing is that they do not fail (which is good, on one hand).
Also here, the discrepency between total running time and CPU time is nearly 4 1/2 hours.
ID: 31437 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31440 - Posted: 16 Jul 2017, 17:01:13 UTC - in response to Message 31437.  

Just now another CMS got finished and uploaded (one of the last remaining ones I had in the queue on one of my machines). The interesting thing is that they do not fail (which is good, on one hand).
Also here, the discrepency between total running time and CPU time is nearly 4 1/2 hours.

I'm not seeing failures as my tasks run out, but the Dashboard graphs definitely are not showing any hint of green. But then again Dashboard is not always up to date. I'll keep monitoring but I'll probably go to bed relatively early, I've got a lot happening tomorrow.
ID: 31440 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31442 - Posted: 16 Jul 2017, 18:44:38 UTC - in response to Message 31440.  

... I've got a lot happening tomorrow.

:-) :-) :-)
ID: 31442 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31443 - Posted: 16 Jul 2017, 18:56:20 UTC - in response to Message 31442.  

... I've got a lot happening tomorrow.

:-) :-) :-)

Tja. A Group meeting. Three reports to write (including CMS@Home). Five or six servers to do security updates on. Two 4 TB disks to RAID into one of the servers if I can get hold of a couple of CMOS batteries for the matched-pair that are having the Intel equivalent of Alzheimer's. And a doctor's appointment. As well as trying to get the current stage-out problem fixed.
ID: 31443 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,662,306
RAC: 81,477
Message 31464 - Posted: 17 Jul 2017, 14:59:58 UTC - in response to Message 31443.  

Ivan, any news on the stage-out problem?
ID: 31464 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31467 - Posted: 17 Jul 2017, 17:20:39 UTC - in response to Message 31464.  

Ivan, any news on the stage-out problem?

Only that the LHC@Home crew is aware of it and working on it. I'll let you know as soon as I hear anything.
ID: 31467 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,690,313
RAC: 6,878
Message 31483 - Posted: 18 Jul 2017, 14:39:07 UTC - in response to Message 31467.  

Ivan, any news on the stage-out problem?

Only that the LHC@Home crew is aware of it and working on it. I'll let you know as soon as I hear anything.

We are staging out jobs again, and the Data Bridge Web interface is working. Please proceed cautiously, but it looks like you can start running tasks again.
ID: 31483 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN