Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 22 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,668,768
RAC: 6,679
Message 30554 - Posted: 29 May 2017, 21:10:44 UTC - in response to Message 30551.  
Last modified: 29 May 2017, 21:25:12 UTC

OK, we're up again now. Alan spotted a problem that maybe will prevent this particular issue in the future.
ID: 30554 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,370,076
RAC: 74,781
Message 30556 - Posted: 30 May 2017, 7:25:35 UTC
Last modified: 30 May 2017, 7:25:56 UTC

Obviously due to the reasons described by Ivan above, yesterday evening I had quite a number of tasks that failed after about 10 minutes (which I noticed only some minutes ago, having been away). Same, I guess, was definitely true for all other crunchers.

What exactly is this WMAgent doing? Why does it fail that often?
ID: 30556 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2520
Credit: 252,039,324
RAC: 130,172
Message 30557 - Posted: 30 May 2017, 7:36:26 UTC - in response to Message 30543.  

Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority."


An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem.

It may be that CMS (and possibly Theory) hit the break and are now unable to release it.
Server Status: 10 CMS, 0 Theory (Task data as of 30 May 2017, 7:32:12 UTC)
ID: 30557 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,668,768
RAC: 6,679
Message 30558 - Posted: 30 May 2017, 8:19:15 UTC - in response to Message 30557.  

Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority."

An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem.

It may be that CMS (and possibly Theory) hit the break and are now unable to release it.
Server Status: 10 CMS, 0 Theory (Task data as of 30 May 2017, 7:32:12 UTC)

We do seem to be still sending jobs, despite that status report. All my monitors report over 900 jobs in progress.
ID: 30558 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2520
Credit: 252,039,324
RAC: 130,172
Message 30559 - Posted: 30 May 2017, 9:01:26 UTC - in response to Message 30558.  

We do seem to be still sending jobs, despite that status report. All my monitors report over 900 jobs in progress.

Right.

According to the server status page there are 6308 "tasks" in progress. I personally prefer to call them "work units (WUs)".
This means on volunteer's hosts there are 6308 virtual machines that, once running, request "jobs" (currently 900) via the WMAgent.

Those "jobs" are unfortunately not visible on the status page but here.

The current situation differs from other error periods in that point that now there is a lack of WUs whereas "normally" there is a lack of jobs.
ID: 30559 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,816,585
RAC: 18,715
Message 30560 - Posted: 30 May 2017, 9:35:35 UTC
Last modified: 30 May 2017, 9:36:57 UTC

make_work app is stopped on server-page.

Edit Upps - now active again
ID: 30560 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 380
Credit: 238,712
RAC: 0
Message 30562 - Posted: 30 May 2017, 9:39:11 UTC - in response to Message 30557.  
Last modified: 30 May 2017, 9:39:43 UTC

It may be that CMS (and possibly Theory) hit the break and are now unable to release it.


Correct! A false positive but now fixed.
ID: 30562 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,668,768
RAC: 6,679
Message 30707 - Posted: 9 Jun 2017, 12:45:13 UTC

Number of running jobs is falling. Can't see any reason yet (other than that it's Friday...).
ID: 30707 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,668,768
RAC: 6,679
Message 30710 - Posted: 9 Jun 2017, 14:08:07 UTC - in response to Message 30707.  

Number of running jobs is falling. Can't see any reason yet (other than that it's Friday...).

It looks like a cvmfs problem; the mapping of the site-config is wrong; Laurence's is mapping to T1_CH_CERN (which doesn't exist) instead of T3_CH_Volunteer:

http://lfield.web.cern.ch/lfield/finished_1.log

Beginning CMSSW wrapper script
slc5_amd64_gcc462 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Completed SCRAM project
Executing CMSSW
cmsRun -j FrameworkJobReport.xml PSet.py
----- Begin Fatal Exception 09-Jun-2017 15:13:34 CEST-----------------------
An exception of category 'Incomplete configuration' occurred while
[0] Constructing the EventProcessor
[1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml
----- End Fatal Exception -------------------------------------------------
Complete
process id is 5485 status is 65
ID: 30710 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,668,768
RAC: 6,679
Message 30711 - Posted: 9 Jun 2017, 15:29:57 UTC - in response to Message 30710.  

Is anyone else seeing this? I just had a new task start and it picked up the right site-config file.
ID: 30711 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,370,076
RAC: 74,781
Message 30713 - Posted: 9 Jun 2017, 16:48:17 UTC - in response to Message 30711.  

my latest CMS task started 1 hour ago, and so far, it seems to work fine
ID: 30713 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,668,768
RAC: 6,679
Message 30714 - Posted: 9 Jun 2017, 19:02:25 UTC - in response to Message 30713.  
Last modified: 9 Jun 2017, 21:07:13 UTC

Maybe we just lost a heavy-hitter and Laurence's error was a coincidence?
ID: 30714 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 380
Credit: 238,712
RAC: 0
Message 30715 - Posted: 9 Jun 2017, 19:15:13 UTC - in response to Message 30714.  

It may be my cluster that has fallen over. Won't be able to check until tomorrow.
ID: 30715 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,370,076
RAC: 74,781
Message 30717 - Posted: 10 Jun 2017, 6:28:16 UTC - in response to Message 30715.  

according to this:

https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php

CMS jobs still below 400 - whereas they were over 1000 two days ago.
ID: 30717 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2520
Credit: 252,039,324
RAC: 130,172
Message 30718 - Posted: 10 Jun 2017, 6:57:17 UTC - in response to Message 30717.  

according to this:

https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php

CMS jobs still below 400 - whereas they were over 1000 two days ago.

It looks like Laurence's cluster runs 600 jobs/h while all other users worldwide run 400 jobs/h.
:-)
ID: 30718 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,370,076
RAC: 74,781
Message 30730 - Posted: 11 Jun 2017, 6:10:16 UTC - in response to Message 30718.  

according to this:

https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php

CMS jobs still below 400 - whereas they were over 1000 two days ago.

It looks like Laurence's cluster runs 600 jobs/h while all other users worldwide run 400 jobs/h.
:-)


well, what though does account for the sudden drop from around 1000 jobs to less than 400 jobs on June 9/10 ?
ID: 30730 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,668,768
RAC: 6,679
Message 30735 - Posted: 11 Jun 2017, 9:52:05 UTC - in response to Message 30730.  

We're still scratching our heads over that one. Perhaps something will emerge in the new working week.
ID: 30735 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,370,076
RAC: 74,781
Message 30742 - Posted: 12 Jun 2017, 4:21:48 UTC

since last night, we are back to over 1000:

https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php
ID: 30742 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,668,768
RAC: 6,679
Message 30748 - Posted: 12 Jun 2017, 8:12:42 UTC - in response to Message 30742.  

...and we're still scratching our heads...
ID: 30748 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,370,076
RAC: 74,781
Message 30750 - Posted: 12 Jun 2017, 8:42:44 UTC - in response to Message 30748.  

...and we're still scratching our heads...

:-))))))
ID: 30750 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN