Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,509,145 RAC: 31,527 |
Obviously due to the reasons described by Ivan above, yesterday evening I had quite a number of tasks that failed after about 10 minutes (which I noticed only some minutes ago, having been away). Same, I guess, was definitely true for all other crunchers. What exactly is this WMAgent doing? Why does it fail that often? |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,880,085 RAC: 39,046 |
Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." It may be that CMS (and possibly Theory) hit the break and are now unable to release it. Server Status: 10 CMS, 0 Theory (Task data as of 30 May 2017, 7:32:12 UTC) |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." We do seem to be still sending jobs, despite that status report. All my monitors report over 900 jobs in progress. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,880,085 RAC: 39,046 |
We do seem to be still sending jobs, despite that status report. All my monitors report over 900 jobs in progress. Right. According to the server status page there are 6308 "tasks" in progress. I personally prefer to call them "work units (WUs)". This means on volunteer's hosts there are 6308 virtual machines that, once running, request "jobs" (currently 900) via the WMAgent. Those "jobs" are unfortunately not visible on the status page but here. The current situation differs from other error periods in that point that now there is a lack of WUs whereas "normally" there is a lack of jobs. |
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 2,013 |
make_work app is stopped on server-page. Edit Upps - now active again |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
It may be that CMS (and possibly Theory) hit the break and are now unable to release it. Correct! A false positive but now fixed. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Number of running jobs is falling. Can't see any reason yet (other than that it's Friday...). It looks like a cvmfs problem; the mapping of the site-config is wrong; Laurence's is mapping to T1_CH_CERN (which doesn't exist) instead of T3_CH_Volunteer: http://lfield.web.cern.ch/lfield/finished_1.log Beginning CMSSW wrapper script slc5_amd64_gcc462 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py ----- Begin Fatal Exception 09-Jun-2017 15:13:34 CEST----------------------- An exception of category 'Incomplete configuration' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml ----- End Fatal Exception ------------------------------------------------- Complete process id is 5485 status is 65 |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,509,145 RAC: 31,527 |
my latest CMS task started 1 hour ago, and so far, it seems to work fine |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
It may be my cluster that has fallen over. Won't be able to check until tomorrow. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,509,145 RAC: 31,527 |
according to this: https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php CMS jobs still below 400 - whereas they were over 1000 two days ago. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,880,085 RAC: 39,046 |
according to this: It looks like Laurence's cluster runs 600 jobs/h while all other users worldwide run 400 jobs/h. :-) |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,509,145 RAC: 31,527 |
according to this: well, what though does account for the sudden drop from around 1000 jobs to less than 400 jobs on June 9/10 ? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,509,145 RAC: 31,527 |
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,509,145 RAC: 31,527 |
...and we're still scratching our heads... :-)))))) |
©2024 CERN