CMS Tasks Failing

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,325,126 RAC: 19,807	Message 30554 - Posted: 29 May 2017, 21:10:44 UTC - in response to Message 30551. Last modified: 29 May 2017, 21:25:12 UTC OK, we're up again now. Alan spotted a problem that maybe will prevent this particular issue in the future. ID: 30554 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,946,305 RAC: 145,246	Message 30556 - Posted: 30 May 2017, 7:25:35 UTC Last modified: 30 May 2017, 7:25:56 UTC Obviously due to the reasons described by Ivan above, yesterday evening I had quite a number of tasks that failed after about 10 minutes (which I noticed only some minutes ago, having been away). Same, I guess, was definitely true for all other crunchers. What exactly is this WMAgent doing? Why does it fail that often? ID: 30556 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2708 Credit: 291,519,121 RAC: 147,127	Message 30557 - Posted: 30 May 2017, 7:36:26 UTC - in response to Message 30543. Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem. It may be that CMS (and possibly Theory) hit the break and are now unable to release it. Server Status: 10 CMS, 0 Theory (Task data as of 30 May 2017, 7:32:12 UTC) ID: 30557 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,325,126 RAC: 19,807	Message 30558 - Posted: 30 May 2017, 8:19:15 UTC - in response to Message 30557. Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem. It may be that CMS (and possibly Theory) hit the break and are now unable to release it. Server Status: 10 CMS, 0 Theory (Task data as of 30 May 2017, 7:32:12 UTC) We do seem to be still sending jobs, despite that status report. All my monitors report over 900 jobs in progress. ID: 30558 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2708 Credit: 291,519,121 RAC: 147,127	Message 30559 - Posted: 30 May 2017, 9:01:26 UTC - in response to Message 30558. We do seem to be still sending jobs, despite that status report. All my monitors report over 900 jobs in progress. Right. According to the server status page there are 6308 "tasks" in progress. I personally prefer to call them "work units (WUs)". This means on volunteer's hosts there are 6308 virtual machines that, once running, request "jobs" (currently 900) via the WMAgent. Those "jobs" are unfortunately not visible on the status page but here. The current situation differs from other error periods in that point that now there is a lack of WUs whereas "normally" there is a lack of jobs. ID: 30559 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2278 Credit: 178,775,457 RAC: 2,811	Message 30560 - Posted: 30 May 2017, 9:35:35 UTC Last modified: 30 May 2017, 9:36:57 UTC make_work app is stopped on server-page. Edit Upps - now active again ID: 30560 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 30562 - Posted: 30 May 2017, 9:39:11 UTC - in response to Message 30557. Last modified: 30 May 2017, 9:39:43 UTC It may be that CMS (and possibly Theory) hit the break and are now unable to release it. Correct! A false positive but now fixed. ID: 30562 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,325,126 RAC: 19,807	Message 30707 - Posted: 9 Jun 2017, 12:45:13 UTC Number of running jobs is falling. Can't see any reason yet (other than that it's Friday...). ID: 30707 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,325,126 RAC: 19,807	Message 30710 - Posted: 9 Jun 2017, 14:08:07 UTC - in response to Message 30707. Number of running jobs is falling. Can't see any reason yet (other than that it's Friday...). It looks like a cvmfs problem; the mapping of the site-config is wrong; Laurence's is mapping to T1_CH_CERN (which doesn't exist) instead of T3_CH_Volunteer: http://lfield.web.cern.ch/lfield/finished_1.log Beginning CMSSW wrapper script slc5_amd64_gcc462 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py ----- Begin Fatal Exception 09-Jun-2017 15:13:34 CEST----------------------- An exception of category 'Incomplete configuration' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml ----- End Fatal Exception ------------------------------------------------- Complete process id is 5485 status is 65 ID: 30710 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,325,126 RAC: 19,807	Message 30711 - Posted: 9 Jun 2017, 15:29:57 UTC - in response to Message 30710. Is anyone else seeing this? I just had a new task start and it picked up the right site-config file. ID: 30711 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,946,305 RAC: 145,246	Message 30713 - Posted: 9 Jun 2017, 16:48:17 UTC - in response to Message 30711. my latest CMS task started 1 hour ago, and so far, it seems to work fine ID: 30713 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,325,126 RAC: 19,807	Message 30714 - Posted: 9 Jun 2017, 19:02:25 UTC - in response to Message 30713. Last modified: 9 Jun 2017, 21:07:13 UTC Maybe we just lost a heavy-hitter and Laurence's error was a coincidence? ID: 30714 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 30715 - Posted: 9 Jun 2017, 19:15:13 UTC - in response to Message 30714. It may be my cluster that has fallen over. Won't be able to check until tomorrow. ID: 30715 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,946,305 RAC: 145,246	Message 30717 - Posted: 10 Jun 2017, 6:28:16 UTC - in response to Message 30715. according to this: https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php CMS jobs still below 400 - whereas they were over 1000 two days ago. ID: 30717 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2708 Credit: 291,519,121 RAC: 147,127	Message 30718 - Posted: 10 Jun 2017, 6:57:17 UTC - in response to Message 30717. according to this: https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php CMS jobs still below 400 - whereas they were over 1000 two days ago. It looks like Laurence's cluster runs 600 jobs/h while all other users worldwide run 400 jobs/h. :-) ID: 30718 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,946,305 RAC: 145,246	Message 30730 - Posted: 11 Jun 2017, 6:10:16 UTC - in response to Message 30718. according to this: https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php CMS jobs still below 400 - whereas they were over 1000 two days ago. It looks like Laurence's cluster runs 600 jobs/h while all other users worldwide run 400 jobs/h. :-) well, what though does account for the sudden drop from around 1000 jobs to less than 400 jobs on June 9/10 ? ID: 30730 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,325,126 RAC: 19,807	Message 30735 - Posted: 11 Jun 2017, 9:52:05 UTC - in response to Message 30730. We're still scratching our heads over that one. Perhaps something will emerge in the new working week. ID: 30735 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,946,305 RAC: 145,246	Message 30742 - Posted: 12 Jun 2017, 4:21:48 UTC since last night, we are back to over 1000: https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php ID: 30742 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,325,126 RAC: 19,807	Message 30748 - Posted: 12 Jun 2017, 8:12:42 UTC - in response to Message 30742. ...and we're still scratching our heads... ID: 30748 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,946,305 RAC: 145,246	Message 30750 - Posted: 12 Jun 2017, 8:42:44 UTC - in response to Message 30748. ...and we're still scratching our heads... :-)))))) ID: 30750 · Reply Quote

LHC@home