CMS Tasks Failing

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,461,126 RAC: 19,078	Message 39213 - Posted: 27 Jun 2019, 21:36:26 UTC - in response to Message 39208. Looks like my CMS task have (temporarily) problems to upload subtask results. In addition there's a huge red peak in the dashboard graphic: http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=T3_CH_Volunteer&sitesSort=3&start=null&end=null&timeRange=lastWeek&sortBy=0&granularity=Hourly&generic=0&series=All&type=nwcb Yes. I fear that we are picking up people with poorly-setup VirtualBox installations, but I don't have access to logs to try to verify that. On the other hand, I'm not sure how much I trust Dashboard displays. For example at the moment Dashboard says we have 641 jobs -- although I guess that could be 641/hour -- while WMStats says we have 254 jobs running and zero failures for our jobs. Another difference could be that if a job fails and is resubmitted, VMStats doesn't count that as a failure until it has actually failed three times. I need to look at that, but Dashboard can be difficult to navigate. There is a problem coming up, in that Dashboard is going to be retired in a couple of weeks. The new Grafana-based dashboard is up and running for normal jobs, but we have a ticket in to have it include our jobs too. When that comes up, we'll need to see what information we can glean from it to replace the current jobs plots. Oh, and the WMAgent developers have confirmed that they have recently introduced length limits on various character strings within the system, hence the error message I (finally?) saw yesterday. ID: 39213 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,461,126 RAC: 19,078	Message 39224 - Posted: 28 Jun 2019, 20:39:06 UTC - in response to Message 39213. Last modified: 28 Jun 2019, 20:43:12 UTC I've just done some analysis on a couple of the DataBridge directories where the Volunteer jobs write their results. There's some time overlap between the directories, as 1000 jobs are allocated to each directory but the time they come in depends on how fast the individual jobs were processed. In all, 1560 jobs in 6.75 hours -> 231 successful jobs/hour. That doesn't square very well with the current peak in the running jobs plot which is showing 1,000/hr, but is much closer to WMStats which shows 287 jobs running at the moment. On the other hand, there are 1,000 results in some of the directories, and >980 in others, so it looks like direct failures (implying retries) are rare, which also jibes quite well with the zero overall failure rate given by WMStats. Oh, the size of the result files ranges from ~40 MB to ~75 MB. Unfortunately it seems that direct access to the old dashboard is turned off already. We can access the graphs still, but every time I try to access the Web pages I get an Access Forbidden error. I've still not heard back about when we will show up on the new Grafana dashboard. ID: 39224 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,461,126 RAC: 19,078	Message 39225 - Posted: 29 Jun 2019, 1:07:57 UTC - in response to Message 39224. Last modified: 29 Jun 2019, 1:09:40 UTC Hmm, actually there are about 420 missing results in a recent directory, which might account for the spike in the running jobs graph -- someone snarfed a large number of jobs but has yet to return the results. We shall see... [Edit] No, that doesn't stack up, as WMStats would show them as running, and it only shows 273 jobs out in the field at the moment. [/Edit] ID: 39225 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,461,126 RAC: 19,078	Message 39226 - Posted: 29 Jun 2019, 2:24:33 UTC - in response to Message 39225. Hmm, actually there are about 420 missing results in a recent directory, which might account for the spike in the running jobs graph -- someone snarfed a large number of jobs but has yet to return the results. We shall see... [Edit] No, that doesn't stack up, as WMStats would show them as running, and it only shows 273 jobs out in the field at the moment. [/Edit] No, the "missing" 420 job results have now arrived. WMStats is still batting a better average. ID: 39226 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 40551 - Posted: 20 Nov 2019, 12:19:41 UTC Last modified: 20 Nov 2019, 12:46:09 UTC After solving the problem with Theory on computer 10570592 by uninstalling the McAfee virus protector, CMS@home tasks start running but condor stops running after 64277 s in two tasks. Tullio ID: 40551 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 790 Credit: 61,888,860 RAC: 43,574	Message 40553 - Posted: 20 Nov 2019, 14:18:33 UTC - in response to Message 40551. I think that there is a limit of 18 hours when tasks are terminated even not complete. The same limit was applied to Theory tasks as well but this was changed to 36 hours a little while ago. So I think this is by design and not an error as such. ID: 40553 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1470 Credit: 9,929,293 RAC: 1,497	Message 40555 - Posted: 20 Nov 2019, 14:24:57 UTC - in response to Message 40551. After solving the problem with Theory on computer 10570592 by uninstalling the McAfee virus protector, CMS@home tasks start running but condor stops running after 64277 s in two tasks. Tullio 2019-11-18 14:05:42 (12788): VM state change detected. (old = 'Running', new = 'Paused') 2019-11-19 01:31:52 (12788): VM state change detected. (old = 'Paused', new = 'Running') 2019-11-19 12:39:58 (2948): VM state change detected. (old = 'Running', new = 'Paused') 2019-11-20 05:54:38 (2948): VM state change detected. (old = 'Paused', new = 'Running') ID: 40555 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 40556 - Posted: 20 Nov 2019, 15:35:25 UTC But I am not stopping and starting. My PCs run 24/7 and the one dedicated to LHC WBox tasks has a Ryzen 5 1400 CPU and 24 GB RAM. Only Atlas@home tasks take all 8 cores. Is this the problem? Tullio ID: 40556 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1470 Credit: 9,929,293 RAC: 1,497	Message 40557 - Posted: 20 Nov 2019, 15:51:17 UTC Yes Tullio, that's the problem. When an ATLAS-task starts it need all 8 cores and other running tasks get the 'waiting to run' state (pausing). If you want to run a mixture of ATLAS, Theory and CMS you could setup an app_config.xml like this: <app_config> <project_max_concurrent>8</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>2</max_concurrent> </app> <app> <name>CMS</name> <max_concurrent>2</max_concurrent> </app> <app> <name>Theory</name> <max_concurrent>2</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>3.000000</avg_ncpus> <cmdline>--memory_size_mb 5700</cmdline> </app_version> <app_version> <app_name>CMS</app_name> <plan_class>vbox64</plan_class> <avg_ncpus>1.000000</avg_ncpus> <cmdline>--nthreads 1.000000</cmdline> <cmdline>--memory_size_mb 2048</cmdline> </app_version> <app_version> <app_name>Theory</app_name> <plan_class>vbox64_theory</plan_class> <avg_ncpus>1.000000</avg_ncpus> <cmdline>--nthreads 1.000000</cmdline> <cmdline>--memory_size_mb 630</cmdline> </app_version> </app_config> ID: 40557 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 40558 - Posted: 20 Nov 2019, 16:05:32 UTC Thanks Crystal. Going by trial and error I have reduced the number of CPUs to 6 only on this host, which is in the work location. I am now running one Atlas@home task on 6 cores and two Theory tasks on the remaining two cores. I may cur down the number of CPUs to 4, that is enough for Atlas. Tullio ID: 40558 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,016,701 RAC: 143,655	Message 40801 - Posted: 5 Dec 2019, 20:05:55 UTC For a few hours, all tasks have failed after about 8 minutes with -152 (0xFFFFFF68) ERR_NETOPEN excerpt from stderr: "Guest Log: [ERROR] Could not connect to Condor server on port 9618" ID: 40801 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1243 Credit: 85,687,975 RAC: 144,104	Message 40803 - Posted: 6 Dec 2019, 1:04:07 UTC - in response to Message 40801. For a few hours, all tasks have failed after about 8 minutes with -152 (0xFFFFFF68) ERR_NETOPEN excerpt from stderr: "Guest Log: [ERROR] Could not connect to Condor server on port 9618" The CMS problems have not been repaired yet so don't waste your time on them. ID: 40803 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,016,701 RAC: 143,655	Message 40807 - Posted: 6 Dec 2019, 6:47:13 UTC - in response to Message 40803. The CMS problems have not been repaired yet so don't waste your time on them. okay, I just thought that everything is okay now, since all went well for almost a week. But probably not so. Perhaps Ivan could tell us more about the current situation. ID: 40807 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 40809 - Posted: 6 Dec 2019, 9:07:11 UTC - in response to Message 40807. Last modified: 6 Dec 2019, 10:15:00 UTC The CMS problems have not been repaired yet so don't waste your time on them. okay, I just thought that everything is okay now, since all went well for almost a week. But probably not so. Perhaps Ivan could tell us more about the current situation. Sorry this was my fault. I switched off a server that I thought we were no longer using and it looks like we are using it for the test. I have restarted it and will update the test. ID: 40809 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 41128 - Posted: 31 Dec 2019, 16:56:13 UTC Last modified: 31 Dec 2019, 16:58:22 UTC I was running fine on two machines until the ones sent about 31 Dec 2019, 5:32:08 UTC. Now they are mostly failing. Presumably they are short of work over New Year's. ID: 41128 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,016,701 RAC: 143,655	Message 41130 - Posted: 31 Dec 2019, 19:40:07 UTC - in response to Message 41128. Presumably they are short of work over New Year's. I have the same problem here. Since last night, most of the tasks fail after about 20 minutes. They show: 207 (0x000000CF) EXIT_NO_SUB_TASKS so it's clear - no jobs available. What I am wondering though is that some (long) time ago they introduced some mechanism according to which the tasks download queue should be stopped as soon as there are no jobs. Obviously, this does not work :-( ID: 41130 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,016,701 RAC: 143,655	Message 41131 - Posted: 1 Jan 2020, 6:11:46 UTC - in response to Message 41130. since later last evening some tasks failed after more than 5 hours with "unknown error code" - for example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=257257063 2019-12-31 23:39:18 (19000): Guest Log: [ERROR] Condor ended after 19787 seconds I abandoned CMS and switched to Theory. ID: 41131 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,016,701 RAC: 143,655	Message 41154 - Posted: 4 Jan 2020, 7:23:43 UTC - in response to Message 41131. I abandoned CMS and switched to Theory. before I switch back to CMS - can anyone tell whether the problem of jobs (subtasks) not available has been solved meanwhile? ID: 41154 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1243 Credit: 85,687,975 RAC: 144,104	Message 41155 - Posted: 4 Jan 2020, 7:37:42 UTC - in response to Message 41154. I abandoned CMS and switched to Theory. before I switch back to CMS - can anyone tell whether the problem of jobs (subtasks) not available has been solved meanwhile? I think it would be best to stick with Theory tasks still Erich I don't see any running here yet with Windows OS and with Linux not very often ( about 25%) ID: 41155 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,016,701 RAC: 143,655	Message 41156 - Posted: 4 Jan 2020, 7:53:26 UTC - in response to Message 41155. I don't see any running here yet with Windows OS and with Linux not very often ( about 25%) thank you, MAGIC, for your quick answer. Strange, which figures the server status pages shows: 7.931 CMS tasks in process; rather impossible, right? ID: 41156 · Reply Quote

LHC@home