Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,198,060 RAC: 10,259 |
Looks like my CMS task have (temporarily) problems to upload subtask results. Yes. I fear that we are picking up people with poorly-setup VirtualBox installations, but I don't have access to logs to try to verify that. On the other hand, I'm not sure how much I trust Dashboard displays. For example at the moment Dashboard says we have 641 jobs -- although I guess that could be 641/hour -- while WMStats says we have 254 jobs running and zero failures for our jobs. Another difference could be that if a job fails and is resubmitted, VMStats doesn't count that as a failure until it has actually failed three times. I need to look at that, but Dashboard can be difficult to navigate. There is a problem coming up, in that Dashboard is going to be retired in a couple of weeks. The new Grafana-based dashboard is up and running for normal jobs, but we have a ticket in to have it include our jobs too. When that comes up, we'll need to see what information we can glean from it to replace the current jobs plots. Oh, and the WMAgent developers have confirmed that they have recently introduced length limits on various character strings within the system, hence the error message I (finally?) saw yesterday. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,198,060 RAC: 10,259 |
I've just done some analysis on a couple of the DataBridge directories where the Volunteer jobs write their results. There's some time overlap between the directories, as 1000 jobs are allocated to each directory but the time they come in depends on how fast the individual jobs were processed. In all, 1560 jobs in 6.75 hours -> 231 successful jobs/hour. That doesn't square very well with the current peak in the running jobs plot which is showing 1,000/hr, but is much closer to WMStats which shows 287 jobs running at the moment. On the other hand, there are 1,000 results in some of the directories, and >980 in others, so it looks like direct failures (implying retries) are rare, which also jibes quite well with the zero overall failure rate given by WMStats. Oh, the size of the result files ranges from ~40 MB to ~75 MB. Unfortunately it seems that direct access to the old dashboard is turned off already. We can access the graphs still, but every time I try to access the Web pages I get an Access Forbidden error. I've still not heard back about when we will show up on the new Grafana dashboard. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,198,060 RAC: 10,259 |
Hmm, actually there are about 420 missing results in a recent directory, which might account for the spike in the running jobs graph -- someone snarfed a large number of jobs but has yet to return the results. We shall see... [Edit] No, that doesn't stack up, as WMStats would show them as running, and it only shows 273 jobs out in the field at the moment. [/Edit] |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,198,060 RAC: 10,259 |
Hmm, actually there are about 420 missing results in a recent directory, which might account for the spike in the running jobs graph -- someone snarfed a large number of jobs but has yet to return the results. We shall see... No, the "missing" 420 job results have now arrived. WMStats is still batting a better average. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
After solving the problem with Theory on computer 10570592 by uninstalling the McAfee virus protector, CMS@home tasks start running but condor stops running after 64277 s in two tasks. Tullio |
Send message Joined: 28 Sep 04 Posts: 739 Credit: 50,540,073 RAC: 33,326 |
I think that there is a limit of 18 hours when tasks are terminated even not complete. The same limit was applied to Theory tasks as well but this was changed to 36 hours a little while ago. So I think this is by design and not an error as such. |
Send message Joined: 14 Jan 10 Posts: 1437 Credit: 9,614,158 RAC: 2,399 |
After solving the problem with Theory on computer 10570592 by uninstalling the McAfee virus protector, CMS@home tasks start running but condor stops running after 64277 s in two tasks. 2019-11-18 14:05:42 (12788): VM state change detected. (old = 'Running', new = 'Paused') 2019-11-19 01:31:52 (12788): VM state change detected. (old = 'Paused', new = 'Running') 2019-11-19 12:39:58 (2948): VM state change detected. (old = 'Running', new = 'Paused') 2019-11-20 05:54:38 (2948): VM state change detected. (old = 'Paused', new = 'Running') |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
But I am not stopping and starting. My PCs run 24/7 and the one dedicated to LHC WBox tasks has a Ryzen 5 1400 CPU and 24 GB RAM. Only Atlas@home tasks take all 8 cores. Is this the problem? Tullio |
Send message Joined: 14 Jan 10 Posts: 1437 Credit: 9,614,158 RAC: 2,399 |
Yes Tullio, that's the problem. When an ATLAS-task starts it need all 8 cores and other running tasks get the 'waiting to run' state (pausing). If you want to run a mixture of ATLAS, Theory and CMS you could setup an app_config.xml like this: <app_config> <project_max_concurrent>8</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>2</max_concurrent> </app> <app> <name>CMS</name> <max_concurrent>2</max_concurrent> </app> <app> <name>Theory</name> <max_concurrent>2</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>3.000000</avg_ncpus> <cmdline>--memory_size_mb 5700</cmdline> </app_version> <app_version> <app_name>CMS</app_name> <plan_class>vbox64</plan_class> <avg_ncpus>1.000000</avg_ncpus> <cmdline>--nthreads 1.000000</cmdline> <cmdline>--memory_size_mb 2048</cmdline> </app_version> <app_version> <app_name>Theory</app_name> <plan_class>vbox64_theory</plan_class> <avg_ncpus>1.000000</avg_ncpus> <cmdline>--nthreads 1.000000</cmdline> <cmdline>--memory_size_mb 630</cmdline> </app_version> </app_config> |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
Thanks Crystal. Going by trial and error I have reduced the number of CPUs to 6 only on this host, which is in the work location. I am now running one Atlas@home task on 6 cores and two Theory tasks on the remaining two cores. I may cur down the number of CPUs to 4, that is enough for Atlas. Tullio |
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 121,617,591 RAC: 88,689 |
For a few hours, all tasks have failed after about 8 minutes with -152 (0xFFFFFF68) ERR_NETOPEN excerpt from stderr: "Guest Log: [ERROR] Could not connect to Condor server on port 9618" |
Send message Joined: 24 Oct 04 Posts: 1184 Credit: 56,979,704 RAC: 63,050 |
For a few hours, all tasks have failed after about 8 minutes with The CMS problems have not been repaired yet so don't waste your time on them. |
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 121,617,591 RAC: 88,689 |
The CMS problems have not been repaired yet so don't waste your time on them.okay, I just thought that everything is okay now, since all went well for almost a week. But probably not so. Perhaps Ivan could tell us more about the current situation. |
Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 |
The CMS problems have not been repaired yet so don't waste your time on them.okay, I just thought that everything is okay now, since all went well for almost a week. But probably not so. Sorry this was my fault. I switched off a server that I thought we were no longer using and it looks like we are using it for the test. I have restarted it and will update the test. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I was running fine on two machines until the ones sent about 31 Dec 2019, 5:32:08 UTC. Now they are mostly failing. Presumably they are short of work over New Year's. |
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 121,617,591 RAC: 88,689 |
Presumably they are short of work over New Year's.I have the same problem here. Since last night, most of the tasks fail after about 20 minutes. They show: 207 (0x000000CF) EXIT_NO_SUB_TASKS so it's clear - no jobs available. What I am wondering though is that some (long) time ago they introduced some mechanism according to which the tasks download queue should be stopped as soon as there are no jobs. Obviously, this does not work :-( |
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 121,617,591 RAC: 88,689 |
since later last evening some tasks failed after more than 5 hours with "unknown error code" - for example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=257257063 2019-12-31 23:39:18 (19000): Guest Log: [ERROR] Condor ended after 19787 seconds I abandoned CMS and switched to Theory. |
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 121,617,591 RAC: 88,689 |
I abandoned CMS and switched to Theory.before I switch back to CMS - can anyone tell whether the problem of jobs (subtasks) not available has been solved meanwhile? |
Send message Joined: 24 Oct 04 Posts: 1184 Credit: 56,979,704 RAC: 63,050 |
I abandoned CMS and switched to Theory.before I switch back to CMS - can anyone tell whether the problem of jobs (subtasks) not available has been solved meanwhile? I think it would be best to stick with Theory tasks still Erich I don't see any running here yet with Windows OS and with Linux not very often ( about 25%) |
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 121,617,591 RAC: 88,689 |
I don't see any running here yet with Windows OS and with Linux not very often ( about 25%)thank you, MAGIC, for your quick answer. Strange, which figures the server status pages shows: 7.931 CMS tasks in process; rather impossible, right? |
©2025 CERN