Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 22 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,311,999
RAC: 1,591
Message 39213 - Posted: 27 Jun 2019, 21:36:26 UTC - in response to Message 39208.  

Looks like my CMS task have (temporarily) problems to upload subtask results.

In addition there's a huge red peak in the dashboard graphic:
http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=T3_CH_Volunteer&sitesSort=3&start=null&end=null&timeRange=lastWeek&sortBy=0&granularity=Hourly&generic=0&series=All&type=nwcb

Yes. I fear that we are picking up people with poorly-setup VirtualBox installations, but I don't have access to logs to try to verify that. On the other hand, I'm not sure how much I trust Dashboard displays. For example at the moment Dashboard says we have 641 jobs -- although I guess that could be 641/hour -- while WMStats says we have 254 jobs running and zero failures for our jobs. Another difference could be that if a job fails and is resubmitted, VMStats doesn't count that as a failure until it has actually failed three times. I need to look at that, but Dashboard can be difficult to navigate.
There is a problem coming up, in that Dashboard is going to be retired in a couple of weeks. The new Grafana-based dashboard is up and running for normal jobs, but we have a ticket in to have it include our jobs too. When that comes up, we'll need to see what information we can glean from it to replace the current jobs plots.
Oh, and the WMAgent developers have confirmed that they have recently introduced length limits on various character strings within the system, hence the error message I (finally?) saw yesterday.
ID: 39213 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,311,999
RAC: 1,591
Message 39224 - Posted: 28 Jun 2019, 20:39:06 UTC - in response to Message 39213.  
Last modified: 28 Jun 2019, 20:43:12 UTC

I've just done some analysis on a couple of the DataBridge directories where the Volunteer jobs write their results. There's some time overlap between the directories, as 1000 jobs are allocated to each directory but the time they come in depends on how fast the individual jobs were processed. In all, 1560 jobs in 6.75 hours -> 231 successful jobs/hour. That doesn't square very well with the current peak in the running jobs plot which is showing 1,000/hr, but is much closer to WMStats which shows 287 jobs running at the moment.
On the other hand, there are 1,000 results in some of the directories, and >980 in others, so it looks like direct failures (implying retries) are rare, which also jibes quite well with the zero overall failure rate given by WMStats.
Oh, the size of the result files ranges from ~40 MB to ~75 MB.

Unfortunately it seems that direct access to the old dashboard is turned off already. We can access the graphs still, but every time I try to access the Web pages I get an Access Forbidden error. I've still not heard back about when we will show up on the new Grafana dashboard.
ID: 39224 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,311,999
RAC: 1,591
Message 39225 - Posted: 29 Jun 2019, 1:07:57 UTC - in response to Message 39224.  
Last modified: 29 Jun 2019, 1:09:40 UTC

Hmm, actually there are about 420 missing results in a recent directory, which might account for the spike in the running jobs graph -- someone snarfed a large number of jobs but has yet to return the results. We shall see...

[Edit] No, that doesn't stack up, as WMStats would show them as running, and it only shows 273 jobs out in the field at the moment. [/Edit]
ID: 39225 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,311,999
RAC: 1,591
Message 39226 - Posted: 29 Jun 2019, 2:24:33 UTC - in response to Message 39225.  

Hmm, actually there are about 420 missing results in a recent directory, which might account for the spike in the running jobs graph -- someone snarfed a large number of jobs but has yet to return the results. We shall see...

[Edit] No, that doesn't stack up, as WMStats would show them as running, and it only shows 273 jobs out in the field at the moment. [/Edit]

No, the "missing" 420 job results have now arrived. WMStats is still batting a better average.
ID: 39226 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 40551 - Posted: 20 Nov 2019, 12:19:41 UTC
Last modified: 20 Nov 2019, 12:46:09 UTC

After solving the problem with Theory on computer 10570592 by uninstalling the McAfee virus protector, CMS@home tasks start running but condor stops running after 64277 s in two tasks.
Tullio
ID: 40551 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 683
Credit: 44,065,721
RAC: 17,613
Message 40553 - Posted: 20 Nov 2019, 14:18:33 UTC - in response to Message 40551.  

I think that there is a limit of 18 hours when tasks are terminated even not complete. The same limit was applied to Theory tasks as well but this was changed to 36 hours a little while ago. So I think this is by design and not an error as such.
ID: 40553 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1302
Credit: 8,662,418
RAC: 8,256
Message 40555 - Posted: 20 Nov 2019, 14:24:57 UTC - in response to Message 40551.  

After solving the problem with Theory on computer 10570592 by uninstalling the McAfee virus protector, CMS@home tasks start running but condor stops running after 64277 s in two tasks.
Tullio
2019-11-18 14:05:42 (12788): VM state change detected. (old = 'Running', new = 'Paused')
2019-11-19 01:31:52 (12788): VM state change detected. (old = 'Paused', new = 'Running')

2019-11-19 12:39:58 (2948): VM state change detected. (old = 'Running', new = 'Paused')
2019-11-20 05:54:38 (2948): VM state change detected. (old = 'Paused', new = 'Running')
ID: 40555 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 40556 - Posted: 20 Nov 2019, 15:35:25 UTC

But I am not stopping and starting. My PCs run 24/7 and the one dedicated to LHC WBox tasks has a Ryzen 5 1400 CPU and 24 GB RAM. Only Atlas@home tasks take all 8 cores. Is this the problem?
Tullio
ID: 40556 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1302
Credit: 8,662,418
RAC: 8,256
Message 40557 - Posted: 20 Nov 2019, 15:51:17 UTC

Yes Tullio, that's the problem.
When an ATLAS-task starts it need all 8 cores and other running tasks get the 'waiting to run' state (pausing).
If you want to run a mixture of ATLAS, Theory and CMS you could setup an app_config.xml like this:
<app_config>
<project_max_concurrent>8</project_max_concurrent>
 <app>
  <name>ATLAS</name>
  <max_concurrent>2</max_concurrent>
 </app>
 <app>
  <name>CMS</name>
  <max_concurrent>2</max_concurrent>
 </app>
 <app>
  <name>Theory</name>
  <max_concurrent>2</max_concurrent>
 </app>
 <app_version>
  <app_name>ATLAS</app_name>
  <plan_class>vbox64_mt_mcore_atlas</plan_class>
  <avg_ncpus>3.000000</avg_ncpus>
  <cmdline>--memory_size_mb 5700</cmdline>
 </app_version>
 <app_version>
  <app_name>CMS</app_name>
  <plan_class>vbox64</plan_class>
  <avg_ncpus>1.000000</avg_ncpus>
  <cmdline>--nthreads 1.000000</cmdline>
  <cmdline>--memory_size_mb 2048</cmdline>
 </app_version>
 <app_version>
  <app_name>Theory</app_name>
  <plan_class>vbox64_theory</plan_class>
  <avg_ncpus>1.000000</avg_ncpus>
  <cmdline>--nthreads 1.000000</cmdline>
  <cmdline>--memory_size_mb 630</cmdline>
 </app_version>
</app_config>
ID: 40557 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 40558 - Posted: 20 Nov 2019, 16:05:32 UTC

Thanks Crystal. Going by trial and error I have reduced the number of CPUs to 6 only on this host, which is in the work location. I am now running one Atlas@home task on 6 cores and two Theory tasks on the remaining two cores. I may cur down the number of CPUs to 4, that is enough for Atlas.
Tullio
ID: 40558 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1698
Credit: 105,458,494
RAC: 69,308
Message 40801 - Posted: 5 Dec 2019, 20:05:55 UTC

For a few hours, all tasks have failed after about 8 minutes with

-152 (0xFFFFFF68) ERR_NETOPEN

excerpt from stderr: "Guest Log: [ERROR] Could not connect to Condor server on port 9618"
ID: 40801 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1133
Credit: 49,900,549
RAC: 7,015
Message 40803 - Posted: 6 Dec 2019, 1:04:07 UTC - in response to Message 40801.  

For a few hours, all tasks have failed after about 8 minutes with

-152 (0xFFFFFF68) ERR_NETOPEN

excerpt from stderr: "Guest Log: [ERROR] Could not connect to Condor server on port 9618"


The CMS problems have not been repaired yet so don't waste your time on them.
ID: 40803 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1698
Credit: 105,458,494
RAC: 69,308
Message 40807 - Posted: 6 Dec 2019, 6:47:13 UTC - in response to Message 40803.  

The CMS problems have not been repaired yet so don't waste your time on them.
okay, I just thought that everything is okay now, since all went well for almost a week. But probably not so.
Perhaps Ivan could tell us more about the current situation.
ID: 40807 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 380
Credit: 238,712
RAC: 0
Message 40809 - Posted: 6 Dec 2019, 9:07:11 UTC - in response to Message 40807.  
Last modified: 6 Dec 2019, 10:15:00 UTC

The CMS problems have not been repaired yet so don't waste your time on them.
okay, I just thought that everything is okay now, since all went well for almost a week. But probably not so.
Perhaps Ivan could tell us more about the current situation.


Sorry this was my fault. I switched off a server that I thought we were no longer using and it looks like we are using it for the test. I have restarted it and will update the test.
ID: 40809 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 41128 - Posted: 31 Dec 2019, 16:56:13 UTC
Last modified: 31 Dec 2019, 16:58:22 UTC

I was running fine on two machines until the ones sent about 31 Dec 2019, 5:32:08 UTC.
Now they are mostly failing.

Presumably they are short of work over New Year's.
ID: 41128 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1698
Credit: 105,458,494
RAC: 69,308
Message 41130 - Posted: 31 Dec 2019, 19:40:07 UTC - in response to Message 41128.  

Presumably they are short of work over New Year's.
I have the same problem here. Since last night, most of the tasks fail after about 20 minutes. They show:
207 (0x000000CF) EXIT_NO_SUB_TASKS

so it's clear - no jobs available. What I am wondering though is that some (long) time ago they introduced some mechanism according to which the tasks download queue should be stopped as soon as there are no jobs. Obviously, this does not work :-(
ID: 41130 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1698
Credit: 105,458,494
RAC: 69,308
Message 41131 - Posted: 1 Jan 2020, 6:11:46 UTC - in response to Message 41130.  

since later last evening some tasks failed after more than 5 hours with "unknown error code" - for example:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=257257063

2019-12-31 23:39:18 (19000): Guest Log: [ERROR] Condor ended after 19787 seconds

I abandoned CMS and switched to Theory.
ID: 41131 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1698
Credit: 105,458,494
RAC: 69,308
Message 41154 - Posted: 4 Jan 2020, 7:23:43 UTC - in response to Message 41131.  

I abandoned CMS and switched to Theory.
before I switch back to CMS - can anyone tell whether the problem of jobs (subtasks) not available has been solved meanwhile?
ID: 41154 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1133
Credit: 49,900,549
RAC: 7,015
Message 41155 - Posted: 4 Jan 2020, 7:37:42 UTC - in response to Message 41154.  

I abandoned CMS and switched to Theory.
before I switch back to CMS - can anyone tell whether the problem of jobs (subtasks) not available has been solved meanwhile?


I think it would be best to stick with Theory tasks still Erich

I don't see any running here yet with Windows OS and with Linux not very often ( about 25%)
ID: 41155 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1698
Credit: 105,458,494
RAC: 69,308
Message 41156 - Posted: 4 Jan 2020, 7:53:26 UTC - in response to Message 41155.  

I don't see any running here yet with Windows OS and with Linux not very often ( about 25%)
thank you, MAGIC, for your quick answer.
Strange, which figures the server status pages shows: 7.931 CMS tasks in process; rather impossible, right?
ID: 41156 · Report as offensive     Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN