CMS jobs are becoming available again

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,561,083 RAC: 119,543	Message 38225 - Posted: 12 Mar 2019, 6:24:20 UTC now the queue is empty again - any new problems? ID: 38225 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1279 Credit: 8,484,048 RAC: 1,651	Message 38226 - Posted: 12 Mar 2019, 7:32:17 UTC - in response to Message 38225. Last modified: 12 Mar 2019, 16:02:12 UTC now the queue is empty again - any new problems? No sub-jobs available. 2781 succeeded, 230 aborted, 162 app failed, 0 pending and 17 running ID: 38226 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,044 RAC: 436	Message 38229 - Posted: 12 Mar 2019, 14:48:13 UTC - in response to Message 38225. now the queue is empty again - any new problems? No, I stayed too long in bed as Storm Gareth started to batter London, and the queue drained. There are more jobs there now. ID: 38229 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 623	Message 38238 - Posted: 13 Mar 2019, 10:17:35 UTC - in response to Message 38217. What do you mean by "CPU load"? Do you mean top's %cpu or do you mean the ratio of cpu time to run time. The only CMS task I've completed since the last update reported by Ivan showed a fairly constant ~99 %cpu in top but the ratio of cpu time to run time is 45,764.33/64,179.97 = 71%. But I think that your 71% is also a reflection of the wall-clock time taken up by the VM filling its CVMFS cache and so on, i.e. network traffic, rather than just CPU efficiency in the number crunching phase proper. There again, all my CMS tasks failed on a heartbeat error :( even though the same machines run Theory and Atlas VMs quite happily. ID: 38238 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,561,083 RAC: 119,543	Message 38263 - Posted: 18 Mar 2019, 19:21:54 UTC Although the Server Status page shows 197 tasks available for download, none of my machines can download any. BOINC says "no tasks available for CMS simulation". BTW: same is true for Theory. What's going on? ID: 38263 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,044 RAC: 436	Message 38270 - Posted: 18 Mar 2019, 21:25:27 UTC - in response to Message 38263. Although the Server Status page shows 197 tasks available for download, none of my machines can download any. BOINC says "no tasks available for CMS simulation". BTW: same is true for Theory. What's going on? Do you already have tasks running? CMS won't download tasks unless there is a BOINC slot for them (i.e. you shouldn't get any tasks in the "Waiting to run" state. You can turn on extra logging in your cc_config.xml file if you want to probe more deeply -- https://boinc.berkeley.edu/wiki/Client_configuration. ID: 38270 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 651,908,250 RAC: 291,761	Message 38272 - Posted: 18 Mar 2019, 21:42:27 UTC Last modified: 18 Mar 2019, 21:56:38 UTC I think something changed server side, I see my 44 core machines draining there queues down with the message that "This computer has reached a limit on tasks in progress". In the past they would buffer 50 WU so they were fully stocked and had 6 spare. See logging: 4457 LHC@home 03/18/19 22:34:25 [work_fetch] REC 33006.218 prio -1.282 can request work 4460 03/18/19 22:34:25 [work_fetch] --- state for CPU --- 4461 03/18/19 22:34:25 [work_fetch] shortfall 1453962.91 nidle 18.00 saturated 0.00 busy 0.00 4467 LHC@home 03/18/19 22:34:25 [work_fetch] share 1.000 4471 LHC@home 03/18/19 22:34:25 [work_fetch] set_request() for CPU: ninst 44 nused_total 26.00 nidle_now 18.00 fetch share 1.00 req_inst 18.00 req_secs 1453962.91 4472 LHC@home 03/18/19 22:34:25 [work_fetch] set_request() for AMD/ATI GPU: ninst 1 nused_total 0.00 nidle_now 1.00 fetch share 1.00 req_inst 1.00 req_secs 44064.00 4473 LHC@home 03/18/19 22:34:25 [work_fetch] request: CPU (1453962.91 sec, 18.00 inst) AMD/ATI GPU (44064.00 sec, 1.00 inst) 4474 LHC@home 03/18/19 22:34:25 Sending scheduler request: To fetch work. 4475 LHC@home 03/18/19 22:34:25 Requesting new tasks for CPU and AMD/ATI GPU 4476 LHC@home 03/18/19 22:34:26 Scheduler request completed: got 0 new tasks 4477 LHC@home 03/18/19 22:34:26 No tasks sent 4478 LHC@home 03/18/19 22:34:26 No tasks are available for SixTrack 4479 LHC@home 03/18/19 22:34:26 No tasks are available for sixtracktest 4480 LHC@home 03/18/19 22:34:26 No tasks are available for CMS Simulation 4481 LHC@home 03/18/19 22:34:26 No tasks are available for Theory Simulation 4482 LHC@home 03/18/19 22:34:26 This computer has reached a limit on tasks in progress it looks like it request 18 WU but there is a limit so doesn't get anything. Not sure what flag shows server responses? Maybe there is some limit to how many jobs you can run per day? This computer took 55 task today. ID: 38272 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,044 RAC: 436	Message 38273 - Posted: 19 Mar 2019, 3:45:17 UTC - in response to Message 38272. I think something changed server side, I see my 44 core machines draining there queues down with the message that "This computer has reached a limit on tasks in progress". In the past they would buffer 50 WU so they were fully stocked and had 6 spare. Maybe there is some limit to how many jobs you can run per day? This computer took 55 task today. There is a limit to how many tasks you can queue, but I'm not sure how it's implemented in LHC@Home -- whether it's a limit per project or an overall limit. I know at SETI@Home the limit is 100 per PC (regardless of how many CPUs) plus 100 per GPU. There is also a daily quota, usually too large to be noticed, Errored or aborted tasks will incrementally reduce your quota, down to one task per day; conversely valid tasks increment your quota up to the machine limit. There is usually an output message that the machine has reached its quota of N per day when this happens (I got that when I aborted a bunch of tasks the other day, and it took a little while before I could get more tasks). ID: 38273 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,561,083 RAC: 119,543	Message 38276 - Posted: 19 Mar 2019, 6:05:37 UTC there were definitely some changes made on server -side. So far, I was able to run as many subproject-jobs as my 6+6(HT)cores CPU allowed (32 GB RAM was never a problem, except for ATLAS). So, for example, I had 8 Theory or 8 CMS tasks running simultaneously, plus a few more in the waiting queue. Currently, there runs 1 CMS.+ 1 Theory. Although my web-settings say "max. number of tasks: 6", plus my app_config.xml are set to run 3 CMS and 3 Theory tasks, I cannot download any more either CMS or Theory tasks. The BOINC log says: 19.03.2019 06:50:37 \| LHC@home \| update requested by user 19.03.2019 06:50:42 \| LHC@home \| Sending scheduler request: Requested by user. 19.03.2019 06:50:42 \| LHC@home \| Requesting new tasks for CPU 19.03.2019 06:50:43 \| LHC@home \| Scheduler request completed: got 0 new tasks 19.03.2019 06:50:43 \| LHC@home \| No tasks sent 19.03.2019 06:50:43 \| LHC@home \| No tasks are available for CMS Simulation 19.03.2019 06:50:43 \| LHC@home \| Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them 19.03.2019 06:50:43 \| LHC@home \| This computer has reached a limit on tasks in progress 19.03.2019 06:50:54 \| LHC@home \| Sending scheduler request: To fetch work. 19.03.2019 06:50:54 \| LHC@home \| Requesting new tasks for CPU 19.03.2019 06:50:55 \| LHC@home \| Scheduler request completed: got 0 new tasks 19.03.2019 06:50:55 \| LHC@home \| No tasks sent 19.03.2019 06:50:55 \| LHC@home \| No tasks are available for CMS Simulation 19.03.2019 06:50:55 \| LHC@home \| Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them 19.03.2019 06:50:55 \| LHC@home \| This computer has reached a limit on tasks in progress 19.03.2019 07:01:26 \| LHC@home \| update requested by user 19.03.2019 07:01:28 \| LHC@home \| Sending scheduler request: Requested by user. 19.03.2019 07:01:28 \| LHC@home \| Requesting new tasks for CPU 19.03.2019 07:01:29 \| LHC@home \| Scheduler request completed: got 0 new tasks 19.03.2019 07:01:29 \| LHC@home \| No tasks sent 19.03.2019 07:01:29 \| LHC@home \| No tasks are available for Theory Simulation 19.03.2019 07:01:29 \| LHC@home \| Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them 19.03.2019 07:01:29 \| LHC@home \| This computer has reached a limit on tasks in progress 19.03.2019 07:01:39 \| LHC@home \| Sending scheduler request: To fetch work. 19.03.2019 07:01:39 \| LHC@home \| Requesting new tasks for CPU 19.03.2019 07:01:40 \| LHC@home \| Scheduler request completed: got 0 new tasks 19.03.2019 07:01:40 \| LHC@home \| No tasks sent 19.03.2019 07:01:40 \| LHC@home \| No tasks are available for Theory Simulation 19.03.2019 07:01:40 \| LHC@home \| Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them 19.03.2019 07:01:40 \| LHC@home \| This computer has reached a limit on tasks in progress what irritates me is what Toby Broom was mentioning already: "This computer has reached a limit on tasks in progress" - how come? Who sets this limit of tasks in progress? All my settings are "6", on the webpage as well in the app_config.xml ! So, something is definitely running wrong somewhere server-side all of a sudden, since yesterday. Ivan, could you please look into this? ID: 38276 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 651,908,250 RAC: 291,761	Message 38277 - Posted: 19 Mar 2019, 6:26:05 UTC Last modified: 19 Mar 2019, 6:26:54 UTC This morning the PC is down to 4 tasks. I can think that now unlimited = 1 Job, this has been configured in the past, so I assume the introduction of native Theory has reset this config. Since I run ATLAS in a separate BOINC session it seems that this project is not effected, still running 12 at once. looking at Erich's results is would appear the Job system is totally broke as he has forced 6 Jobs and only gets 1. Lawrence should take a look at the settings. ID: 38277 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,561,083 RAC: 119,543	Message 38280 - Posted: 19 Mar 2019, 7:58:43 UTC - in response to Message 38277. it would appear the Job system is totally broke this is most probably the case :-( ID: 38280 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,044 RAC: 436	Message 38281 - Posted: 19 Mar 2019, 8:26:36 UTC - in response to Message 38276. there were definitely some changes made on server -side. So, something is definitely running wrong somewhere server-side all of a sudden, since yesterday. Ivan, could you please look into this? Yes, I'll be sending some e-mails this morning. ID: 38281 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 651,908,250 RAC: 291,761	Message 38282 - Posted: 19 Mar 2019, 8:39:45 UTC My PC finished the remaining Wu's, now it got 1 task for CMS and one for Theory. I imagine Lawrence set it to one for the new project so if there is problems it doesn't cause too much damage but it's change CMS and the regular Theory projects. ID: 38282 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1279 Credit: 8,484,048 RAC: 1,651	Message 38283 - Posted: 19 Mar 2019, 9:20:11 UTC When I set Max # jobs No limit Max # CPUs No limit I get a max of 2 tasks / core (tested with Theory Native), but I think this is not wanted for the multi-core running applications CMS and ATLAS. ID: 38283 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,561,083 RAC: 119,543	Message 38318 - Posted: 19 Mar 2019, 17:21:53 UTC I am curious when this problem will be repaired ID: 38318 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 651,908,250 RAC: 291,761	Message 38321 - Posted: 19 Mar 2019, 18:41:29 UTC - in response to Message 38283. I'm not sure where the 50 Job limits came from I just worked that out from testing, I assume there isn't many people that run 50 tasks at once. My settings are: Max # jobs No limit Max # CPUs 1 I think one task/job per core is a fine limit. I changed my settings to match yours, now I get more tasks/jobs The reason I use the Max # CPUs 1, is that the ram calculation from BOINC is not correct when set to no limit. e.g No limit theory task takes 32Cores with a ram usage of 3.06GB, vs a 1 core theory task takes WS 0.74GB. A 32Core WU is nonsense I can imagine? Option #1 I can dial back the number of cores with app_config but now the working set is wrong by 4x, so if I have 44 cores BOINC thinks I need 134GB of ram so will not run 44 tasks/jobs, where as in reality I use 33GB to run 44 cores. Option #2 I have to use less #CPU setting, which as we know at 1 limits the number of Jobs to 1. I could have 8 which would give me maybe 16 WU's Since there is no multi-core CMS then it runs fine now with No Limit settings. ID: 38321 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,044 RAC: 436	Message 38325 - Posted: 19 Mar 2019, 21:29:55 UTC - in response to Message 38318. I am curious when this problem will be repaired People have been notified; I'm awaiting responses. It was a public holiday in der Schweiz today, but "Graubünden, Lucerne, Nidwalden, Schwyz, Solothurn, Ticino, Uri, Valais only". ID: 38325 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,044 RAC: 436	Message 38326 - Posted: 19 Mar 2019, 21:32:32 UTC - in response to Message 38321. Since there is no multi-core CMS then it runs fine now with No Limit settings. There is in -dev. Specify an N-core VM and it runs N CMS jobs in parallel, but not N threads. ID: 38326 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,614,265 RAC: 15,805	Message 38337 - Posted: 20 Mar 2019, 8:15:30 UTC The CMS jobs graphs are failing both here and at dev. ID: 38337 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,044 RAC: 436	Message 38340 - Posted: 20 Mar 2019, 10:43:15 UTC - in response to Message 38337. The CMS jobs graphs are failing both here and at dev. Yes, I noticed that (they are in essence the same graphs, just presenting the data in different categories). Can't see why yet. ID: 38340 · Reply Quote

LHC@home