Message boards : CMS Application : CMS jobs are becoming available again
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,382
RAC: 102,152
Message 38225 - Posted: 12 Mar 2019, 6:24:20 UTC

now the queue is empty again - any new problems?
ID: 38225 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 38226 - Posted: 12 Mar 2019, 7:32:17 UTC - in response to Message 38225.  
Last modified: 12 Mar 2019, 16:02:12 UTC

now the queue is empty again - any new problems?
No sub-jobs available. 2781 succeeded, 230 aborted, 162 app failed, 0 pending and 17 running
ID: 38226 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 38229 - Posted: 12 Mar 2019, 14:48:13 UTC - in response to Message 38225.  

now the queue is empty again - any new problems?

No, I stayed too long in bed as Storm Gareth started to batter London, and the queue drained. There are more jobs there now.
ID: 38229 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 38238 - Posted: 13 Mar 2019, 10:17:35 UTC - in response to Message 38217.  

What do you mean by "CPU load"? Do you mean top's %cpu or do you mean the ratio of cpu time to run time. The only CMS task I've completed since the last update reported by Ivan showed a fairly constant ~99 %cpu in top but the ratio of cpu time to run time is 45,764.33/64,179.97 = 71%.


But I think that your 71% is also a reflection of the wall-clock time taken up by the VM filling its CVMFS cache and so on, i.e. network traffic, rather than just CPU efficiency in the number crunching phase proper.

There again, all my CMS tasks failed on a heartbeat error :( even though the same machines run Theory and Atlas VMs quite happily.
ID: 38238 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,382
RAC: 102,152
Message 38263 - Posted: 18 Mar 2019, 19:21:54 UTC

Although the Server Status page shows 197 tasks available for download, none of my machines can download any.
BOINC says "no tasks available for CMS simulation".

BTW: same is true for Theory.

What's going on?
ID: 38263 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 38270 - Posted: 18 Mar 2019, 21:25:27 UTC - in response to Message 38263.  

Although the Server Status page shows 197 tasks available for download, none of my machines can download any.
BOINC says "no tasks available for CMS simulation".

BTW: same is true for Theory.

What's going on?

Do you already have tasks running? CMS won't download tasks unless there is a BOINC slot for them (i.e. you shouldn't get any tasks in the "Waiting to run" state.
You can turn on extra logging in your cc_config.xml file if you want to probe more deeply -- https://boinc.berkeley.edu/wiki/Client_configuration.
ID: 38270 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,770,563
RAC: 231,713
Message 38272 - Posted: 18 Mar 2019, 21:42:27 UTC
Last modified: 18 Mar 2019, 21:56:38 UTC

I think something changed server side, I see my 44 core machines draining there queues down with the message that "This computer has reached a limit on tasks in progress". In the past they would buffer 50 WU so they were fully stocked and had 6 spare.

See logging:

4457 LHC@home 03/18/19 22:34:25 [work_fetch] REC 33006.218 prio -1.282 can request work
4460 03/18/19 22:34:25 [work_fetch] --- state for CPU ---
4461 03/18/19 22:34:25 [work_fetch] shortfall 1453962.91 nidle 18.00 saturated 0.00 busy 0.00
4467 LHC@home 03/18/19 22:34:25 [work_fetch] share 1.000
4471 LHC@home 03/18/19 22:34:25 [work_fetch] set_request() for CPU: ninst 44 nused_total 26.00 nidle_now 18.00 fetch share 1.00 req_inst 18.00 req_secs 1453962.91
4472 LHC@home 03/18/19 22:34:25 [work_fetch] set_request() for AMD/ATI GPU: ninst 1 nused_total 0.00 nidle_now 1.00 fetch share 1.00 req_inst 1.00 req_secs 44064.00
4473 LHC@home 03/18/19 22:34:25 [work_fetch] request: CPU (1453962.91 sec, 18.00 inst) AMD/ATI GPU (44064.00 sec, 1.00 inst)
4474 LHC@home 03/18/19 22:34:25 Sending scheduler request: To fetch work.
4475 LHC@home 03/18/19 22:34:25 Requesting new tasks for CPU and AMD/ATI GPU
4476 LHC@home 03/18/19 22:34:26 Scheduler request completed: got 0 new tasks
4477 LHC@home 03/18/19 22:34:26 No tasks sent
4478 LHC@home 03/18/19 22:34:26 No tasks are available for SixTrack
4479 LHC@home 03/18/19 22:34:26 No tasks are available for sixtracktest
4480 LHC@home 03/18/19 22:34:26 No tasks are available for CMS Simulation
4481 LHC@home 03/18/19 22:34:26 No tasks are available for Theory Simulation
4482 LHC@home 03/18/19 22:34:26 This computer has reached a limit on tasks in progress

it looks like it request 18 WU but there is a limit so doesn't get anything. Not sure what flag shows server responses?

Maybe there is some limit to how many jobs you can run per day? This computer took 55 task today.
ID: 38272 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 38273 - Posted: 19 Mar 2019, 3:45:17 UTC - in response to Message 38272.  

I think something changed server side, I see my 44 core machines draining there queues down with the message that "This computer has reached a limit on tasks in progress". In the past they would buffer 50 WU so they were fully stocked and had 6 spare.

Maybe there is some limit to how many jobs you can run per day? This computer took 55 task today.

There is a limit to how many tasks you can queue, but I'm not sure how it's implemented in LHC@Home -- whether it's a limit per project or an overall limit. I know at SETI@Home the limit is 100 per PC (regardless of how many CPUs) plus 100 per GPU.
There is also a daily quota, usually too large to be noticed, Errored or aborted tasks will incrementally reduce your quota, down to one task per day; conversely valid tasks increment your quota up to the machine limit. There is usually an output message that the machine has reached its quota of N per day when this happens (I got that when I aborted a bunch of tasks the other day, and it took a little while before I could get more tasks).
ID: 38273 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,382
RAC: 102,152
Message 38276 - Posted: 19 Mar 2019, 6:05:37 UTC

there were definitely some changes made on server -side.

So far, I was able to run as many subproject-jobs as my 6+6(HT)cores CPU allowed (32 GB RAM was never a problem, except for ATLAS).
So, for example, I had 8 Theory or 8 CMS tasks running simultaneously, plus a few more in the waiting queue.

Currently, there runs 1 CMS.+ 1 Theory.
Although my web-settings say "max. number of tasks: 6", plus my app_config.xml are set to run 3 CMS and 3 Theory tasks, I cannot download any more either CMS or Theory tasks.

The BOINC log says:

19.03.2019 06:50:37 | LHC@home | update requested by user
19.03.2019 06:50:42 | LHC@home | Sending scheduler request: Requested by user.
19.03.2019 06:50:42 | LHC@home | Requesting new tasks for CPU
19.03.2019 06:50:43 | LHC@home | Scheduler request completed: got 0 new tasks
19.03.2019 06:50:43 | LHC@home | No tasks sent
19.03.2019 06:50:43 | LHC@home | No tasks are available for CMS Simulation
19.03.2019 06:50:43 | LHC@home | Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them
19.03.2019 06:50:43 | LHC@home | This computer has reached a limit on tasks in progress
19.03.2019 06:50:54 | LHC@home | Sending scheduler request: To fetch work.
19.03.2019 06:50:54 | LHC@home | Requesting new tasks for CPU
19.03.2019 06:50:55 | LHC@home | Scheduler request completed: got 0 new tasks
19.03.2019 06:50:55 | LHC@home | No tasks sent
19.03.2019 06:50:55 | LHC@home | No tasks are available for CMS Simulation
19.03.2019 06:50:55 | LHC@home | Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them
19.03.2019 06:50:55 | LHC@home | This computer has reached a limit on tasks in progress

19.03.2019 07:01:26 | LHC@home | update requested by user
19.03.2019 07:01:28 | LHC@home | Sending scheduler request: Requested by user.
19.03.2019 07:01:28 | LHC@home | Requesting new tasks for CPU
19.03.2019 07:01:29 | LHC@home | Scheduler request completed: got 0 new tasks
19.03.2019 07:01:29 | LHC@home | No tasks sent
19.03.2019 07:01:29 | LHC@home | No tasks are available for Theory Simulation
19.03.2019 07:01:29 | LHC@home | Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them
19.03.2019 07:01:29 | LHC@home | This computer has reached a limit on tasks in progress
19.03.2019 07:01:39 | LHC@home | Sending scheduler request: To fetch work.
19.03.2019 07:01:39 | LHC@home | Requesting new tasks for CPU
19.03.2019 07:01:40 | LHC@home | Scheduler request completed: got 0 new tasks
19.03.2019 07:01:40 | LHC@home | No tasks sent
19.03.2019 07:01:40 | LHC@home | No tasks are available for Theory Simulation
19.03.2019 07:01:40 | LHC@home | Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them
19.03.2019 07:01:40 | LHC@home | This computer has reached a limit on tasks in progress

what irritates me is what Toby Broom was mentioning already: "This computer has reached a limit on tasks in progress" - how come? Who sets this limit of tasks in progress?
All my settings are "6", on the webpage as well in the app_config.xml !

So, something is definitely running wrong somewhere server-side all of a sudden, since yesterday.
Ivan, could you please look into this?
ID: 38276 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,770,563
RAC: 231,713
Message 38277 - Posted: 19 Mar 2019, 6:26:05 UTC
Last modified: 19 Mar 2019, 6:26:54 UTC

This morning the PC is down to 4 tasks. I can think that now unlimited = 1 Job, this has been configured in the past, so I assume the introduction of native Theory has reset this config.

Since I run ATLAS in a separate BOINC session it seems that this project is not effected, still running 12 at once.

looking at Erich's results is would appear the Job system is totally broke as he has forced 6 Jobs and only gets 1.

Lawrence should take a look at the settings.
ID: 38277 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,382
RAC: 102,152
Message 38280 - Posted: 19 Mar 2019, 7:58:43 UTC - in response to Message 38277.  

it would appear the Job system is totally broke
this is most probably the case :-(
ID: 38280 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 38281 - Posted: 19 Mar 2019, 8:26:36 UTC - in response to Message 38276.  

there were definitely some changes made on server -side.
So, something is definitely running wrong somewhere server-side all of a sudden, since yesterday.
Ivan, could you please look into this?

Yes, I'll be sending some e-mails this morning.
ID: 38281 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,770,563
RAC: 231,713
Message 38282 - Posted: 19 Mar 2019, 8:39:45 UTC

My PC finished the remaining Wu's, now it got 1 task for CMS and one for Theory.

I imagine Lawrence set it to one for the new project so if there is problems it doesn't cause too much damage but it's change CMS and the regular Theory projects.
ID: 38282 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 38283 - Posted: 19 Mar 2019, 9:20:11 UTC

When I set
Max # jobs	No limit
Max # CPUs	No limit
I get a max of 2 tasks / core (tested with Theory Native),
but I think this is not wanted for the multi-core running applications CMS and ATLAS.
ID: 38283 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,382
RAC: 102,152
Message 38318 - Posted: 19 Mar 2019, 17:21:53 UTC

I am curious when this problem will be repaired
ID: 38318 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,770,563
RAC: 231,713
Message 38321 - Posted: 19 Mar 2019, 18:41:29 UTC - in response to Message 38283.  

I'm not sure where the 50 Job limits came from I just worked that out from testing, I assume there isn't many people that run 50 tasks at once.

My settings are:

Max # jobs No limit
Max # CPUs 1

I think one task/job per core is a fine limit.

I changed my settings to match yours, now I get more tasks/jobs

The reason I use the Max # CPUs 1, is that the ram calculation from BOINC is not correct when set to no limit. e.g No limit theory task takes 32Cores with a ram usage of 3.06GB, vs a 1 core theory task takes WS 0.74GB.

A 32Core WU is nonsense I can imagine?

Option #1
I can dial back the number of cores with app_config but now the working set is wrong by 4x, so if I have 44 cores BOINC thinks I need 134GB of ram so will not run 44 tasks/jobs, where as in reality I use 33GB to run 44 cores.

Option #2
I have to use less #CPU setting, which as we know at 1 limits the number of Jobs to 1. I could have 8 which would give me maybe 16 WU's

Since there is no multi-core CMS then it runs fine now with No Limit settings.
ID: 38321 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 38325 - Posted: 19 Mar 2019, 21:29:55 UTC - in response to Message 38318.  

I am curious when this problem will be repaired

People have been notified; I'm awaiting responses. It was a public holiday in der Schweiz today, but "Graubünden, Lucerne, Nidwalden, Schwyz, Solothurn, Ticino, Uri, Valais only".
ID: 38325 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 38326 - Posted: 19 Mar 2019, 21:32:32 UTC - in response to Message 38321.  

Since there is no multi-core CMS then it runs fine now with No Limit settings.

There is in -dev. Specify an N-core VM and it runs N CMS jobs in parallel, but not N threads.
ID: 38326 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 38337 - Posted: 20 Mar 2019, 8:15:30 UTC

The CMS jobs graphs are failing both here and at dev.
ID: 38337 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 38340 - Posted: 20 Mar 2019, 10:43:15 UTC - in response to Message 38337.  

The CMS jobs graphs are failing both here and at dev.

Yes, I noticed that (they are in essence the same graphs, just presenting the data in different categories). Can't see why yet.
ID: 38340 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : CMS Application : CMS jobs are becoming available again


©2024 CERN