Message boards : CMS Application : Feature Request: wu.rsc_fpops_est adjustment
Message board moderation

To post messages, you must log in.

AuthorMessage
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,973
RAC: 232,541
Message 45984 - Posted: 3 Jan 2022, 20:46:12 UTC
Last modified: 3 Jan 2022, 20:48:23 UTC

Hello,

Can I request that CMS adjusts the fpop est for the WU's?

On my computers even with a very low work buffer of 0.2 days when BOINC request work from CMS it receives 100's of WUs.



I would like the fpop est to be adjusted server side so something more like the ones from Theory or ATLAS as they are all similar actual runtimes.

making this enhancement would reduce the server load in creating WU's especially when the run out on the backend.

Thanks
ID: 45984 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45985 - Posted: 3 Jan 2022, 21:33:37 UTC - in response to Message 45984.  

On my computers even with a very low work buffer of 0.2 days when BOINC request work from CMS it receives 100's of WUs.

It looks like you have the dreaded <max_concurrent> problem.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5738&postid=45506#45506

See:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5726&postid=45384#45384
ID: 45985 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 45986 - Posted: 3 Jan 2022, 22:46:23 UTC

For Atlas and Theory tasks there are probably additional server side limits for host tasks that prevent them flooding the computer with tasks even when <max_concurrent> is used in app_config.xml. My 8/16 core host gets only 8 Theory tasks and 16 Atlas tasks although I am using the <max_concurrent> limitations. The host can handle those tasks before deadline as LHC is the only CPU project on it. 12 CPU cores crunch those LHC tasks, 2 cores are reserved to aid the two GPUs on it and 2 CPUs are kept free for OS.

CMS seems to be lacking this feature. If I enable CMS on that host I also get hundreds of CMS tasks.
ID: 45986 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 45988 - Posted: 4 Jan 2022, 8:34:51 UTC

Probably the server is confused, cause on the 2nd of January there were CMS Boinc-tasks available, but no CMS jobs for the VM.
BOINC returned those tasks as valid results with only 2-3 minutes CPU-time, causing the server calculating that you can do a lot of tasks within a short time and sending a lot of tasks.
This will settle after a while when you return several tasks with the normal runtime.

The other problem is still the BOINC-bug with max_concurrent. In github it's solved a month ago, but not yet implemented in the recommended BOINC version.
ID: 45988 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,973
RAC: 232,541
Message 45998 - Posted: 4 Jan 2022, 17:52:34 UTC - in response to Message 45988.  
Last modified: 4 Jan 2022, 17:56:37 UTC

Probably, I don't limit CMS but I do for ATLAS, it seems like from discussion on github that if you set for anything in project it breaks the schedule.

CMS commonly sends a ton of work so I don't it correlates to the loss of work, I must hammer the backend when it happens though, pulling down something like 6000 WUs as the error out in 2-3min.

I wait for the next boinc and see what happens.

It would maybe be smart that CMS did the same server side edits though as well?
ID: 45998 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,946,857
RAC: 137,286
Message 46000 - Posted: 4 Jan 2022, 18:42:51 UTC - in response to Message 45998.  

BOINC's work fetch and credit calculation are closely related.
The major factors are
- the estimated fpops per task
- the runtime per task
- the computer's peak fpops stored in the server DB

The latter is calculated based on the first 2 (beside some minor parameters).
As long as the runtime remains stable for lots of tasks the peak flops also remain stable.

Weird things happen - delayed - when the runtimes are far off their usual values, in case of CMS when the job queue is empty.
Computers running just a few "empty" tasks will see just a small change for their peak fpops but computers running lots of tasks will quickly become "crunching monsters".
This is (beside the known bug) a major reason why they sometimes receive tons of tasks - until the peak fpops are down to normal again.


Changing the estimated fpops per task would not change the long term behavior.
It would just result in a different peak fpops value where all of that starts.
ID: 46000 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,973
RAC: 232,541
Message 46006 - Posted: 5 Jan 2022, 20:02:07 UTC - in response to Message 46000.  

My computer was often getting 1000 WUs as per the bug, I assume the bug was there for a while. but for whatever reason it only effected CMS ever.
ID: 46006 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,130,430
RAC: 104,897
Message 46007 - Posted: 6 Jan 2022, 5:38:02 UTC - in response to Message 46006.  

Is this parameter for the app_config a temporary solution?
--fetch_minimal_work
Fetch only enough jobs to use all device instances (CPU, GPU). Used with --exit_when_idle, the client will use all devices (possibly with a single multicore job), then exit when this initial set of jobs is completed.
ID: 46007 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 46010 - Posted: 6 Jan 2022, 10:02:55 UTC - in response to Message 46006.  

What have you got set in your LHC@Home preferences for "Max # jobs" and "Max # CPUs"? I always match them for each locale. I do recall getting a lot of tasks once when they were mismatched.
ID: 46010 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,973
RAC: 232,541
Message 46014 - Posted: 6 Jan 2022, 17:30:15 UTC - in response to Message 46007.  

Probably the best for the time being
ID: 46014 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,973
RAC: 232,541
Message 46016 - Posted: 6 Jan 2022, 17:39:00 UTC - in response to Message 46010.  

I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount.
ID: 46016 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 46017 - Posted: 6 Jan 2022, 18:42:26 UTC - in response to Message 46016.  
Last modified: 6 Jan 2022, 18:42:52 UTC

I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount.

Running 6 or 7 multiple BOINC-clients on 1 machine could be the solution.
ID: 46017 · Report as offensive     Reply Quote
Evangelos Katikos

Send message
Joined: 4 Oct 21
Posts: 10
Credit: 37,777,863
RAC: 179
Message 46018 - Posted: 6 Jan 2022, 19:25:17 UTC - in response to Message 46017.  

Too many posts, too little substance. Only Harri Liljeroos was on point.



Running 6 or 7 multiple BOINC-clients on 1 machine could be the solution.

No, the solution is, until a patched boinc comes out, the project administrator impose a hard limit for workunits in progress like there is in atlas and (probably) theory. Sixtrack seems to be the same as CMS, but because of sufficiently small computation times they can get away with it.

Until then I use a script that keeps only 50 workunits on board and throws away the rest.
ID: 46018 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,130,430
RAC: 104,897
Message 46019 - Posted: 6 Jan 2022, 20:42:18 UTC - in response to Message 46018.  

Evangelos,
you can deselect CMS in the LHC-prefs.
It's a better solution then deleting thousands of CMS tasks!
ID: 46019 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 46020 - Posted: 6 Jan 2022, 22:07:25 UTC - in response to Message 46016.  

I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount.

I seem to recall that we established some while ago that the arbitrary limit of 8 in the CPU/Task preferences could be increased but there wasn't a need for it at the time. That's something for Laurence or Nils to contemplate, I have no control over that aspect of the project.
ID: 46020 · Report as offensive     Reply Quote

Message boards : CMS Application : Feature Request: wu.rsc_fpops_est adjustment


©2024 CERN