Message boards :
CMS Application :
Feature Request: wu.rsc_fpops_est adjustment
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,824,076 RAC: 62,588 |
Hello, Can I request that CMS adjusts the fpop est for the WU's? On my computers even with a very low work buffer of 0.2 days when BOINC request work from CMS it receives 100's of WUs. I would like the fpop est to be adjusted server side so something more like the ones from Theory or ATLAS as they are all similar actual runtimes. making this enhancement would reduce the server load in creating WU's especially when the run out on the backend. Thanks |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
On my computers even with a very low work buffer of 0.2 days when BOINC request work from CMS it receives 100's of WUs. It looks like you have the dreaded <max_concurrent> problem. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5738&postid=45506#45506 See: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5726&postid=45384#45384 |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
For Atlas and Theory tasks there are probably additional server side limits for host tasks that prevent them flooding the computer with tasks even when <max_concurrent> is used in app_config.xml. My 8/16 core host gets only 8 Theory tasks and 16 Atlas tasks although I am using the <max_concurrent> limitations. The host can handle those tasks before deadline as LHC is the only CPU project on it. 12 CPU cores crunch those LHC tasks, 2 cores are reserved to aid the two GPUs on it and 2 CPUs are kept free for OS. CMS seems to be lacking this feature. If I enable CMS on that host I also get hundreds of CMS tasks. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
Probably the server is confused, cause on the 2nd of January there were CMS Boinc-tasks available, but no CMS jobs for the VM. BOINC returned those tasks as valid results with only 2-3 minutes CPU-time, causing the server calculating that you can do a lot of tasks within a short time and sending a lot of tasks. This will settle after a while when you return several tasks with the normal runtime. The other problem is still the BOINC-bug with max_concurrent. In github it's solved a month ago, but not yet implemented in the recommended BOINC version. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,824,076 RAC: 62,588 |
Probably, I don't limit CMS but I do for ATLAS, it seems like from discussion on github that if you set for anything in project it breaks the schedule. CMS commonly sends a ton of work so I don't it correlates to the loss of work, I must hammer the backend when it happens though, pulling down something like 6000 WUs as the error out in 2-3min. I wait for the next boinc and see what happens. It would maybe be smart that CMS did the same server side edits though as well? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 28,391 |
BOINC's work fetch and credit calculation are closely related. The major factors are - the estimated fpops per task - the runtime per task - the computer's peak fpops stored in the server DB The latter is calculated based on the first 2 (beside some minor parameters). As long as the runtime remains stable for lots of tasks the peak flops also remain stable. Weird things happen - delayed - when the runtimes are far off their usual values, in case of CMS when the job queue is empty. Computers running just a few "empty" tasks will see just a small change for their peak fpops but computers running lots of tasks will quickly become "crunching monsters". This is (beside the known bug) a major reason why they sometimes receive tons of tasks - until the peak fpops are down to normal again. Changing the estimated fpops per task would not change the long term behavior. It would just result in a different peak fpops value where all of that starts. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,824,076 RAC: 62,588 |
My computer was often getting 1000 WUs as per the bug, I assume the bug was there for a while. but for whatever reason it only effected CMS ever. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
Is this parameter for the app_config a temporary solution? --fetch_minimal_work Fetch only enough jobs to use all device instances (CPU, GPU). Used with --exit_when_idle, the client will use all devices (possibly with a single multicore job), then exit when this initial set of jobs is completed. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
|
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,824,076 RAC: 62,588 |
Probably the best for the time being |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,824,076 RAC: 62,588 |
I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount. Running 6 or 7 multiple BOINC-clients on 1 machine could be the solution. |
Send message Joined: 4 Oct 21 Posts: 10 Credit: 43,987,595 RAC: 8,529 |
Too many posts, too little substance. Only Harri Liljeroos was on point. Running 6 or 7 multiple BOINC-clients on 1 machine could be the solution. No, the solution is, until a patched boinc comes out, the project administrator impose a hard limit for workunits in progress like there is in atlas and (probably) theory. Sixtrack seems to be the same as CMS, but because of sufficiently small computation times they can get away with it. Until then I use a script that keeps only 50 workunits on board and throws away the rest. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
Evangelos, you can deselect CMS in the LHC-prefs. It's a better solution then deleting thousands of CMS tasks! |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount. I seem to recall that we established some while ago that the arbitrary limit of 8 in the CPU/Task preferences could be increased but there wasn't a need for it at the time. That's something for Laurence or Nils to contemplate, I have no control over that aspect of the project. |
©2024 CERN