Feature Request: wu.rsc_fpops

Author	Message
Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 878 Credit: 745,382,348 RAC: 311,651	Message 45984 - Posted: 3 Jan 2022, 20:46:12 UTC Last modified: 3 Jan 2022, 20:48:23 UTC Hello, Can I request that CMS adjusts the fpop est for the WU's? On my computers even with a very low work buffer of 0.2 days when BOINC request work from CMS it receives 100's of WUs. I would like the fpop est to be adjusted server side so something more like the ones from Theory or ATLAS as they are all similar actual runtimes. making this enhancement would reduce the server load in creating WU's especially when the run out on the backend. Thanks ID: 45984 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45985 - Posted: 3 Jan 2022, 21:33:37 UTC - in response to Message 45984. On my computers even with a very low work buffer of 0.2 days when BOINC request work from CMS it receives 100's of WUs. It looks like you have the dreaded <max_concurrent> problem. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5738&postid=45506#45506 See: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5726&postid=45384#45384 ID: 45985 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 780 Credit: 59,778,881 RAC: 46,504	Message 45986 - Posted: 3 Jan 2022, 22:46:23 UTC For Atlas and Theory tasks there are probably additional server side limits for host tasks that prevent them flooding the computer with tasks even when <max_concurrent> is used in app_config.xml. My 8/16 core host gets only 8 Theory tasks and 16 Atlas tasks although I am using the <max_concurrent> limitations. The host can handle those tasks before deadline as LHC is the only CPU project on it. 12 CPU cores crunch those LHC tasks, 2 cores are reserved to aid the two GPUs on it and 2 CPUs are kept free for OS. CMS seems to be lacking this feature. If I enable CMS on that host I also get hundreds of CMS tasks. ID: 45986 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,852,564 RAC: 3,037	Message 45988 - Posted: 4 Jan 2022, 8:34:51 UTC Probably the server is confused, cause on the 2nd of January there were CMS Boinc-tasks available, but no CMS jobs for the VM. BOINC returned those tasks as valid results with only 2-3 minutes CPU-time, causing the server calculating that you can do a lot of tasks within a short time and sending a lot of tasks. This will settle after a while when you return several tasks with the normal runtime. The other problem is still the BOINC-bug with max_concurrent. In github it's solved a month ago, but not yet implemented in the recommended BOINC version. ID: 45988 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 878 Credit: 745,382,348 RAC: 311,651	Message 45998 - Posted: 4 Jan 2022, 17:52:34 UTC - in response to Message 45988. Last modified: 4 Jan 2022, 17:56:37 UTC Probably, I don't limit CMS but I do for ATLAS, it seems like from discussion on github that if you set for anything in project it breaks the schedule. CMS commonly sends a ton of work so I don't it correlates to the loss of work, I must hammer the backend when it happens though, pulling down something like 6000 WUs as the error out in 2-3min. I wait for the next boinc and see what happens. It would maybe be smart that CMS did the same server side edits though as well? ID: 45998 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,801,332 RAC: 74,302	Message 46000 - Posted: 4 Jan 2022, 18:42:51 UTC - in response to Message 45998. BOINC's work fetch and credit calculation are closely related. The major factors are - the estimated fpops per task - the runtime per task - the computer's peak fpops stored in the server DB The latter is calculated based on the first 2 (beside some minor parameters). As long as the runtime remains stable for lots of tasks the peak flops also remain stable. Weird things happen - delayed - when the runtimes are far off their usual values, in case of CMS when the job queue is empty. Computers running just a few "empty" tasks will see just a small change for their peak fpops but computers running lots of tasks will quickly become "crunching monsters". This is (beside the known bug) a major reason why they sometimes receive tons of tasks - until the peak fpops are down to normal again. Changing the estimated fpops per task would not change the long term behavior. It would just result in a different peak fpops value where all of that starts. ID: 46000 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 878 Credit: 745,382,348 RAC: 311,651	Message 46006 - Posted: 5 Jan 2022, 20:02:07 UTC - in response to Message 46000. My computer was often getting 1000 WUs as per the bug, I assume the bug was there for a while. but for whatever reason it only effected CMS ever. ID: 46006 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,527,209 RAC: 127,551	Message 46007 - Posted: 6 Jan 2022, 5:38:02 UTC - in response to Message 46006. Is this parameter for the app_config a temporary solution? --fetch_minimal_work Fetch only enough jobs to use all device instances (CPU, GPU). Used with --exit_when_idle, the client will use all devices (possibly with a single multicore job), then exit when this initial set of jobs is completed. ID: 46007 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,376,015 RAC: 4,676	Message 46010 - Posted: 6 Jan 2022, 10:02:55 UTC - in response to Message 46006. What have you got set in your LHC@Home preferences for "Max # jobs" and "Max # CPUs"? I always match them for each locale. I do recall getting a lot of tasks once when they were mismatched. ID: 46010 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 878 Credit: 745,382,348 RAC: 311,651	Message 46014 - Posted: 6 Jan 2022, 17:30:15 UTC - in response to Message 46007. Probably the best for the time being ID: 46014 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 878 Credit: 745,382,348 RAC: 311,651	Message 46016 - Posted: 6 Jan 2022, 17:39:00 UTC - in response to Message 46010. I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount. ID: 46016 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,852,564 RAC: 3,037	Message 46017 - Posted: 6 Jan 2022, 18:42:26 UTC - in response to Message 46016. Last modified: 6 Jan 2022, 18:42:52 UTC I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount. Running 6 or 7 multiple BOINC-clients on 1 machine could be the solution. ID: 46017 · Reply Quote

Evangelos Katikos Send message Joined: 4 Oct 21 Posts: 10 Credit: 46,458,962 RAC: 12,690	Message 46018 - Posted: 6 Jan 2022, 19:25:17 UTC - in response to Message 46017. Too many posts, too little substance. Only Harri Liljeroos was on point. Running 6 or 7 multiple BOINC-clients on 1 machine could be the solution. No, the solution is, until a patched boinc comes out, the project administrator impose a hard limit for workunits in progress like there is in atlas and (probably) theory. Sixtrack seems to be the same as CMS, but because of sufficiently small computation times they can get away with it. Until then I use a script that keeps only 50 workunits on board and throws away the rest. ID: 46018 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,527,209 RAC: 127,551	Message 46019 - Posted: 6 Jan 2022, 20:42:18 UTC - in response to Message 46018. Evangelos, you can deselect CMS in the LHC-prefs. It's a better solution then deleting thousands of CMS tasks! ID: 46019 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,376,015 RAC: 4,676	Message 46020 - Posted: 6 Jan 2022, 22:07:25 UTC - in response to Message 46016. I have it set to No Limit, as the max of 8 in flight WU limits my 44 - 56 core computers by a large amount. I seem to recall that we established some while ago that the arbitrary limit of 8 in the CPU/Task preferences could be increased but there wasn't a need for it at the time. That's something for Laurence or Nils to contemplate, I have no control over that aspect of the project. ID: 46020 · Reply Quote

LHC@home