Please check your task times and your IPv6 connectivity

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 43636 - Posted: 17 Nov 2020, 16:48:08 UTC - in response to Message 43615. Last modified: 17 Nov 2020, 16:48:55 UTC This is all a bit strange to me. As far as I recall I've never got more CMS tasks than the physical number of CPUs in my machines -- I have one 4-core machine set to work preferences of Max 6 tasks, Max 6 CPUs; it only ever gets 4 tasks. 6-core and 40 core "work" servers get no more than 6 tasks at any one time. Perhaps something's changed since I stopped running them intensively during lockdown(s). Could you try how many you get if you don't limit the number of tasks? Those of us that have more than 8 cores the 8 max tasks is not enough to keep our machines fully loaded. Atlas and Theory limits my tasks to 8+8 even with 'No limit' setting for number of tasks but CMS seems different. Oh, damn! I mustn't have hit "Post Reply" yesterday.On my 40-core machine I first set Max CPUs to unlimited, and only got 6 tasks, with the familiar message that I'd reached a limit. Then I set Max tasks to unlimited, and the jobs started flowing in -- over 90 before I reverted to my original settings, after which the limit message appears again. I've gone through 80 of the tasks so far, and am just finishing off the rest now. So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity. ID: 43636 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 43638 - Posted: 17 Nov 2020, 17:00:51 UTC - in response to Message 43636. Then I set Max tasks to unlimited, and the jobs started flowing in -- over 90 before I reverted to my original settings, after which the limit message appears again. I've gone through 80 of the tasks so far, and am just finishing off the rest now. So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity. That would be a good backup, but it just corrects a BOINC goof. I usually have "Max # jobs No limit", and it is normally not a problem, unless BOINC goes bananas. It is the same on WCG. It is nice to have the maximum number as an insurance policy. ID: 43638 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 24	Message 43639 - Posted: 17 Nov 2020, 17:25:29 UTC - in response to Message 43636. Last modified: 17 Nov 2020, 17:38:35 UTC So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity. The flood does seem to stop eventually - I think it's at about (60 * max_concurrent) - but in that situation I've always been distracted by "how do loops work in bash so as to abort most of these jobs" from analysing any significance of the number. The point is it's many, many more than would be expected from the cache settings in the web preferences, which I think spooks people. If I specify I want to keep two days' work in hand, why is BOINC giving me a month's worth? ID: 43639 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 780 Credit: 59,746,926 RAC: 46,224	Message 43640 - Posted: 17 Nov 2020, 18:56:36 UTC - in response to Message 43639. So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity. The flood does seem to stop eventually - I think it's at about (60 * max_concurrent) - but in that situation I've always been distracted by "how do loops work in bash so as to abort most of these jobs" from analysing any significance of the number. The point is it's many, many more than would be expected from the cache settings in the web preferences, which I think spooks people. If I specify I want to keep two days' work in hand, why is BOINC giving me a month's worth? I agree with Henry. Boinc already has cache size setting that users can set themselves, so why can't LHC@home follow that for CMS tasks as well? Sixtrack seems to follow this but virtual box tasks seem to have a mind of their own. ID: 43640 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 43939 - Posted: 17 Dec 2020, 14:28:31 UTC Hurrah! Our "problem child" seems to have stopped running CMS jobs, and our failure rate has plummeted. (I can't tell you who [s]he is, but some of you have worked it out anyway.) There were also failures recently in the LogCollect stage, which runs on Laurence's T3_CH_CMSAtHome VM cluster, which apparently were due to the decomissioning of CERN's Castor archive service. The new update to WMAgent cures this, as it now stores explicitly to the EOS storage system. There's a new archive system coming online, I wait to see if this will transition seamlessly. ID: 43939 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 43943 - Posted: 18 Dec 2020, 15:17:25 UTC Aargh, he's back again... ID: 43943 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 44019 - Posted: 29 Dec 2020, 22:14:40 UTC Last modified: 29 Dec 2020, 22:15:24 UTC Hmm, I guess other LHC@Home projects have run out of jobs (see server status). We have a 3X spike in running jobs which is now filtering down to completed jobs. Just in the last couple of hours our problem child with a 64-core AMD machine has stopped running CMS jobs again (he was averaging about 50 job failures/hour...). The failure rate graph looks a lot better for the last hour or two. ID: 44019 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 44020 - Posted: 30 Dec 2020, 0:14:06 UTC Last modified: 30 Dec 2020, 0:18:16 UTC Atlas completely dry and Theory heavy loaded. Hosts fetch a lot CMS now i hope it could hold until Atlas is back. First time i see Atlas shutdown. I have installed VirtualBox on my CentOS machines to help out with high failures on jobs. Strangely they are not detected in all boinc-clients but i got 8 out of 11 host running. ID: 44020 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 44021 - Posted: 30 Dec 2020, 9:23:49 UTC I have received 5 CMS tasks. I know they will fail when Condor reaches 10656. Why? Tullio ID: 44021 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 44022 - Posted: 30 Dec 2020, 14:45:02 UTC - in response to Message 44020. Atlas completely dry and Theory heavy loaded. Hosts fetch a lot CMS now i hope it could hold until Atlas is back. First time i see Atlas shutdown. I have installed VirtualBox on my CentOS machines to help out with high failures on jobs. Strangely they are not detected in all boinc-clients but i got 8 out of 11 host running. Have you installed the VirtualBox extension pack? I see this in your logs: 2020-12-30 00:54:47 (3262867): Required extension pack not installed, remote desktop not enabled. Nevertheless, some machines are returning valid results but there are a lot of failures too. ID: 44022 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 44023 - Posted: 30 Dec 2020, 14:50:26 UTC - in response to Message 44021. I have received 5 CMS tasks. I know they will fail when Condor reaches 10656. Why? Tullio From the times, I guess something is going wrong at the end of the first job but not enough information shows up in your logs to see what. ID: 44023 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 44024 - Posted: 30 Dec 2020, 17:58:56 UTC - in response to Message 44022. I have one host that bothers me (hostid 10629638). I have monitor since yesterday and works if i start them slowly and limit with max concurrently running task. I have installed extension pack now on this one. Others hosts looks to running fine since i installed but they would not have extension pack. ID: 44024 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 44025 - Posted: 30 Dec 2020, 18:34:25 UTC - in response to Message 44024. Last modified: 30 Dec 2020, 18:37:28 UTC I have one host that bothers me (hostid 10629638). I have monitor since yesterday and works if i start them slowly and limit with max concurrently running task. I have installed extension pack now on this one. Others hosts looks to running fine since i installed but they would not have extension pack. Yes, I think 10629638 was the one that worried me. I did have a thought; do you have enough bandwidth to support as many jobs as you are running? There is a large amount of data downloaded at the start of each job (conditions database, etc.) and of course there is 70 MB or so of results returned at the end of the job. If you check through the message board, there are instructions on how to set up a caching proxy on a local machine{1} , which greatly reduces the amount of initial downloads that must come through the external network. [Edit] {1} https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5052#39072 [/Edit] ID: 44025 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 44027 - Posted: 30 Dec 2020, 21:56:05 UTC - in response to Message 44025. Last modified: 30 Dec 2020, 22:23:47 UTC Yes that is the one. When i look vm it would reach section where check NTP time and fail before it start to boot.Some of them error out with UROOT no access for input/output rw others passed and success boot up and run as they should. When i start one and wait 20-30 min then next it worked but when i let boinc handle it it seems to affect the others and busy. Somehow this system did not allow more then one boot up session concurrently. I would need to wipe this system when other task are done. I have another that is identical with os and hardware but use 4.18.0-147.3.1.el8_1 instead of 4.18.0-147.5.1.el8_1. When i make changes or new setup i do exact same for all host so. Same Virtualbox version on all and should be identical. The hosID: 10629847 have 18 CMS running concurrently while this host could not handle 8. Yes, I think 10629638 was the one that worried me. I did have a thought; do you have enough bandwidth to support as many jobs as you are running? There is a large amount of data downloaded at the start of each job (conditions database, etc.) and of course there is 70 MB or so of results returned at the end of the job. If you check through the message board, there are instructions on how to set up a caching proxy on a local machine{1} , which greatly reduces the amount of initial downloads that must come through the external network. Got Squid running and used it around 2 years now it sure helps a lot for files and latency. I have 1 Gbit link to 10 hosts in Lan but WAN is limited to 250/250 mbit. I saw spike from squid to hosts when it fetch master files and .vdi file i hit limit on local speed the spike was close 1 Gib/s to host at that time. Inside vm it takes 1-2 sec for small files but higher HTTP.HTTP_Proxy flows to vocms s1ral then when i run theory or atlas. I reach in total 200-300 flows right now with only CMS active. No error at all on other host while this host have less then 1/4 success rate to start CMS. Would not think it would network to this host and would believe corruption or permission on this host only. I have set No new task set so it would not bother CMS more. ID: 44027 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 44028 - Posted: 30 Dec 2020, 22:23:22 UTC - in response to Message 44027. I have set No new task set so it would not bother CMS more. OK, thanks for looking into this. Your diligence is appreciated. ID: 44028 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,782,544 RAC: 77,208	Message 44030 - Posted: 31 Dec 2020, 7:14:46 UTC - in response to Message 44027. According to the Grafana-Monitoring each CMS job runs a bit longer than 3 h. At the end it uploads a result file of about 120 MB. This numbers can be used to estimate how many CMS jobs can be run concurrently to reach 100 % upload saturation: 1 Mbit/s: 11 5 Mbit/s: 56 10 Mbit/s: 112 20 Mbit/s: 225 50 Mbit/s: 562 250 Mbit/s: 2812 @Gunde Your computer list shows that you might had more than 3000 active cores during the last 2 days. If this is correct and all of them ran CMS this may have saturated your upload. ID: 44030 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,365,295 RAC: 3,880	Message 44034 - Posted: 31 Dec 2020, 23:42:43 UTC Last modified: 31 Dec 2020, 23:44:56 UTC OK, I see ATLAS jobs are available again, so our job share is starting to decrease. Positively, the 64-core machine that was giving us almost all our primary failures, as reported by Condor (most ran perfectly well when resubmitted to a new machine), is now reporting run and CPU times more in line with what we expect, and the overall monit/grafana "Job Failure" graph is receding into low single-digit percentages. Happy New Year, everybody. Let's hope we get on top of all our problems, large and small. ID: 44034 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 44035 - Posted: 1 Jan 2021, 0:28:36 UTC - in response to Message 44030. Last modified: 1 Jan 2021, 0:32:36 UTC According to the Grafana-Monitoring each CMS job runs a bit longer than 3 h. At the end it uploads a result file of about 120 MB. This numbers can be used to estimate how many CMS jobs can be run concurrently to reach 100 % upload saturation: 1 Mbit/s: 11 5 Mbit/s: 56 10 Mbit/s: 112 20 Mbit/s: 225 50 Mbit/s: 562 250 Mbit/s: 2812 @Gunde Your computer list shows that you might had more than 3000 active cores during the last 2 days. If this is correct and all of them ran CMS this may have saturated your upload. Thanks for.the info. I limit most the hosts in app config. Small host around 10 and big host to 20 task concurrently running as core idle if it waiting for free memory. I would estimate that with all host combined would be able to do 140 task concurrently. When I checked manager it rarely hit this limit as I do other project. Very low in and out before atlas got back. ID: 44035 · Reply Quote

LHC@home