Message boards :
CMS Application :
Please check your task times and your IPv6 connectivity
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
This is all a bit strange to me. As far as I recall I've never got more CMS tasks than the physical number of CPUs in my machines -- I have one 4-core machine set to work preferences of Max 6 tasks, Max 6 CPUs; it only ever gets 4 tasks. 6-core and 40 core "work" servers get no more than 6 tasks at any one time. Oh, damn! I mustn't have hit "Post Reply" yesterday.On my 40-core machine I first set Max CPUs to unlimited, and only got 6 tasks, with the familiar message that I'd reached a limit. Then I set Max tasks to unlimited, and the jobs started flowing in -- over 90 before I reverted to my original settings, after which the limit message appears again. I've gone through 80 of the tasks so far, and am just finishing off the rest now. So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Then I set Max tasks to unlimited, and the jobs started flowing in -- over 90 before I reverted to my original settings, after which the limit message appears again. I've gone through 80 of the tasks so far, and am just finishing off the rest now. So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity. That would be a good backup, but it just corrects a BOINC goof. I usually have "Max # jobs No limit", and it is normally not a problem, unless BOINC goes bananas. It is the same on WCG. It is nice to have the maximum number as an insurance policy. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 2 |
So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity.The flood does seem to stop eventually - I think it's at about (60 * max_concurrent) - but in that situation I've always been distracted by "how do loops work in bash so as to abort most of these jobs" from analysing any significance of the number. The point is it's many, many more than would be expected from the cache settings in the web preferences, which I think spooks people. If I specify I want to keep two days' work in hand, why is BOINC giving me a month's worth? |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,363,408 RAC: 17,955 |
So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity.The flood does seem to stop eventually - I think it's at about (60 * max_concurrent) - but in that situation I've always been distracted by "how do loops work in bash so as to abort most of these jobs" from analysing any significance of the number. I agree with Henry. Boinc already has cache size setting that users can set themselves, so why can't LHC@home follow that for CMS tasks as well? Sixtrack seems to follow this but virtual box tasks seem to have a mind of their own. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Hurrah! Our "problem child" seems to have stopped running CMS jobs, and our failure rate has plummeted. (I can't tell you who [s]he is, but some of you have worked it out anyway.) There were also failures recently in the LogCollect stage, which runs on Laurence's T3_CH_CMSAtHome VM cluster, which apparently were due to the decomissioning of CERN's Castor archive service. The new update to WMAgent cures this, as it now stores explicitly to the EOS storage system. There's a new archive system coming online, I wait to see if this will transition seamlessly. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Hmm, I guess other LHC@Home projects have run out of jobs (see server status). We have a 3X spike in running jobs which is now filtering down to completed jobs. Just in the last couple of hours our problem child with a 64-core AMD machine has stopped running CMS jobs again (he was averaging about 50 job failures/hour...). The failure rate graph looks a lot better for the last hour or two. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Atlas completely dry and Theory heavy loaded. Hosts fetch a lot CMS now i hope it could hold until Atlas is back. First time i see Atlas shutdown. I have installed VirtualBox on my CentOS machines to help out with high failures on jobs. Strangely they are not detected in all boinc-clients but i got 8 out of 11 host running. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
I have received 5 CMS tasks. I know they will fail when Condor reaches 10656. Why? Tullio |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Atlas completely dry and Theory heavy loaded. Hosts fetch a lot CMS now i hope it could hold until Atlas is back. First time i see Atlas shutdown. Have you installed the VirtualBox extension pack? I see this in your logs: 2020-12-30 00:54:47 (3262867): Required extension pack not installed, remote desktop not enabled.Nevertheless, some machines are returning valid results but there are a lot of failures too. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
I have one host that bothers me (hostid 10629638). I have monitor since yesterday and works if i start them slowly and limit with max concurrently running task. I have installed extension pack now on this one. Others hosts looks to running fine since i installed but they would not have extension pack. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
I have one host that bothers me (hostid 10629638). I have monitor since yesterday and works if i start them slowly and limit with max concurrently running task. I have installed extension pack now on this one. Yes, I think 10629638 was the one that worried me. I did have a thought; do you have enough bandwidth to support as many jobs as you are running? There is a large amount of data downloaded at the start of each job (conditions database, etc.) and of course there is 70 MB or so of results returned at the end of the job. If you check through the message board, there are instructions on how to set up a caching proxy on a local machine{1} , which greatly reduces the amount of initial downloads that must come through the external network. [Edit] {1} https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5052#39072 [/Edit] |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Yes that is the one. When i look vm it would reach section where check NTP time and fail before it start to boot.Some of them error out with UROOT no access for input/output rw others passed and success boot up and run as they should. When i start one and wait 20-30 min then next it worked but when i let boinc handle it it seems to affect the others and busy. Somehow this system did not allow more then one boot up session concurrently. I would need to wipe this system when other task are done. I have another that is identical with os and hardware but use 4.18.0-147.3.1.el8_1 instead of 4.18.0-147.5.1.el8_1. When i make changes or new setup i do exact same for all host so. Same Virtualbox version on all and should be identical. The hosID: 10629847 have 18 CMS running concurrently while this host could not handle 8. Yes, I think 10629638 was the one that worried me. I did have a thought; do you have enough bandwidth to support as many jobs as you are running? There is a large amount of data downloaded at the start of each job (conditions database, etc.) and of course there is 70 MB or so of results returned at the end of the job. If you check through the message board, there are instructions on how to set up a caching proxy on a local machine{1} , which greatly reduces the amount of initial downloads that must come through the external network. Got Squid running and used it around 2 years now it sure helps a lot for files and latency. I have 1 Gbit link to 10 hosts in Lan but WAN is limited to 250/250 mbit. I saw spike from squid to hosts when it fetch master files and .vdi file i hit limit on local speed the spike was close 1 Gib/s to host at that time. Inside vm it takes 1-2 sec for small files but higher HTTP.HTTP_Proxy flows to vocms s1ral then when i run theory or atlas. I reach in total 200-300 flows right now with only CMS active. No error at all on other host while this host have less then 1/4 success rate to start CMS. Would not think it would network to this host and would believe corruption or permission on this host only. I have set No new task set so it would not bother CMS more. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
According to the Grafana-Monitoring each CMS job runs a bit longer than 3 h. At the end it uploads a result file of about 120 MB. This numbers can be used to estimate how many CMS jobs can be run concurrently to reach 100 % upload saturation: 1 Mbit/s: 11 5 Mbit/s: 56 10 Mbit/s: 112 20 Mbit/s: 225 50 Mbit/s: 562 250 Mbit/s: 2812 @Gunde Your computer list shows that you might had more than 3000 active cores during the last 2 days. If this is correct and all of them ran CMS this may have saturated your upload. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
OK, I see ATLAS jobs are available again, so our job share is starting to decrease. Positively, the 64-core machine that was giving us almost all our primary failures, as reported by Condor (most ran perfectly well when resubmitted to a new machine), is now reporting run and CPU times more in line with what we expect, and the overall monit/grafana "Job Failure" graph is receding into low single-digit percentages. Happy New Year, everybody. Let's hope we get on top of all our problems, large and small. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
According to the Grafana-Monitoring each CMS job runs a bit longer than 3 h. Thanks for.the info. I limit most the hosts in app config. Small host around 10 and big host to 20 task concurrently running as core idle if it waiting for free memory. I would estimate that with all host combined would be able to do 140 task concurrently. When I checked manager it rarely hit this limit as I do other project. Very low in and out before atlas got back. |
©2024 CERN