Message boards : CMS Application : Please check your task times and your IPv6 connectivity
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 43636 - Posted: 17 Nov 2020, 16:48:08 UTC - in response to Message 43615.  
Last modified: 17 Nov 2020, 16:48:55 UTC

This is all a bit strange to me. As far as I recall I've never got more CMS tasks than the physical number of CPUs in my machines -- I have one 4-core machine set to work preferences of Max 6 tasks, Max 6 CPUs; it only ever gets 4 tasks. 6-core and 40 core "work" servers get no more than 6 tasks at any one time.
Perhaps something's changed since I stopped running them intensively during lockdown(s).

Could you try how many you get if you don't limit the number of tasks? Those of us that have more than 8 cores the 8 max tasks is not enough to keep our machines fully loaded. Atlas and Theory limits my tasks to 8+8 even with 'No limit' setting for number of tasks but CMS seems different.

Oh, damn! I mustn't have hit "Post Reply" yesterday.On my 40-core machine I first set Max CPUs to unlimited, and only got 6 tasks, with the familiar message that I'd reached a limit. Then I set Max tasks to unlimited, and the jobs started flowing in -- over 90 before I reverted to my original settings, after which the limit message appears again. I've gone through 80 of the tasks so far, and am just finishing off the rest now. So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity.
ID: 43636 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 43638 - Posted: 17 Nov 2020, 17:00:51 UTC - in response to Message 43636.  

Then I set Max tasks to unlimited, and the jobs started flowing in -- over 90 before I reverted to my original settings, after which the limit message appears again. I've gone through 80 of the tasks so far, and am just finishing off the rest now. So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity.

That would be a good backup, but it just corrects a BOINC goof. I usually have "Max # jobs No limit", and it is normally not a problem, unless BOINC goes bananas.
It is the same on WCG. It is nice to have the maximum number as an insurance policy.
ID: 43638 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 43639 - Posted: 17 Nov 2020, 17:25:29 UTC - in response to Message 43636.  
Last modified: 17 Nov 2020, 17:38:35 UTC

So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity.
The flood does seem to stop eventually - I think it's at about (60 * max_concurrent) - but in that situation I've always been distracted by "how do loops work in bash so as to abort most of these jobs" from analysing any significance of the number.
The point is it's many, many more than would be expected from the cache settings in the web preferences, which I think spooks people. If I specify I want to keep two days' work in hand, why is BOINC giving me a month's worth?
ID: 43639 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 43640 - Posted: 17 Nov 2020, 18:56:36 UTC - in response to Message 43639.  

So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity.
The flood does seem to stop eventually - I think it's at about (60 * max_concurrent) - but in that situation I've always been distracted by "how do loops work in bash so as to abort most of these jobs" from analysing any significance of the number.
The point is it's many, many more than would be expected from the cache settings in the web preferences, which I think spooks people. If I specify I want to keep two days' work in hand, why is BOINC giving me a month's worth?

I agree with Henry. Boinc already has cache size setting that users can set themselves, so why can't LHC@home follow that for CMS tasks as well? Sixtrack seems to follow this but virtual box tasks seem to have a mind of their own.
ID: 43640 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 43939 - Posted: 17 Dec 2020, 14:28:31 UTC

Hurrah! Our "problem child" seems to have stopped running CMS jobs, and our failure rate has plummeted. (I can't tell you who [s]he is, but some of you have worked it out anyway.) There were also failures recently in the LogCollect stage, which runs on Laurence's T3_CH_CMSAtHome VM cluster, which apparently were due to the decomissioning of CERN's Castor archive service. The new update to WMAgent cures this, as it now stores explicitly to the EOS storage system. There's a new archive system coming online, I wait to see if this will transition seamlessly.
ID: 43939 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 43943 - Posted: 18 Dec 2020, 15:17:25 UTC

Aargh, he's back again...
ID: 43943 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 44019 - Posted: 29 Dec 2020, 22:14:40 UTC
Last modified: 29 Dec 2020, 22:15:24 UTC

Hmm, I guess other LHC@Home projects have run out of jobs (see server status). We have a 3X spike in running jobs which is now filtering down to completed jobs. Just in the last couple of hours our problem child with a 64-core AMD machine has stopped running CMS jobs again (he was averaging about 50 job failures/hour...). The failure rate graph looks a lot better for the last hour or two.
ID: 44019 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 44020 - Posted: 30 Dec 2020, 0:14:06 UTC
Last modified: 30 Dec 2020, 0:18:16 UTC

Atlas completely dry and Theory heavy loaded. Hosts fetch a lot CMS now i hope it could hold until Atlas is back. First time i see Atlas shutdown.

I have installed VirtualBox on my CentOS machines to help out with high failures on jobs. Strangely they are not detected in all boinc-clients but i got 8 out of 11 host running.
ID: 44020 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44021 - Posted: 30 Dec 2020, 9:23:49 UTC

I have received 5 CMS tasks. I know they will fail when Condor reaches 10656. Why?
Tullio
ID: 44021 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 44022 - Posted: 30 Dec 2020, 14:45:02 UTC - in response to Message 44020.  

Atlas completely dry and Theory heavy loaded. Hosts fetch a lot CMS now i hope it could hold until Atlas is back. First time i see Atlas shutdown.

I have installed VirtualBox on my CentOS machines to help out with high failures on jobs. Strangely they are not detected in all boinc-clients but i got 8 out of 11 host running.

Have you installed the VirtualBox extension pack? I see this in your logs:
2020-12-30 00:54:47 (3262867): Required extension pack not installed, remote desktop not enabled.
Nevertheless, some machines are returning valid results but there are a lot of failures too.
ID: 44022 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 44023 - Posted: 30 Dec 2020, 14:50:26 UTC - in response to Message 44021.  

I have received 5 CMS tasks. I know they will fail when Condor reaches 10656. Why?
Tullio

From the times, I guess something is going wrong at the end of the first job but not enough information shows up in your logs to see what.
ID: 44023 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 44024 - Posted: 30 Dec 2020, 17:58:56 UTC - in response to Message 44022.  

I have one host that bothers me (hostid 10629638). I have monitor since yesterday and works if i start them slowly and limit with max concurrently running task. I have installed extension pack now on this one.

Others hosts looks to running fine since i installed but they would not have extension pack.
ID: 44024 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 44025 - Posted: 30 Dec 2020, 18:34:25 UTC - in response to Message 44024.  
Last modified: 30 Dec 2020, 18:37:28 UTC

I have one host that bothers me (hostid 10629638). I have monitor since yesterday and works if i start them slowly and limit with max concurrently running task. I have installed extension pack now on this one.

Others hosts looks to running fine since i installed but they would not have extension pack.

Yes, I think 10629638 was the one that worried me. I did have a thought; do you have enough bandwidth to support as many jobs as you are running? There is a large amount of data downloaded at the start of each job (conditions database, etc.) and of course there is 70 MB or so of results returned at the end of the job. If you check through the message board, there are instructions on how to set up a caching proxy on a local machine{1} , which greatly reduces the amount of initial downloads that must come through the external network.

[Edit] {1} https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5052#39072 [/Edit]
ID: 44025 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 44027 - Posted: 30 Dec 2020, 21:56:05 UTC - in response to Message 44025.  
Last modified: 30 Dec 2020, 22:23:47 UTC

Yes that is the one. When i look vm it would reach section where check NTP time and fail before it start to boot.Some of them error out with UROOT no access for input/output rw others passed and success boot up and run as they should.
When i start one and wait 20-30 min then next it worked but when i let boinc handle it it seems to affect the others and busy. Somehow this system did not allow more then one boot up session concurrently.

I would need to wipe this system when other task are done. I have another that is identical with os and hardware but use 4.18.0-147.3.1.el8_1 instead of 4.18.0-147.5.1.el8_1. When i make changes or new setup i do exact same for all host so. Same Virtualbox version on all and should be identical. The hosID: 10629847 have 18 CMS running concurrently while this host could not handle 8.

Yes, I think 10629638 was the one that worried me. I did have a thought; do you have enough bandwidth to support as many jobs as you are running? There is a large amount of data downloaded at the start of each job (conditions database, etc.) and of course there is 70 MB or so of results returned at the end of the job. If you check through the message board, there are instructions on how to set up a caching proxy on a local machine{1} , which greatly reduces the amount of initial downloads that must come through the external network.


Got Squid running and used it around 2 years now it sure helps a lot for files and latency. I have 1 Gbit link to 10 hosts in Lan but WAN is limited to 250/250 mbit. I saw spike from squid to hosts when it fetch master files and .vdi file i hit limit on local speed the spike was close 1 Gib/s to host at that time.

Inside vm it takes 1-2 sec for small files but higher HTTP.HTTP_Proxy flows to vocms s1ral then when i run theory or atlas. I reach in total 200-300 flows right now with only CMS active.

No error at all on other host while this host have less then 1/4 success rate to start CMS. Would not think it would network to this host and would believe corruption or permission on this host only.

I have set No new task set so it would not bother CMS more.
ID: 44027 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 44028 - Posted: 30 Dec 2020, 22:23:22 UTC - in response to Message 44027.  

I have set No new task set so it would not bother CMS more.

OK, thanks for looking into this. Your diligence is appreciated.
ID: 44028 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,911,701
RAC: 138,045
Message 44030 - Posted: 31 Dec 2020, 7:14:46 UTC - in response to Message 44027.  

According to the Grafana-Monitoring each CMS job runs a bit longer than 3 h.
At the end it uploads a result file of about 120 MB.

This numbers can be used to estimate how many CMS jobs can be run concurrently to reach 100 % upload saturation:

1 Mbit/s: 11
5 Mbit/s: 56
10 Mbit/s: 112
20 Mbit/s: 225
50 Mbit/s: 562
250 Mbit/s: 2812


@Gunde
Your computer list shows that you might had more than 3000 active cores during the last 2 days.
If this is correct and all of them ran CMS this may have saturated your upload.
ID: 44030 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 44034 - Posted: 31 Dec 2020, 23:42:43 UTC
Last modified: 31 Dec 2020, 23:44:56 UTC

OK, I see ATLAS jobs are available again, so our job share is starting to decrease.
Positively, the 64-core machine that was giving us almost all our primary failures, as reported by Condor (most ran perfectly well when resubmitted to a new machine), is now reporting run and CPU times more in line with what we expect, and the overall monit/grafana "Job Failure" graph is receding into low single-digit percentages.

Happy New Year, everybody. Let's hope we get on top of all our problems, large and small.
ID: 44034 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 44035 - Posted: 1 Jan 2021, 0:28:36 UTC - in response to Message 44030.  
Last modified: 1 Jan 2021, 0:32:36 UTC

According to the Grafana-Monitoring each CMS job runs a bit longer than 3 h.
At the end it uploads a result file of about 120 MB.

This numbers can be used to estimate how many CMS jobs can be run concurrently to reach 100 % upload saturation:

1 Mbit/s: 11
5 Mbit/s: 56
10 Mbit/s: 112
20 Mbit/s: 225
50 Mbit/s: 562
250 Mbit/s: 2812


@Gunde
Your computer list shows that you might had more than 3000 active cores during the last 2 days.
If this is correct and all of them ran CMS this may have saturated your upload.


Thanks for.the info.
I limit most the hosts in app config. Small host around 10 and big host to 20 task concurrently running as core idle if it waiting for free memory. I would estimate that with all host combined would be able to do 140 task concurrently. When I checked manager it rarely hit this limit as I do other project. Very low in and out before atlas got back.
ID: 44035 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : CMS Application : Please check your task times and your IPv6 connectivity


©2024 CERN