Message boards : CMS Application : CMS computation error in 30 seconds every time
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47677 - Posted: 15 Jan 2023, 10:02:12 UTC - in response to Message 47668.  

Did CMS used to work like this?
Yes, for years.
I never noticed it before, but I have recently added a lot more RAM so I'm probably running more CMS than before. Some time this year when 6 upload changes to 220 upload, I won't have this problem. It seems you outpaced UK internet.

As of now we are currently running 614 CMS jobs via BOINC while CERN and affiliated datacentres are running more than 114000 jobs (not via BOINC).
If the datacentres are doing 99.46% of the work, why do you bother with Boinc at all? Why not just give all the work to them?

Is there an easy way to get equal numbers of Theory and Atlas aswell? I've asked for anything and only get CMS. If I could do some of each, and the others don't have the same bandwidth requirements, I could run all cores on LHC.
Since the standard BOINC server is used it sends out what is next in the task queue and accepted by the requesting client.
Since I ask for anything and only ever get CMS, this suggests a lot of folk are avoiding CMS (due to RAM, bandwidth, or personal preference), or you're putting a lot more CMS into the queue than anything else. I can only assume this means you'd prefer CMS to be done over Atlas and Theory, so I'm concentrating on that wherever possible.

Best would be to run multiple BOINC clients on the same box and connect them to different venues.
Each venue can then be set to run either ATLAS/CMS/Theory.
Since I have 8 PCs, it's easier just to change x of them over to Atlas (well actually I changed them over to "anything but CMS" which gave me Atlas). That must be your second most important thing.
ID: 47677 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,932,134
RAC: 137,676
Message 47678 - Posted: 15 Jan 2023, 11:12:07 UTC - in response to Message 47677.  

1.
I clearly stated that 2:30 min CPU-time point out that the task did nothing useful.
Ivan recently explained that/why the project team decided not to treat those tasks as invalid.
Nonetheless you should be made aware.


2.
If the datacentres are doing 99.46% of the work, why do you bother with Boinc at all? Why not just give all the work to them?

It's like a clockwork of a swiss watch: even the smallest pieces are required to make it work as a whole.


3.
Since I ask for anything and only ever get CMS, this suggests a lot of folk are avoiding CMS (due to RAM, bandwidth, or personal preference), or you're putting a lot more CMS into the queue than anything else. I can only assume this means you'd prefer CMS to be done over Atlas and Theory, so I'm concentrating on that wherever possible.
.
.
.
Since I have 8 PCs, it's easier just to change x of them over to Atlas (well actually I changed them over to "anything but CMS" which gave me Atlas). That must be your second most important thing.

This is how the standard BOINC server works (also for years and often explained, just listen).
In addition CERN runs 2 server instances side by side for load balancing and a client gets one of them in random order.

The result is very simple and usually easy to understand:
You get what is next in the queue of a randomly contacted server.



4.
I don't know if the other 7 PCs would end up going through the VPN or not.

So, you don't even know whether you send your packets through a VPN or not?
Beside the fact that the project data would be able to pass a VPN it is recommended NOT to do so as it slows down the performance between CERN and the VMs.
CERN uses Cloudflare as CDN provider to keep as much data as close to the volunteer's home (UK in your case).
This makes it as fast as possible.
It's a local Squid that would make it even faster.
Based on your description you would need to set it up in your garage.


What you do is to force additional hops between your location, the VPN provider (Norway) and Cloudflare (now also Norway!).
Well done!

In addition:
Do you know whether the VPN provider allows packets via all ports required by CERN?
If not this would be another reason NOT to use that VPN.
ID: 47678 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47679 - Posted: 15 Jan 2023, 12:39:10 UTC - in response to Message 47678.  

1.
I clearly stated that 2:30 min CPU-time point out that the task did nothing useful.
Ivan recently explained that/why the project team decided not to treat those tasks as invalid.
Nonetheless you should be made aware.
I cannot find any explanation why they should fail just because of a slow internet connection.

I'm monitoring anything doing stupidly small amounts of CPU time and will rectify it. It seems it's usually because of my maxed out uplink (and presumably very impatient CMS programs).

2.
If the datacentres are doing 99.46% of the work, why do you bother with Boinc at all? Why not just give all the work to them?

It's like a clockwork of a swiss watch: even the smallest pieces are required to make it work as a whole.
If we're doing work they can't, that would make sense. But adding a fraction of a percent to their speed isn't worthwhile. Would you heat your house with a 15kW furnace then add a 100W heater aswell?

3.
You get what is next in the queue of a randomly contacted server.
That doesn't answer the question. You're inputting Theory, CMS, and Atlas into the queue, yet I only get CMS out unless I ask for something else. Illogical.

So, you don't even know whether you send your packets through a VPN or not?
Why would I need to know that? The VPN is to protect only this machine, what it does to the garage is irrelevant. I installed the VPN on one machine. I don't know if it protects just that machine, or anything on a Windows bridge aswell. But I can check.... No, it only protects the one machine (which happens to be the one you linked to a task of). I just used https://whatismyipaddress.com/ on this machine and a garage machine while the VPN was running on this machine, and it only changed the public IP on this machine.

Beside the fact that the project data would be able to pass a VPN it is recommended NOT to do so as it slows down the performance between CERN and the VMs.
I doubt it, I can download at my full 32Mbit through the VPN, it's fast.

CERN uses Cloudflare as CDN provider to keep as much data as close to the volunteer's home (UK in your case).
This makes it as fast as possible.
Are you saying Cloudflare caches data so I don't have to go all the way to Switzerland for it? I thought Cloudflare was just a DDOS protection mechanism.

It's a local Squid that would make it even faster.
Only downloads. Most of what I see is uploads. And Boinc should be doing this anyway.

Based on your description you would need to set it up in your garage.
Why would it matter it it's in the garage or the house? The length of 40Gbit cable between the two buildings makes no difference.

What you do is to force additional hops between your location, the VPN provider (Norway) and Cloudflare (now also Norway!).
Well done!
I'm paying for the VPN. From your point of view all you're doing is sending work to an "LHC user" in Norway. The extra hops are at the expense of the VPN provider.

In addition:
Do you know whether the VPN provider allows packets via all ports required by CERN?
If not this would be another reason NOT to use that VPN.
It has something called port forwarding which I have switched on, so anything should get through. Is there a test I can try?

The VPN is only on occasionally anyway.
ID: 47679 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,109,667
RAC: 104,249
Message 47680 - Posted: 15 Jan 2023, 14:39:41 UTC

When you let only one PC (inside the house or outside-garage) running CMS for a whole day,
the other PC's with Atlas?

You can change this CMS-PC next day to an other, so if they are all running well, or showing problems, you can find it easier.

For me, CMS have more Network Traffic and need a more stable Networking in the LAN.

Have a CentOS9-VM for a Squid.
Mostly Atlas on Win11pro and Win11 Workstation. 70 MBits ISP and 7-9 TByte in one month.
My LAN have 1GBit for all five PC's.
ID: 47680 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47681 - Posted: 15 Jan 2023, 15:38:32 UTC - in response to Message 47680.  

When you let only one PC (inside the house or outside-garage) running CMS for a whole day,
the other PC's with Atlas?
I had them all on CMS, I'm changing some over to Atlas until my internet speed isn't limiting CMS.

You can change this CMS-PC next day to an other, so if they are all running well, or showing problems, you can find it easier.
They all work with CMS, but if all at once, the internet connection speed gets in the way and the impatient CMS program gets upset.

70 MBits ISP and 7-9 TByte in one month.
I have 32Mbits down, but only 6Mbits up. The up is limiting CMS. There is no monthly limit.

I will soon be getting (sometime this year, I've seen the engineers digging) 1000Mbit down 220Mbit up. Then there will be no problems :-)

My LAN have 1GBit for all five PC's.
I have 40Gbit cables, but only 1Gbit switches and network adapters. Faster adapters are very expensive.
ID: 47681 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,109,667
RAC: 104,249
Message 47682 - Posted: 15 Jan 2023, 16:17:14 UTC

6Mbit upload is really slow. Have 37 MBit upload and waiting since a year for 250 MBit (Super-VDSL).
Ok, we looking together who is the first one with more speed ;-))
ID: 47682 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47683 - Posted: 15 Jan 2023, 16:50:08 UTC - in response to Message 47682.  

6Mbit upload is really slow. Have 37 MBit upload and waiting since a year for 250 MBit (Super-VDSL).
Ok, we looking together who is the first one with more speed ;-))
I suppose it makes sense for most people to have it asymmetrical, but I'd rather have more upload and less download. Then again when I'm downloading TV or computer games, maybe not.

Your Super-VDSL - will that be 250 both ways? If so, I will have more download than you :-P
ID: 47683 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,109,667
RAC: 104,249
Message 47684 - Posted: 15 Jan 2023, 17:38:38 UTC - in response to Message 47683.  
Last modified: 15 Jan 2023, 18:29:16 UTC

Your Super-VDSL - will that be 250 both ways? If so, I will have more download than you :-P

No, 40 MBit for upload as before.
Streaming Premier League atm 20 interruptes in a halftime.
Not really a fun.
ID: 47684 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47685 - Posted: 15 Jan 2023, 18:03:55 UTC - in response to Message 47684.  

No, 40 MBit for upload as before.
Streaming PremiereLigue atm 20 interruptes in a halftime.
Not really a fun.
If you have 40Mbit upload, presumably download is at least 40? How can 40 not be enough for streaming?
ID: 47685 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,109,667
RAC: 104,249
Message 47686 - Posted: 15 Jan 2023, 18:31:11 UTC - in response to Message 47685.  

Have atm no Ethernetcable free, only WiFi for LG OLED.
ID: 47686 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47687 - Posted: 15 Jan 2023, 19:05:27 UTC - in response to Message 47686.  

Have atm no Ethernetcable free, only WiFi for LG OLED.
Can you not get an unmanaged gigabit switch for €10? I have one in the garage for 7 computers.

Wifi isn't so good if you need to go through walls.
ID: 47687 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47689 - Posted: 16 Jan 2023, 12:30:59 UTC

I think I've sorted everything, although I notice Atlas only uses just under 2 cores (those used to use almost 8), and Theory finishes quicker than it used to. If anyone can check my most recent tasks form CMS, Atlas, and Theory, and confirm they're doing a sensible amount of work, it would be much appreciated.
ID: 47689 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 47693 - Posted: 17 Jan 2023, 7:37:24 UTC - in response to Message 47674.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=377315268

Looks like the P.H. LAN as a whole is now misconfigured.
The publicly available logfiles don't tell why but there's now way to complete a CMS subtask within only 2:30 min CPU-time.
My guess would be that some internet data requested by deeper level scripts can't be downloaded (=> timeout) and the error doesn't arrive at the BOINC level.
It would require a CERN expert to look through those deeper level logs.

    2023-01-15 04:45:14 (2360): Guest Log: [INFO] Requesting an idtoken from LHC@home
    2023-01-15 05:22:19 (2360): Guest Log: [INFO] glidein exited with return value 0.


Looks to me like it timed out retrieving the idtoken.


@P.H.
Since you set up very unusual packet redirections you may have forgotten to forward all required ports in both directions.

ID: 47693 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,932,134
RAC: 137,676
Message 47694 - Posted: 17 Jan 2023, 8:06:27 UTC - in response to Message 47689.  

Recent results look fine.
Much better than the days before.


Peter Hucker wrote:
I have FTTC --> ISP router --> Windows 11 PC with 2 bridged ethernets --> unmanaged switch --> 7 Windows 11 PCs

It is still highly recommended to run a local Squid.
Especially since you run lots of CMS VMs and many other LHC VMs.
Best would be to directly connect a Squid box to the switch most of your crunchers are connected to (may be marked above).
If that switch is in your garage, fine, set up the Squid also in the garage.


Peter Hucker wrote:
I will soon be getting (sometime this year, I've seen the engineers digging) 1000Mbit down 220Mbit up. Then there will be no problems

Bandwidth is not the only critical factor.
Hence, it makes no sense to focus on that and claim everything will be fine after an upgrade.
Latency and #concurrent TCP connections also play important roles due to how CVMFS and Frontier work.
ID: 47694 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,932,134
RAC: 137,676
Message 47695 - Posted: 17 Jan 2023, 8:27:27 UTC - in response to Message 47693.  

ivan wrote:
Looks to me like it timed out retrieving the idtoken.

I would expect this to be reported before glidein starts.
https://gitlab.cern.ch/vc/vm/-/blob/master/sbin/bootstrap-idtoken#L147-L156
https://gitlab.cern.ch/vc/vm/-/blob/master/bin/boinc-idtoken
ID: 47695 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47696 - Posted: 17 Jan 2023, 9:36:44 UTC - in response to Message 47693.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=377315268

Looks like the P.H. LAN as a whole is now misconfigured.
The publicly available logfiles don't tell why but there's now way to complete a CMS subtask within only 2:30 min CPU-time.
My guess would be that some internet data requested by deeper level scripts can't be downloaded (=> timeout) and the error doesn't arrive at the BOINC level.
It would require a CERN expert to look through those deeper level logs.

    2023-01-15 04:45:14 (2360): Guest Log: [INFO] Requesting an idtoken from LHC@home
    2023-01-15 05:22:19 (2360): Guest Log: [INFO] glidein exited with return value 0.


Looks to me like it timed out retrieving the idtoken.

It was probably doing that just as I switched the VPN on or off. If I do that while a web browser is loading a page, Opera gives me the error "network has changed" then reloads a few seconds later. Does CMS not retry?
ID: 47696 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47697 - Posted: 17 Jan 2023, 9:58:25 UTC - in response to Message 47694.  
Last modified: 17 Jan 2023, 9:59:54 UTC

Recent results look fine.
Much better than the days before.
All I changed was making sure the uplink isn't maxed out 100% of the time, which points to CMS having a problem if it can't get through first time. Why is it more impatient than any other program connecting to the internet?

Are my Atlases ok? On defaults, they only use 2 cores out of the 8 Boinc gives them. I see yours do too. I told Boinc to only give it 2 (in nthreads and avg_ncpus), and it still uses 2, although that does mean I use 10GB for 2 cores, so until I upgrade the 24 core machines from 64GB to 128GB, that's a bit limiting.

Are my Theories ok? They don't seem to run for long.

10th place on credit per day pleases me, I guess I've got them doing something useful. I'm almost going as fast as you!
https://www.boincstats.com/stats/3/user/list/12/0/0

Peter Hucker wrote:
I have FTTC --> ISP router --> Windows 11 PC with 2 bridged ethernets --> unmanaged switch --> 7 Windows 11 PCs
It is still highly recommended to run a local Squid.
Especially since you run lots of CMS VMs and many other LHC VMs.
Best would be to directly connect a Squid box to the switch most of your crunchers are connected to (may be marked above).
If that switch is in your garage, fine, set up the Squid also in the garage.
Since my downloads don't look that big, I'm not sure it would help much, only uploads are a problem. But if it takes a load off your servers I'll give it a go. Not sure why you want it connected to that switch. Wherever it is, it will be 1Gbit (the speed of the switch and the network adapters in every machine) between it and all computers. Actually, I guess that means my downloads would be relatively instant instead of waiting for 32Mbit. I'd prefer to set up Squid on the house computer, it's the fastest, the most reliable, and sits on a UPS. And actually, that is connected directly to that switch, but along a 20 metre 40Gbit ethernet cable. As you can see from the "diagram" I posted above, all 8 PCs are on the switch, it's just the internet tagged on the side. So in fact putting Squid on the house PC would be both on the switch and closest to the internet.

Peter Hucker wrote:
I'll soon be getting (sometime this year, I've seen the engineers digging) 1000Mbit down 220Mbit up. Then there will be no problems
Bandwidth is not the only critical factor.
Hence, it makes no sense to focus on that and claim everything will be fine after an upgrade.
Latency and #concurrent TCP connections also play important roles due to how CVMFS and Frontier work.
By latency I take it you mean ping time? That should be better, since they're taking the 150 metres of aluminium wire out of the equation. I'm sure gigabit internet will be able to cope just fine with all 126 cores on CMS.
ID: 47697 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47698 - Posted: 17 Jan 2023, 12:15:22 UTC

Update: Seems downloads don't always go very fast if upload is maxed out. So I'll definitely install squid today. Perhaps some tasks are waiting on downloads because upload is busy. Crappy internet connection? Or just the way TCP ACK packets work?
ID: 47698 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,932,134
RAC: 137,676
Message 47699 - Posted: 17 Jan 2023, 12:44:48 UTC - in response to Message 47697.  

As for the recent valids:
My comment was for CMS as well as ATLAS and Theory.

As for Theory runtimes:
It has been posted many times by many volunteers that Theory can have runtimes from a few seconds to a couple of days.
There's nothing unusual in your logs.

As for the impatience:
The most impatient part of the equation appears to sit in front of your computer.
Just accept that (be aware: example values!) it is better to run 70 % of the max. possible tasks but return reliable results than to go for 80 % but return 50 % trash.
Patience would include not to switch between things like VPN-on/VPN-off just for fun.
It has been explained many times that all LHC apps (except SixTrack) prefer and rely on stable network connections.

As for your internet connection:
You still don't accept that there's life beside download bandwidth (Your recent post points out you may have started to think about that) and a local Squid is helpful even if your internet is the fastest you can get. This would help both, the next hop (e.g. *.openhtc.io) as well as your own LAN.
Comparing credits and the setup between your computers and mine is completely useless - especially in case of ATLAS since mine are running ATLAS native which has different requirements.

As for your LAN switch:
Yes, connect your Squid to that switch, but do it!
And be patient and careful when you set it up.
Doing it in a hurry will fail.
ID: 47699 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 47700 - Posted: 17 Jan 2023, 13:21:59 UTC - in response to Message 47699.  
Last modified: 17 Jan 2023, 13:22:26 UTC

As for the impatience:
The most impatient part of the equation appears to sit in front of your computer.
Just accept that (be aware: example values!) it is better to run 70 % of the max. possible tasks but return reliable results than to go for 80 % but return 50 % trash.
Patience would include not to switch between things like VPN-on/VPN-off just for fun.
It has been explained many times that all LHC apps (except SixTrack) prefer and rely on stable network connections.
Why? Why can't can't they just retry or be more patient? Why can't they accept it's likely the internet connection is also used for other things, including many more instances of itself! This is a very basic thing. If you don't get an answer try again. I can manage it when I make a phonecall for example.

As for your internet connection:
You still don't accept that there's life beside download bandwidth (Your recent post points out you may have started to think about that) and a local Squid is helpful even if your internet is the fastest you can get. This would help both, the next hop (e.g. *.openhtc.io) as well as your own LAN.
Since it only helps downloads, I can't see it doing much to help how many tasks I can run at once. It will certainly ease some load at the LHC end, and give me the files a bit faster than otherwise, so I'll do it anyway. But I don't expect much more work to get done here because of it.

Comparing credits and the setup between your computers and mine is completely useless - especially in case of ATLAS since mine are running ATLAS native which has different requirements.
What's going on with me not being able to get them to use 8 cores at once like they used to?

As for the comparison, I was just using credits to see I was succeeding in getting things working well. 10th in the world must mean something. I'm not a show off credit hoarder, I use credits to see how much useful science I'm doing.

As for your LAN switch:
Yes, connect your Squid to that switch, but do it!.
It will happen to be directly connected to the switch, but I don't see why it would be a problem if it was 5 switches away. If it's not going out of my property, it's a lot faster than the internet and saves LHC bandwidth.

And be patient and careful when you set it up.
Doing it in a hurry will fail.
Understood.
ID: 47700 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : CMS Application : CMS computation error in 30 seconds every time


©2024 CERN