Message boards : Number crunching : Setting up a local Squid to work with LHC@home - HowTo
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,346,708
RAC: 132,583
Message 42987 - Posted: 9 Jul 2020, 14:20:02 UTC
Last modified: 9 Jul 2020, 14:21:21 UTC

Older comments regarding a Squid configuration can be found here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4611

New comments and questions should be posted here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5474



1. Introduction

LHC@home tasks except SixTrack require permanent access to various software repositories and databases.
The total amount of data stored there is really hughe - much too hughe to be completely downloaded by each client.
Hence, CERN (= CERN and other scientific facilities around the world) distributes the data through CVMFS (https://cvmfs.readthedocs.io) and Frontier (http://frontier.cern.ch/), two systems that allow downloads of just that parts of the data that is required to run an entire task.

Since the tasks are organized in series using similar sets of data and since CVMFS and Frontier do the transfer via HTTP, standard proxy caches like Squid (http://www.squid-cache.org) can be used to make data distribution more efficient.
This results in a multi tier proxy network made of hundreds of Squid proxies being part of the Worldwide LHC Computing Grid (WLCG => https://wlcg.web.cern.ch).

Nonetheless there's still a bottleneck remaining - the internet section between the volunteer's router and the ISP's backbone.
Even nowadays this section is in nearly all cases slower than the local LAN or the main internet backbones most WLCG Squids are attached to. To reduce the impact of that bottleneck a local Squid can be run inside the volunteer's LAN which keeps most of the data close to the place where it is used.

Squid is available for Linux as well as for Windows, both representing the majority of systems running LHC@home tasks.
It can be installed on the computer that is running the tasks as well as on a separate machine. The latter is recommended for larger installations. For performance reasons it is not recommended to install Squid on a virtual machine.



2. Hardware Requirements

1 CPU core (even much less in most cases)
256 MB RAM dedicated to Squid
2-4 GB RAM headroom to be used by the OS disk cache (shared with other processes)
15-30 GB disk space (the example configures a 20 GB disk cache)
Cable bound LAN; not slower than Squid's client devices; wi-fi is not recommended




3. Getting Squid

A recent Squid package is included in most of the modern Linux distributions or other Unix like OSs.
It should be used if the Squid version is at least 3.5.27 or 4.9.

System admins who plan to integrate their Squid into WLCG should consider to use the Frontier Squid package instead:
https://twiki.cern.ch/twiki/bin/view/Frontier/FrontierOverview

Volunteers running Windows should use the Squid package provided by Diladele:
https://squid.diladele.com

All others should check the Squid website:
http://www.squid-cache.org/Download




4. Squid Configuration Changes

All Squid packages ship a default configuration file squid.conf that usually allows basic internet access after the local network has been given permission. This basic configuration should be replaced with the BOINC optimized configuration at the bottom of this post.

Some installation packages, including Diladele's Windows installer, tend to automatically start Squid at the end of the installation procedure. Hence, Squid should explicitly be stopped now and it should be given a grace period of at least 30 seconds to shut down.


On Linux run as root:
systemctl stop squid.service



On Windows the installer creates 2 new icons on the desktop:
- "Squid Terminal"
- "Squid Server Tray"

Open "Squid Terminal" as Administrator and run:
squid -k shutdown

Then wait until "squid.exe" (2 entries!) disappear from taskmanager's process tab.



5. Preparing the Disk Cache Directory

Using a disk cache may be disabled in the default configuration but it is enabled in the squid.conf below.
This requires the existence of the top-level directory given at the "cache_dir" line before Squid starts. In addition the Squid process must have write access to this directory.


On Linux:
If "/var/cache/squid" is given at the "cache_dir" line, "/var/cache/squid" must exist.


On Windows:
If "/var/cache/squid" is given at the "cache_dir" line and Squid had been installed in "C:\Squid", "C:\Squid\var\cache\squid" must exist.

The complete disk cache structure can then be created running:
squid -z



To purge and recreate the disk cache Squid must be stopped.
Then remove all files and directories below "/var/cache/squid" and run "squid -z".



6. Basic Tests

To avoid unpermitted internet access via Squid it doesn't forward requests from all network devices that are not explicitely allowed. To test if a local computer is able to do internet requests via the local Squid a browser can be started on that computer. The browser has to be configured to use the local Squid (Squid's hostname or IP and it's TCP port). It should now be possible to visit arbitrary internet pages. In addition Squid's access.log should list all requests.

If Squid denies requests from a local computer that should have internet access the squid.conf has to be checked and changes have to be activated running:
squid -k reconfigure




7. Dealing with hughe Logfiles

Since Squid writes a logfile line to access.log for every request that file can grow by hundreds of MB per day on heavily used systems. Although writing an access.log can be turned off it is recommended to rotate the logfiles instead. Squid's native command to rotate the logfiles cache.log and access.log is:
squid -k rotate

With default settings Squid keeps a history of up to 10 old logfiles beside the ones that are currently in use.

On Linux systems logfile rotation is usually configured via the logrotate utility.

On Windows systems the following command creates a simple scheduled task that does the rotation every night:
SCHTASKS /Create /RU SYSTEM /SC DAILY /TN "Squid\RotateSquidLogs" /TR "C:\Squid\bin\squid.exe -k rotate" /ST 00:00 /RL HIGHEST

It should be modified according to local needs.



8. Connecting the BOINC Client

If a computer is allowed to use the local Squid a BOINC client on that computer can be configured to send requests via Squid by entering the hostname or IP of the Squid machine and the TCP port Squid is listening to (usually 3128) into the BOINC manager's proxy form "Options -> Other Options -> HTTP Proxy". In addition the checkbox "Connect via HTTP Proxy" has to be checked.

When the settings are saved BOINC shows the following messages:
Using proxy info from GUI
Using HTTP proxy squid_hostname_or_IP:3128


All tasks from LHC@home are enabled to automatically read the proxy settings from the BOINC client and use them for their CVMFS and Frontier configuration.

Exception:
Native tasks on Linux use an independent local CVMFS client. This client requires the following entry in /etc/cvmfs/default.local:
CVMFS_HTTP_PROXY="http://squid_hostname_or_IP:3128;DIRECT"




9. Acknowledgements

Thanks to maeax and Harri Liljeroos for running the Windows test configuration.



10. Basic squid.conf to be used for BOINC

This squid.conf replaces the default configuration file shipped with the installation packages.
Some parameters still have to be changed according to the local network environment.
See the comments for details.

# Squid configuration for BOINC
# Based on squid version 3.5
# See also: http://www.squid-cache.org/

# Every line starting with "#" represents a comment.

# Define your local hosts/networks here.
# If neither "crunchers" nor "localnet" is set none of your devices will be permitted to use the proxy.
# The examples show the principle.
# For advanced options read the Squid documentation.
#
# Examples:
#
# Either enter a list of IPs representing your computers that are permitted to use the proxy.
# Each IP on a separate line.
# acl crunchers src 198.51.100.20
# acl crunchers src 198.51.100.31
# acl crunchers src 198.51.100.37
# acl crunchers src 198.51.100.42
#
# Or enter complete network ranges.
# Be aware that this may permit devices like printers or TVs that you may not want to use the proxy.
# acl localnet src 192.0.2.0/24
# acl localnet src 198.51.100.0/24
# acl localnet src 203.0.113.0/24



acl SSL_ports port 443
acl Safe_ports port 80
acl Safe_ports port 443
acl Safe_ports port 1025-65535	# unregistered ports

acl CONNECT method CONNECT



follow_x_forwarded_for allow localhost
follow_x_forwarded_for deny all





#
# Start of extra section 1
# Requests that need special handling

# worldcommunitygrid doesn't like it if data is taken from the local cache
acl wcg_nocache dstdomain .worldcommunitygrid.org
cache deny wcg_nocache


# if CVMFS uses geoapi, ensure it's checked directly
acl cvmfs_geoapi urlpath_regex -i ^/+cvmfs/+[0-9a-z._~-]+/+api/+[0-9a-z._~-]+/+geo/+[0-9a-z._~-]+/+[0-9a-z.,_~-]+
cache deny cvmfs_geoapi


# avoids polluting the disk cache with typical onetimers, e.g. ATLAS job data
acl boinc_nocache urlpath_regex -i /download[0-9a-z._~-]*/+[0-9a-z._~-]+/+.+
cache deny boinc_nocache


# seriously: do NOT cache that!
# Based on a frontier cache suggestion
acl PragmaNoCache req_header Pragma no-cache
cache deny PragmaNoCache

# End of extra section 1
#



#
# Start of extra section 2
# parent cache configuration
#
# ATLAS tasks route frontier requests via predefined WLCG proxy chains including load balancing and fail-over.
# The following lines ensure those proxy chains are respected by a local squid as intended by the CERN ATLAS team.

acl request_via_atlasfrontier_chain url_regex -i ^http://+atlasfrontier[1-4]?-ai\.cern\.ch:8000/+[^/]+

cache_peer atlas-db-squid.grid.uio.no parent 3128 0 no-query no-digest weighted-round-robin no-netdb-exchange connect-timeout=7 connect-fail-limit=1
cache_peer_access atlas-db-squid.grid.uio.no allow request_via_atlasfrontier_chain

cache_peer dcache.ijs.si parent 3128 0 no-query no-digest weighted-round-robin no-netdb-exchange connect-timeout=7 connect-fail-limit=1
cache_peer_access dcache.ijs.si allow request_via_atlasfrontier_chain

cache_peer atlasfrontier-ai.cern.ch parent 8000 0 no-query no-digest no-netdb-exchange connect-fail-limit=1
cache_peer_access atlasfrontier-ai.cern.ch allow request_via_atlasfrontier_chain

never_direct allow request_via_atlasfrontier_chain

# End of extra section 2
#


acl Purge method PURGE


http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access allow localhost manager
http_access deny manager
http_access deny to_localhost

#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
# Depending on the definition of "crunchers" or "localnet" above at least 1 of the following lines must be uncommented.
# Examples:
# http_access allow crunchers
# http_access allow localnet


http_access allow localhost
# Last "http_access" line.
# Order matters, hence all "http_access" lines following this one will be ignored.
http_access deny all



# http_port
# don't bind it to an IP that is accessible from outside unless you know what you do.
# Examples:
# http_port localhost:3128
#
# This assumes 198.51.100.99 is the external IP of the Squid box
# http_port 198.51.100.99:3128
#
# default setting that binds Squid to all IPs of the Squid box
http_port 3128


# A MUST on Windows.
# If unsure try the the LAN IP of your internet router.
# Avoid using external DNS here.
# On Linux this option shouldn't be necessary
dns_nameservers 198.51.100.1


max_filedescriptors 4096


# Required OFF for intercepted traffic from LHCb VMs
client_dst_passthru off


# You don't believe this is enough?
# For sure, it is!
cache_mem 256 MB
maximum_object_size_in_memory 24 KB
memory_replacement_policy heap GDSF


# Keep it large enough to store vdi files in the cache.
# See extra section 1 how to avoid onetimers eating up your disk storage.
# min-size=xxx keeps very small files away from your disk
# 20000 limits the disk cache to 20 GB
cache_replacement_policy heap LFUDA
maximum_object_size 6144 MB
cache_dir aufs /var/cache/squid 20000 16 64 min-size=7937


# default=10
logfile_rotate 10

# logformat has to be changed according to your needs and the capabilities of your logfile analyser
logformat my_awstats %>A %lp %ui %un [%tl] "%rm %>ru HTTP/%rv" %>Hs %st "%{Referer}>h" "%{User-Agent}>h" %Ss:%Sh
access_log stdio:/var/log/squid/access.log logformat=my_awstats
#access_log none
strip_query_terms off

coredump_dir none
ftp_user anonymous@


# max_stale 1 week  #default
# extended to be prepared for a project reset
max_stale 37 days

# 1 line is required to avoid the ancient default settings
# be conservative
# don't violate the HTTP standards
refresh_pattern .	0	0%	0

store_avg_object_size 1800 KB

shutdown_lifetime 0 seconds


# booster 1!
collapsed_forwarding on


# booster 2!
client_persistent_connections on
server_persistent_connections on


log_icp_queries off


dns_defnames on
dns_v4_first on

forwarded_for transparent

##### End of squid.conf
ID: 42987 · Report as offensive
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,346,708
RAC: 132,583
Message 44222 - Posted: 29 Jan 2021, 10:33:06 UTC

When should a local HTTP proxy like Squid be used?

As explained in the OP LHC@home tasks make lots of HTTP requests to external repositories and DBs.
Much more requests than all other known BOINC projects together.

In the past it was suggested to use a local proxy if more than 10 worker nodes were attached to the project.
Recent changes to CVMFS reduced this number to 5 worker nodes.
The official CVMFS configuration repository states this:
"For individual clients (laptops, clusters < 5 nodes),
use a site proxy where possible ..."



How are worker nodes calculated?

Example 1:
A volunteer has 1 computer and runs 3 Theory tasks concurrently which are all singlecore.
=> This counts as 3 worker nodes.

Example 2:
A volunteer has 3 computers and runs 3 + 4 Theory tasks and 2 CMS tasks concurrently.
=> This counts as 3 + 4 + 2 = 9 worker nodes.

Example 3:
A volunteer has 2 computers and concurrently runs
- 3 Theory tasks on computer 1
- 2 ATLAS tasks on computer 2 using a 4-core setup
=> This counts as 3 + (2 * 4) = 11 worker nodes.

Example 4:
A volunteer has 2 32-core computers and concurrently runs 3 ATLAS tasks on each of them in an 8-core setup
=> This counts as 2 * 3 * 8 = 48 worker nodes.



Is it a MUST to use a local proxy?

Nobody can or will be forced.
The suggestion is based on the experience from the people running the CVMFS/Frontier systems.
The #worker nodes should be seen as an orientation, especially in case of the examples 1/2/3.
Example 4 would be far above the limit. Hence, that volunteer should definitely use a local proxy.

As a rule of thumb:
The more worker nodes are running the more the suggestion should be seen as a must.



Benefits

An increasing proxy hitrate should be visible if >10 worker nodes are running.
16 worker nodes should already result in a hitrate around 25 %.
80-90 worker nodes typically result in a hitrate >90 %.

A single Squid instance can serve hundreds of worker nodes.
ID: 44222 · Report as offensive

Message boards : Number crunching : Setting up a local Squid to work with LHC@home - HowTo


©2024 CERN