Message boards : ATLAS application : docker-ce 29.x breaks ATLAS tasks: tmpfs mount failure and the fix
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 25 Mar 24 Posts: 3 Credit: 1,191,600 RAC: 5,619 |
# CPU Affinity Management and Core Rotation *Applies to: BOINC 8.x, Linux, multi-core systems with sustained compute workloads* --- ## The Problem BOINC does not manage CPU affinity for its worker processes. On a multi-core system running multiple simultaneous tasks, the Linux scheduler will distribute work across cores as it sees fit — which in practice often means a small number of cores carry most of the load while others sit idle. Over time this produces thermal hotspots, uneven hardware wear, and on systems with SMT (hyperthreading), interference between threads sharing physical cores. On pots, before any affinity management, 2-4 cores were regularly hitting 68-71°C while adjacent cores remained near idle. After the affinity script was deployed, sustained temperatures dropped to the 26-53°C range with load distributed visibly across all cores. A secondary problem: even with affinity pinning, always assigning the same heavy process to the same cores concentrates electromigration and thermal stress on those physical cores over months of continuous operation. Core rotation addresses this. --- ## Design Decisions ### Why not equal distribution? The naive approach is to divide cores evenly — with 14 cores and 2 workers, give each 7. This ignores the reality that BOINC workers vary enormously in their CPU consumption: - An ATLAS GEANT4 simulation may run at 800-900% CPU (8-9 cores worth) - A MilkyWay N-body task may run at 700% CPU - An Einstein gravitational wave search may run at 90% CPU (nearly single-threaded, GPU-assisted) Giving a 90% CPU process 7 cores wastes those cores. Giving an 800% process only 7 cores artificially throttles it. Proportional allocation gives each process a share of available cores that matches its actual demand. ### Why not per-thread pinning? An early iteration of the script attempted to pin individual threads within a process to individual cores. This produced excessive overhead — the script spent most of its time enumerating threads and issuing taskset calls — and the benefit over process-level pinning was marginal. Linux's scheduler handles intra-process thread distribution well once the process is confined to a set of cores. Process-level pinning was retained. ### The 20% CPU threshold Processes below 20% CPU are not considered compute workers and are not pinned. This filters out BOINC's own utility processes (gawk, sleep, the boinccmd wrapper chain) which briefly appear as children of boinc-client but consume negligible CPU and should not be constrained to a subset of cores. --- ## How the Script Works ### Worker detection Each cycle, the script builds a list of candidate PIDs by combining two sources: - All descendants of the boinc-client process (recursive pgrep walk) - Any processes matching known ATLAS/CVMFS name patterns (see ATLAS_ORPHAN_PROBLEM.md) The combined deduplicated list is then filtered by the 20% CPU threshold to produce the final worker list. ### Proportional core allocation For each worker, the script calculates its share of total cores proportionally to its share of total CPU consumption across all workers. A minimum of 2 cores is guaranteed to any worker regardless of its relative share, to avoid assigning a single core to a multi-threaded process. The allocation is calculated as: allocated_cores = max( int( (worker_cpu / total_cpu) * total_cores ), MIN_CORES ) The last worker in the list receives all remaining cores in the window — this ensures no cores are left unassigned due to integer rounding. ### Core rotation Without rotation, the proportional allocator always starts at core 0. A heavy process consuming 800% CPU would always land on cores 0-7. Over months of continuous operation, those cores accumulate disproportionate thermal and electromigration stress. The rotation mechanism maintains a counter that advances by ROTATION_STEP cores each cycle. The starting position for core assignment each cycle is: core_cursor = ROTATION_COUNTER % TOTAL_CORES On a 14-core system with ROTATION_STEP=2 and POLL_INTERVAL=10 seconds, the assignment window completes a full rotation across all cores every 70 seconds. Over a day of continuous operation, every core receives roughly equal aggregate load from heavy processes. The script also issues renice -n 19 alongside every taskset call. This ensures that even an escaped ATLAS orphan running at 900% CPU remains at the lowest scheduler priority, yielding to any interactive or system process that needs CPU time. ### Change detection The script tracks which PIDs have been pinned. It reassigns cores every cycle regardless (to enforce rotation), but logs a "change detected" message when the worker set changes — a new PID appearing, an existing PID dying, or a PID dropping below the CPU threshold. This makes the journal useful for tracking task lifecycle. --- ## Configuration Parameters All tunable parameters are at the top of the script: POLL_INTERVAL — seconds between cycles. Default 10. Lower values make the script more responsive to new workers but increase CPU overhead from the monitoring itself. On a system where tasks run for hours, 10-30 seconds is appropriate. CPU_THRESHOLD — minimum %CPU to be considered a compute worker. Default 20.0. Raise this if utility processes are being incorrectly identified as workers. Lower it if lightweight tasks are being ignored. MIN_CORES — minimum cores to assign any single worker. Default 2. Prevents a low-CPU worker from being squeezed onto a single core when a high-CPU worker dominates the proportional calculation. ROTATION_STEP — cores to advance the window each cycle. Default 2. Higher values produce faster rotation but less time on each core window per cycle. On a 14-core system, step 2 completes a full rotation in 7 cycles (70 seconds at POLL_INTERVAL=10). --- ## Systemd Integration The script runs as a systemd service under `boinc-affinity.service`. Key unit settings: AmbientCapabilities=CAP_SYS_NICE — required for taskset and renice to operate on processes owned by other users (including root-owned ATLAS orphans) without running the entire script as root. CapabilityBoundingSet=CAP_SYS_NICE — limits the service to only this capability, following the principle of least privilege. Restart=always — the script is designed to run continuously. If it exits for any reason (including a clean exit triggered by SIGTERM when boinc-client restarts for a scheduler cycle), systemd restarts it after RestartSec=15 seconds. After=boinc-client.service — start ordering only, not a hard dependency. The script handles boinc-client not being present gracefully (it waits and retries). The BindsTo dependency that was present in an earlier version was removed because it caused the affinity service to die whenever boinc-client restarted for scheduler cycles. ExecStartPre=/bin/sleep 5 — gives boinc-client time to spawn its initial workers before the affinity script starts scanning for them. --- ## Known Limitations Binary name logging — the script reads /proc/PID/exe to get the worker's binary name for log output. For processes owned by root (ATLAS orphans), this requires root access. The service runs with CAP_SYS_NICE but not as root, so binary names for root-owned processes appear as empty string in the journal. Pinning and renicing still work correctly — this is cosmetic only. Single-parent multi-threaded workers — taskset on a process PID pins all threads of that process to the specified core set. This is correct behaviour: child threads inherit the affinity mask. However, the CPU usage reported by ps for the parent PID is the sum of all thread activity, so a 14-thread process showing 900% CPU will be treated as a single worker consuming 900% and allocated cores accordingly. Workers appearing below threshold mid-task — some tasks throttle themselves during I/O phases or checkpointing and may temporarily drop below the 20% threshold. The script will unpin them during that cycle. They will be re-detected and re-pinned on the next cycle when CPU usage rises again. This is normal and harmless. --- ## Tuning for Different Hardware On systems with more cores, consider increasing ROTATION_STEP proportionally to maintain a similar rotation period. On a 28-core system, ROTATION_STEP=4 would give a similar 70-second full rotation. On systems with fewer cores (8 or fewer), MIN_CORES=2 may be too high if you regularly run 3 or more simultaneous workers. Reduce to 1 if allocation arithmetic is being distorted by the minimum floor. If your workload is dominated by a single heavy multi-threaded task, rotation provides the most benefit. If your workload is many lightweight single-threaded tasks, proportional allocation provides the most benefit. The script handles both. |
©2026 CERN