Message boards : ATLAS application : docker-ce 29.x breaks ATLAS tasks: tmpfs mount failure and the fix
Message board moderation

To post messages, you must log in.

AuthorMessage
superkali

Send message
Joined: 25 Mar 24
Posts: 3
Credit: 1,191,600
RAC: 5,619
Message 53182 - Posted: 18 Mar 2026, 7:06:08 UTC

# CPU Affinity Management and Core Rotation

*Applies to: BOINC 8.x, Linux, multi-core systems with sustained compute workloads*

---

## The Problem

BOINC does not manage CPU affinity for its worker processes. On a multi-core system running
multiple simultaneous tasks, the Linux scheduler will distribute work across cores as it sees
fit — which in practice often means a small number of cores carry most of the load while
others sit idle. Over time this produces thermal hotspots, uneven hardware wear, and on
systems with SMT (hyperthreading), interference between threads sharing physical cores.

On pots, before any affinity management, 2-4 cores were regularly hitting 68-71°C while
adjacent cores remained near idle. After the affinity script was deployed, sustained
temperatures dropped to the 26-53°C range with load distributed visibly across all cores.

A secondary problem: even with affinity pinning, always assigning the same heavy process to
the same cores concentrates electromigration and thermal stress on those physical cores over
months of continuous operation. Core rotation addresses this.

---

## Design Decisions

### Why not equal distribution?

The naive approach is to divide cores evenly — with 14 cores and 2 workers, give each 7.
This ignores the reality that BOINC workers vary enormously in their CPU consumption:

- An ATLAS GEANT4 simulation may run at 800-900% CPU (8-9 cores worth)
- A MilkyWay N-body task may run at 700% CPU
- An Einstein gravitational wave search may run at 90% CPU (nearly single-threaded, GPU-assisted)

Giving a 90% CPU process 7 cores wastes those cores. Giving an 800% process only 7 cores
artificially throttles it. Proportional allocation gives each process a share of available
cores that matches its actual demand.

### Why not per-thread pinning?

An early iteration of the script attempted to pin individual threads within a process to
individual cores. This produced excessive overhead — the script spent most of its time
enumerating threads and issuing taskset calls — and the benefit over process-level pinning
was marginal. Linux's scheduler handles intra-process thread distribution well once the
process is confined to a set of cores. Process-level pinning was retained.

### The 20% CPU threshold

Processes below 20% CPU are not considered compute workers and are not pinned. This filters
out BOINC's own utility processes (gawk, sleep, the boinccmd wrapper chain) which briefly
appear as children of boinc-client but consume negligible CPU and should not be constrained
to a subset of cores.

---

## How the Script Works

### Worker detection

Each cycle, the script builds a list of candidate PIDs by combining two sources:

- All descendants of the boinc-client process (recursive pgrep walk)
- Any processes matching known ATLAS/CVMFS name patterns (see ATLAS_ORPHAN_PROBLEM.md)

The combined deduplicated list is then filtered by the 20% CPU threshold to produce the
final worker list.

### Proportional core allocation

For each worker, the script calculates its share of total cores proportionally to its share
of total CPU consumption across all workers. A minimum of 2 cores is guaranteed to any
worker regardless of its relative share, to avoid assigning a single core to a multi-threaded
process.

The allocation is calculated as:

allocated_cores = max( int( (worker_cpu / total_cpu) * total_cores ), MIN_CORES )

The last worker in the list receives all remaining cores in the window — this ensures no
cores are left unassigned due to integer rounding.

### Core rotation

Without rotation, the proportional allocator always starts at core 0. A heavy process
consuming 800% CPU would always land on cores 0-7. Over months of continuous operation, those
cores accumulate disproportionate thermal and electromigration stress.

The rotation mechanism maintains a counter that advances by ROTATION_STEP cores each cycle.
The starting position for core assignment each cycle is:

core_cursor = ROTATION_COUNTER % TOTAL_CORES

On a 14-core system with ROTATION_STEP=2 and POLL_INTERVAL=10 seconds, the assignment window
completes a full rotation across all cores every 70 seconds. Over a day of continuous
operation, every core receives roughly equal aggregate load from heavy processes.

The script also issues renice -n 19 alongside every taskset call. This ensures that even
an escaped ATLAS orphan running at 900% CPU remains at the lowest scheduler priority,
yielding to any interactive or system process that needs CPU time.

### Change detection

The script tracks which PIDs have been pinned. It reassigns cores every cycle regardless
(to enforce rotation), but logs a "change detected" message when the worker set changes —
a new PID appearing, an existing PID dying, or a PID dropping below the CPU threshold.
This makes the journal useful for tracking task lifecycle.

---

## Configuration Parameters

All tunable parameters are at the top of the script:

POLL_INTERVAL — seconds between cycles. Default 10. Lower values make the script more
responsive to new workers but increase CPU overhead from the monitoring itself. On a system
where tasks run for hours, 10-30 seconds is appropriate.

CPU_THRESHOLD — minimum %CPU to be considered a compute worker. Default 20.0. Raise this
if utility processes are being incorrectly identified as workers. Lower it if lightweight
tasks are being ignored.

MIN_CORES — minimum cores to assign any single worker. Default 2. Prevents a low-CPU worker
from being squeezed onto a single core when a high-CPU worker dominates the proportional
calculation.

ROTATION_STEP — cores to advance the window each cycle. Default 2. Higher values produce
faster rotation but less time on each core window per cycle. On a 14-core system, step 2
completes a full rotation in 7 cycles (70 seconds at POLL_INTERVAL=10).

---

## Systemd Integration

The script runs as a systemd service under `boinc-affinity.service`. Key unit settings:

AmbientCapabilities=CAP_SYS_NICE — required for taskset and renice to operate on processes
owned by other users (including root-owned ATLAS orphans) without running the entire script
as root.

CapabilityBoundingSet=CAP_SYS_NICE — limits the service to only this capability, following
the principle of least privilege.

Restart=always — the script is designed to run continuously. If it exits for any reason
(including a clean exit triggered by SIGTERM when boinc-client restarts for a scheduler
cycle), systemd restarts it after RestartSec=15 seconds.

After=boinc-client.service — start ordering only, not a hard dependency. The script handles
boinc-client not being present gracefully (it waits and retries). The BindsTo dependency
that was present in an earlier version was removed because it caused the affinity service
to die whenever boinc-client restarted for scheduler cycles.

ExecStartPre=/bin/sleep 5 — gives boinc-client time to spawn its initial workers before
the affinity script starts scanning for them.

---

## Known Limitations

Binary name logging — the script reads /proc/PID/exe to get the worker's binary name for
log output. For processes owned by root (ATLAS orphans), this requires root access. The
service runs with CAP_SYS_NICE but not as root, so binary names for root-owned processes
appear as empty string in the journal. Pinning and renicing still work correctly — this
is cosmetic only.

Single-parent multi-threaded workers — taskset on a process PID pins all threads of that
process to the specified core set. This is correct behaviour: child threads inherit the
affinity mask. However, the CPU usage reported by ps for the parent PID is the sum of all
thread activity, so a 14-thread process showing 900% CPU will be treated as a single worker
consuming 900% and allocated cores accordingly.

Workers appearing below threshold mid-task — some tasks throttle themselves during I/O
phases or checkpointing and may temporarily drop below the 20% threshold. The script will
unpin them during that cycle. They will be re-detected and re-pinned on the next cycle when
CPU usage rises again. This is normal and harmless.

---

## Tuning for Different Hardware

On systems with more cores, consider increasing ROTATION_STEP proportionally to maintain
a similar rotation period. On a 28-core system, ROTATION_STEP=4 would give a similar 70-second
full rotation.

On systems with fewer cores (8 or fewer), MIN_CORES=2 may be too high if you regularly run
3 or more simultaneous workers. Reduce to 1 if allocation arithmetic is being distorted by
the minimum floor.

If your workload is dominated by a single heavy multi-threaded task, rotation provides the
most benefit. If your workload is many lightweight single-threaded tasks, proportional
allocation provides the most benefit. The script handles both.
ID: 53182 · Report as offensive     Reply Quote

Message boards : ATLAS application : docker-ce 29.x breaks ATLAS tasks: tmpfs mount failure and the fix


©2026 CERN