1) Message boards : CMS Application : No resends on failures (Message 49847)
Posted 2 days ago by computezrmle
Post:
Is there a reason why the project doesn't resend these?

Yes.
At the project server the resend quota is intentionally set to 1 for vbox apps.
This is due to the fact that BOINC tasks are only "envelopes" launching a VM.
The scientific jobs are then done inside the VM.
2) Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs (Message 49837)
Posted 2 days ago by computezrmle
Post:
As always, the BOINC tasks are only envelopes created by a server script (or an independent backend system).
Prod and dev each run their own script, independent from the other one.

CMS jobs are created and administered by a rather complex backend workflow.
There is 1 active workflow feeding both, prod and dev.
To maintain separate workflows would be a huge effort and there are not enough volunteers on dev to guarantee a steady return rate.

Although the CMS vdi is the same for prod and dev (hence can be run as 1-core/n-core) the BOINC app plus the job startup scripts (partly hardwired) on prod configure the VM to accept only singlecore jobs.
On dev the BOINC app is a full multicore app (which can also run 1-core jobs).

ATM the workflow queue contains jobs that are configured to run on 2-core systems.
Hence, they can run on dev but fail on prod.


I'm sure Ivan and the CMS team are working on a solution to get out the parameters that are necessary to run stable multicore jobs.
Once this is done I expect the singlecore CMS on prod will be replaced by a multicore app.

Be patient.
Give them the time it needs.
3) Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs (Message 49828)
Posted 3 days ago by computezrmle
Post:
It needs to be clarified whether

1. a workflow batch at the backend must be configured to run on n cores before any work is sent out

2. a task on a volunteer VM can forward it's own #cores to the CMS app and CMS uses this #cores.
Like:
2-core VM -> 2-core CMS
4-core VM -> 4-core CMS


Sending out something like fix n-core CMS tasks to a VM not running n cores makes no sense.
4) Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs (Message 49826)
Posted 3 days ago by computezrmle
Post:
Got a 2-core CMS on a 4-core VM connected to -dev.

After a long setup phase (~18 min) with CPU usage between idle and 100 % (= 1 core) cmsRun switched to ~200 %.
This points out it uses 2 cores inside the VM.
Monitoring data on the host confirms this.


The long setup phase is not an error as
- the box runs another BOINC client running lots of Theory tasks
- the CMS task itself made lots of internet request to update CVMFS/Frontier data


Unfortunately except console 1 and console 3 (top) all other monitoring consoles at the VM do not work.
5) Message boards : Theory Application : New native version v300.08 (Message 49820)
Posted 5 days ago by computezrmle
Post:
The log entries should look a bit like these:
06:14:37 CET +01:00 2024-03-23: cranky: [INFO] Starting runc container.
06:14:38 CET +01:00 2024-03-23: cranky: [INFO] To get some details on systemd level run
06:14:38 CET +01:00 2024-03-23: cranky: [INFO] systemctl status Theory_2743-2785673-11_0.scope
06:14:38 CET +01:00 2024-03-23: cranky: [INFO] mcplots runspec: boinc pp jets 7000 80,-,1060 - herwig++ 2.7.1 UE-EE-5 100000 11
06:14:38 CET +01:00 2024-03-23: cranky: [INFO] ----,^^^^,<<<~_____---,^^^,<<~____--,^^,<~__;_
06:21:28 CET +01:00 2024-03-23: cranky: [INFO] Pausing systemd unit Theory_2743-2785673-11_0.scope
06:22:27 CET +01:00 2024-03-23: cranky: [INFO] Resuming systemd unit Theory_2743-2785673-11_0.scope
06:32:49 CET +01:00 2024-03-23: cranky: [INFO] Pausing systemd unit Theory_2743-2785673-11_0.scope
06:32:58 CET +01:00 2024-03-23: cranky: [INFO] Resuming systemd unit Theory_2743-2785673-11_0.scope
06:33:24 CET +01:00 2024-03-23: cranky: [INFO] Pausing systemd unit Theory_2743-2785673-11_0.scope
06:34:03 CET +01:00 2024-03-23: cranky: [INFO] Resuming systemd unit Theory_2743-2785673-11_0.scope
07:42:57 CET +01:00 2024-03-23: cranky: [INFO] Container Theory_2743-2785673-11_0 finished with status code 0.
07:42:57 CET +01:00 2024-03-23: cranky: [INFO] Preparing output.
07:42:58 (102851): cranky exited; CPU time 5042.031816
07:42:58 (102851): called boinc_finish(0)



Yours look weird:
06:16:59 AWST +08:00 2024-03-23: cranky-0.1.4: [INFO] mcplots runspec: boinc pp jets 13000 260 - pythia6 6.428 ambt1 100000 9
06:16:59 AWST +08:00 2024-03-23: cranky-0.1.4: [INFO] ----,^^^^,<<<~_____---,^^^,<<~____--,^^,<~__;_
07:39:34 (590135): wrapper (7.15.26016): starting
07:39:34 (590135): wrapper (7.15.26016): starting
.
.
.
time="2024-03-23T07:39:38+08:00" level=error msg="container with id exists: Theory_2743-2733248-9_1"

It looks like the task stared from scratch (for an unknown reason).
It finally failed because runc didn't remove the container id from the 1st attempt.


Which systemctl version do you use (must be at least v246)?
Please post the output of "systemctl --version" plus the status output of a currently running Theory task.
You get the latter via a command like this:
systemctl --no-pager status Theory_2743-2733248-9_1.scope
6) Message boards : Theory Application : New native version v300.08 (Message 49812)
Posted 5 days ago by computezrmle
Post:
Looks good.


Since a major goal of that version is to make suspend/resume work via systemd you may want to test this.

Select a currently running task in BOINC manager (or your preferred BOINC tool) and pause the task.
Test this with a task that has already started the container (see stderr.txt).
Then this should happen:

1.
You should find a corresponding line in the task's stderr.txt

2.
run the "systemd status ..." command shown in stderr.txt (press 'q' to exit the pager).
The output should mention the scope as "frozen".


A while later resume the task via the BOINC management tool.
Check again stderr.txt and the scope status.


Hint:
Although it would be possible to manually freeze/thaw the scope via systemctl this should not be done because BOINC will not be notified.
Hence, always use BOINC for this.
7) Message boards : Theory Application : Native Theory 300.08 configuration issue (Message 49810)
Posted 6 days ago by computezrmle
Post:
OK, I see where it comes from.
You shouldn't use that any more for (mainly) the following reasons:

1.
The thread explains settings for cgroups v1.
These can't be mixed with cgroups v2.

2.
The recent cranky app uses systemd-run to start it's main part as a systemd scope.
This method delegates suspend/resume to systemd which implicitly uses cgroups v2.
Hence, users don't need to directly fiddle around with cgroups(v1) stuff any more.

Furthermore, cgroups v1 support is already deprecated in systemd and as a result in all Linux distros using it.



As for the systemd version

This is what the original maintainer's manpage states.
See:
https://www.freedesktop.org/software/systemd/man/latest/systemd-run.html
--unit=, -u
    Use this unit name instead of an automatically generated one.
    Added in version 206.

--property=, -p
    Sets a property on the scope or service unit that is created. This option takes an assignment in the same format as systemctl(1)'s set-property command.
    Added in version 211.
    
--slice-inherit
    Make the new .service or .scope unit part of the inherited slice. This option can be combined with --slice=.
    An inherited slice is located within systemd-run slice. Example: if systemd-run slice is foo.slice, and the --slice= argument is bar, the unit will be placed under the foo-bar.slice.
    Added in version 246.

The latter might be a problem since cranky uses "--slice-inherit" and your version reports v239.
You may need to upgrade systemd or use a more recent Linux distro.

Hint:
Systemd v246 has been released in July 2020.
https://lwn.net/Articles/827675/
Hence, more than 3 years before this cranky version.
8) Message boards : Theory Application : Native Theory 300.08 configuration issue (Message 49806)
Posted 7 days ago by computezrmle
Post:
Your computers (hence it's logs) are not visible for other volunteers.
Please make them visible in your prefs.


using systemctl edit boinc-client.service to set:

ProtectControlGroups=no

You modified "ProtectControlGroups"?
Why?
The usual suggestion is to replace "ProtectSystem=strict" with "ProtectSystem=full".

Did you set other hardening options?
If so, they may stop BOINC from working.
Start with the settings your Linux vendor ships at installation time.


I also added this to /etc/fstab:

tmpfs  /sys/fs/cgroup  tmpfs  rw,nosuid,nodev,noexec,mode=755  0  0

Why?
Cgroups are kernel internal administrative structures.
If enabled they should automatically be mapped to /sys/fs/cgroup.
There's usually no need to mount them via fstab or force them through tmpfs.



Note that in I LHCATHOMEBOINC_03 changed -u to --unit=, because -u was giving me an unknown option error on my OS.

This looks weird for the following reasons:

1.
The cranky script calls systemd-run with "-u" which MUST match the Cmnd_Alias in the sudoers file.
If there's no match the command will not be recognized by sudo.
But you don't have a match since you did not modify the command within the cranky script, did you?

2.
"--unit" and it's short form "-u" have both been introduced in the same systemd version.
Either both are allowed to be used or none.

3.
Systemd-run called by cranky also uses the "-p" option.
That option has been introduced after the "unit" options.
If "-p" works there's no reason why both "unit" options shouldn't.

Please post the output of "systemd-run --version".
9) Message boards : CMS Application : CMS VM errors out of the blue (Message 49796)
Posted 8 days ago by computezrmle
Post:
This is what has most likely happened:

CMS had no tasks for a while.
At restart your computer got a couple of them and started all concurrently.
This caused a race condition in Virtualbox while it attached the virtual disk.

Now the VirtualBox media registry is in an inconsistent state and needs to be cleaned up manually.


1. Ensure no vbox task is currently running
2. Not a must, but stop BOINC to avoid it starts any vbox tasks while you do the next steps
3. Use the account running BOINC to open the VirtualBox Media Manager
4. Remove the affected disk entry (here: CMS_2022_09_07_prod.vdi) and it's children; do NOT remove the parent vdi file when asked!
5. Restart BOINC
6. Start 1 (only 1!) fresh CMS task and wait until it has registered it's disk
7. Start other CMS tasks (even concurrently)
10) Message boards : CMS Application : Does anyone else never restart your machine running CMS? (Message 49794)
Posted 8 days ago by computezrmle
Post:
VirtualBox apps keep complex configuration problems away from volunteers who can't deal with them.
That's just one reason why CMS doesn't distribute a native app.
Another one is that they run on various platforms.

Native apps don't always solve the problems.
Your Linux computer returned 4 valid CMS tasks (vbox!) but did not return a single valid Theory native task for 3 days.
Instead it failed (so far) 283 of them.
All because you didn't install (and configure) a local CVMFS client.
And the reason is written in every log:
14:53:31 CDT -05:00 2024-03-17: cranky-0.1.4: [ERROR] Can't find 'cvmfs_config'.
14:53:31 CDT -05:00 2024-03-17: cranky-0.1.4: [ERROR] This usually means a local CVMFS client is not installed
14:53:31 CDT -05:00 2024-03-17: cranky-0.1.4: [ERROR] although it is a MUST to get data from online repositories.



As for your CMS errors on Windows.
Nearly all within the last few days have been caused by a temporary system outage at CERN.
But your total CPU time "waste" during this period sums up to less than 1 h.
Far away from the 80+ hrs you claim.
11) Message boards : Theory Application : Problem of the day (Message 49789)
Posted 8 days ago by computezrmle
Post:
Just killed another rogue phojet eating up >60GB RAM.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=408012071

Theory_2687-2528715-1157_1
21:49:00 CET +01:00 2024-03-19: cranky: [INFO] mcplots runspec: boinc pp jets 13000 430 - phojet 1.12a default 100000 1157
12) Message boards : Theory Application : Problem of the day (Message 49788)
Posted 9 days ago by computezrmle
Post:
Today I had another phojet task continuously eating up all available RAM.
So far >60 GB RAM within less than 30 min runtime.

runRivet.log shows >35000 lines like this:
0 events processed

and very few lines like this:
Rivet.AnalysisHandler: WARN Sub-event weight list has 2000 elements: are the weight numbers correctly set in the input events?
13) Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs (Message 49781)
Posted 11 days ago by computezrmle
Post:
This explains why I got only singlecore jobs here and on -dev this afternoon although the VMs were all configured running 4 cores.
14) Questions and Answers : Getting started : Output File Absent Ubuntu 23.10 (Message 49735)
Posted 20 days ago by computezrmle
Post:
Is the user account running BOINC a member of the group "vboxusers"?


If yes, check the steps below.


Check you BOINC's systemd unit file for this entry:
ProtectSystem=strict


If this is set, it is too strict.

Do this:
1. stop BOINC
2. create an override file "/etc/systemd/system/boinc-client.service.d/override.conf" containing the following 2 lines:
[Service]
ProtectSystem=full
3. run "sudo systemctl daemon-reload"
4. start BOINC and try the next VirtualBox task



If this doesn't succeed, post your complete boinc-client.service file.
15) Questions and Answers : Getting started : Output File Absent Ubuntu 23.10 (Message 49733)
Posted 20 days ago by computezrmle
Post:
From your logs:
Most likely, the VirtualBox COM server is not running or failed to start.
This is a problem with VirtualBox, not a problem caused by this project.


looks like it's working now!

As of a sudden?
Other volunteers having the same problem may be interested how you solved it.
Could you post what you did?
16) Questions and Answers : Getting started : Output File Absent Ubuntu 23.10 (Message 49730)
Posted 20 days ago by computezrmle
Post:
I checked the box next to "Should LHC@home show your computers on its web site?"

OK.


I'm not sure exactly what you need

Your computer(s) are still shown as hidden.
If you attached the computer in question via an account manager like Science United disconnect it from the account manager and connect it directly to this project and run a task.
Then your computer(s) will be visible as links under your account.
That's what I ask for before I continue.
17) Questions and Answers : Getting started : Output File Absent Ubuntu 23.10 (Message 49728)
Posted 20 days ago by computezrmle
Post:
A local CVMFS installation is not required if you run VirtualBox apps.

Instead make your computer visible for other volunteers in your project prefs and post a link to it.
If this is not possible because it is attached via Science United disable that first and attach the computer directly.
18) Message boards : ATLAS application : hits file upload fails immediately (Message 49700)
Posted 23 days ago by computezrmle
Post:
Let them finish and upload the smaller logfile.
When the huge logfile upload gets stuck, cancel that upload (only the upload, not the task!).
That way you may get credits for the task (worked for 2 of them from my hosts that got stuck recently).

The scientific work gets lost but may be rescheduled by the backend systems.
19) Message boards : ATLAS application : hits file upload fails immediately (Message 49688)
Posted 25 days ago by computezrmle
Post:
Did some tests to get out if an upload size limit exists.
It does.
:-(


Looks like files > 1024 MB do not upload to lhcathome-upload.cern.ch.

Still unclear whether the limit is set
- at the project server
- at the client side, e.g. hardwired or implicitly a libcurl limit

Since in any case there will not be a quick solution tasks producing an upload file > 1024 MB are lost and should be cancelled.



As for the Squid workaround mentioned in other posts:
client_request_buffer_max_size xyz MB

During the tests the value xyz was set to 100.
Nonetheless, files larger than that but < 1024 MB uploaded fine.

Only if the option is not set in squid.conf uploads via Squid stuck.
Looks like the option just needs to be there.

Squid version used: v6.6 on Linux
Other Squid versions (especially 5.x) may behave differently.
20) Message boards : ATLAS application : hits file upload fails immediately (Message 49686)
Posted 25 days ago by computezrmle
Post:
What makes the project's downloads faster is the local cache Squid provides for HTTP objects.
It's proxy comes on top automatically.

SOCKS proxies usually don't cache anything, does yours?
If not you may try a Squid between your clients and the SOCKS proxy.

Either this:
Internet <-> local Router <-> SOCKS <-> Squid <-> local clients

Or this:
Internet <-> local Router <-> Squid <-> local clients


Next 20


©2024 CERN