1) Message boards : ATLAS application : Extreme event processing times (Message 49707)
Posted 5 Mar 2024 by hadron
Post:
This morning I've several tasks running with the 'normal' 400 events,
but after some normal runtimes, I now have tasks with processing times for each seperate event up to 6700 seconds.
Since the logging from ALT-F2 is still stuck, I've no idea of the average event runtime.

I can't access Alt-F2 on my system either. However there may be a way to bypass that, and still get all the info you need.
Are you running Linux on your system? If so, is BOINC running as a system service? If it is, I can give you instructions so you can access the same information using the VirtualBox Manager.
2) Message boards : Theory Application : file_xfer_error (Message 49697)
Posted 5 Mar 2024 by hadron
Post:
The default maximum run time for a Theory tasks is 10 days (same as deadline). After that it gets aborted automatically.

I have never had any Theory task fail because it ran into the maximum run time. I have had numerous tasks run for 9 days and around 22 or 23 hours, then fail for no apparent reason. In all instances, I do not recall total CPU time ever being more than one hour behind elapsed time; moreover, the tasks have always run to about 99.95 completion, only to fail with a "computation error".
Memory on this next bit is a little foggy, but I do believe the most common reason for failure has been "too many results".
3) Message boards : Theory Application : file_xfer_error (Message 49696)
Posted 5 Mar 2024 by hadron
Post:
After about 10 days if work a these seemed to fail on completion, I saw a few down as failed and watched this one tick over to check and sure enough it failed on 100%
https://lhcathome.cern.ch/lhcathome/result.php?resultid=406318361

Its giving a file_xfer_error?

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>Theory_2687-2527341-808_1_r686114138_result</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>

If you look up a bit, you will find this:
2024-03-04 19:29:47 (2245072): Status Report: Job Duration: '864000.000000'
2024-03-04 19:29:47 (2245072): Status Report: Elapsed Time: '860427.000000'
2024-03-04 19:29:47 (2245072): Status Report: CPU Time: '4690.400000'


Note the CPU time. In 10 days, the CPU has been in use for barely one hour and 18 minutes.
Ordinarily, I think the CPU time should always lag elapsed time by less than one hour. After an hour, if CPU time doesn't increase very nearly as fast as elapsed time, personally I believe there is no point in keeping the task running -- just abort it and get another one.
4) Message boards : ATLAS application : 2000 Events Threadripper 3995WX (Message 49662)
Posted 26 Feb 2024 by hadron
Post:
Read the kernel docs and the systemd docs.

You remind me of the guy in the bad old days of DOS and no internet, who steadfastly refused to document his code.
His rationale? "People who can't read code have no business running a computer."
5) Message boards : ATLAS application : 2000 Events Threadripper 3995WX (Message 49636)
Posted 24 Feb 2024 by hadron
Post:
Especially on Linux a few cgroups tweaks via systemd can be set to ensure CPU cycles are not lost during an ATLAS setup but instead given to other running tasks.
This slightly slows down an individual task but increases the total throughput of the computer.

More detail, please.
6) Message boards : Theory Application : Latest errors on tasks (Message 49616)
Posted 22 Feb 2024 by hadron
Post:
In your tasks:
Command:
VBoxManage -q closemedium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi" 
Output:
VBoxManage: error: Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi' because it has 2 child media
When it cannot close the medium, it will also not open it. The problem is not with BOINC, but VirtualBox.
You have to use VirtualBox Manager to solve your problem. See https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6112&postid=49574,

but instead of the CMS-vdi, remove Theory_2023_12_13.vdi, but don't delete the file itself.

That seems to have done it. When I looked in Tools/Media I found there were 2 orphaned Theory tasks listed under the .vdi file -- I assume those were the 2 child media the log referred to.
The Theory .vdi couldn't be removed from the VB manager until those were gone.
Thanks for the detailed reply. I would never have found this without it.
7) Message boards : Theory Application : Latest errors on tasks (Message 49613)
Posted 22 Feb 2024 by hadron
Post:
I have set LHC to "No new tasks" until the current batch of CMS tasks are all finished. Then I'm going to remove LHC and add it back into my client, to see if that might get rid of the error. I'm not holding my breath, but who knows?

I reset the project, then tried a few Theory tasks. No luck, they still fail with the same error.
8) Message boards : Theory Application : Latest errors on tasks (Message 49611)
Posted 21 Feb 2024 by hadron
Post:
Yellow Triangle in Virtualbox program?
VBoxManage: error: Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi' because it has 2 child media

However, look at the last entries in the VM trace log:

Command: VBoxManage -q unregistervm "boinc_be4d931171c5cdf6" --delete 
Exit Code: 0
Output:
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%

18:26:06 (16682): called boinc_finish(-2135228404)

It looks like termination with extreme prejudice. I never see the tasks until they've been shut down and are waiting to be reported.
9) Message boards : Theory Application : Latest errors on tasks (Message 49608)
Posted 21 Feb 2024 by hadron
Post:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=219902109
194 (0x000000C2) EXIT_ABORTED_BY_CLIENT

Before me: 1 (0x00000001) Unknown error code
Before them: -2135228404 (0x80BB000C) Unknown error code

And then this (I am the first, it is being resent): https://lhcathome.cern.ch/lhcathome/result.php?resultid=406297248 Same thing...aborted by client

Really? Come on.....

You are getting the same error that I have been seeing for the past several days, on every Theory task:

VBoxManage.exe: error: Cannot attach medium 'D:\data\projects\lhcathome.cern.ch_lhcathome\Theory_2023_12_13.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later

See, for example, https://lhcathome.cern.ch/lhcathome/result.php?resultid=406303951

I have set LHC to "No new tasks" until the current batch of CMS tasks are all finished. Then I'm going to remove LHC and add it back into my client, to see if that might get rid of the error. I'm not holding my breath, but who knows?
10) Message boards : Theory Application : Problem of the day (Message 49580)
Posted 16 Feb 2024 by hadron
Post:
New error today. Nearly 30 of them, all reported just after 0900 UTC 16 Feb:

2024-02-16 03:19:00 (32167): Adding storage controller(s) to VM.
2024-02-16 03:19:00 (32167): Adding virtual disk drive to VM. (Theory_2023_12_13.vdi)
2024-02-16 03:19:05 (32167): Error in deregister parent vdi for VM: -2135228404
Command:
VBoxManage -q closemedium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi" 
Output:
VBoxManage: error: Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi' because it has 2 child media
VBoxManage: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee nsISupports
VBoxManage: error: Context: "Close()" at line 1875 of file VBoxManageDisk.cpp

2024-02-16 03:19:05 (32167): Could not create VM
2024-02-16 03:19:05 (32167): ERROR: VM failed to start
2024-02-16 03:19:05 (32167): Powering off VM.
2024-02-16 03:19:05 (32167): Deregistering VM. (boinc_d0135c6cd87fd305, slot#12)
2024-02-16 03:19:05 (32167): Removing network bandwidth throttle group from VM.
2024-02-16 03:19:05 (32167): Removing VM from VirtualBox.


and then from the VM trace log:

2024-02-16 03:19:00 (32167): 
Command: VBoxManage -q storageattach "boinc_d0135c6cd87fd305" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi" 
Exit Code: -2135228409
Output:
VBoxManage: error: Cannot attach medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later
VBoxManage: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee nsISupports
VBoxManage: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 785 of file VBoxManageStorageController.cpp

2024-02-16 03:19:00 (32167): 
Command: VBoxManage -q closemedium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi" 
Exit Code: -2135228404
Output:
VBoxManage: error: Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi' because it has 2 child media
VBoxManage: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee nsISupports
VBoxManage: error: Context: "Close()" at line 1875 of file VBoxManageDisk.cpp

Note in particular the line that says "...can only be attached to machines that were created with VirtualBox 4.0 or later", which is very strange because I am running version 7.0.12
11) Message boards : Theory Application : Problem of the day (Message 49535)
Posted 13 Feb 2024 by hadron
Post:
2024-01-23 05:30:32 (21140): Guest Log: 05:30:38 CET +01:00 2024-01-23: cranky: [INFO] Checking CVMFS.
2024-01-23 05:30:33 (21140): Guest Log: Probing /cvmfs/sft.cern.ch... Failed!
2024-01-23 05:30:33 (21140): Guest Log: 05:30:38 CET +01:00 2024-01-23: cranky: [ERROR] 'cvmfs_config probe sft.cern.ch' failed.


Same happening here, only I am running everything under VBox.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=406100207
12) Message boards : CMS Application : Could not get X509 credentials (Message 49515)
Posted 11 Feb 2024 by hadron
Post:
Can this problem with getting a proxy credential from LHC be avoided by installing a local proxy server?
If so, how would one go about this in Linux?
13) Message boards : CMS Application : Could not get X509 credentials (Message 49504)
Posted 10 Feb 2024 by hadron
Post:
I've got 12 tasks with logs that look just like that.
14) Message boards : ATLAS application : Download sometime between 20 and 50 kBps (Message 49401)
Posted 5 Feb 2024 by hadron
Post:
Take your statement and exchange sender/receiver to get:
"When CERN sends thousands of large files at full speed and only 2 recipients report occasional problems, then it's not a problem at the CERN end."

Do you see the problem?
Without deeper investigation both are not valid.

Perhaps only 2 of us a) have noticed a problem, b) are interested in trying to resolve it, or c) have even noticed it.
In fact, I only became interested in this thread when I noticed that the finger was always being pointed at the home user's system or router, or at his/her ISP. My own experience shows that not to be the case, but you seem quite intent on ignoring or dismissing my real-life experience in favour of your explanation.

OK, I don't expect you to accept this point of view.
Instead I expect complaints about the "only 2".
Well, replace it with few more but compare that with up to 7.32 k jobs ATLAS was recently running concurrently via non-grid-BOINC as shown here.

I will start looking for a problem on my end, once you adequately address my observation that, at the same time as a large ATLAS task is coming down in BOINC at 100 KB/s or less, I can simultaneously grab a file of the same size, or larger, at 20 to 30 MB/s.

Note that I really don't care much about what happens after all the files for one task are downloaded and the task is ready to be started. The problem I've seen does not lie there; rather, the problem is with the initial download of the files necessary to run the task -- and it occurs with ATLAS tasks alone, even when all other network operations on my system are functioning normally.
15) Message boards : ATLAS application : Download sometime between 20 and 50 kBps (Message 49396)
Posted 5 Feb 2024 by hadron
Post:
@computezrmle

I have enough download bandwidth for 35MB/s. When I see an ATLAS file coming in at 100KB/s at the same time as some non-BOINC download (say from Usenet or whatever) is coming in at 20-25MB/s, I know that there is nothing wrong with my end of the connections.
When 2 ATLAS tasks come in at the same time, and I watch one of them struggle to reach 100KB/s while the other comes in at 10MB/s, I know there is nothing wrong with my end of the connections.

This is not a problem at the user's end.
16) Message boards : ATLAS application : Download sometime between 20 and 50 kBps (Message 49382)
Posted 3 Feb 2024 by hadron
Post:
The problem most certainly does not lie outside CERN.

Just a statement. Not a valid evidence either to blame CERN.
The relevant point is that a speed drop can happen anywhere between the connection endpoints.

It doesn't strike you as odd that the problem occurs _only_ with ATLAS tasks?
17) Message boards : ATLAS application : Download sometime between 20 and 50 kBps (Message 49369)
Posted 3 Feb 2024 by hadron
Post:
Have VDSL Hybrid since three months

Since hybrid solutions (partly) use wireless connections this makes them less stable.
The speed promise is an up-to-promise which means the real speed can temporarily be much lower.
For years the ISP's boards are full of complaints about hybrid speed dropping close to zero.

I live in western Canada, so a German ISP being responsible is nonsense.
I have cable -- coax link to the telephone pole in the back alley, then optical fiber all the way to CERN. I too see these ridiculously low speeds from time to time, and they only happen with ATLAS tasks.

The problem most certainly does not lie outside CERN.
18) Questions and Answers : Getting started : Broken Website FAQ's and GPU activation (Message 49324)
Posted 31 Jan 2024 by hadron
Post:
The FAQ page works fine here with Firefox.

Same here.
19) Questions and Answers : Unix/Linux : code erreur 1 (Message 49258)
Posted 24 Jan 2024 by hadron
Post:
bonsoir je viens d'essayer mais je pense qu il y a une erreur de syntaxe.
good evening I just tried but I think there is a syntax error.
merci
thank you


pascal@pascal-MS-7D07:~$ sudo groups boinc
[sudo] Mot de passe de pascalĀ :
boinc : boinc video render
pascal@pascal-MS-7D07:~$ sudo groupmod -a -U boinc vboxusers.
groupmodĀ : option invalide -- 'a'
UtilisationĀ : groupmod [options] GROUP

My apologies.
That should have been "usermod" not "groupmod".
20) Questions and Answers : Unix/Linux : code erreur 1 (Message 49246)
Posted 23 Jan 2024 by hadron
Post:
Is user boinc a member of group vboxusers? Run

sudo groups boinc

to find out.
If not, run this to add boinc to vboxusers:

sudo groupmod -a -U boinc vboxusers.

Also, if you want to use b the boinc manager GUI, make sure your user is a member of group boinc:

sudo groupmod -a -U <your_user_name> boinc

Then run

kdesu -u boinc /usr/bin/boincmgr

to start the manager.


Next 20


©2024 CERN