Message boards : Theory Application : New version v300.20
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 380
Credit: 238,712
RAC: 0
Message 50078 - Posted: 29 Apr 2024, 14:32:41 UTC

This new version has an updated vboxwrapper and the images are cloned to ensure a unique ID.
ID: 50078 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50081 - Posted: 29 Apr 2024, 15:18:45 UTC - in response to Message 50078.  

<multiattach_vdi_file>Theory_2024_04_29_dev.xml</multiattach_vdi_file>
ID: 50081 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,367,266
RAC: 17,281
Message 50082 - Posted: 29 Apr 2024, 17:59:56 UTC

All seem to fail with error:
Command:
VBoxManage -q showhdinfo "C:\ProgramData\BOINC/projects/lhcathome.cern.ch_lhcathome/Theory_2024_04_29_dev.xml" 
Output:
VBoxManage.exe: error: Could not find file for the medium 'C:\ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome\Theory_2024_04_29_dev.xml' (VERR_FILE_NOT_FOUND)
VBoxManage.exe: error: Details: code VBOX_E_FILE_ERROR (0x80bb0004), component MediumWrap, interface IMedium, callee IUnknown
VBoxManage.exe: error: Context: "OpenMedium(Bstr(pszFilenameOrUuid).raw(), enmDevType, enmAccessMode, fForceNewUuidOnOpen, pMedium.asOutParam())" at line 179 of file VBoxManageDisk.cpp

2024-04-29 20:56:12 (21480): Could not create VM
2024-04-29 20:56:12 (21480): ERROR: VM failed to start
2024-04-29 20:56:12 (21480): Powering off VM.
2024-04-29 20:56:12 (21480): Deregistering VM. (boinc_106b2e9625f4f065, slot#5)
2024-04-29 20:56:12 (21480): Removing network bandwidth throttle group from VM.
2024-04-29 20:56:12 (21480): Removing VM from VirtualBox.

ID: 50082 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,009
RAC: 20,590
Message 50083 - Posted: 29 Apr 2024, 18:23:07 UTC

same problem here
ID: 50083 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 380
Credit: 238,712
RAC: 0
Message 50085 - Posted: 29 Apr 2024, 19:44:12 UTC - in response to Message 50083.  

Sorry, I started from the wrong xml file. New version v300.30 will fix this.
ID: 50085 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50087 - Posted: 29 Apr 2024, 22:11:51 UTC

All the new Theory tasks are failing immediately on openSUSE Leap 15.5, and presumably also on all previous versions as well.
It appears as if this is because the new tasks require a version of glibc which is not yet available in the main repos (ver 2.31 is the current version while the tasks appear to require 2.34).
It looks like this situation will prevail until Leap 15.6 is released in early June; that release should include glibc 2.38 -- at least, that is what I am reading in the 15.6 repos.

Anyone running openSUSE 15.5 or earlier has two options:
1) stop fetching Theory tasks until you have upgraded your system;
2) replace the OS with either the slowroll version (http://http://download.opensuse.org/slowroll/) or with Tumbleweed.
ID: 50087 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50169 - Posted: 11 May 2024, 23:55:48 UTC

For the past 90 minutes, all Theory tasks have failed on my system after 3 minutes.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=410987839
ID: 50169 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,266
Message 50170 - Posted: 12 May 2024, 7:47:38 UTC - in response to Message 50169.  

For the past 90 minutes, all Theory tasks have failed on my system after 3 minutes.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=410987839
Could you retry after you have deleted the Theory-vdi's from VirtualBox Media manager without removing them from disk?
ID: 50170 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50173 - Posted: 13 May 2024, 9:32:32 UTC - in response to Message 50170.  

For the past 90 minutes, all Theory tasks have failed on my system after 3 minutes.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=410987839
Could you retry after you have deleted the Theory-vdi's from VirtualBox Media manager without removing them from disk?

Did you mean to delete the Theory*.vdi's from the media manager? I had a task where the VM crashed, but it was still listed in the media manager. I deleted that, but left the Theory*.vdi entries alone.
I just had a bunch of Theory tasks fail with the same error, and one of them left an orphan task file in the media manager.
My primary interest in LHC is the CMS stuff anyway, so I'm probably not going to re-enable Theory tasks.
ID: 50173 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 50174 - Posted: 13 May 2024, 12:19:20 UTC - in response to Message 50173.  

... My primary interest in LHC is the CMS stuff ...

For nearly a week CMS sent only 4-core jobs.
Since then each CMS task your computer ran did a basic setup, ran 2 short benchmarks and shut itself down without doing any scientific work.

To run those 4-core jobs the VM must be configured to allocate at least 4 cores (e.g. via the web prefs).
Your VMs report 1 core:
2024-05-13 05:24:06 (31720): Setting CPU Count for VM. (1)
ID: 50174 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50180 - Posted: 14 May 2024, 0:58:45 UTC - in response to Message 50174.  

... My primary interest in LHC is the CMS stuff ...

For nearly a week CMS sent only 4-core jobs.
Since then each CMS task your computer ran did a basic setup, ran 2 short benchmarks and shut itself down without doing any scientific work.

Could you then explain to me why, over the past week, my system has run and completed over 1000 CMS tasks, and received proper credit for same? Or is this observation about the previous week?

To run those 4-core jobs the VM must be configured to allocate at least 4 cores (e.g. via the web prefs).
Your VMs report 1 core:
2024-05-13 05:24:06 (31720): Setting CPU Count for VM. (1)

The last time I configured my LHC account for multi-core tasks was shortly after I joined. I had been receiving tasks from all 3 projects before I made that change, after all I got from LHC was multi-core ATLAS tasks.
Furthermore, the credit given for multi-core tasks is pathetic. Credit is awarded for total run-time, not CPU time.
Well, let's look at that. Suppose we have a task that will run on one core for 4 hours. Close enough for government work, the CPU time will also be 4 hours. Now suppose that same task runs on 4 cores. It will require only 1 hour of run time, but will consume the same 4 hours of CPU time, but will be given only 1/4 the credit, even though it has consumed the same amount of computer resources as the first task. Where is the incentive to run 4-core tasks instead of 1-core tasks, when the credit is only 1/4 as much.
Please do not mistake me for a credit whore, because I am not; I simply think that my computer resources should be worth the same no matter how much real time it takes to complete the work. Getting the finished results back to the people who want them in the shortest possible time is not the primary concern in any of this; in fact, it is of no concern to me except that it be returned before the task's allotted time expires.
ID: 50180 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 50182 - Posted: 14 May 2024, 7:34:52 UTC - in response to Message 50180.  

I already explained what your VMs did:
"... ran a basic setup, ran 2 short benchmarks and shut itself down ..."

That's why the runtime/CPU time is so low.
Real job processing takes 2-6 h (averages).
Compare that to computers running 4-core VMs.

The short runtime pattern is typical for a VM not getting a job via WMAgent.
The same pattern can be seen when the backend queue is empty.

BOINC credits are given for valid envelopes no matter whether the VM processed a scientific job or not.
BOINC simply does not understand the various return codes from the deeper script levels.
Hence those are mostly hidden and a "success" is reported.


In addition:
- Multicore is new for CMS, hence some backend settings need to be tested/adjusted.
- Credit issues have been dicussed multiple times for years. Find related posts and try to understand them.
ID: 50182 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50183 - Posted: 14 May 2024, 12:21:49 UTC - in response to Message 50182.  

I already explained what your VMs did:
"... ran a basic setup, ran 2 short benchmarks and shut itself down ..."

That's why the runtime/CPU time is so low.
Real job processing takes 2-6 h (averages).
Compare that to computers running 4-core VMs.

The short runtime pattern is typical for a VM not getting a job via WMAgent.
The same pattern can be seen when the backend queue is empty.

BOINC credits are given for valid envelopes no matter whether the VM processed a scientific job or not.
BOINC simply does not understand the various return codes from the deeper script levels.
Hence those are mostly hidden and a "success" is reported.

Please assume you are talking to someone with zero understanding of the inner workings of Boinc and VBox. That "ran a basic setup...." bit was so obscure to me as to be meaningless.
Your latest seems to be suggesting that it is somehow my fault that the VM was not getting a job via WMAgent. I fail to see how that is even possible. I was set up to receive single-core tasks, that is what I got, and they ran as long as they ran. I should not (could not?) have been receiving 4-core tasks that somehow got run in only 1 thread.

OK, so I set my preferences to 4 CPU tasks; it took a lot of fiddling to get my client to realize it had nothing from LHC in the job queue (it kept telling me I didn't need anything from this site) -- basically I had to turn of both Einstein and Rosetta -- and even then I had to turn off Atlas and Theory before I got any CMS tasks. Now I have CMS tasks running on 4 threads -- but that is 1/3 of the total capability of my machine, and I still have Einstein and Rosetta to bring back in. It looks like I will be pretty much stuck with 2, maybe 3, CMS tasks tops -- and that doesn't even take into consideration that I might with to bring Atlas and Theory back into the picture.
I sure hope it doesn't take too long before the client gets it all sorted out -- ATM, the CMS tasks have been running for about an hour, and the client is telling me they still have almost 17 hours to go before completion.

In addition:
- Multicore is new for CMS, hence some backend settings need to be tested/adjusted.
- Credit issues have been dicussed multiple times for years. Find related posts and try to understand them.

I hardly have the time to go searching through years of threads in multiple forums to find relevant threads, even if I start guessing at what search phrases *might* possibly find relevant material.
HOWEVER... this is hardly an issue I am going to pursue any further. The only important thing right now is to get my system doing the projects I want to run.
ID: 50183 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50205 - Posted: 20 May 2024, 1:27:26 UTC

Hello-ooooo? Is anyone looking at this stuff? Preferably someone who can fix it?
Theory tasks are still failing after only 3 minutes. This one is from a week ago:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=411034415

And this one is from a few minutes ago:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=411228620
ID: 50205 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 50206 - Posted: 20 May 2024, 2:20:06 UTC - in response to Message 50205.  

VBoxManage: error: Could not find a bandwidth group named 'boinc_372f2cc2be2d23c7_net'
mittlere Uploadgeschwindigkeit 6525877.03 KB/sek
mittlere Downloadgeschwindigkeit 7079.84 KB/sek
Can you take a look to your networking for this VM?
ID: 50206 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50207 - Posted: 20 May 2024, 2:39:07 UTC - in response to Message 50206.  

VBoxManage: error: Could not find a bandwidth group named 'boinc_372f2cc2be2d23c7_net'
mittlere Uploadgeschwindigkeit 6525877.03 KB/sek
mittlere Downloadgeschwindigkeit 7079.84 KB/sek
Can you take a look to your networking for this VM?

And how would I go about doing that? That VM was removed from my system an hour ago.
ID: 50207 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 50208 - Posted: 20 May 2024, 2:46:38 UTC - in response to Message 50207.  

Have no idea, what going wrong with your network.
You can limiting network parameter in OpenSuse, for example
ID: 50208 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50209 - Posted: 20 May 2024, 3:05:59 UTC - in response to Message 50208.  

Have no idea, what going wrong with your network.
You can limiting network parameter in OpenSuse, for example

There's absolutely nothing wrong with my network.
The error is about one or more files that are not present, not about networking errors.
ID: 50209 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 50210 - Posted: 20 May 2024, 5:42:29 UTC - in response to Message 50209.  

Post the output of
mount |grep shm ; ls -hal /dev/shm/

Then prepare for a reboot.
Then reboot.
ID: 50210 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 12,040
Message 50211 - Posted: 20 May 2024, 6:20:55 UTC - in response to Message 50210.  

Post the output of
mount |grep shm ; ls -hal /dev/shm/

Then prepare for a reboot.
Then reboot.

Why?
ID: 50211 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : New version v300.20


©2024 CERN