Message boards : Theory Application : Theory Failure Ratio Explodes
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2425
Credit: 227,491,123
RAC: 130,153
Message 48574 - Posted: 19 Sep 2023, 7:03:08 UTC

Overall Theory failure ratio raised to 100 % this morning:
http://mcplots-dev.cern.ch/production.php?view=status&plots=hourly#plots
ID: 48574 · Report as offensive     Reply Quote
Anton

Send message
Joined: 26 Nov 10
Posts: 11
Credit: 1,435,923
RAC: 0
Message 48575 - Posted: 19 Sep 2023, 11:35:15 UTC - in response to Message 48574.  

Hello!
There was an update to the new version of scientific software yesterday.
The update has issues and this is the reason of failures. I am looking now for the solution and will update once it is fixed.
Thank you for the notice!
ID: 48575 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 126
Credit: 53,906,164
RAC: 26,149
Message 48578 - Posted: 19 Sep 2023, 12:13:50 UTC - in response to Message 48575.  

There was an update to the new version of scientific software yesterday.

Can't you remove the defective WUs from the work server queue or must they all fail several times each?
ID: 48578 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1691
Credit: 104,607,242
RAC: 100,490
Message 48582 - Posted: 19 Sep 2023, 14:38:30 UTC - in response to Message 48578.  

There was an update to the new version of scientific software yesterday.

Can't you remove the defective WUs from the work server queue or must they all fail several times each?
Obviously, the faulty tasks were removed.
So it's the first time now, as far back as I can remember, that there are no tasks from any LHC-subproject available :-(
ID: 48582 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1691
Credit: 104,607,242
RAC: 100,490
Message 48583 - Posted: 19 Sep 2023, 19:01:44 UTC

new tasks were sent out, and again they were faulty.

I am wondering why such a batch is not being testet before it is distributed ???
ID: 48583 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2121
Credit: 159,926,969
RAC: 70,085
Message 48584 - Posted: 19 Sep 2023, 19:16:24 UTC - in response to Message 48583.  

Erich56, this is the easy answer.
Have you ever worked in IT?
I have.
ID: 48584 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 810
Credit: 654,497,622
RAC: 259,773
Message 48585 - Posted: 20 Sep 2023, 6:15:48 UTC

Its still bad for me
ID: 48585 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2425
Credit: 227,491,123
RAC: 130,153
Message 48586 - Posted: 20 Sep 2023, 6:34:35 UTC
Last modified: 20 Sep 2023, 7:21:13 UTC

Recently started a bunch of fresh ones.
So far only 2 failed within 20 s with status code 1.
All others either succeeded very quickly (status code 0) or are running fine.

<edit>
Meanwhile the failure rate increases again.

Theory revision 2390 -> runs fine
Theory revision 2638 -> fails
</edit>
ID: 48586 · Report as offensive     Reply Quote
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,859,718
RAC: 1,434
Message 48588 - Posted: 20 Sep 2023, 7:23:52 UTC

Hi,

My Theory jobs are still failing (hostID=10834815).

Different errors
runRivet
Setting environment...
grep: /etc/redhat-release: No such file or directory
./runRivet.sh: line 33: /cvmfs/sft.cern.ch/lcg/releases/LCG_102b_ATLAS_28/../gcc/11.3.0/x86_64-slc6/setup.sh: No such file or directory
ERROR: fail to set environment (gcc)

or

make: *** [yoda2flat-split.exe] Error 1
make: Leaving directory `/shared/rivetvm'
ERROR: fail to compile yoda2flat-split

or

just hangs at
Running job shoud appear here.
[INFO] Container 'runc' finished with status code 1.

When I shutdown the VM it reports job as finished (which is strange)...

Is there any special procedure in place to recover from this?

Best regards.
ID: 48588 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2425
Credit: 227,491,123
RAC: 130,153
Message 48589 - Posted: 20 Sep 2023, 7:34:33 UTC - in response to Message 48588.  

Is there any special procedure in place to recover from this?

You can't do anything.
The errors are caused by deeper level scientific scripts.
The developers are already aware and are working on a solution.


When I shutdown the VM it reports job as finished (which is strange)...

Not really strange since from BOINC's perspective (higher level) the tasks don't fail.
Nonetheless it might be a good idea to stop requesting fresh work until the problem is solved since those very short runtimes will sooner or later confuse BOINC's work fetch algorithm (as well as it's credit calculation).
ID: 48589 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2121
Credit: 159,926,969
RAC: 70,085
Message 48590 - Posted: 20 Sep 2023, 11:03:51 UTC - in response to Message 48588.  

Hi broz69,
is it possible to make your PC's visible for us Volunteers? (prefs of LHCatHome).
ID: 48590 · Report as offensive     Reply Quote
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,859,718
RAC: 1,434
Message 48591 - Posted: 20 Sep 2023, 15:37:59 UTC - in response to Message 48590.  

They're visible now.
ID: 48591 · Report as offensive     Reply Quote
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,859,718
RAC: 1,434
Message 48592 - Posted: 20 Sep 2023, 15:40:23 UTC - in response to Message 48591.  
Last modified: 20 Sep 2023, 15:45:02 UTC

At the moment both my Windows computers are set to "No new tasks".

Only Linux is running (native). Linux only has Theory_2390 jobs.
ID: 48592 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2121
Credit: 159,926,969
RAC: 70,085
Message 48593 - Posted: 20 Sep 2023, 15:49:57 UTC - in response to Message 48591.  

Thank you to open it visible.
For me using Virtualbox 7.0.6 with Boinc 7.24.1 from boinc.berkeley.edu.
Don't know if 7.0.10 making problems.
You can test it without squid, to see if there is a conflict.
ID: 48593 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 810
Credit: 654,497,622
RAC: 259,773
Message 48595 - Posted: 20 Sep 2023, 16:31:12 UTC

Linux seems to be back working on my computers, windows is still rocky.

As CM said, I doubt its anthing to to with Boinc or VirtualBox
ID: 48595 · Report as offensive     Reply Quote
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,859,718
RAC: 1,434
Message 48596 - Posted: 20 Sep 2023, 20:42:52 UTC
Last modified: 20 Sep 2023, 20:45:17 UTC

Hi,

It's what computezrmle wrote 2638 jobs fail and 2390 are OK,
In the morning the jobs were not all 2390:
Theory_2638 - 35 failed
Theory_2637 - 26 failed
Theory_2636 - 24 failed
Theory_2390 - 1 OK

In the evening I got some Theory jobs, all of them 2390.
Theory_2390-1109174-576, Theory_2390-1100306-576, Theory_2390-1140982-576, Theory_2390-1099685-576 finished OK without proxy. But so did others with proxy. So it doesn't seem a problem with proxy, vbox or Boinc.

So I'll just wait that people at LHC find the solution.

Thanks.
ID: 48596 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1691
Credit: 104,607,242
RAC: 100,490
Message 48638 - Posted: 23 Sep 2023, 7:29:26 UTC - in response to Message 48586.  
Last modified: 23 Sep 2023, 7:30:39 UTC

Theory revision 2390 -> runs fine
Theory revision 2638 -> fails
the 2638 tasks still come in, once in a while. Would have been nice if they had been sorted out ...
ID: 48638 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2425
Credit: 227,491,123
RAC: 130,153
Message 48640 - Posted: 23 Sep 2023, 7:46:16 UTC - in response to Message 48638.  

They will sort out automatically but it will take some time.
Better this way than to cancel tasks in progress (see ATLAS).
ID: 48640 · Report as offensive     Reply Quote

Message boards : Theory Application : Theory Failure Ratio Explodes


©2024 CERN