Message boards : ATLAS application : Bad WUs?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 590
Credit: 33,916,711
RAC: 19,106
Message 45748 - Posted: 25 Nov 2021, 19:10:59 UTC - in response to Message 45747.  

I have some also. Here's one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=334410656 Probably more to come later.
ID: 45748 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45749 - Posted: 25 Nov 2021, 20:47:38 UTC - in response to Message 45748.  

I have some also. Here's one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=334410656 Probably more to come later.
okay, I see. So most probably misonfigured WUs, and no reason for me to be afraid that something's wrong with my system.

The bad thing thoug is that such a bad WU would run for hours and hours, overnight, ... thus blocking a slot for nothing.
ID: 45749 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45750 - Posted: 26 Nov 2021, 6:13:37 UTC - in response to Message 45749.  

Over night, there were some more of these faulty tasks :-(
ID: 45750 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1589
Credit: 66,552,416
RAC: 238,488
Message 45751 - Posted: 26 Nov 2021, 7:25:49 UTC - in response to Message 45750.  

What you can testing, Boinc have a upgrade only for Windows, from 7.16.11 to 7.16.20,
if this faulty Atlas are involved.
Have also upgrated Virtualbox to 6.1.30.
Saw also some faulty before in Win10pro and Win11pro.
ID: 45751 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45752 - Posted: 26 Nov 2021, 7:35:04 UTC - in response to Message 45751.  

What you can testing, Boinc have a upgrade only for Windows, from 7.16.11 to 7.16.20,
if this faulty Atlas are involved.
Have also upgrated Virtualbox to 6.1.30.
Saw also some faulty before in Win10pro and Win11pro.
okay, I could try this, thanks for the hint.

What I am also seeing now are tasks that are faulty in a different way than before: they do use full CPU power (in contrast to the faulty tasks from before), but the VM console shows zero events processed, all the time long the task is running.
ID: 45752 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2027
Credit: 148,381,289
RAC: 117,711
Message 45753 - Posted: 26 Nov 2021, 7:50:55 UTC - in response to Message 45750.  

Looking at this example shows that the WU fails on Windows as well as on Linux (running native):

Erich's task:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=334527292

Corresponding WU:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=176712440

Checked a few other WUs and all failed on all computers they were sent to.
Hence, it's more likely that its a faulty batch than a local issue.
ID: 45753 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45755 - Posted: 26 Nov 2021, 9:32:19 UTC

I keep getting WUs which fail with all kinds of error reasons.
This one here now failed after 6 minutes: https://lhcathome.cern.ch/lhcathome/result.php?resultid=334540111

so something seems to go rather wrong with ATLAS presently.
ID: 45755 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45757 - Posted: 26 Nov 2021, 16:47:35 UTC

I am wondering that while it is known since this morning that there obviously is a batch with faulty WUs, no steps have been taken to get those removed or stopped.
ID: 45757 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45801 - Posted: 7 Dec 2021, 4:34:13 UTC - in response to Message 45747.  

within the past few hours, a had several task in a row, where CPU usage was less than 1 minute, but the task was running forever.
See here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=334449376

if I interpret the stderr correctly, the problem was

2021-11-25 14:19:01 (27532): Guest Log: 00:00:10.004806 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 587 877 475 000ns (GuestNow=1 637 846 340 764 851 000 ns GuestLast=1 637 849 928 642 326 000 ns fSetTimeLastLoop=true )

has anyone else made same experience?
last night, I had another task with:

2021-12-06 23:23:04 (16356): Guest Log: 00:00:10.018997 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 587 915 035 000ns (GuestNow=1 638 829 384 890 054 000 ns GuestLast=1 638 832 972 805 089 000 ns fSetTimeLastLoop=true )

CPU use was only 54 seconds, but the task did not stop automatically, but ran through all night thus wasting a slot for nothing.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=335188584

what kind of failure is this "Radical guest time change" thing - what's the cause for it?
ID: 45801 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1589
Credit: 66,552,416
RAC: 238,488
Message 45802 - Posted: 7 Dec 2021, 5:30:32 UTC - in response to Message 45801.  

what kind of failure is this "Radical guest time change" thing - what's the cause for it?

This is not essentiell.
When one of hundreds Atlas-Tasks have this error, and run for several hours, we can live with it.
Had also one last week. We can only watching it over the time, atm.
ID: 45802 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45803 - Posted: 7 Dec 2021, 11:28:45 UTC - in response to Message 45802.  

what kind of failure is this "Radical guest time change" thing - what's the cause for it?

When one of hundreds Atlas-Tasks have this error, and run for several hours, we can live with it.
unfortunately, the error rate is much higher than just 1 out of hundreds.
I had one last night, I had the next one just now.
ID: 45803 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1589
Credit: 66,552,416
RAC: 238,488
Message 45804 - Posted: 7 Dec 2021, 11:36:20 UTC - in response to Message 45803.  

One on Win11pro atm with 2 Cores and 13 hours runtime. The faulty-Counter for me is now TWO!
Hoping, the last and only one for this week :-).
ID: 45804 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45805 - Posted: 7 Dec 2021, 11:52:35 UTC - in response to Message 45804.  

I got the next one right now (so the third one today).

what I notice is that with these faulty tasks, the vm_image.vdi is only about 2.5GB in size, as opposed to the others with about 3.3GB.
Also, when trying to open the VM console, localhost login does not work.

So these are the two characteristics of this kind of faulty task, and I now have no other choice than checking every once a new task starts :-(
ID: 45805 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 436
Credit: 117,893,361
RAC: 4,236
Message 45806 - Posted: 7 Dec 2021, 11:58:26 UTC

The Rate of this failure has raised since yesterday to more than 10 for me :-(


Supporting BOINC, a great concept !
ID: 45806 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45807 - Posted: 7 Dec 2021, 15:48:39 UTC - in response to Message 45806.  

just now, I got the next one.
So this was the fourth one since last night. I am afraid more will follow :-(
ID: 45807 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 436
Credit: 117,893,361
RAC: 4,236
Message 45808 - Posted: 7 Dec 2021, 15:52:49 UTC - in response to Message 45807.  

just now, I got the next one.
So this was the fourth one since last night. I am afraid more will follow :-(

I'm shure more will follow !


Supporting BOINC, a great concept !
ID: 45808 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1589
Credit: 66,552,416
RAC: 238,488
Message 45809 - Posted: 7 Dec 2021, 16:27:10 UTC - in response to Message 45808.  

Have stopped Atlas for Windows and have send a PM to David.
ID: 45809 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 436
Credit: 117,893,361
RAC: 4,236
Message 45810 - Posted: 8 Dec 2021, 12:16:42 UTC

This Morning I had to to cancel more than 15 WUs hanging around.

Sorry, but it sucks !


Supporting BOINC, a great concept !
ID: 45810 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 436
Credit: 117,893,361
RAC: 4,236
Message 45811 - Posted: 8 Dec 2021, 12:48:21 UTC

Okay, for me time to take an outage from Atlas.


Supporting BOINC, a great concept !
ID: 45811 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,504,379
RAC: 59,638
Message 45812 - Posted: 8 Dec 2021, 13:06:55 UTC - in response to Message 45810.  

This Morning I had to to cancel more than 15 WUs hanging around.
Sorry, but it sucks !
indeed, at this point it's a waste of time and ressources :-(
I am wondering that no one is stopping these faulty tasks.
ID: 45812 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : ATLAS application : Bad WUs?


©2022 CERN