Message boards : ATLAS application : Another batch of faulty WUs?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Dave Peachey

Send message
Joined: 9 May 09
Posts: 17
Credit: 772,975
RAC: 0
Message 30069 - Posted: 26 Apr 2017, 21:42:52 UTC
Last modified: 26 Apr 2017, 21:51:10 UTC

I'm starting to see a number of WUs terminating early and giving a "Validate error" in the results.

Examples include:
- Workunit 65970381
- Workunit 65971624
- Workunit 65971430
each of which has the common parameters:
- name includes text string ..Su7Ccp2YYBZmABFKDmABFKDm3INKDm..
- taskID = 10995533
and all of which are terminating early (anything from 10 to 30 minutes elapsed run-time).

As these are relatively new batch of WUs (created around 10:00 UTC today) and I haven't had any/many wingmen report results, I don't know whether this is "just me" or a symptom of another batch of faulty WUs.

Having said all of the above, I have also had some successes with WUs bearing these parameters so that would suggest it isn't necessarily a completely faulty batch and that maybe some other factors are involved (although my machine is generally stable so I don't believe the fault is in the hardware/software set-up).

Is anyone else seeing the same or similar behaviour with WUs having these parameters?

Dave
ID: 30069 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30072 - Posted: 27 Apr 2017, 6:02:45 UTC - in response to Message 30069.  

ID: 30072 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,382
RAC: 102,152
Message 30076 - Posted: 27 Apr 2017, 9:29:05 UTC
Last modified: 27 Apr 2017, 9:58:19 UTC

I, too, had such a task this morning, it errored out after 17 minutes:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=136754666
ID: 30076 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 30096 - Posted: 27 Apr 2017, 21:08:19 UTC

ID: 30096 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,147,301
RAC: 105,597
Message 30443 - Posted: 20 May 2017, 13:00:06 UTC

This task is not running for more than five user:

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=66206901
ID: 30443 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,147,301
RAC: 105,597
Message 31169 - Posted: 29 Jun 2017, 5:49:45 UTC - in response to Message 30443.  

This task is not running for more than five user:

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=72503000
ID: 31169 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,147,301
RAC: 105,597
Message 35597 - Posted: 21 Jun 2018, 6:06:52 UTC
Last modified: 21 Jun 2018, 6:07:31 UTC

This task is not running with four users:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=97252597
ID: 35597 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,029
RAC: 3,972
Message 35599 - Posted: 21 Jun 2018, 7:57:43 UTC - in response to Message 35597.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=198796123

And yours is the one that ran 22 hours with 2-cores and did run for most of that time and then started having a VB crash Axel

On June 15th we were having problems with these Atlas tasks but after that they started running Valids again.

I know it is better if they crash right away like those other computers did since wasting 22 hours with the 2 cores can be a _______ but when it is just the one task and then you get back to running Valids it makes it a bit better.

Hard for that computer to try keeping up with your Ryzen which is running nice and I see the credits are running closer to average but nothing wrong with getting extra credits with that machine!
ID: 35599 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,147,301
RAC: 105,597
Message 35602 - Posted: 21 Jun 2018, 8:45:26 UTC

Ah... you are also here.
Yes, not a lucky situation.
BTW found yesterday in Virtualbox Strg+T for reboot VM if crashed.
Had helped for one VM so long.
Some time you have no luck. But than come bad luck.
ID: 35602 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,147,301
RAC: 105,597
Message 36016 - Posted: 23 Jul 2018, 19:32:13 UTC
Last modified: 23 Jul 2018, 19:39:44 UTC

This WU is running in Error for three User:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=99518942
Number four finished this WU in the moment. Was too fast with the message, sorry.
Oh, was a Ryzen, too.
ID: 36016 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,956,330
RAC: 136,952
Message 36017 - Posted: 23 Jul 2018, 20:07:18 UTC - in response to Message 36016.  

Number four finished this WU in the moment. Was too fast with the message, sorry.
Oh, was a Ryzen, too.

It also failed but with error 65.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=203063922

The recent batch obviously needs more RAM for 1 core and 2 core setups.
ID: 36017 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,147,301
RAC: 105,597
Message 38982 - Posted: 26 May 2019, 18:13:25 UTC

This task is not running for more than three Volunteers:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=114857980
ID: 38982 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 126
Credit: 52,457,949
RAC: 23,953
Message 39288 - Posted: 5 Jul 2019, 16:24:51 UTC

I'm seeing a lot of failed atlas WUs today. I can't see why so I thought I'd reinstall the wrapper packages, or whatever they're called:
wget https://github.com/singularityware/singularity/releases/download/2.6.0/singularity-2.6.0.tar.gz ; tar xvf singularity-2.6.0.tar.gz ; cd singularity-2.6.0 ; sudo apt install libarchive-dev ; ./configure --prefix=/usr/local ; make ; sudo make install

wget https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest_all.deb ; sudo dpkg -i cvmfs-release-latest_all.deb ; rm -f cvmfs-release-latest_all.deb ; sudo apt-get update ; sudo apt-get install cvmfs ; sudo apt install glibc-doc open-iscsi watchdog

sudo wget https://lhcathomedev.cern.ch/lhcathome-dev/download/default.local -O /etc/cvmfs/default.local ; sudo cvmfs_config setup ; sudo echo "/cvmfs /etc/auto.cvmfs" > /etc/auto.master.d/cvmfs.autofs ; sudo systemctl restart autofs ; cvmfs_config probe

I noticed this message that I haven't seen before.
alice.cern.ch: Unloading Fuse module
atlas-condb.cern.ch: Waiting for the delivery of SIGUSR1...
alice.cern.ch: Waiting for the delivery of SIGUSR1...
alice.cern.ch: Re-Loading Fuse module
atlas-condb.cern.ch: Re-Loading Fuse module
atlas.cern.ch: Re-Loading Fuse module
cernvm-prod.cern.ch: Re-Loading Fuse module
grid.cern.ch: Re-Loading Fuse module
sft.cern.ch: Re-Loading Fuse module
Reload CRASHED! CernVM-FS mountpoints unusable.
Reload CRASHED! CernVM-FS mountpoints unusable.
Reload CRASHED! CernVM-FS mountpoints unusable.
Reload CRASHED! CernVM-FS mountpoints unusable.
Reload CRASHED! CernVM-FS mountpoints unusable.
Reload CRASHED! CernVM-FS mountpoints unusable.
I also see that cvmfs is being upgraded. How do we civilians know when we need to upgrade ATLAS code???
The following packages will be upgraded:
  cvmfs
1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Also, how do we know we have the latest correct version of singularity??? I have 2.6.0 now.
ID: 39288 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 39684 - Posted: 22 Aug 2019, 7:45:14 UTC

I see a lot of failed jobs on the Grafana graphics around 16:00 yesterday. What happened at that time? There are also some download errors yesterday, but for me they happened later. I assume that the time shown on Grafana is either UTC or CET.
ID: 39684 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,956,330
RAC: 136,952
Message 39685 - Posted: 22 Aug 2019, 9:12:36 UTC - in response to Message 39684.  

I assume that the time shown on Grafana is either UTC or CET.

Grafana shows the timeframe near the top right corner of the page.
At least on my browser ATLAS pages are shown in UTC, CMS pages are shown in local time.
ID: 39685 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 39687 - Posted: 22 Aug 2019, 9:41:37 UTC - in response to Message 39685.  
Last modified: 22 Aug 2019, 9:43:48 UTC

Ah, so it does. UTC for me also.
[edit]Too bad that all the data disappeared for 'Bad Gateway (502)' error[/edit]
ID: 39687 · Report as offensive     Reply Quote
csbyseti

Send message
Joined: 6 Jul 17
Posts: 22
Credit: 29,430,354
RAC: 0
Message 39944 - Posted: 17 Sep 2019, 9:39:04 UTC

actual i got more than 80% faulty Atlas WU's, Stop working after a few minutes. With such big download-sizes not really funny.

Is it a problem of the WU data set or is it a problem with LHC infrastructure?

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10488512&offset=0&show_names=0&state=0&appid=14

or

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10493650&offset=0&show_names=0&state=0&appid=14
ID: 39944 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,147,301
RAC: 105,597
Message 43289 - Posted: 28 Aug 2020, 6:09:52 UTC

ID: 43289 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,147,301
RAC: 105,597
Message 43304 - Posted: 4 Sep 2020, 12:28:37 UTC

ID: 43304 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,956,330
RAC: 136,952
Message 43450 - Posted: 29 Sep 2020, 21:06:14 UTC

Looks like we got a faulty batch or backend system.
There's an increasing number of tasks reporting this kind of error:
<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
  <file_name>HXRNDmUgIhxn9Rq4apoT9bVoABFKDmABFKDmLQdXDmABFKDm0yWkAn_EVNT.22646322._000379.pool.root.1</file_name>
  <error_code>-224 (permanent HTTP error)</error_code>
  <error_message>permanent HTTP error</error_message>
</file_xfer_error>
</message>
]]>

@David Cameron
Be so kind as to investigate.
ID: 43450 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : Another batch of faulty WUs?


©2024 CERN