Thread 'Another batch of faulty WUs?'

Author	Message
Dave Peachey Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0	Message 30069 - Posted: 26 Apr 2017, 21:42:52 UTC Last modified: 26 Apr 2017, 21:51:10 UTC I'm starting to see a number of WUs terminating early and giving a "Validate error" in the results. Examples include: - Workunit 65970381 - Workunit 65971624 - Workunit 65971430 each of which has the common parameters: - name includes text string ..Su7Ccp2YYBZmABFKDmABFKDm3INKDm.. - taskID = 10995533 and all of which are terminating early (anything from 10 to 30 minutes elapsed run-time). As these are relatively new batch of WUs (created around 10:00 UTC today) and I haven't had any/many wingmen report results, I don't know whether this is "just me" or a symptom of another batch of faulty WUs. Having said all of the above, I have also had some successes with WUs bearing these parameters so that would suggest it isn't necessarily a completely faulty batch and that maybe some other factors are involved (although my machine is generally stable so I don't believe the fault is in the hardware/software set-up). Is anyone else seeing the same or similar behaviour with WUs having these parameters? Dave ID: 30069 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1563 Credit: 10,136,688 RAC: 1,023	Message 30072 - Posted: 27 Apr 2017, 6:02:45 UTC - in response to Message 30069. From the same batch, I had one successful: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136702475 ID: 30072 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1996 Credit: 164,987,273 RAC: 71,601	Message 30076 - Posted: 27 Apr 2017, 9:29:05 UTC Last modified: 27 Apr 2017, 9:58:19 UTC I, too, had such a task this morning, it errored out after 17 minutes: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136754666 ID: 30076 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 30096 - Posted: 27 Apr 2017, 21:08:19 UTC I had a few from the same batch that went through with no problem: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136760061 https://lhcathome.cern.ch/lhcathome/result.php?resultid=136759824 https://lhcathome.cern.ch/lhcathome/result.php?resultid=136759670 We are the product of random evolution. ID: 30096 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 697	Message 30443 - Posted: 20 May 2017, 13:00:06 UTC This task is not running for more than five user: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=66206901 ID: 30443 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 697	Message 31169 - Posted: 29 Jun 2017, 5:49:45 UTC - in response to Message 30443. This task is not running for more than five user: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=72503000 ID: 31169 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 697	Message 35597 - Posted: 21 Jun 2018, 6:06:52 UTC Last modified: 21 Jun 2018, 6:07:31 UTC This task is not running with four users: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=97252597 ID: 35597 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1325 Credit: 102,096,365 RAC: 145,413	Message 35599 - Posted: 21 Jun 2018, 7:57:43 UTC - in response to Message 35597. https://lhcathome.cern.ch/lhcathome/result.php?resultid=198796123 And yours is the one that ran 22 hours with 2-cores and did run for most of that time and then started having a VB crash Axel On June 15th we were having problems with these Atlas tasks but after that they started running Valids again. I know it is better if they crash right away like those other computers did since wasting 22 hours with the 2 cores can be a _______ but when it is just the one task and then you get back to running Valids it makes it a bit better. Hard for that computer to try keeping up with your Ryzen which is running nice and I see the credits are running closer to average but nothing wrong with getting extra credits with that machine! ID: 35599 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 697	Message 35602 - Posted: 21 Jun 2018, 8:45:26 UTC Ah... you are also here. Yes, not a lucky situation. BTW found yesterday in Virtualbox Strg+T for reboot VM if crashed. Had helped for one VM so long. Some time you have no luck. But than come bad luck. ID: 35602 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 697	Message 36016 - Posted: 23 Jul 2018, 19:32:13 UTC Last modified: 23 Jul 2018, 19:39:44 UTC This WU is running in Error for three User: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=99518942 Number four finished this WU in the moment. Was too fast with the message, sorry. Oh, was a Ryzen, too. ID: 36016 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2768 Credit: 308,258,975 RAC: 116,070	Message 36017 - Posted: 23 Jul 2018, 20:07:18 UTC - in response to Message 36016. Number four finished this WU in the moment. Was too fast with the message, sorry. Oh, was a Ryzen, too. It also failed but with error 65. https://lhcathome.cern.ch/lhcathome/result.php?resultid=203063922 The recent batch obviously needs more RAM for 1 core and 2 core setups. ID: 36017 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 697	Message 38982 - Posted: 26 May 2019, 18:13:25 UTC This task is not running for more than three Volunteers: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=114857980 ID: 38982 · Reply Quote

Aurum Send message Joined: 12 Jun 18 Posts: 142 Credit: 57,421,670 RAC: 1	Message 39288 - Posted: 5 Jul 2019, 16:24:51 UTC I'm seeing a lot of failed atlas WUs today. I can't see why so I thought I'd reinstall the wrapper packages, or whatever they're called: wget https://github.com/singularityware/singularity/releases/download/2.6.0/singularity-2.6.0.tar.gz ; tar xvf singularity-2.6.0.tar.gz ; cd singularity-2.6.0 ; sudo apt install libarchive-dev ; ./configure --prefix=/usr/local ; make ; sudo make install wget https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest_all.deb ; sudo dpkg -i cvmfs-release-latest_all.deb ; rm -f cvmfs-release-latest_all.deb ; sudo apt-get update ; sudo apt-get install cvmfs ; sudo apt install glibc-doc open-iscsi watchdog sudo wget https://lhcathomedev.cern.ch/lhcathome-dev/download/default.local -O /etc/cvmfs/default.local ; sudo cvmfs_config setup ; sudo echo "/cvmfs /etc/auto.cvmfs" > /etc/auto.master.d/cvmfs.autofs ; sudo systemctl restart autofs ; cvmfs_config probe I noticed this message that I haven't seen before. alice.cern.ch: Unloading Fuse module atlas-condb.cern.ch: Waiting for the delivery of SIGUSR1... alice.cern.ch: Waiting for the delivery of SIGUSR1... alice.cern.ch: Re-Loading Fuse module atlas-condb.cern.ch: Re-Loading Fuse module atlas.cern.ch: Re-Loading Fuse module cernvm-prod.cern.ch: Re-Loading Fuse module grid.cern.ch: Re-Loading Fuse module sft.cern.ch: Re-Loading Fuse module Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. I also see that cvmfs is being upgraded. How do we civilians know when we need to upgrade ATLAS code??? The following packages will be upgraded: cvmfs 1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. Also, how do we know we have the latest correct version of singularity??? I have 2.6.0 now. ID: 39288 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 817 Credit: 66,932,764 RAC: 28,530	Message 39684 - Posted: 22 Aug 2019, 7:45:14 UTC I see a lot of failed jobs on the Grafana graphics around 16:00 yesterday. What happened at that time? There are also some download errors yesterday, but for me they happened later. I assume that the time shown on Grafana is either UTC or CET. ID: 39684 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2768 Credit: 308,258,975 RAC: 116,070	Message 39685 - Posted: 22 Aug 2019, 9:12:36 UTC - in response to Message 39684. I assume that the time shown on Grafana is either UTC or CET. Grafana shows the timeframe near the top right corner of the page. At least on my browser ATLAS pages are shown in UTC, CMS pages are shown in local time. ID: 39685 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 817 Credit: 66,932,764 RAC: 28,530	Message 39687 - Posted: 22 Aug 2019, 9:41:37 UTC - in response to Message 39685. Last modified: 22 Aug 2019, 9:43:48 UTC Ah, so it does. UTC for me also. [edit]Too bad that all the data disappeared for 'Bad Gateway (502)' error[/edit] ID: 39687 · Reply Quote

csbyseti Send message Joined: 6 Jul 17 Posts: 22 Credit: 29,430,354 RAC: 0	Message 39944 - Posted: 17 Sep 2019, 9:39:04 UTC actual i got more than 80% faulty Atlas WU's, Stop working after a few minutes. With such big download-sizes not really funny. Is it a problem of the WU data set or is it a problem with LHC infrastructure? https://lhcathome.cern.ch/lhcathome/results.php?hostid=10488512&offset=0&show_names=0&state=0&appid=14 or https://lhcathome.cern.ch/lhcathome/results.php?hostid=10493650&offset=0&show_names=0&state=0&appid=14 ID: 39944 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 697	Message 43289 - Posted: 28 Aug 2020, 6:09:52 UTC WU not running for all User: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=144139052 ID: 43289 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 697	Message 43304 - Posted: 4 Sep 2020, 12:28:37 UTC This WU is not running for all User: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=144480648 ID: 43304 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2768 Credit: 308,258,975 RAC: 116,070	Message 43450 - Posted: 29 Sep 2020, 21:06:14 UTC like we got a faulty batch or backend system. There's an increasing number of tasks reporting this kind of error: [pre]<core_client_version>7.17.0</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>HXRNDmUgIhxn9Rq4apoT9bVoABFKDmABFKDmLQdXDmABFKDm0yWkAn_EVNT.22646322._000379.pool.root.1</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> </message> ]]>[/pre] @David Cameron Be so kind as to investigate. ID: 43450 · Reply Quote