Message boards :
ATLAS application :
Another batch of faulty WUs?
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0 ![]() ![]() |
I'm starting to see a number of WUs terminating early and giving a "Validate error" in the results. Examples include: - Workunit 65970381 - Workunit 65971624 - Workunit 65971430 each of which has the common parameters: - name includes text string ..Su7Ccp2YYBZmABFKDmABFKDm3INKDm.. - taskID = 10995533 and all of which are terminating early (anything from 10 to 30 minutes elapsed run-time). As these are relatively new batch of WUs (created around 10:00 UTC today) and I haven't had any/many wingmen report results, I don't know whether this is "just me" or a symptom of another batch of faulty WUs. Having said all of the above, I have also had some successes with WUs bearing these parameters so that would suggest it isn't necessarily a completely faulty batch and that maybe some other factors are involved (although my machine is generally stable so I don't believe the fault is in the hardware/software set-up). Is anyone else seeing the same or similar behaviour with WUs having these parameters? Dave |
Send message Joined: 14 Jan 10 Posts: 1446 Credit: 9,709,772 RAC: 763 ![]() ![]() |
From the same batch, I had one successful: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136702475 |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 130,872,641 RAC: 102,842 ![]() ![]() ![]() |
I, too, had such a task this morning, it errored out after 17 minutes: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136754666 |
![]() ![]() Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 ![]() ![]() |
I had a few from the same batch that went through with no problem: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136760061 https://lhcathome.cern.ch/lhcathome/result.php?resultid=136759824 https://lhcathome.cern.ch/lhcathome/result.php?resultid=136759670 We are the product of random evolution. |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
This task is not running for more than five user: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=66206901 |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
This task is not running for more than five user: |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
This task is not running with four users: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=97252597 |
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1195 Credit: 61,995,969 RAC: 83,411 ![]() ![]() |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=198796123 And yours is the one that ran 22 hours with 2-cores and did run for most of that time and then started having a VB crash Axel On June 15th we were having problems with these Atlas tasks but after that they started running Valids again. I know it is better if they crash right away like those other computers did since wasting 22 hours with the 2 cores can be a _______ but when it is just the one task and then you get back to running Valids it makes it a bit better. Hard for that computer to try keeping up with your Ryzen which is running nice and I see the credits are running closer to average but nothing wrong with getting extra credits with that machine! |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
Ah... you are also here. Yes, not a lucky situation. BTW found yesterday in Virtualbox Strg+T for reboot VM if crashed. Had helped for one VM so long. Some time you have no luck. But than come bad luck. |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
This WU is running in Error for three User: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=99518942 Number four finished this WU in the moment. Was too fast with the message, sorry. Oh, was a Ryzen, too. |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,291,999 RAC: 129,263 ![]() ![]() |
Number four finished this WU in the moment. Was too fast with the message, sorry. It also failed but with error 65. https://lhcathome.cern.ch/lhcathome/result.php?resultid=203063922 The recent batch obviously needs more RAM for 1 core and 2 core setups. |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
This task is not running for more than three Volunteers: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=114857980 |
![]() Send message Joined: 12 Jun 18 Posts: 126 Credit: 53,906,164 RAC: 0 ![]() ![]() |
I'm seeing a lot of failed atlas WUs today. I can't see why so I thought I'd reinstall the wrapper packages, or whatever they're called: wget https://github.com/singularityware/singularity/releases/download/2.6.0/singularity-2.6.0.tar.gz ; tar xvf singularity-2.6.0.tar.gz ; cd singularity-2.6.0 ; sudo apt install libarchive-dev ; ./configure --prefix=/usr/local ; make ; sudo make install wget https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest_all.deb ; sudo dpkg -i cvmfs-release-latest_all.deb ; rm -f cvmfs-release-latest_all.deb ; sudo apt-get update ; sudo apt-get install cvmfs ; sudo apt install glibc-doc open-iscsi watchdog sudo wget https://lhcathomedev.cern.ch/lhcathome-dev/download/default.local -O /etc/cvmfs/default.local ; sudo cvmfs_config setup ; sudo echo "/cvmfs /etc/auto.cvmfs" > /etc/auto.master.d/cvmfs.autofs ; sudo systemctl restart autofs ; cvmfs_config probe I noticed this message that I haven't seen before. alice.cern.ch: Unloading Fuse module atlas-condb.cern.ch: Waiting for the delivery of SIGUSR1... alice.cern.ch: Waiting for the delivery of SIGUSR1... alice.cern.ch: Re-Loading Fuse module atlas-condb.cern.ch: Re-Loading Fuse module atlas.cern.ch: Re-Loading Fuse module cernvm-prod.cern.ch: Re-Loading Fuse module grid.cern.ch: Re-Loading Fuse module sft.cern.ch: Re-Loading Fuse module Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable. Reload CRASHED! CernVM-FS mountpoints unusable.I also see that cvmfs is being upgraded. How do we civilians know when we need to upgrade ATLAS code??? The following packages will be upgraded: cvmfs 1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.Also, how do we know we have the latest correct version of singularity??? I have 2.6.0 now. ![]() |
![]() Send message Joined: 28 Sep 04 Posts: 759 Credit: 53,689,956 RAC: 42,267 ![]() ![]() ![]() |
I see a lot of failed jobs on the Grafana graphics around 16:00 yesterday. What happened at that time? There are also some download errors yesterday, but for me they happened later. I assume that the time shown on Grafana is either UTC or CET. ![]() |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,291,999 RAC: 129,263 ![]() ![]() |
I assume that the time shown on Grafana is either UTC or CET. Grafana shows the timeframe near the top right corner of the page. At least on my browser ATLAS pages are shown in UTC, CMS pages are shown in local time. |
![]() Send message Joined: 28 Sep 04 Posts: 759 Credit: 53,689,956 RAC: 42,267 ![]() ![]() ![]() |
Ah, so it does. UTC for me also. [edit]Too bad that all the data disappeared for 'Bad Gateway (502)' error[/edit] ![]() |
Send message Joined: 6 Jul 17 Posts: 22 Credit: 29,430,354 RAC: 0 ![]() ![]() |
actual i got more than 80% faulty Atlas WU's, Stop working after a few minutes. With such big download-sizes not really funny. Is it a problem of the WU data set or is it a problem with LHC infrastructure? https://lhcathome.cern.ch/lhcathome/results.php?hostid=10488512&offset=0&show_names=0&state=0&appid=14 or https://lhcathome.cern.ch/lhcathome/results.php?hostid=10493650&offset=0&show_names=0&state=0&appid=14 |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
WU not running for all User: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=144139052 |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
This WU is not running for all User: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=144480648 |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,291,999 RAC: 129,263 ![]() ![]() |
Looks like we got a faulty batch or backend system. There's an increasing number of tasks reporting this kind of error: <core_client_version>7.17.0</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>HXRNDmUgIhxn9Rq4apoT9bVoABFKDmABFKDmLQdXDmABFKDm0yWkAn_EVNT.22646322._000379.pool.root.1</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> </message> ]]> @David Cameron Be so kind as to investigate. |
©2025 CERN