Message boards : ATLAS application : Last days a lot of validate errors or No Hits file produced
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
ktamail666

Send message
Joined: 11 Jul 06
Posts: 6
Credit: 2,915,386
RAC: 1,464
Message 50968 - Posted: 30 Oct 2024, 12:57:01 UTC

ID: 50968 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 850
Credit: 692,824,076
RAC: 62,588
Message 50970 - Posted: 31 Oct 2024, 8:42:28 UTC

Around 50% of my ATLAS task are not vaild, this is quite high cf normal range of errors.
ID: 50970 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50971 - Posted: 31 Oct 2024, 11:08:52 UTC - in response to Message 50970.  

Around 50% of my ATLAS task are not vaild, this is quite high cf normal range of errors.
And the valid ones don't produce a valid HITS-file like e.g. https://lhcathome.cern.ch/lhcathome/result.php?resultid=415615260
ID: 50971 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 850
Credit: 692,824,076
RAC: 62,588
Message 50972 - Posted: 31 Oct 2024, 12:44:44 UTC - in response to Message 50971.  

I didn't check the vaild ones ;), seems like ATLAS is quite broken at the moment in this case.
ID: 50972 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 374
Message 50973 - Posted: 31 Oct 2024, 15:10:20 UTC - in response to Message 50972.  

2024-10-31 12:36:42 (13544): Guest Log: Looking for outputfile HITS.41843707._006391.pool.root.1
2024-10-31 12:36:42 (13544): Guest Log: HITS file was successfully produced

For me no problems so far.

Boinc 8.0.2 Virtualbox 7.0.14
ID: 50973 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 59
Credit: 142,189,853
RAC: 39,114
Message 50974 - Posted: 31 Oct 2024, 19:28:05 UTC - in response to Message 50973.  


For me no problems so far.


+1
HITS file was successfully produced
(native_mt) x86_64-pc-linux-gnu
ID: 50974 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,367,266
RAC: 17,281
Message 50975 - Posted: 31 Oct 2024, 22:10:02 UTC

If you follow the link to original grafana data from the Atlas jobs graph page you'll find that Boinc_mcore is producing about 5 % successful results. The rest are not valid.
ID: 50975 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 43
Credit: 2,624,143
RAC: 6,750
Message 50977 - Posted: 1 Nov 2024, 9:10:24 UTC - in response to Message 50968.  

With native version also produced many valid but "No HITS result produced" results:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=415553020
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415552887
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415552906
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415552965
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415553015
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415553016
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415553017
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415552861

As I see:
CRITICAL | max running time (10000s) minus grace time (180s) has been exceeded - time to abort pilot

Apparently I've been having this same kind of issue since 26/27 Oct, with quite some tasks running on my slow computer (where I also fake more CPUs than the actual ones, so to run more tasks concurrently and avoid dead times in the "starting up" and "finishing" stages). On my fast one, though, very very few.
--
Bye
ID: 50977 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Aug 14
Posts: 27
Credit: 10,000,233
RAC: 238
Message 50979 - Posted: 1 Nov 2024, 10:29:45 UTC - in response to Message 50977.  

Haven't looked at all the logs but of the 20+ native tasks I have checked they all had a HITS file.
ID: 50979 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 374
Message 50982 - Posted: 1 Nov 2024, 12:45:57 UTC - in response to Message 50977.  

Apparently I've been having this same kind of issue since 26/27 Oct, with quite some tasks running on my slow computer (where I also fake more CPUs than the actual ones, so to run more tasks concurrently and avoid dead times in the "starting up" and "finishing" stages). On my fast one, though, very very few.

You can change your prefs to run only Atlas.
Also you can select only one Atlas-Task with for example 4 CPU's to see if it work.
ID: 50982 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 43
Credit: 2,624,143
RAC: 6,750
Message 50983 - Posted: 1 Nov 2024, 13:12:22 UTC - in response to Message 50982.  

You can change your prefs to run only Atlas.
Also you can select only one Atlas-Task with for example 4 CPU's to see if it work.


When I checked last time, I had the impression things were already getting much better.

I'm running only Atlas on my slow pc (which is not too slow, however), when Atlas work is available.

IMHO the better option is to give each Atlas task as much CPU time and as many cores as possible, since this issue seems to be time-related.
--
Bye
ID: 50983 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 104
Credit: 25,221,969
RAC: 14,189
Message 51037 - Posted: 10 Nov 2024, 8:07:38 UTC

I'm up to 670+ invalid tasks with my modest hardware, all failing around the 1 minute mark. Others will have far more with significant amounts of time and power wasted. Could someone perhaps stop loading these miss-configured units until things get fixed?
ID: 51037 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 374
Message 51039 - Posted: 10 Nov 2024, 12:10:44 UTC - in response to Message 51037.  

The best idea,
deselect this project, until Cern-IT have an answer.
Have stopped work with this Project.
No idea, how long this problems will avalaible.
ID: 51039 · Report as offensive     Reply Quote
rob

Send message
Joined: 4 Mar 11
Posts: 29
Credit: 3,848,900
RAC: 13
Message 51042 - Posted: 10 Nov 2024, 16:34:27 UTC - in response to Message 51037.  

No need to disconnect from the project, just simply select "no new tasks", abort any tasks you have. Then sit back and wait until the project staff announce that the problem is solved.
ID: 51042 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 51051 - Posted: 14 Nov 2024, 21:51:02 UTC

The same old story:

2024-11-14 22:40:37 (3440): Guest Log: *** Error codes and diagnostics ***
2024-11-14 22:40:37 (3440): Guest Log: "exeErrorCode": 65,
2024-11-14 22:40:37 (3440): Guest Log: "exeErrorDiag": "Non-zero return code from EVNTtoHITS (8); Logfile error in log.EVNTtoHITS: \"Unable to identify specific exception\"",
2024-11-14 22:40:37 (3440): Guest Log: "pilotErrorCode": 1305,
2024-11-14 22:40:37 (3440): Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-11-14 21:40:01,625 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (8); Logfile error in log.EVNTtoHITS: \"Unable to identify specific exception\"",


https://lhcathome.cern.ch/lhcathome/result.php?resultid=416572140
ID: 51051 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,202
RAC: 20,001
Message 51055 - Posted: 15 Nov 2024, 6:26:19 UTC - in response to Message 51051.  

I am wondering that no one over there by now has noticed that all the tasks which were sent out within the recent past are faulty. How can this be?
ID: 51055 · Report as offensive     Reply Quote
Profile microchip
Avatar

Send message
Joined: 27 Jun 06
Posts: 8
Credit: 2,592,725
RAC: 2,080
Message 51057 - Posted: 15 Nov 2024, 11:24:58 UTC
Last modified: 15 Nov 2024, 11:25:24 UTC

My desktop completed successfully 2 ATLAS WUs today and 1 yesterday. Seems it's back OK now, no? https://lhcathome.cern.ch/lhcathome/results.php?hostid=10859153
ID: 51057 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,202
RAC: 20,001
Message 51058 - Posted: 15 Nov 2024, 12:26:27 UTC - in response to Message 51057.  

...Seems it's back OK now, no?
except that according to the server status page there are no tasks available for download
ID: 51058 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,202
RAC: 20,001
Message 51060 - Posted: 15 Nov 2024, 15:36:17 UTC - in response to Message 51058.  

...Seems it's back OK now, no?
except that according to the server status page there are no tasks available for download
I've been trying for a few hours now to get Atlas tasks - without success :-(
How did you manage to get tasks ?
ID: 51060 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,202
RAC: 20,001
Message 51069 - Posted: 16 Nov 2024, 7:23:57 UTC - in response to Message 51060.  

...Seems it's back OK now, no?
except that according to the server status page there are no tasks available for download
I've been trying for a few hours now to get Atlas tasks - without success :-(
How did you manage to get tasks ?
after some time, one of my hosts did receive a task - and it worked well
ID: 51069 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : ATLAS application : Last days a lot of validate errors or No Hits file produced


©2024 CERN