Message boards : ATLAS application : Validate error on all tasks, and short run time with 1 core only
Message board moderation

To post messages, you must log in.

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42638 - Posted: 27 May 2020, 22:44:08 UTC
Last modified: 27 May 2020, 22:47:52 UTC

Atlas used to work on all my computers. I just upgraded two of them with extra RAM. They're dual Xeon machines which now have 36GB instead of 20GB of RAM. But I'm getting this: https://lhcathome.cern.ch/lhcathome/results.php?userid=55945&offset=0&show_names=0&state=5&appid=14 - all the tasks are only using 1 CPU core instead of 8, and finishing within 15 minutes, then causing a validate error. Any way to find out what's wrong? All I changed was adding more RAM (which has been tested by Memtest).

Looking at one of the task logs, I see this from https://lhcathome.cern.ch/lhcathome/result.php?resultid=275499071:

*****
2020-05-27 23:12:03 (3224): Guest Log: *** Error codes and diagnostics ***

2020-05-27 23:12:03 (3224): Guest Log: "exeErrorCode": 65,

2020-05-27 23:12:03 (3224): Guest Log: "exeErrorDiag": "Non-zero return code from EVNTtoHITS (33); Logfile error in log.EVNTtoHITS: \"DetectorStore FATAL in sysInitialize(): standard std::exception is caught\"",

2020-05-27 23:12:03 (3224): Guest Log: "pilotErrorCode": 1165,

2020-05-27 23:12:03 (3224): Guest Log: "pilotErrorDiag": "Local output file is missing",
*****
ID: 42638 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42639 - Posted: 27 May 2020, 23:45:18 UTC - in response to Message 42638.  

Same problem on my other computer which has not changed apart from upgrading Virtualbox and extensions to latest version.

I shall cease Atlas tasks until someone tells me what's happened.
ID: 42639 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,928,310
RAC: 137,674
Message 42645 - Posted: 28 May 2020, 6:07:31 UTC - in response to Message 42639.  

... until someone tells me what's happened.

Just look around and read other posts.
The answer might already be there:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5438&postid=42630
ID: 42645 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 42652 - Posted: 28 May 2020, 13:00:58 UTC - in response to Message 42645.  

As you can read in the above link, there was a major database outage at CERN yesterday evening which affected BOINC servers and pretty much all of ATLAS' distributed computing services. Unfortunately one of the last things to come back were the Frontier database servers which the ATLAS tasks read data from as they are running. So although we were able to submit tasks here, they would all fail straight away. Now things should be working ok, sorry for the inconvenience.
ID: 42652 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42654 - Posted: 28 May 2020, 14:03:01 UTC - in response to Message 42645.  
Last modified: 28 May 2020, 14:27:59 UTC

... until someone tells me what's happened.

Just look around and read other posts.
The answer might already be there:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5438&postid=42630


I did search first, but the search function in these forums isn't the best, and I was looking for a specific Atlas problem. It never dawned on me to link this problem to the outage I noticed when I couldn't even get on the forums yesterday.

As you can read in the above link, there was a major database outage at CERN yesterday evening which affected BOINC servers and pretty much all of ATLAS' distributed computing services. Unfortunately one of the last things to come back were the Frontier database servers which the ATLAS tasks read data from as they are running. So although we were able to submit tasks here, they would all fail straight away. Now things should be working ok, sorry for the inconvenience.


No problem, I just thought I was doing something wrong, I didn't want to throw back hundreds of useless results.

And now processing with 8 cores per Atlas task, so I assume all is ok.
ID: 42654 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,151,503
RAC: 15,790
Message 42822 - Posted: 10 Jun 2020, 8:18:13 UTC

During the night and this morning I've had quite a lot of invalid tasks (>25). All those tasks have failed with other hosts as well. The error for all of them seems to be 'Error: Service 'control' failed to initialize: VERR_INVALID_PARAMETER'.

Here's one of them: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277054761
ID: 42822 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 42835 - Posted: 11 Jun 2020, 17:23:33 UTC
Last modified: 11 Jun 2020, 18:04:05 UTC

Since 2 hours i get nothing but validation errors on ATLAS native tasks!!

The problem seems to be:
"pilotErrorDiag": "Transform not found:/bin/bash: Sim_tf.py: command not found\n"

There are a lot of other hosts out there with failing tasks!
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 42835 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,151,503
RAC: 15,790
Message 42838 - Posted: 11 Jun 2020, 20:14:07 UTC - in response to Message 42835.  

Now all my Atlas tasks (windows virtual box) are failing with the error I posted before. I put Atlas on hold for a while.
ID: 42838 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 42852 - Posted: 12 Jun 2020, 18:25:22 UTC
Last modified: 12 Jun 2020, 18:26:15 UTC

In addition to the validating errors on ATLAS i have now troubles getting other LHC workunits.

BOINC tells me:
Fr 12 Jun 2020 20:04:56 CEST | LHC@home | Scheduler request failed: HTTP gateway timeout

Uploading results seems to be fine.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 42852 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 126
Credit: 52,457,949
RAC: 23,953
Message 42854 - Posted: 12 Jun 2020, 22:35:26 UTC - in response to Message 42852.  
Last modified: 12 Jun 2020, 23:27:41 UTC

In addition to the validating errors on ATLAS i have now troubles getting other LHC workunits.
I had trouble too until I read LHC BOINC Messages and saw that no CPU was requested because queue was full and none needed. I suspended unstarted WUs and LHC immediately DLed a boatload. Hopefully this batch won't fail instantly.

Edit: Not looking good: Valids zero, Invalids 73. Validation error.
ID: 42854 · Report as offensive     Reply Quote
Profile Atomic Booty

Send message
Joined: 11 Sep 05
Posts: 2
Credit: 275,738
RAC: 0
Message 42856 - Posted: 12 Jun 2020, 23:48:15 UTC

Validate error on all my ATLAS tasks today.
ID: 42856 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,832
RAC: 102,026
Message 42858 - Posted: 13 Jun 2020, 10:13:38 UTC

same here, on 2 machines so far:

2020-06-13 12:00:42 (10588): Guest Log: "pilotErrorDiag": "Transform not found:/bin/bash: Sim_tf.py: command not found\n"

for more information: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277530075
ID: 42858 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,832
RAC: 102,026
Message 42862 - Posted: 13 Jun 2020, 18:40:49 UTC - in response to Message 42858.  

same here, on 2 machines so far:

2020-06-13 12:00:42 (10588): Guest Log: "pilotErrorDiag": "Transform not found:/bin/bash: Sim_tf.py: command not found\n"

for more information: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277530075
I now tried ATLAS on a third computer - same problem, the task failed after 12 minutes :-(

see here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=277581917

can anyone tell me what causes this problem?
ID: 42862 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,287,231
RAC: 20,565
Message 42894 - Posted: 20 Jun 2020, 13:20:21 UTC

Hello.

Is ATLAS running fine now?
ID: 42894 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42896 - Posted: 20 Jun 2020, 14:23:00 UTC - in response to Message 42894.  

The last time I tried ATLAS was 5 days ago and that task ran fine.
ID: 42896 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 42905 - Posted: 22 Jun 2020, 20:48:33 UTC
Last modified: 22 Jun 2020, 20:49:16 UTC

I have six Atlas tasks running on a new PC with an Intel i5 CPU. I had ordered a HP desktop with an AMD Ryzen 5 3500 CPU, which would have given me 8 cores but they sent me an Intel CPU. It's a long time from the Intel PII Deschutes i had used in the Nineties. It has a 8 GB RAM and I brought it to 12 GB putting a second 4 GB RAM on the second slot. It has a 128 GB SSD disk plus a 1 TB hard disk. Wait and see.
Tullio
ID: 42905 · Report as offensive     Reply Quote

Message boards : ATLAS application : Validate error on all tasks, and short run time with 1 core only


©2024 CERN