Message boards : ATLAS application : Validate error on all tasks, and short run time with 1 core only
Message board moderation

To post messages, you must log in.

AuthorMessage
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 231
Credit: 1,451,601
RAC: 12,266
Message 42638 - Posted: 27 May 2020, 22:44:08 UTC
Last modified: 27 May 2020, 22:47:52 UTC

Atlas used to work on all my computers. I just upgraded two of them with extra RAM. They're dual Xeon machines which now have 36GB instead of 20GB of RAM. But I'm getting this: https://lhcathome.cern.ch/lhcathome/results.php?userid=55945&offset=0&show_names=0&state=5&appid=14 - all the tasks are only using 1 CPU core instead of 8, and finishing within 15 minutes, then causing a validate error. Any way to find out what's wrong? All I changed was adding more RAM (which has been tested by Memtest).

Looking at one of the task logs, I see this from https://lhcathome.cern.ch/lhcathome/result.php?resultid=275499071:

*****
2020-05-27 23:12:03 (3224): Guest Log: *** Error codes and diagnostics ***

2020-05-27 23:12:03 (3224): Guest Log: "exeErrorCode": 65,

2020-05-27 23:12:03 (3224): Guest Log: "exeErrorDiag": "Non-zero return code from EVNTtoHITS (33); Logfile error in log.EVNTtoHITS: \"DetectorStore FATAL in sysInitialize(): standard std::exception is caught\"",

2020-05-27 23:12:03 (3224): Guest Log: "pilotErrorCode": 1165,

2020-05-27 23:12:03 (3224): Guest Log: "pilotErrorDiag": "Local output file is missing",
*****
ID: 42638 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 231
Credit: 1,451,601
RAC: 12,266
Message 42639 - Posted: 27 May 2020, 23:45:18 UTC - in response to Message 42638.  

Same problem on my other computer which has not changed apart from upgrading Virtualbox and extensions to latest version.

I shall cease Atlas tasks until someone tells me what's happened.
ID: 42639 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1607
Credit: 94,453,637
RAC: 98,679
Message 42645 - Posted: 28 May 2020, 6:07:31 UTC - in response to Message 42639.  

... until someone tells me what's happened.

Just look around and read other posts.
The answer might already be there:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5438&postid=42630
ID: 42645 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 329
Credit: 11,202,583
RAC: 5,081
Message 42652 - Posted: 28 May 2020, 13:00:58 UTC - in response to Message 42645.  

As you can read in the above link, there was a major database outage at CERN yesterday evening which affected BOINC servers and pretty much all of ATLAS' distributed computing services. Unfortunately one of the last things to come back were the Frontier database servers which the ATLAS tasks read data from as they are running. So although we were able to submit tasks here, they would all fail straight away. Now things should be working ok, sorry for the inconvenience.
ID: 42652 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 231
Credit: 1,451,601
RAC: 12,266
Message 42654 - Posted: 28 May 2020, 14:03:01 UTC - in response to Message 42645.  
Last modified: 28 May 2020, 14:27:59 UTC

... until someone tells me what's happened.

Just look around and read other posts.
The answer might already be there:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5438&postid=42630


I did search first, but the search function in these forums isn't the best, and I was looking for a specific Atlas problem. It never dawned on me to link this problem to the outage I noticed when I couldn't even get on the forums yesterday.

As you can read in the above link, there was a major database outage at CERN yesterday evening which affected BOINC servers and pretty much all of ATLAS' distributed computing services. Unfortunately one of the last things to come back were the Frontier database servers which the ATLAS tasks read data from as they are running. So although we were able to submit tasks here, they would all fail straight away. Now things should be working ok, sorry for the inconvenience.


No problem, I just thought I was doing something wrong, I didn't want to throw back hundreds of useless results.

And now processing with 8 cores per Atlas task, so I assume all is ok.
ID: 42654 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 484
Credit: 25,883,776
RAC: 14,652
Message 42822 - Posted: 10 Jun 2020, 8:18:13 UTC

During the night and this morning I've had quite a lot of invalid tasks (>25). All those tasks have failed with other hosts as well. The error for all of them seems to be 'Error: Service 'control' failed to initialize: VERR_INVALID_PARAMETER'.

Here's one of them: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277054761
ID: 42822 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 129
Credit: 10,082,584
RAC: 11,472
Message 42835 - Posted: 11 Jun 2020, 17:23:33 UTC
Last modified: 11 Jun 2020, 18:04:05 UTC

Since 2 hours i get nothing but validation errors on ATLAS native tasks!!

The problem seems to be:
"pilotErrorDiag": "Transform not found:/bin/bash: Sim_tf.py: command not found\n"

There are a lot of other hosts out there with failing tasks!
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 42835 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 484
Credit: 25,883,776
RAC: 14,652
Message 42838 - Posted: 11 Jun 2020, 20:14:07 UTC - in response to Message 42835.  

Now all my Atlas tasks (windows virtual box) are failing with the error I posted before. I put Atlas on hold for a while.
ID: 42838 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 129
Credit: 10,082,584
RAC: 11,472
Message 42852 - Posted: 12 Jun 2020, 18:25:22 UTC
Last modified: 12 Jun 2020, 18:26:15 UTC

In addition to the validating errors on ATLAS i have now troubles getting other LHC workunits.

BOINC tells me:
Fr 12 Jun 2020 20:04:56 CEST | LHC@home | Scheduler request failed: HTTP gateway timeout

Uploading results seems to be fine.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 42852 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 92
Credit: 37,970,693
RAC: 0
Message 42854 - Posted: 12 Jun 2020, 22:35:26 UTC - in response to Message 42852.  
Last modified: 12 Jun 2020, 23:27:41 UTC

In addition to the validating errors on ATLAS i have now troubles getting other LHC workunits.
I had trouble too until I read LHC BOINC Messages and saw that no CPU was requested because queue was full and none needed. I suspended unstarted WUs and LHC immediately DLed a boatload. Hopefully this batch won't fail instantly.

Edit: Not looking good: Valids zero, Invalids 73. Validation error.
ID: 42854 · Report as offensive     Reply Quote
Profile Atomic Booty

Send message
Joined: 11 Sep 05
Posts: 2
Credit: 275,147
RAC: 1
Message 42856 - Posted: 12 Jun 2020, 23:48:15 UTC

Validate error on all my ATLAS tasks today.
ID: 42856 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1322
Credit: 24,349,814
RAC: 10,050
Message 42858 - Posted: 13 Jun 2020, 10:13:38 UTC

same here, on 2 machines so far:

2020-06-13 12:00:42 (10588): Guest Log: "pilotErrorDiag": "Transform not found:/bin/bash: Sim_tf.py: command not found\n"

for more information: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277530075
ID: 42858 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1322
Credit: 24,349,814
RAC: 10,050
Message 42862 - Posted: 13 Jun 2020, 18:40:49 UTC - in response to Message 42858.  

same here, on 2 machines so far:

2020-06-13 12:00:42 (10588): Guest Log: "pilotErrorDiag": "Transform not found:/bin/bash: Sim_tf.py: command not found\n"

for more information: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277530075
I now tried ATLAS on a third computer - same problem, the task failed after 12 minutes :-(

see here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=277581917

can anyone tell me what causes this problem?
ID: 42862 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 103
Credit: 31,974,136
RAC: 18,737
Message 42894 - Posted: 20 Jun 2020, 13:20:21 UTC

Hello.

Is ATLAS running fine now?
ID: 42894 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 990
Credit: 6,426,380
RAC: 581
Message 42896 - Posted: 20 Jun 2020, 14:23:00 UTC - in response to Message 42894.  

The last time I tried ATLAS was 5 days ago and that task ran fine.
ID: 42896 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 634
Credit: 3,874,092
RAC: 1,108
Message 42905 - Posted: 22 Jun 2020, 20:48:33 UTC
Last modified: 22 Jun 2020, 20:49:16 UTC

I have six Atlas tasks running on a new PC with an Intel i5 CPU. I had ordered a HP desktop with an AMD Ryzen 5 3500 CPU, which would have given me 8 cores but they sent me an Intel CPU. It's a long time from the Intel PII Deschutes i had used in the Nineties. It has a 8 GB RAM and I brought it to 12 GB putting a second 4 GB RAM on the second slot. It has a 128 GB SSD disk plus a 1 TB hard disk. Wait and see.
Tullio
ID: 42905 · Report as offensive     Reply Quote

Message boards : ATLAS application : Validate error on all tasks, and short run time with 1 core only


©2021 CERN