Message boards : ATLAS application : Last days a lot of validate errors or No Hits file produced
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
greg_be

Send message
Joined: 28 Dec 08
Posts: 341
Credit: 4,924,084
RAC: 1,303
Message 51228 - Posted: 30 Nov 2024, 18:56:05 UTC - in response to Message 51055.  

I am wondering that no one over there by now has noticed that all the tasks which were sent out within the recent past are faulty. How can this be?


And again yesterday and today.
I had a page worth of ATLAS all come up as invalid.
Here is an example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=417650730
This was not resent to anyone.
I also notice the quorum is set to 1 on this and all the rest.

I also had a page full crash on me. But the quorum on those was set to 1. I can't make out if it was my system or the task that crashed. But I do know for the past few days I have been getting a dead computer (still running, but black). Not sure if it was this stuff causing that problem or something windows.
In any case I uninstalled Vbox and installed it fresh again. Will see if that helps.
ID: 51228 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1439
Credit: 9,618,700
RAC: 2,133
Message 51263 - Posted: 10 Dec 2024, 7:56:16 UTC

I still have the same issue, also with the current batch of tasks.
Only validate errors or valid, but no HITS-file produced.
Most of the time no events are processed, but very rarely a task achieves to start event processing like this one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=417953594
But after about 100 events from the 250 to do the task suddenly stopped;
ID: 51263 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1838
Credit: 121,990,104
RAC: 92,614
Message 51264 - Posted: 10 Dec 2024, 9:32:43 UTC - in response to Message 51263.  

I now checked the tasks from my hosts which have run Atlas in the past few days (some other hosts are crunching Theory).
In most cases a HITS file was produced, there were just 2 or 3 where it said "no HITS file produced".
ID: 51264 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 61
Credit: 145,649,139
RAC: 117,285
Message 51265 - Posted: 10 Dec 2024, 10:52:43 UTC

Atlas native has been going well since December 8th... nonstop.
ID: 51265 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 89
Credit: 57,479,053
RAC: 9,771
Message 51266 - Posted: 10 Dec 2024, 13:07:41 UTC

Has anyone else been able to run more than one atlas task at a time? The instant I do I end up with the yellow hard drive triangle in virtual box and all my other atlas tasks fail with computation errors until I clean it up.
ID: 51266 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1838
Credit: 121,990,104
RAC: 92,614
Message 51267 - Posted: 10 Dec 2024, 13:24:32 UTC - in response to Message 51266.  

Has anyone else been able to run more than one atlas task at a time? The instant I do I end up with the yellow hard drive triangle in virtual box and all my other atlas tasks fail with computation errors until I clean it up.
on several of my hosts I run more than one Atlas task at a time. So no idea what exactly might be the problem on your host, but it looks like some misconfiguration of the VirtualBox. Which version are you running? Maybe an update to a newer one might help.
ID: 51267 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1838
Credit: 121,990,104
RAC: 92,614
Message 51268 - Posted: 10 Dec 2024, 13:26:15 UTC - in response to Message 51265.  

Atlas native has been going well since December 8th... nonstop.
what I also notice with the latest Atlas tasks: console 2 and 3 are working again the same way they used to long time ago.
ID: 51268 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2567
Credit: 258,574,994
RAC: 119,363
Message 51269 - Posted: 10 Dec 2024, 13:45:16 UTC

CloverField wrote:
Has anyone else been able to run more than one atlas task at a time? ...

Unlike CMS/Theory ATLAS vbox still runs an older vboxwrapper where this bug is not fixed.
The CERN BOINC team is aware but it looks like nobody from the ATLAS team wants to create a fresh app_version.

Workaround:
- clean the VirtualBox media registry
- start a single ATLAS task
- once the vdi is registered start other ATLAS tasks



Erich56 wrote:
... console 2 and 3 are working again ...

They work for "run 2" tasks but don't for "run 3" tasks since the logfile structure has changed.
So far the recent tasks are "run 2" tasks.

"Run 3" was the last major change before David Cameron left CERN.
The required changes are not too complicated, but ... (same as above).
ID: 51269 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 89
Credit: 57,479,053
RAC: 9,771
Message 51270 - Posted: 10 Dec 2024, 13:47:07 UTC - in response to Message 51267.  

Has anyone else been able to run more than one atlas task at a time? The instant I do I end up with the yellow hard drive triangle in virtual box and all my other atlas tasks fail with computation errors until I clean it up.
on several of my hosts I run more than one Atlas task at a time. So no idea what exactly might be the problem on your host, but it looks like some misconfiguration of the VirtualBox. Which version are you running? Maybe an update to a newer one might help.


So I went and updated virtual box a couple of weeks ago to see if it would fix the issue. I'm running virtual box 7.1.4. I'm just baffled because it would run multiple tasks happily until about a month ago.
ID: 51270 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2567
Credit: 258,574,994
RAC: 119,363
Message 51271 - Posted: 10 Dec 2024, 14:14:18 UTC - in response to Message 51270.  

... I'm running virtual box 7.1.4 ...

This version runs fine.

The reason why it sometimes fails is a race condition when you had no ATLAS tasks and then start at least 2 of them concurrently.
If you are lucky the timings do not cause the race condition and everything works fine.
Otherwise the media registry gets corrupted and stays corrupted until you manually clean it.

Vboxwrapper 26208 includes a patch that avoids the race condition.
CMS/Theory use a beta version that already includes that patch:
https://github.com/BOINC/boinc/pull/5571
ID: 51271 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 89
Credit: 57,479,053
RAC: 9,771
Message 51272 - Posted: 10 Dec 2024, 15:18:04 UTC - in response to Message 51271.  

... I'm running virtual box 7.1.4 ...

This version runs fine.

The reason why it sometimes fails is a race condition when you had no ATLAS tasks and then start at least 2 of them concurrently.
If you are lucky the timings do not cause the race condition and everything works fine.
Otherwise the media registry gets corrupted and stays corrupted until you manually clean it.

Vboxwrapper 26208 includes a patch that avoids the race condition.
CMS/Theory use a beta version that already includes that patch:
https://github.com/BOINC/boinc/pull/5571

Thanks for the explanation. I'll lock atlas at one task until it gets the patch.
ID: 51272 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1439
Credit: 9,618,700
RAC: 2,133
Message 51275 - Posted: 12 Dec 2024, 17:06:17 UTC

I was very hopefull that I finally could return a valid task with the HITS-file, because all 50 events out of 50 were processed and
the Console showed processing HITS-file, but after returning the result no HITS-file was seen.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=418119537
From the result:
2024-12-12 17:13:47 (12592): Guest Log: *** Error codes and diagnostics ***
2024-12-12 17:13:47 (12592): Guest Log: "exeErrorCode": 68,
2024-12-12 17:13:47 (12592): Guest Log: "exeErrorDiag": "Fatal error in athena logfile: \"Long ERROR message at line 2945 (see jobReport for further details)\"",
2024-12-12 17:13:47 (12592): Guest Log: "pilotErrorCode": 1305,
2024-12-12 17:13:47 (12592): Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-12-12 16:11:09,915 CRITICAL Transform executor raised TransformLogfileErrorException: Fatal error in athena logfile: \"Long ERROR message at line 2945 (see jobReport for further details)\"",



The next one https://lhcathome.cern.ch/lhcathome/result.php?resultid=418133093 no event processing at all.
2024-12-12 17:39:35 (8864): Guest Log: *** Error codes and diagnostics ***
2024-12-12 17:39:35 (8864): Guest Log: "exeErrorCode": 65,
2024-12-12 17:39:35 (8864): Guest Log: "exeErrorDiag": "Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"GeoModelSvc FATAL in sysInitialize(): standard std::exception is caught\"",
2024-12-12 17:39:35 (8864): Guest Log: "pilotErrorCode": 1305,
2024-12-12 17:39:35 (8864): Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-12-12 16:36:58,835 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"GeoModelSvc FATAL in sysInitia",
ID: 51275 · Report as offensive     Reply Quote
GDB

Send message
Joined: 29 Dec 17
Posts: 1
Credit: 3,585,756
RAC: 2,019
Message 51278 - Posted: 14 Dec 2024, 0:07:43 UTC - in response to Message 51269.  

CloverField wrote:
Has anyone else been able to run more than one atlas task at a time? ...

Unlike CMS/Theory ATLAS vbox still runs an older vboxwrapper where this bug is not fixed.
The CERN BOINC team is aware but it looks like nobody from the ATLAS team wants to create a fresh app_version.

Workaround:
- clean the VirtualBox media registry
- start a single ATLAS task
- once the vdi is registered start other ATLAS tasks



Erich56 wrote:
... console 2 and 3 are working again ...

They work for "run 2" tasks but don't for "run 3" tasks since the logfile structure has changed.
So far the recent tasks are "run 2" tasks.

"Run 3" was the last major change before David Cameron left CERN.
The required changes are not too complicated, but ... (same as above).



How do you clean VB media registry?
ID: 51278 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2567
Credit: 258,574,994
RAC: 119,363
Message 51279 - Posted: 14 Dec 2024, 6:55:50 UTC - in response to Message 51278.  

How do you clean VB media registry?

This has been explained a couple of times, e.g here for CMS:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6127&postid=49796

If your ATLAS vdi is affected, remove that entry from the list.
ID: 51279 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2567
Credit: 258,574,994
RAC: 119,363
Message 51280 - Posted: 14 Dec 2024, 9:00:40 UTC - in response to Message 51275.  

Just to ensure the errors are not caused by something unexpected.
Did you recently check the health of
- the disk (hardware)
- the filesystem
- the ATLAS vdi file
ID: 51280 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : ATLAS application : Last days a lot of validate errors or No Hits file produced


©2025 CERN