Message boards : ATLAS application : Last days a lot of validate errors or No Hits file produced
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50692 - Posted: 3 Oct 2024, 9:57:11 UTC

The last 2 days I had 34 ATLAS tasks: ==> https://lhcathome.cern.ch/lhcathome/results.php?hostid=10690380&offset=0&show_names=0&state=0&appid=14

11 were valid and had produced a HITS-file as a result
10 were valid, but had no HITS-file
13 ended in validate error

Most of the unsuccesfull results had following lines in the result:

Guest Log: "exeErrorCode": 65,
Guest Log: "exeErrorDiag": "Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"IOVDbFolder FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP\"",
Guest Log: "pilotErrorCode": 1305,
Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-10-01 13:49:24,161 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"IOVDbFolder FATAL Conditions d",


Some of the unsuccessfull had these lines mostly after running almost the normal time:

Guest Log: "exeErrorCode": 68,
Guest Log: "exeErrorDiag": "Fatal error in athena logfile: \"Logfile error in log.EVNTtoHITS: \"ToolSvc.G4AtlasDetectorConstructionTool 0 FATAL Failed to initialize SDs for worker thread\"\"",
Guest Log: "pilotErrorCode": 1305,
Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-10-02 12:57:36,663 CRITICAL Transform executor raised TransformLogfileErrorException: Fatal error in athena logfile: \"Logfile error in log.EVNTtoHITS: \"ToolSvc.G4AtlasDetectorConstructionTool 0 FATAL Failed to initialize",
ID: 50692 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,739,920
RAC: 20,336
Message 50693 - Posted: 3 Oct 2024, 11:39:24 UTC - in response to Message 50692.  

Last few days, seeing successful Atlas-Tasks in Win11pro.
Only Downloadspeed between 20 and 50 kbps often.
Checking atm what the Reason why.
ID: 50693 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50694 - Posted: 3 Oct 2024, 13:18:03 UTC - in response to Message 50693.  
Last modified: 3 Oct 2024, 16:30:01 UTC

The initialisation of an ATLAS-task take now 42 minutes (13 min. normally) before the actual event processing begins.

I'm not the only one with a lot errors.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10630013&offset=0&show_names=0&state=0&appid=14
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10294367&offset=0&show_names=0&state=5&appid=14
ID: 50694 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,739,920
RAC: 20,336
Message 50695 - Posted: 4 Oct 2024, 7:25:47 UTC - in response to Message 50694.  

2024-10-04 09:09:11 (86044): Detected: BOINC client v8.0.4
2024-10-04 09:09:11 (86044): Detected: VirtualBox VboxManage Interface (Version: 7.1.2)
For me Boinc 8.0.2 and Virtualbox 7.0.14.
Is this the reason for Atlas?
ID: 50695 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Aug 14
Posts: 23
Credit: 9,927,401
RAC: 19,052
Message 50697 - Posted: 4 Oct 2024, 7:55:04 UTC - in response to Message 50694.  

The initialisation of an ATLAS-task take now 42 minutes (13 min. normally) before the actual event processing begins.

I'm not the only one with a lot errors.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10630013&offset=0&show_names=0&state=0&appid=14
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10294367&offset=0&show_names=0&state=5&appid=14
At least the second one is getting huge amount of credit for the ones that do look like they work.
ID: 50697 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50700 - Posted: 4 Oct 2024, 8:44:58 UTC - in response to Message 50695.  
Last modified: 4 Oct 2024, 8:49:43 UTC

2024-10-04 09:09:11 (86044): Detected: BOINC client v8.0.4
2024-10-04 09:09:11 (86044): Detected: VirtualBox VboxManage Interface (Version: 7.1.2)
For me Boinc 8.0.2 and Virtualbox 7.0.14.
Is this the reason for Atlas?
I don't use BOINC v8.0.4 but v8.0.2.

VBox 7.1.2 solved the issue with the very slow networkinterface and developer computezrmle fixed the remote desktop issue in vboxwrapper.
The new vboxwrapper should be released by the LHC-admins. So those errors are not 'version' related.

The reason for those validate errors and valids (but no HITS-file) is in the ATLAS exe-code:

"exeErrorCode": 65
"exeErrorCode": 68
"pilotErrorCode": 1305
ID: 50700 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2520
Credit: 251,308,353
RAC: 119,939
Message 50701 - Posted: 4 Oct 2024, 9:02:43 UTC

Errors 65/68 point out issues in deeper level ATLAS scripts.
In most cases they are caused by configuration errors at the submitter side and affect a whole batch.
In rare cases (and if they affect vbox only) they are caused by a VM not having enough RAM.

So, if all wingmen's tasks succeed try to slightly increase the RAM given to the VMs.
Otherwise be patient until the faulty batch is done.
ID: 50701 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50702 - Posted: 4 Oct 2024, 9:33:28 UTC - in response to Message 50701.  

So, if all wingmen's tasks succeed try to slightly increase the RAM given to the VMs.
For ATLAS-tasks there is no wing(wo)man: max # of error/total/success tasks 1, 1, 1
ID: 50702 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50747 - Posted: 8 Oct 2024, 12:27:02 UTC - in response to Message 50702.  

ID: 50747 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50780 - Posted: 13 Oct 2024, 16:35:09 UTC

More errors and No HITS than Valids:

2024-10-13 17:16:19 (12584): Guest Log: *** Error codes and diagnostics ***
2024-10-13 17:16:19 (12584): Guest Log: "exeErrorCode": 65,
2024-10-13 17:16:19 (12584): Guest Log: "exeErrorDiag": "Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"IOVDbFolder FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP\"",
2024-10-13 17:16:19 (12584): Guest Log: "pilotErrorCode": 1305,
2024-10-13 17:16:19 (12584): Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-10-13 15:15:42,348 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"IOVDbFolder FATAL Conditions d",
ID: 50780 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,739,920
RAC: 20,336
Message 50781 - Posted: 13 Oct 2024, 16:58:59 UTC - in response to Message 50780.  

Have stopped Atlas-Download since 12.Oct.
All Tasks need a restart inside of the VM in Virtualbox.
Otherwhise 0.0 % CPU-use for the whole time.

There must be something wrong in combination Boinc (8.0.2 and Virtualbox (7.0.14) since this time.
ID: 50781 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 719
Credit: 48,219,148
RAC: 30,122
Message 50782 - Posted: 13 Oct 2024, 17:50:53 UTC

No problem here with Boinc 8.0.2 and VirtualBox 7.0.6 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10858236 or with Boinc 7.16.5 and Virtualbox 5.2.44 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10509390
ID: 50782 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 22 Mar 17
Posts: 60
Credit: 13,864,571
RAC: 42,395
Message 50892 - Posted: 23 Oct 2024, 21:40:45 UTC

More tasks with Native app. I was thinking it was my system as it hasn't ran Atlas in some time but is still running theory OK. But I hadn't seen a validate error before due to a configuration issue.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415205612
ID: 50892 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50931 - Posted: 26 Oct 2024, 16:10:18 UTC
Last modified: 27 Oct 2024, 16:59:06 UTC

All 90 ATLAS-tasks from yesterday and today were unsuccesfull.
35 tasks got the invalid status, the other 55 were validated OK,
but none of those 'valid' tasks returned a HITS-file.
Most tasks didn't even start the event-processing, six however did
event processing after a 'lazy' init phase of over 150 minutes,
but the tasks stopped after about 10 to 30 events out of 50 to go.

Those six:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=415298686
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415298434
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415294829
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415291440
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415268645
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415244533

Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415372751
ID: 50931 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,739,920
RAC: 20,336
Message 50932 - Posted: 26 Oct 2024, 22:56:44 UTC - in response to Message 50931.  

Thinking it was the DataCenter.
Now running ok with 2 or 3 Atlas with 6 CPU's.
In a few hours let 10 Tasks with 6 CPU's running.
ID: 50932 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50958 - Posted: 28 Oct 2024, 15:45:13 UTC - in response to Message 50931.  

All 90 ATLAS-tasks from yesterday and today were unsuccesfull.......
I had already increased the value of RAM to 5120MB without success.
From the tasks that came into the event processing part the highest number of processed events was 40 out of 50.
Then the processing suddenly ended and the task was returned without a HITS-file.
I looked around and see that other crunchers return valid results with the HITS-file, so I will try again and increased the RAM to 5700MB.
ID: 50958 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,739,920
RAC: 20,336
Message 50959 - Posted: 28 Oct 2024, 15:55:53 UTC

This app_config.xml run without problems.
<app>
<name>ATLAS</name>

</app>
<app_version>
<app_name>ATLAS</app_name>
<max_file_xfers_per_project>3</max_file_xfers_per_project>
<avg_ncpus>6</avg_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 4750</cmdline>
</app_version>
<project_max_concurrent>10</project_max_concurrent>
</app_config>

In a few hours testing 10 Atlas with 6 CPU's.
atm WCG need some time to make the Boincmanager free to work with Atlas.
ID: 50959 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50964 - Posted: 29 Oct 2024, 15:34:08 UTC

I got finally a task that did 50 events, but now the job had to do 100 events,
so after 66 events same behaviour. All 4 event processing processes went from 99% suddenly to 0%.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=415487700

Valid but no HITS-file.

2024-10-29 14:53:36 (6260): Status Report: Elapsed Time: '6000.000000'
2024-10-29 14:53:36 (6260): Status Report: CPU Time: '247.031250'
2024-10-29 15:59:45 (6260): Guest Log:  *** Job finished ***
2024-10-29 15:59:45 (6260): Guest Log:  *** The last 20 lines of the pilot log: ***
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:32,829 | INFO     | waiting for thread to finish: ['<_MainThread(MainThread, started 139780515440448)>', '<ExcThread(queue_monitor, started 139780071335680)>']
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:34,849 | INFO     | waiting for thread to finish: ['<_MainThread(MainThread, started 139780515440448)>', '<ExcThread(queue_monitor, started 139780071335680)>']
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:36,865 | INFO     | waiting for thread to finish: ['<_MainThread(MainThread, started 139780515440448)>', '<ExcThread(queue_monitor, started 139780071335680)>']
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:37,388 | INFO     | waiting for thread to finish: ['<_MainThread(MainThread, started 139780515440448)>', '<ExcThread(queue_monitor, started 139780071335680)>']
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:37,388 | INFO     | [job] queue monitor thread has finished
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:38,904 | INFO     | caller=run is remaining thread - safe to abort (names=['<_MainThread(MainThread, started 139780515440448)>'])
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,905 | INFO     | all workflow threads have been joined
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,905 | INFO     | end of generic workflow (traces error code: 0)
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,905 | INFO     | traces error code: 0
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,905 | INFO     | pilot has finished (exit code=0, shell exit code=0)
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,950 [wrapper] ==== pilot stdout END ====
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,952 [wrapper] ==== wrapper stdout RESUME ====
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,955 [wrapper] pilotpid: 5995
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,957 [wrapper] Pilot exit status: 0
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,971 [wrapper] pandaids: 6382409099
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,986 [wrapper] cleanup supervisor_pilot 14668 5996
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,988 [wrapper] Test setup, not cleaning
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,991 [wrapper] ==== wrapper stdout END ====
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,993 [wrapper] ==== wrapper stderr END ====
2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,998 [wrapper] apfmon messages muted
ID: 50964 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,739,920
RAC: 20,336
Message 50965 - Posted: 29 Oct 2024, 17:58:03 UTC

Using your link show this Atlas-Task:
Laufzeit 2 Stunden 46 min. 33 sek.
CPU Zeit 1 Stunden 59 min. 52 sek.
Prüfungsstatus Gültig
Punkte 1,333.29
Successful
ID: 50965 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 50966 - Posted: 29 Oct 2024, 20:53:36 UTC - in response to Message 50965.  

Using your link show this Atlas-Task:
Laufzeit 2 Stunden 46 min. 33 sek.
CPU Zeit 1 Stunden 59 min. 52 sek.
Prüfungsstatus Gültig
Punkte 1,333.29
Successful
Valid but no HITS-file
ID: 50966 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : Last days a lot of validate errors or No Hits file produced


©2024 CERN