Message boards :
ATLAS application :
Last days a lot of validate errors or No Hits file produced
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
The last 2 days I had 34 ATLAS tasks: ==> https://lhcathome.cern.ch/lhcathome/results.php?hostid=10690380&offset=0&show_names=0&state=0&appid=14 11 were valid and had produced a HITS-file as a result 10 were valid, but had no HITS-file 13 ended in validate error Most of the unsuccesfull results had following lines in the result: Guest Log: "exeErrorCode": 65, Guest Log: "exeErrorDiag": "Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"IOVDbFolder FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP\"", Guest Log: "pilotErrorCode": 1305, Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-10-01 13:49:24,161 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"IOVDbFolder FATAL Conditions d", Some of the unsuccessfull had these lines mostly after running almost the normal time: Guest Log: "exeErrorCode": 68, Guest Log: "exeErrorDiag": "Fatal error in athena logfile: \"Logfile error in log.EVNTtoHITS: \"ToolSvc.G4AtlasDetectorConstructionTool 0 FATAL Failed to initialize SDs for worker thread\"\"", Guest Log: "pilotErrorCode": 1305, Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-10-02 12:57:36,663 CRITICAL Transform executor raised TransformLogfileErrorException: Fatal error in athena logfile: \"Logfile error in log.EVNTtoHITS: \"ToolSvc.G4AtlasDetectorConstructionTool 0 FATAL Failed to initialize", |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Last few days, seeing successful Atlas-Tasks in Win11pro. Only Downloadspeed between 20 and 50 kbps often. Checking atm what the Reason why. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
The initialisation of an ATLAS-task take now 42 minutes (13 min. normally) before the actual event processing begins. I'm not the only one with a lot errors. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10630013&offset=0&show_names=0&state=0&appid=14 https://lhcathome.cern.ch/lhcathome/results.php?hostid=10294367&offset=0&show_names=0&state=5&appid=14 |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
2024-10-04 09:09:11 (86044): Detected: BOINC client v8.0.4 2024-10-04 09:09:11 (86044): Detected: VirtualBox VboxManage Interface (Version: 7.1.2) For me Boinc 8.0.2 and Virtualbox 7.0.14. Is this the reason for Atlas? |
Send message Joined: 7 Aug 14 Posts: 27 Credit: 10,000,233 RAC: 290 |
The initialisation of an ATLAS-task take now 42 minutes (13 min. normally) before the actual event processing begins.At least the second one is getting huge amount of credit for the ones that do look like they work. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
2024-10-04 09:09:11 (86044): Detected: BOINC client v8.0.4I don't use BOINC v8.0.4 but v8.0.2. VBox 7.1.2 solved the issue with the very slow networkinterface and developer computezrmle fixed the remote desktop issue in vboxwrapper. The new vboxwrapper should be released by the LHC-admins. So those errors are not 'version' related. The reason for those validate errors and valids (but no HITS-file) is in the ATLAS exe-code: "exeErrorCode": 65 "exeErrorCode": 68 "pilotErrorCode": 1305 |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Errors 65/68 point out issues in deeper level ATLAS scripts. In most cases they are caused by configuration errors at the submitter side and affect a whole batch. In rare cases (and if they affect vbox only) they are caused by a VM not having enough RAM. So, if all wingmen's tasks succeed try to slightly increase the RAM given to the VMs. Otherwise be patient until the faulty batch is done. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
So, if all wingmen's tasks succeed try to slightly increase the RAM given to the VMs.For ATLAS-tasks there is no wing(wo)man: max # of error/total/success tasks 1, 1, 1 |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
|
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
More errors and No HITS than Valids: 2024-10-13 17:16:19 (12584): Guest Log: *** Error codes and diagnostics *** 2024-10-13 17:16:19 (12584): Guest Log: "exeErrorCode": 65, 2024-10-13 17:16:19 (12584): Guest Log: "exeErrorDiag": "Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"IOVDbFolder FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP\"", 2024-10-13 17:16:19 (12584): Guest Log: "pilotErrorCode": 1305, 2024-10-13 17:16:19 (12584): Guest Log: "pilotErrorDiag": "Failed to execute payload:PyJobTransforms.transform.execute 2024-10-13 15:15:42,348 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (1); Logfile error in log.EVNTtoHITS: \"IOVDbFolder FATAL Conditions d", |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Have stopped Atlas-Download since 12.Oct. All Tasks need a restart inside of the VM in Virtualbox. Otherwhise 0.0 % CPU-use for the whole time. There must be something wrong in combination Boinc (8.0.2 and Virtualbox (7.0.14) since this time. |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
No problem here with Boinc 8.0.2 and VirtualBox 7.0.6 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10858236 or with Boinc 7.16.5 and Virtualbox 5.2.44 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10509390 |
Send message Joined: 22 Mar 17 Posts: 64 Credit: 14,576,403 RAC: 1,276 |
More tasks with Native app. I was thinking it was my system as it hasn't ran Atlas in some time but is still running theory OK. But I hadn't seen a validate error before due to a configuration issue. https://lhcathome.cern.ch/lhcathome/result.php?resultid=415205612 |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
All 90 ATLAS-tasks from yesterday and today were unsuccesfull. 35 tasks got the invalid status, the other 55 were validated OK, but none of those 'valid' tasks returned a HITS-file. Most tasks didn't even start the event-processing, six however did event processing after a 'lazy' init phase of over 150 minutes, but the tasks stopped after about 10 to 30 events out of 50 to go. Those six: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415298686 https://lhcathome.cern.ch/lhcathome/result.php?resultid=415298434 https://lhcathome.cern.ch/lhcathome/result.php?resultid=415294829 https://lhcathome.cern.ch/lhcathome/result.php?resultid=415291440 https://lhcathome.cern.ch/lhcathome/result.php?resultid=415268645 https://lhcathome.cern.ch/lhcathome/result.php?resultid=415244533 Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415372751 |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Thinking it was the DataCenter. Now running ok with 2 or 3 Atlas with 6 CPU's. In a few hours let 10 Tasks with 6 CPU's running. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
All 90 ATLAS-tasks from yesterday and today were unsuccesfull.......I had already increased the value of RAM to 5120MB without success. From the tasks that came into the event processing part the highest number of processed events was 40 out of 50. Then the processing suddenly ended and the task was returned without a HITS-file. I looked around and see that other crunchers return valid results with the HITS-file, so I will try again and increased the RAM to 5700MB. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
This app_config.xml run without problems. <app> <name>ATLAS</name> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>6</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 4750</cmdline> </app_version> <project_max_concurrent>10</project_max_concurrent> </app_config> In a few hours testing 10 Atlas with 6 CPU's. atm WCG need some time to make the Boincmanager free to work with Atlas. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
I got finally a task that did 50 events, but now the job had to do 100 events, so after 66 events same behaviour. All 4 event processing processes went from 99% suddenly to 0%. https://lhcathome.cern.ch/lhcathome/result.php?resultid=415487700 Valid but no HITS-file. 2024-10-29 14:53:36 (6260): Status Report: Elapsed Time: '6000.000000' 2024-10-29 14:53:36 (6260): Status Report: CPU Time: '247.031250' 2024-10-29 15:59:45 (6260): Guest Log: *** Job finished *** 2024-10-29 15:59:45 (6260): Guest Log: *** The last 20 lines of the pilot log: *** 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:32,829 | INFO | waiting for thread to finish: ['<_MainThread(MainThread, started 139780515440448)>', '<ExcThread(queue_monitor, started 139780071335680)>'] 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:34,849 | INFO | waiting for thread to finish: ['<_MainThread(MainThread, started 139780515440448)>', '<ExcThread(queue_monitor, started 139780071335680)>'] 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:36,865 | INFO | waiting for thread to finish: ['<_MainThread(MainThread, started 139780515440448)>', '<ExcThread(queue_monitor, started 139780071335680)>'] 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:37,388 | INFO | waiting for thread to finish: ['<_MainThread(MainThread, started 139780515440448)>', '<ExcThread(queue_monitor, started 139780071335680)>'] 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:37,388 | INFO | [job] queue monitor thread has finished 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:38,904 | INFO | caller=run is remaining thread - safe to abort (names=['<_MainThread(MainThread, started 139780515440448)>']) 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,905 | INFO | all workflow threads have been joined 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,905 | INFO | end of generic workflow (traces error code: 0) 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,905 | INFO | traces error code: 0 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,905 | INFO | pilot has finished (exit code=0, shell exit code=0) 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,950 [wrapper] ==== pilot stdout END ==== 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,952 [wrapper] ==== wrapper stdout RESUME ==== 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,955 [wrapper] pilotpid: 5995 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,957 [wrapper] Pilot exit status: 0 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,971 [wrapper] pandaids: 6382409099 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,986 [wrapper] cleanup supervisor_pilot 14668 5996 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,988 [wrapper] Test setup, not cleaning 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,991 [wrapper] ==== wrapper stdout END ==== 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,993 [wrapper] ==== wrapper stderr END ==== 2024-10-29 15:59:45 (6260): Guest Log: 2024-10-29 14:59:43,998 [wrapper] apfmon messages muted |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Using your link show this Atlas-Task: Laufzeit 2 Stunden 46 min. 33 sek. CPU Zeit 1 Stunden 59 min. 52 sek. Prüfungsstatus Gültig Punkte 1,333.29 Successful |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
Using your link show this Atlas-Task:Valid but no HITS-file |
©2024 CERN