Message boards :
ATLAS application :
Wrong WU ?
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Jul 11 Posts: 95 Credit: 1,129,876 RAC: 0 ![]() ![]() |
Hi I was forced to cancel this WU after more than 2 days of supposedly crunching (6 cores) but I realize it was not using any CPU for a long time, it was stuck at 99,9x% (not moving) for more than one day, when watching at the console it was full of hexadecimal garbage - sorry, machine language ? What I see in the log is not always nice. Do you think it was really a bad WU ? |
Send message Joined: 18 Dec 15 Posts: 1843 Credit: 126,745,274 RAC: 130,059 ![]() ![]() ![]() |
Can anyone of the experts tell me what was wrong with this 2-core ATLAS WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=187842098 I had to abort it this morning, after some 22 hours runtine, after I realized that CPU usage for it has dropped to 0; when opening the VM console, on the upper part it said: init: rc main process (29317) terminated with status 126. Kernel panic - not syncing: attampted to kill init! exitcode=0x0000007 what I realized already last evening was that when opening console 3, I saw 4 athena.py running (instead of 2), and the RAM usage was close to 8000MB, which is the figure I set in the app_config for the 2-core ATLAS tasks. So, maybe sometime during the night these 8000MB were reached, causing the task to fail? What I am wondering is: why does a 2-core task use 4 athena.py processes? Why does a 2-core task need more than 8000MB RAM? Was this task mis-configured to begin with? Too bad for the wasted CPU time (22 hours for 2 cores) :-((( |
Send message Joined: 12 Jul 11 Posts: 95 Credit: 1,129,876 RAC: 0 ![]() ![]() |
That's the beauty of VM based application : great advantages from support point of view ("easy" multi-platform) but you have two extra layers that can cause failures, the VM itself plus the extra layer of communication between boinc and the VM (wrapper)... I've never used app_config to limit RAM of running tasks, maybe this parameter is not properly handled by the combination boinc / VM / LHC ? I only use the 2 parameters on the website to limit tasks to one and cores to 6 out of 8. |
Send message Joined: 14 Jan 10 Posts: 1440 Credit: 9,663,908 RAC: 1,357 ![]() ![]() |
Can anyone of the experts tell me what was wrong with this 2-core ATLAS WU: There are often more athena.py processes, but only the 2 designed ones will run at 100% CPU The strange thing I see in your result is 3 times: Starting ATLAS job. (PandaID=3905621127 taskID=13756616), where 2 in the same minute. Normally that line is only shown once. |
Send message Joined: 18 Dec 15 Posts: 1843 Credit: 126,745,274 RAC: 130,059 ![]() ![]() ![]() |
I am just having another strange 3-core ATLAS task running - in console 3, I see 6 active athena.py. And from the total 10.000MB RAM which I allocated via app_config (which is normally more than sufficient), more than 9.000 are used up already. At this point, 45 events have been processed (as shown via console 2), the task has been running for almost 8 hours. I am wondering whether I should kill it right away, since I suspect that the same thing will happen as described in my posting above from Aüpril 21 - the process will run out of RAM, most likely :-((( No idea what's happening with the ATLAS tasks lately. Some of them seem to be faulty :-( |
Send message Joined: 12 Jul 11 Posts: 95 Credit: 1,129,876 RAC: 0 ![]() ![]() |
What is "console 3, console 2" ? On my Mac (using CoRD) I don't see any choice / option when I open the console... ? |
Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0 ![]() ![]() |
Here is a post from David Cameron which explains how to do : We have added some information on the processed events in ATLAS tasks on consoles inside the VM. Some improvement have been made since this time but the way to do hasn't changed... F1 --> console 1 F2 --> console 2 and so on ... |
Send message Joined: 18 Dec 15 Posts: 1843 Credit: 126,745,274 RAC: 130,059 ![]() ![]() ![]() |
before I wrote: I am just having another strange 3-core ATLAS task running - in console 3, I see 6 active athena.py. In order to rescue the task I applied a procedure suggested by Crystal Pellet in this posting: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4467&postid=32777#32777 (besides increasing the rsc_disk_bound value, I also increased the rsc_memory_bound value). What I oversaw, though, was that by adding a "0" to the disk value, I increased it by the factor 10 (which would definitely not have been necessary); thus, after restarting BOINC plus the ATLAS task, the manager immediately brought a notice to the effect that the disc_bound value exeeded the disc space (or so), and the task was hence aborted :-( This was really annoying, after a crunching time of 8 hours with 3 cores. Waste of resources :-( I am still wondering why lately there have been such faulty 3-core tasks, whereas no such problem occurred with 2-core tasks. |
Send message Joined: 12 Jul 11 Posts: 95 Credit: 1,129,876 RAC: 0 ![]() ![]() |
Thanks Philippe !! I knew about the console but not about the Fn functions ! I'm currently running a LHC-dev theory simulation WU on my Mac and there are actually 8 different pages of various types of information in the console (for that app), including a (working) top page ! Most of them are completely obscure for me but that's great :) Thanks again. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
Have a new Computer with 8 CPU's and 16 Threads. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=96506115 OS Windows 8 pro, Boinc 7.10.2 with Virtualbox 5.2.8 and Boinc 7.8.3 with Virtualbox 5.1.26 finishing Atlas-tasks in 11 Minutes with Credits, but without doing events. Have no cc_config.xml. Virtualbox say 4.400 GByte for Boinc-VM. Console showing line events will appear here in F2, but no events are computed. Does anyone have a good idea, what is the reason therefore. Thank you. Edit: SVM- Hardware-acceleration(AMD-V) is enabled. |
![]() Send message Joined: 15 Jun 08 Posts: 2607 Credit: 262,635,671 RAC: 139,546 ![]() ![]() |
It may be one of those ATLAS batches that needs slightly more RAM during startup. You may configure 4800 MB via app_config.xml. From your log: 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.execute 2018-06-06 07:14:28,355 INFO EVNTtoHITS executor returns 65 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.validate 2018-06-06 07:14:29,272 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.validate 2018-06-06 07:14:29,289 INFO Scanning logfile log.EVNTtoHITS for errors 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.transform.execute 2018-06-06 07:14:29,653 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.transform.execute 2018-06-06 07:14:32,829 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider") ... Have no cc_config.xml. ... Guess you mean app_config.xml? |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
Guess you mean app_config.xml? OMG, typo, will make a test with more RAM. At the moment a migration to Win 8.1. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
Have now migrated to Win10pro and with app_config.xml. Finished also successful in a short Time. https://lhcathome.cern.ch/lhcathome/result.php?resultid=197671368 Edit: Next step is tomorrow Linux native App-SL69. |
![]() Send message Joined: 15 Jun 08 Posts: 2607 Credit: 262,635,671 RAC: 139,546 ![]() ![]() |
RAM setting is still 4400 MB. Are you sure you reloaded the app_config.xml before the VM start? BTW: The CPU throttle is set to 95%. I would set it to 100% to ensure this doesn't cause the error. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
Yes I think, but... will check it after the Updates of Win10pro. Today it was enough install,migration,testing.... |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
Normally have no app_config for Atlas. Upgraded to Virtualbox.5.2.12. Windows10pro is now (10.0.17134) NetworkBridge for Intel-Gigabit-Networkcard was in the old Virtualbox not detected. Had before only Realtek-Networkcards. Atlasathome.cern.ch/boinc_conf is not avalaible. So need an other way to install CERNVM-FS. Hope Atlasathome/boinc_conf can be reactivated from CernIT. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
Atlasathome.cern.ch/boinc_conf is not avalaible. So need an other way to install CERNVM-FS. The webside is a placeholder, but the link to download the SL69 files is possible. So, SL69-native App is running now with Atlas and Intel-Networkcard on AMD-Board. Will make a new test with Windows tomorrow. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
Will make a new test with Windows tomorrow. This message is coming first: Boinc 7.8.3 and Virtualbox 5.2.12 Error creating VirtualBox instance! rc = 0x80004002 https://lhcathome.cern.ch/lhcathome/result.php?resultid=198376361 Task finished successful? Setting Memory Size for VM. (4400MB)- Ok, will define a app_config.xml and upgrade Boinc to 7.10.2 for next run. Edit: btw SL69 is running Atlas correct on the same Computer. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
This message is coming first: Boinc 7.8.3 and Virtualbox 5.2.12 This bug is registrated by virtualbox: #17795: defect: Failed to instantiate CLSID_VirtualBox w/ IVirtualBox, CLSID_VirtualBox w/ ... (new) ... com/en-us/kb/316911 . with the code E_NOINTERFACE (0x80004002) and the component VirtualBoxClientWrap and the interface {d2937a8e-cb8d-4382-90ba-b7da78a74573} I tried to run the program with different compatibility settings like Windows Vista. I also tried to "repair" it with the insta ... By besutoxu — 05/31/2018 09:49:41 AM |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
Have this app_config.xml: <app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>2</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 7000</cmdline> </app_version> </app_config> Boinc_7.10.2 and Virtualbox 5.2.12 for this new Computer: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10548292 Don't know why the points are some thousands after finishing a Atlas-task. EDIT: No SSD, only HDD! Atlas-Task have a HITS-file. |
©2025 CERN