Thread 'Wrong WU ?'

Author	Message
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 12 Jul 11 Posts: 120 Credit: 1,451,119 RAC: 0	Message 34422 - Posted: 19 Feb 2018, 22:34:01 UTC Hi I was forced to cancel this WU after more than 2 days of supposedly crunching (6 cores) but I realize it was not using any CPU for a long time, it was stuck at 99,9x% (not moving) for more than one day, when watching at the console it was full of hexadecimal garbage - sorry, machine language ? What I see in the log is not always nice. Do you think it was really a bad WU ? ID: 34422 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,749,507 RAC: 107,196	Message 35060 - Posted: 21 Apr 2018, 5:45:58 UTC Can anyone of the experts tell me what was wrong with this 2-core ATLAS WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=187842098 I had to abort it this morning, after some 22 hours runtine, after I realized that CPU usage for it has dropped to 0; when opening the VM console, on the upper part it said: init: rc main process (29317) terminated with status 126. Kernel panic - not syncing: attampted to kill init! exitcode=0x0000007 what I realized already last evening was that when opening console 3, I saw 4 athena.py running (instead of 2), and the RAM usage was close to 8000MB, which is the figure I set in the app_config for the 2-core ATLAS tasks. So, maybe sometime during the night these 8000MB were reached, causing the task to fail? What I am wondering is: why does a 2-core task use 4 athena.py processes? Why does a 2-core task need more than 8000MB RAM? Was this task mis-configured to begin with? Too bad for the wasted CPU time (22 hours for 2 cores) :-((( ID: 35060 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 12 Jul 11 Posts: 120 Credit: 1,451,119 RAC: 0	Message 35061 - Posted: 21 Apr 2018, 8:29:48 UTC Last modified: 21 Apr 2018, 8:30:09 UTC That's the beauty of VM based application : great advantages from support point of view ("easy" multi-platform) but you have two extra layers that can cause failures, the VM itself plus the extra layer of communication between boinc and the VM (wrapper)... I've never used app_config to limit RAM of running tasks, maybe this parameter is not properly handled by the combination boinc / VM / LHC ? I only use the 2 parameters on the website to limit tasks to one and cores to 6 out of 8. ID: 35061 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1497 Credit: 9,990,921 RAC: 929	Message 35075 - Posted: 22 Apr 2018, 19:12:19 UTC - in response to Message 35060. Can anyone of the experts tell me what was wrong with this 2-core ATLAS WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=187842098 There are often more athena.py processes, but only the 2 designed ones will run at 100% CPU The strange thing I see in your result is 3 times: Starting ATLAS job. (PandaID=3905621127 taskID=13756616), where 2 in the same minute. Normally that line is only shown once. ID: 35075 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,749,507 RAC: 107,196	Message 35085 - Posted: 24 Apr 2018, 11:24:50 UTC I am just having another strange 3-core ATLAS task running - in console 3, I see 6 active athena.py. And from the total 10.000MB RAM which I allocated via app_config (which is normally more than sufficient), more than 9.000 are used up already. At this point, 45 events have been processed (as shown via console 2), the task has been running for almost 8 hours. I am wondering whether I should kill it right away, since I suspect that the same thing will happen as described in my posting above from AÃ¼pril 21 - the process will run out of RAM, most likely :-((( No idea what's happening with the ATLAS tasks lately. Some of them seem to be faulty :-( ID: 35085 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 12 Jul 11 Posts: 120 Credit: 1,451,119 RAC: 0	Message 35086 - Posted: 24 Apr 2018, 11:41:17 UTC What is "console 3, console 2" ? On my Mac (using CoRD) I don't see any choice / option when I open the console... ? ID: 35086 · Reply Quote

PHILIPPE Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0	Message 35088 - Posted: 24 Apr 2018, 17:23:24 UTC - in response to Message 35086. Last modified: 24 Apr 2018, 17:25:40 UTC Here is a post from David Cameron which explains how to do : We have added some information on the processed events in ATLAS tasks on consoles inside the VM. To show the consoles, go to the advanced view of BOINC manager, select a running ATLAS task and you should see the button "Show VM Console" on the left menu. If you do not see this button you may need to install the VirtualBox extension pack and/or install remote desktop software such as CoRD on Mac OS or xfreerdp on Linux. There should be remote desktop software included by default on Windows but maybe someone else can confirm this. When you click "Show VM Console" you should see a terminal window with a login prompt. If you press Alt-F2 (Alt-Fn-F2 on Mac) you should see a screen like this: NOTE you will only see this information after the task has been running for some time, i.e. has simulated at least 1 event. So please wait up to 30 minutes for information to appear. This output shows the number of events processed by each core, as well as the time per event and the average time per event so far. Each core has its own independent counter which is why you see the event numbers repeated. In the example there are 4 cores and with 100 events per task each core will process 25 events each. This information therefore can give you an estimate of how long the task will run. We are working on putting the "top" output into console 3 (Alt-F3) but it doesn't quite work perfectly yet. Some improvement have been made since this time but the way to do hasn't changed... F1 --> console 1 F2 --> console 2 and so on ... ID: 35088 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,749,507 RAC: 107,196	Message 35089 - Posted: 24 Apr 2018, 19:18:16 UTC - in response to Message 35085. before I wrote: I am just having another strange 3-core ATLAS task running - in console 3, I see 6 active athena.py. And from the total 10.000MB RAM which I allocated via app_config (which is normally more than sufficient), more than 9.000 are used up already. At this point, 45 events have been processed (as shown via console 2), the task has been running for almost 8 hours. I am wondering whether I should kill it right away, since I suspect that the same thing will happen as described in my posting above from AÃ¼pril 21 - the process will run out of RAM, most likely :-((( No idea what's happening with the ATLAS tasks lately. Some of them seem to be faulty :-( In order to rescue the task I applied a procedure suggested by Crystal Pellet in this posting: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4467&postid=32777#32777 (besides increasing the rsc_disk_bound value, I also increased the rsc_memory_bound value). What I oversaw, though, was that by adding a "0" to the disk value, I increased it by the factor 10 (which would definitely not have been necessary); thus, after restarting BOINC plus the ATLAS task, the manager immediately brought a notice to the effect that the disc_bound value exeeded the disc space (or so), and the task was hence aborted :-( This was really annoying, after a crunching time of 8 hours with 3 cores. Waste of resources :-( I am still wondering why lately there have been such faulty 3-core tasks, whereas no such problem occurred with 2-core tasks. ID: 35089 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 12 Jul 11 Posts: 120 Credit: 1,451,119 RAC: 0	Message 35098 - Posted: 26 Apr 2018, 21:00:17 UTC Thanks Philippe !! I knew about the console but not about the Fn functions ! I'm currently running a LHC-dev theory simulation WU on my Mac and there are actually 8 different pages of various types of information in the console (for that app), including a (working) top page ! Most of them are completely obscure for me but that's great :) Thanks again. ID: 35098 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35434 - Posted: 6 Jun 2018, 6:08:25 UTC Last modified: 6 Jun 2018, 6:17:03 UTC Have a new Computer with 8 CPU's and 16 Threads. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=96506115 OS Windows 8 pro, Boinc 7.10.2 with Virtualbox 5.2.8 and Boinc 7.8.3 with Virtualbox 5.1.26 finishing Atlas-tasks in 11 Minutes with Credits, but without doing events. Have no cc_config.xml. Virtualbox say 4.400 GByte for Boinc-VM. Console showing line events will appear here in F2, but no events are computed. Does anyone have a good idea, what is the reason therefore. Thank you. Edit: SVM- Hardware-acceleration(AMD-V) is enabled. ID: 35434 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 39,191	Message 35435 - Posted: 6 Jun 2018, 6:52:27 UTC - in response to Message 35434. be one of those ATLAS batches that needs slightly more RAM during startup. You may configure 4800 MB via app_config.xml. From your log: [pre]2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.execute 2018-06-06 07:14:28,355 INFO EVNTtoHITS executor returns 65 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.validate 2018-06-06 07:14:29,272 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.validate 2018-06-06 07:14:29,289 INFO Scanning logfile log.EVNTtoHITS for errors 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.transform.execute 2018-06-06 07:14:29,653 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.transform.execute 2018-06-06 07:14:32,829 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")[/pre] ... Have no cc_config.xml. ... Guess you mean app_config.xml? ID: 35435 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35436 - Posted: 6 Jun 2018, 8:15:30 UTC - in response to Message 35435. Guess you mean app_config.xml? OMG, typo, will make a test with more RAM. At the moment a migration to Win 8.1. ID: 35436 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35437 - Posted: 6 Jun 2018, 12:49:46 UTC - in response to Message 35436. Last modified: 6 Jun 2018, 12:59:19 UTC Have now migrated to Win10pro and with app_config.xml. Finished also successful in a short Time. https://lhcathome.cern.ch/lhcathome/result.php?resultid=197671368 Edit: Next step is tomorrow Linux native App-SL69. ID: 35437 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 39,191	Message 35438 - Posted: 6 Jun 2018, 13:46:12 UTC - in response to Message 35437. RAM setting is still 4400 MB. Are you sure you reloaded the app_config.xml before the VM start? BTW: The CPU throttle is set to 95%. I would set it to 100% to ensure this doesn't cause the error. ID: 35438 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35439 - Posted: 6 Jun 2018, 13:48:45 UTC Yes I think, but... will check it after the Updates of Win10pro. Today it was enough install,migration,testing.... ID: 35439 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35445 - Posted: 7 Jun 2018, 5:04:46 UTC - in response to Message 35439. Normally have no app_config for Atlas. Upgraded to Virtualbox.5.2.12. Windows10pro is now (10.0.17134) NetworkBridge for Intel-Gigabit-Networkcard was in the old Virtualbox not detected. Had before only Realtek-Networkcards. Atlasathome.cern.ch/boinc_conf is not avalaible. So need an other way to install CERNVM-FS. Hope Atlasathome/boinc_conf can be reactivated from CernIT. ID: 35445 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35448 - Posted: 7 Jun 2018, 21:03:10 UTC - in response to Message 35445. Last modified: 7 Jun 2018, 21:04:11 UTC Atlasathome.cern.ch/boinc_conf is not avalaible. So need an other way to install CERNVM-FS. Hope Atlasathome/boinc_conf can be reactivated from CernIT. The webside is a placeholder, but the link to download the SL69 files is possible. So, SL69-native App is running now with Atlas and Intel-Networkcard on AMD-Board. Will make a new test with Windows tomorrow. ID: 35448 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35449 - Posted: 8 Jun 2018, 5:47:06 UTC - in response to Message 35448. Last modified: 8 Jun 2018, 5:53:51 UTC Will make a new test with Windows tomorrow. This message is coming first: Boinc 7.8.3 and Virtualbox 5.2.12 Error creating VirtualBox instance! rc = 0x80004002 https://lhcathome.cern.ch/lhcathome/result.php?resultid=198376361 Task finished successful? Setting Memory Size for VM. (4400MB)- Ok, will define a app_config.xml and upgrade Boinc to 7.10.2 for next run. Edit: btw SL69 is running Atlas correct on the same Computer. ID: 35449 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35483 - Posted: 11 Jun 2018, 17:14:05 UTC - in response to Message 35449. This message is coming first: Boinc 7.8.3 and Virtualbox 5.2.12 Error creating VirtualBox instance! rc = 0x80004002 This bug is registrated by virtualbox: #17795: defect: Failed to instantiate CLSID_VirtualBox w/ IVirtualBox, CLSID_VirtualBox w/ ... (new) ... com/en-us/kb/316911 . with the code E_NOINTERFACE (0x80004002) and the component VirtualBoxClientWrap and the interface {d2937a8e-cb8d-4382-90ba-b7da78a74573} I tried to run the program with different compatibility settings like Windows Vista. I also tried to "repair" it with the insta ... By besutoxu â€” 05/31/2018 09:49:41 AM ID: 35483 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 277	Message 35533 - Posted: 16 Jun 2018, 9:45:10 UTC Last modified: 16 Jun 2018, 10:25:06 UTC Have this app_config.xml: <app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>2</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 7000</cmdline> </app_version> </app_config> Boinc_7.10.2 and Virtualbox 5.2.12 for this new Computer: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10548292 Don't know why the points are some thousands after finishing a Atlas-task. EDIT: No SSD, only HDD! Atlas-Task have a HITS-file. ID: 35533 · Reply Quote