Message boards : ATLAS application : Wrong WU ?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 76
Credit: 1,064,445
RAC: 0
Message 34422 - Posted: 19 Feb 2018, 22:34:01 UTC

Hi

I was forced to cancel this WU after more than 2 days of supposedly crunching (6 cores) but I realize it was not using any CPU for a long time, it was stuck at 99,9x% (not moving) for more than one day, when watching at the console it was full of hexadecimal garbage - sorry, machine language ?

What I see in the log is not always nice.

Do you think it was really a bad WU ?
ID: 34422 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1511
Credit: 42,350,120
RAC: 41,051
Message 35060 - Posted: 21 Apr 2018, 5:45:58 UTC

Can anyone of the experts tell me what was wrong with this 2-core ATLAS WU:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=187842098

I had to abort it this morning, after some 22 hours runtine, after I realized that CPU usage for it has dropped to 0; when opening the VM console, on the upper part it said:

init: rc main process (29317) terminated with status 126.
Kernel panic - not syncing: attampted to kill init! exitcode=0x0000007


what I realized already last evening was that when opening console 3, I saw 4 athena.py running (instead of 2), and the RAM usage was close to 8000MB, which is the figure I set in the app_config for the 2-core ATLAS tasks. So, maybe sometime during the night these 8000MB were reached, causing the task to fail?

What I am wondering is: why does a 2-core task use 4 athena.py processes? Why does a 2-core task need more than 8000MB RAM?

Was this task mis-configured to begin with? Too bad for the wasted CPU time (22 hours for 2 cores) :-(((
ID: 35060 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 76
Credit: 1,064,445
RAC: 0
Message 35061 - Posted: 21 Apr 2018, 8:29:48 UTC
Last modified: 21 Apr 2018, 8:30:09 UTC

That's the beauty of VM based application : great advantages from support point of view ("easy" multi-platform) but you have two extra layers that can cause failures, the VM itself plus the extra layer of communication between boinc and the VM (wrapper)...

I've never used app_config to limit RAM of running tasks, maybe this parameter is not properly handled by the combination boinc / VM / LHC ?

I only use the 2 parameters on the website to limit tasks to one and cores to 6 out of 8.
ID: 35061 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1093
Credit: 6,827,780
RAC: 791
Message 35075 - Posted: 22 Apr 2018, 19:12:19 UTC - in response to Message 35060.  

Can anyone of the experts tell me what was wrong with this 2-core ATLAS WU:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=187842098

There are often more athena.py processes, but only the 2 designed ones will run at 100% CPU
The strange thing I see in your result is 3 times: Starting ATLAS job. (PandaID=3905621127 taskID=13756616),
where 2 in the same minute. Normally that line is only shown once.
ID: 35075 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1511
Credit: 42,350,120
RAC: 41,051
Message 35085 - Posted: 24 Apr 2018, 11:24:50 UTC

I am just having another strange 3-core ATLAS task running - in console 3, I see 6 active athena.py.
And from the total 10.000MB RAM which I allocated via app_config (which is normally more than sufficient), more than 9.000 are used up already.
At this point, 45 events have been processed (as shown via console 2), the task has been running for almost 8 hours.
I am wondering whether I should kill it right away, since I suspect that the same thing will happen as described in my posting above from Aüpril 21 - the process will run out of RAM, most likely :-(((

No idea what's happening with the ATLAS tasks lately. Some of them seem to be faulty :-(
ID: 35085 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 76
Credit: 1,064,445
RAC: 0
Message 35086 - Posted: 24 Apr 2018, 11:41:17 UTC

What is "console 3, console 2" ? On my Mac (using CoRD) I don't see any choice / option when I open the console... ?
ID: 35086 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 35088 - Posted: 24 Apr 2018, 17:23:24 UTC - in response to Message 35086.  
Last modified: 24 Apr 2018, 17:25:40 UTC

Here is a post from David Cameron which explains how to do :

We have added some information on the processed events in ATLAS tasks on consoles inside the VM.

To show the consoles, go to the advanced view of BOINC manager, select a running ATLAS task and you should see the button "Show VM Console" on the left menu. If you do not see this button you may need to install the VirtualBox extension pack and/or install remote desktop software such as CoRD on Mac OS or xfreerdp on Linux. There should be remote desktop software included by default on Windows but maybe someone else can confirm this.

When you click "Show VM Console" you should see a terminal window with a login prompt. If you press Alt-F2 (Alt-Fn-F2 on Mac) you should see a screen like this:


NOTE you will only see this information after the task has been running for some time, i.e. has simulated at least 1 event. So please wait up to 30 minutes for information to appear.

This output shows the number of events processed by each core, as well as the time per event and the average time per event so far. Each core has its own independent counter which is why you see the event numbers repeated. In the example there are 4 cores and with 100 events per task each core will process 25 events each. This information therefore can give you an estimate of how long the task will run.

We are working on putting the "top" output into console 3 (Alt-F3) but it doesn't quite work perfectly yet.


Some improvement have been made since this time but the way to do hasn't changed...

F1 --> console 1
F2 --> console 2
and so on ...
ID: 35088 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1511
Credit: 42,350,120
RAC: 41,051
Message 35089 - Posted: 24 Apr 2018, 19:18:16 UTC - in response to Message 35085.  

before I wrote:

I am just having another strange 3-core ATLAS task running - in console 3, I see 6 active athena.py.
And from the total 10.000MB RAM which I allocated via app_config (which is normally more than sufficient), more than 9.000 are used up already.
At this point, 45 events have been processed (as shown via console 2), the task has been running for almost 8 hours.
I am wondering whether I should kill it right away, since I suspect that the same thing will happen as described in my posting above from Aüpril 21 - the process will run out of RAM, most likely :-(((

No idea what's happening with the ATLAS tasks lately. Some of them seem to be faulty :-(

In order to rescue the task I applied a procedure suggested by Crystal Pellet in this posting:

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4467&postid=32777#32777

(besides increasing the rsc_disk_bound value, I also increased the rsc_memory_bound value).
What I oversaw, though, was that by adding a "0" to the disk value, I increased it by the factor 10 (which would definitely not have been necessary); thus, after restarting BOINC plus the ATLAS task, the manager immediately brought a notice to the effect that the disc_bound value exeeded the disc space (or so), and the task was hence aborted :-(
This was really annoying, after a crunching time of 8 hours with 3 cores. Waste of resources :-(

I am still wondering why lately there have been such faulty 3-core tasks, whereas no such problem occurred with 2-core tasks.
ID: 35089 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 76
Credit: 1,064,445
RAC: 0
Message 35098 - Posted: 26 Apr 2018, 21:00:17 UTC

Thanks Philippe !! I knew about the console but not about the Fn functions !

I'm currently running a LHC-dev theory simulation WU on my Mac and there are actually 8 different pages of various types of information in the console (for that app), including a (working) top page !

Most of them are completely obscure for me but that's great :)

Thanks again.
ID: 35098 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35434 - Posted: 6 Jun 2018, 6:08:25 UTC
Last modified: 6 Jun 2018, 6:17:03 UTC

Have a new Computer with 8 CPU's and 16 Threads.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=96506115

OS Windows 8 pro, Boinc 7.10.2 with Virtualbox 5.2.8 and Boinc 7.8.3 with Virtualbox 5.1.26
finishing Atlas-tasks in 11 Minutes with Credits, but without doing events.

Have no cc_config.xml. Virtualbox say 4.400 GByte for Boinc-VM.

Console showing line events will appear here in F2, but no events are computed.

Does anyone have a good idea, what is the reason therefore.
Thank you.
Edit: SVM- Hardware-acceleration(AMD-V) is enabled.
ID: 35434 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1965
Credit: 139,424,685
RAC: 85,926
Message 35435 - Posted: 6 Jun 2018, 6:52:27 UTC - in response to Message 35434.  

It may be one of those ATLAS batches that needs slightly more RAM during startup.
You may configure 4800 MB via app_config.xml.


From your log:
2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.execute 2018-06-06 07:14:28,355 INFO EVNTtoHITS executor returns 65
2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.validate 2018-06-06 07:14:29,272 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.trfExe.validate 2018-06-06 07:14:29,289 INFO Scanning logfile log.EVNTtoHITS for errors
2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.transform.execute 2018-06-06 07:14:29,653 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr     FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2018-06-06 07:17:42 (3636): Guest Log: PyJobTransforms.transform.execute 2018-06-06 07:14:32,829 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr     FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")



... Have no cc_config.xml. ...

Guess you mean app_config.xml?
ID: 35435 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35436 - Posted: 6 Jun 2018, 8:15:30 UTC - in response to Message 35435.  

Guess you mean app_config.xml?

OMG, typo, will make a test with more RAM.
At the moment a migration to Win 8.1.
ID: 35436 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35437 - Posted: 6 Jun 2018, 12:49:46 UTC - in response to Message 35436.  
Last modified: 6 Jun 2018, 12:59:19 UTC

Have now migrated to Win10pro and with app_config.xml.
Finished also successful in a short Time.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=197671368
Edit: Next step is tomorrow Linux native App-SL69.
ID: 35437 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1965
Credit: 139,424,685
RAC: 85,926
Message 35438 - Posted: 6 Jun 2018, 13:46:12 UTC - in response to Message 35437.  

RAM setting is still 4400 MB.
Are you sure you reloaded the app_config.xml before the VM start?

BTW:
The CPU throttle is set to 95%.
I would set it to 100% to ensure this doesn't cause the error.
ID: 35438 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35439 - Posted: 6 Jun 2018, 13:48:45 UTC

Yes I think,
but... will check it after the Updates of Win10pro.
Today it was enough install,migration,testing....
ID: 35439 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35445 - Posted: 7 Jun 2018, 5:04:46 UTC - in response to Message 35439.  

Normally have no app_config for Atlas.
Upgraded to Virtualbox.5.2.12. Windows10pro is now (10.0.17134)
NetworkBridge for Intel-Gigabit-Networkcard was in the old Virtualbox not detected.
Had before only Realtek-Networkcards.
Atlasathome.cern.ch/boinc_conf is not avalaible. So need an other way to install CERNVM-FS.
Hope Atlasathome/boinc_conf can be reactivated from CernIT.
ID: 35445 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35448 - Posted: 7 Jun 2018, 21:03:10 UTC - in response to Message 35445.  
Last modified: 7 Jun 2018, 21:04:11 UTC

Atlasathome.cern.ch/boinc_conf is not avalaible. So need an other way to install CERNVM-FS.
Hope Atlasathome/boinc_conf can be reactivated from CernIT.

The webside is a placeholder, but the link to download the SL69 files is possible.
So, SL69-native App is running now with Atlas and Intel-Networkcard on AMD-Board.
Will make a new test with Windows tomorrow.
ID: 35448 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35449 - Posted: 8 Jun 2018, 5:47:06 UTC - in response to Message 35448.  
Last modified: 8 Jun 2018, 5:53:51 UTC

Will make a new test with Windows tomorrow.

This message is coming first: Boinc 7.8.3 and Virtualbox 5.2.12
Error creating VirtualBox instance! rc = 0x80004002
https://lhcathome.cern.ch/lhcathome/result.php?resultid=198376361
Task finished successful?
Setting Memory Size for VM. (4400MB)- Ok, will define a app_config.xml and
upgrade Boinc to 7.10.2 for next run.
Edit: btw SL69 is running Atlas correct on the same Computer.
ID: 35449 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35483 - Posted: 11 Jun 2018, 17:14:05 UTC - in response to Message 35449.  

This message is coming first: Boinc 7.8.3 and Virtualbox 5.2.12
Error creating VirtualBox instance! rc = 0x80004002

This bug is registrated by virtualbox:
#17795: defect: Failed to instantiate CLSID_VirtualBox w/ IVirtualBox, CLSID_VirtualBox w/ ... (new)
... com/en-us/kb/316911 . with the code E_NOINTERFACE (0x80004002) and the component VirtualBoxClientWrap and the interface {d2937a8e-cb8d-4382-90ba-b7da78a74573} I tried to run the program with different compatibility settings like Windows Vista. I also tried to "repair" it with the insta ...
By besutoxu — 05/31/2018 09:49:41 AM
ID: 35483 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1508
Credit: 48,236,172
RAC: 119,461
Message 35533 - Posted: 16 Jun 2018, 9:45:10 UTC
Last modified: 16 Jun 2018, 10:25:06 UTC

Have this app_config.xml:
<app_config>
<app>
<name>ATLAS</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>2</avg_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 7000</cmdline>
</app_version>
</app_config>

Boinc_7.10.2 and Virtualbox 5.2.12 for this new Computer:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10548292

Don't know why the points are some thousands after finishing a Atlas-task.
EDIT: No SSD, only HDD!
Atlas-Task have a HITS-file.
ID: 35533 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : Wrong WU ?


©2022 CERN