ATLAS issues

Author	Message
tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36554 - Posted: 25 Aug 2018, 7:47:02 UTC All Atlas tasks on my main Linux host produce a HITS file, contrarily to th Windows 10 PC. Tullio ID: 36554 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36555 - Posted: 25 Aug 2018, 9:14:36 UTC - in response to Message 36554. All Atlas tasks on my main Linux host produce a HITS file, contrarily to th Windows 10 PC. Tullio HITS file reporting in stderr output from ATLAS Vbox tasks is totally unreliable. Here is one of your recently completed ATLAS VBox tasks... https://lhcathome.cern.ch/lhcathome/result.php?resultid=205493869. The stferr output says nothing about HITS. It doesn't say HITS file successfully produced nor does it say failure to produce HITS. However if you check the bigpanda report for that task at https://bigpanda.cern.ch/job?pandaid=3993741609 you see that it did produce a HITS file. ID: 36555 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36556 - Posted: 25 Aug 2018, 12:59:18 UTC - in response to Message 36555. Last modified: 25 Aug 2018, 13:00:16 UTC Thanks bronco. Atlas tasks ad SixTrack are the only one running both on a Windows 10 PC and two Linux hosts. All the other fail with condor job not running. Yet the PC has 22 GB RAM and 4 cores (but the Windows Task Manager says 2 cores and 4 logical processors) while the Linux boxen have only 2 cores and 8 GB RAM. I am running 2 core tasks on the Windows 10 PC and one core tasks only on the Linux boxen, with SuSE Leap 42.3 and 15.0. Why the number version of SuSE leap went back from 42.3 to 15.0 I don't know but I suspect it has something to do with SuSE Linux Enterprise System which is now 15.0, being optimized to connect to Microsoft Azure Cloud, which I won't certainly do. Tullio ID: 36556 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2266 Credit: 175,669,348 RAC: 2,570	Message 36557 - Posted: 25 Aug 2018, 15:41:02 UTC - in response to Message 36556. Yes Tullio, OpenSuse 15.0 is the same Kernel as the Enterprise Version. ID: 36557 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36558 - Posted: 25 Aug 2018, 15:48:16 UTC - in response to Message 36557. I am running on it two Einstein@home Continuous Gravitational Wave Search tasks which,according to Bruce Allen,chief of Einstein@home and his wife Maria Alessandra Papa, lead scientist for gravitational wave searches, I should not have received, but I got them and am running them. Tullio ID: 36558 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1868 Credit: 136,076,137 RAC: 90,780	Message 36559 - Posted: 25 Aug 2018, 19:24:44 UTC - in response to Message 36555. HITS file reporting in stderr output from ATLAS Vbox tasks is totally unreliable. Here is one of your recently completed ATLAS VBox tasks... https://lhcathome.cern.ch/lhcathome/result.php?resultid=205493869. The stferr output says nothing about HITS. It doesn't say HITS file successfully produced nor does it say failure to produce HITS. However if you check the bigpanda report for that task at https://bigpanda.cern.ch/job?pandaid=3993741609 you see that it did produce a HITS file. I have made same experience many times. So, yes, whatever stderr is saying may not mean a thing. Maybe it depends on the BOINC version and/or the VBOX version whether stderr shows correctly or not. Because in my case, a HITS file is shown everytime now in the stderr since I updated BOINC and VBOX. In any case, one can always check back in bigpanda - what's shown there is fact. ID: 36559 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36560 - Posted: 26 Aug 2018, 1:55:14 UTC - in response to Message 36559. My BOINC on the Windows PC is 7.12.1 and VBox 5.2.18. The Linux host has BOINC 7.8.3 and VBox 5.2.16. It always reports HITS file in the stderr.txt file. Tullio ID: 36560 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36561 - Posted: 26 Aug 2018, 7:16:50 UTC My latest Windows task reports a HITS file. Tullio ID: 36561 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1199 Credit: 67,425,905 RAC: 89,512	Message 36562 - Posted: 26 Aug 2018, 7:43:59 UTC - in response to Message 36561. My latest Windows task reports a HITS file. Tullio https://lhcathome.cern.ch/lhcathome/result.php?resultid=206025343 ID: 36562 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1868 Credit: 136,076,137 RAC: 90,780	Message 36563 - Posted: 26 Aug 2018, 11:59:53 UTC - in response to Message 36562. https://lhcathome.cern.ch/lhcathome/result.php?resultid=206025343 seeing this also here, I am wondering what the notice 2018-08-26 03:58:46 (7576): Error creating VirtualBox instance! rc = 0x80004002 right at the beginning of the stderr means. I have this in all of my ATLAS tasks, regardless of what VB version I have being using during the past years. ID: 36563 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2266 Credit: 175,669,348 RAC: 2,570	Message 36564 - Posted: 26 Aug 2018, 12:35:34 UTC - in response to Message 36563. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4620&postid=35483#35483 ID: 36564 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36565 - Posted: 26 Aug 2018, 16:19:49 UTC On the 14 August issue of "Nature" there is an article about the Atlas strategy. Al least some in the Atlas Cooperation group want to search not only already simulated events,like those processed by BOINC users but all kind of events. If this on one side will require more processing power on the other hand may diminish the importance of what we are doing. I haven't read any comment on this subject here. Tullio ID: 36565 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36566 - Posted: 26 Aug 2018, 16:44:49 UTC - in response to Message 36564. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4620&postid=35483#35483 translation: do not allow your spirit to be caught up in the madness of VBox, run ATLAS native and rejoice on the path of least resistance ID: 36566 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 36567 - Posted: 26 Aug 2018, 18:18:51 UTC - in response to Message 36565. On the 14 August issue of "Nature" there is an article about the Atlas strategy. Al least some in the Atlas Cooperation group want to search not only already simulated events,like those processed by BOINC users but all kind of events. If this on one side will require more processing power on the other hand may diminish the importance of what we are doing. I haven't read any comment on this subject here. Tullio Here is the link: https://www.nature.com/articles/d41586-018-05972-7 But why can't we help out with AI using BOINC? On GPUGrid, their Quantum Chemistry project uses BOINC to train their system for machine learning. They can then use the results to run on GPUs in-house. They are estimating energies and forces, but I don't know why it could not be applied to other areas. http://www.gpugrid.net/forum_thread.php?id=4707&nowrap=true#49606 ID: 36567 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36568 - Posted: 27 Aug 2018, 8:10:49 UTC - in response to Message 36567. I am running GPUGRID, both the CPU tasks and GPU tasks on a Linux box. On my Windows 10 PC with a GTX 1050 Ti not overclocked it gets too hot (80 C) and the computing stops. Tullio ID: 36568 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2266 Credit: 175,669,348 RAC: 2,570	Message 36569 - Posted: 27 Aug 2018, 8:26:37 UTC - in response to Message 36566. translation: do not allow your spirit to be caught up in the madness of VBox, run ATLAS native and rejoice on the path of least resistance Without vbox, Atlas and Windows have a problem. Tullio, have OpenSuse 15.0 now active. WCG..., but will testing Atlas native! ID: 36569 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36570 - Posted: 27 Aug 2018, 10:19:37 UTC - in response to Message 36569. OK, let me know if you succeed running Atlas native on Leap 15.0. It is running long Einstein@home tasks very well. Tullio ID: 36570 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36605 - Posted: 1 Sep 2018, 23:33:33 UTC I am running an Atlas one core task on my HP Linux laptop with an AMD E-450 CPU. Strangely enough, the "top" command shows a CPU usage which can reach 117% for VBoxHeadless. Tullio ID: 36605 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36606 - Posted: 2 Sep 2018, 2:20:41 UTC - in response to Message 36605. Not so strange. It happens here too. In fact if you watch close enough and for long enough you'll see that it happens on every host. It's not an indication that something has gone awry. It's because CPU cycles are hard to count (and account for). Top doesn't see everything and if you do "man top" and pore over the minutiae you'll see that top isn't even the accountant. Top is merely the poor SOB that collects the reports from other accountants and tries to assemble all the details into a sensible report for users. The final report is close but it's not 100% accurate. Also, tasks sometimes run on more than 1 core briefly even though they are "assigned" a single core. The whole notion of core assignment and core affinity is far more complicated than what the average user realizes. In the BOINC world we bandy the term "assigned cores" around as if it's written in stone but in reality it's just a convenient concept that assists BOINC devs in creating more/less accurate algorithms and code for estimating how many tasks can be downloaded and completed before deadline. ID: 36606 · Reply Quote

cIsCo Send message Joined: 30 Aug 18 Posts: 3 Credit: 1,002 RAC: 0	Message 36706 - Posted: 14 Sep 2018, 11:50:32 UTC I have recently started to devote some time of my computer to LHC Atlas jobs. Most of the jobs have failed, and the indication comes after more than 10 hours of processing time have been given to those. It would have been better if the code has in it to figure out if things are going wrong and give out proper messages so the issue can be resolved, and the job not just terminating/going invalid. Recent example: Job at start showed 16 hours, and then it ran for nearly 2 days, and at the end what I get is a big ZERO with Validate error message. 1) There isn't any proper indication if anything is going wrong. I don't understand what failed in the validation, but I doubt every sub-job had issues. 2) The completion time estimate should be somewhat accurate. https://lhcathome.cern.ch/lhcathome/result.php?resultid=206483510 - Can someone take a look at let me know what exactly failed here? Thanks. ID: 36706 · Reply Quote

LHC@home