Message boards :
Sixtrack Application :
Inconclusive, valid/invalid results
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
Send message Joined: 30 Nov 16 Posts: 1 Credit: 9,189,774 RAC: 0 |
I receive 31064 and 31102 message from LHC@home My Boinc is running under Ubuntu 16.04 + Oracle VM Virtualbox V5.1.22 (Qt5.5.1) with Xenial install. I must be update. How can I help you with my PC ? Have you a 'check-list' to check my install ? I am onlu a end-user at this point of my understanding. Eric92. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I'll be in touch soonest. I just see you are running apparently Linux 4.8.0-58-generic . We think we have a SixTrack problem with this Linux version......Please don't do anything for the moment and I'll let you know. Eric. I receive 31064 and 31102 message from LHC@home |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
I found another workunit that could not be validated cause of the max number of 5 tries. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=71783751 |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks it should have been re-submitted. Sadly some volunteer may lose credit because of this. Eric. I found another workunit that could not be validated cause of the max number of 5 tries. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
It's still happening with new tasks. Look at this machine: All tasks for computer 10489186 State: All (5352) · In progress (0) · Validation pending (3726) · Validation inconclusive (1070) · Valid (3) · Invalid (502) · Error (51) .. and yes it's OS Linux 4.8 on the 11th of July created machine. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well the weekend is coming up....although 4.8.0 is a NECESSARY but NOT SUFFICIENT condition I am going to ban them all, with a NEWS. (the vast majority of Hosts are Windows). This should help clear the backlog. You should receive my interim report shortly and a full report Monday latest. Thanks a lot. Eric. It's still happening with new tasks. Look at this machine: All tasks for computer 10489186 |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
P.S. I think it needs to be Intel Family 6 as well. We shall see. Well the weekend is coming up....although 4.8.0 is a NECESSARY but |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,823,409 RAC: 77,584 |
I agree on my Xeon v3 with the 4.8 Linux I got good results. I think it must be a software problem as the same cpus must exist on windows and have no problems as far as we can see? |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
I agree on my Xeon v3 with the 4.8 Linux I got good results. Can you point me to the hostID? I don't see a Linux machine on your hosts list. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,823,409 RAC: 77,584 |
I put windows on afterwards to validate there wasn't any OS difference. I have a Broadwell-E so not effect by the HT issues. https://lhcathome.cern.ch/lhcathome/results.php?hostid=9961528 |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I think it must be a software problem as the same cpus must exist on windows and have no problems as far as we can see? If this is indeed the hyper-threading problem that I referred to in my post in the News section, then Windows has already been patched. That discussion is on the SETI forum, where I learned about it. http://setiathome.berkeley.edu/forum_thread.php?id=81641&postid=1875423#1875423 |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,823,409 RAC: 77,584 |
From what I can read it's not fixed on windows even in the preview builds of windows10 when some people were comparing to the Linux microcode updates which have fixed it. On the linux distros you can update from the repositories Fortunatly the big motherboard companies have included in there BIOS's so fixed at the hardware level. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Got another clue about SixTrack crashing from a volunteer. Now for the "transients". "IMPORTANT! Apologies for the excessive ban. This is now corrected but took more time than I had available yesterday. The Prpject Management page at CERN is far too slow to be usable. I had to use my own scripts to access the sixt_production database. I have now "unbanned" 770 hosts but maintained the ban for "banned" 73. I have found 98707 all_linux Linux Hosts 4331 allnew48 Linux Hosts with Kernel 4.8.* 3161 allfamily6 Linux Hosts with Kernel 4.8.* and Intel Family 6 Processor(s) Of the "banned" (max_results_day=-1) of which there are 1203, 843 are running Kernel 4.8.* on Intel Family 6. Now I went to PCBE13978 (more disk space, and the validator logs) and looked for all Invalids in the validator logs. Then checked all 843 Hosts in the Invalids. (Had to use nohup a lot as they are digging up the roads and my Internet connection is being broken regularly, or is it lxplus@CERN???) Anyway, to cut a long story short. and I can't remember how to italicise or emphasise with this interface :-( I have found that 73 hosts account for 204,184 Invalid Results ============================================== out of a Total of 258,725, i.e. almost 79% of all Invalids. ========================================= No time to make a plot, but here are the Invalid counts for each of the 73 Hosts. 39852 21733 20055 19813 19601 18485 7587 5425 5360 4848 4651 4266 4196 3791 2325 2293 1953 1825 1802 1789 1620 1535 1103 880 731 730 729 696 598 383 369 369 367 355 338 308 308 305 240 104 96 63 54 34 33 33 31 19 18 10 9 8 7 7 5 5 4 4 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 ....and the HostIds in the same order.... 10452223 10480022 10486162 10487841 10485156 10484503 10480909 10454365 10484659 10484606 10486251 10483458 10477752 10481907 10484752 10487436 10484663 10487212 10453783 10485912 10485911 10485913 10405110 10485905 10485907 10485906 10456121 10487210 10485908 10482829 10453149 10452598 10453254 10453494 10452614 10476277 10453157 10453507 10454458 10488834 10481344 10481733 10485179 10487938 10487900 10487190 10480804 10482592 10475984 10480775 10475982 10453730 10475983 10455704 10488196 10478598 10487688 10476101 10452585 10451971 10421428 10408937 10486716 10479782 10417991 10489602 10489459 10484733 10451832 10449556 10416774 10415082 10396588 Needless to say I shall be looking VERY closely at least the first few of these 73 Hosts! Hint the 1st "englab" system has been banned for a considerable time already :-) However we have an even more urgent problem with "Transient" errors and incorrect Validation. I MUST look at that and write a report for Monday latest. Eric. (Have to take a break. Too hot at the pool, and too sunny to read my screen, and my battery is flat!)" [mcintosh@lxplus007 ~]$ I agree on my Xeon v3 with the 4.8 Linux I got good results. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Eric, have a small idea: Theory,CMS and LHCb use a CERN-Linux 4.1.x. Atlas use a CERN-Linux 3.10.x. For me, OpenSuse 13.2 Linux-Kernel 3.16.x and OpenSuse 42.2 Linux-Kernel 4.4.x. Is it possible to let Sixtrack only run with Linux Kernel less than 4.5.x or 4.6.x? This new Linux-Kernels (for example 4.8) have not this stability or where bugfree. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Not so small. many thanks. I have in fact banned only 4.8.0 or higher and only if producing an error. More news soonest. Eric, |
Send message Joined: 21 Aug 07 Posts: 46 Credit: 1,503,835 RAC: 0 |
You probably also need to look at Host 10388131. It is a Linux 3.19.0-32-generic/AMD FX(tm)-8300 Eight-Core Processor [Family 21 Model 2 Stepping 0] machine. It has a high count of inconclusives and invalids. And, it is still getting new tasks. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Right, I haven't forgotten. I shall try and look at this fully tomorrow. (I am still chasing the Hyperthreading......) However the main source of Hostid 10388131 errors appears to be "NULL", empty, result files. This has been a thorn in my flesh for many years, Some are produced due to local server errors, but others come from a volunteer host.They are now, since beginning of July, being rejected. So far I have found 36.766 empty result files. They come from more than 16,488 non-unique hosts. More news tomorrow. I am NOT banning 10388131 (yet). I think we have a serious infrastructure problem; in fact I know we have with "transient"/open file errors. Interesting that maybe AMDs have a hyperthreading problem too.... Hover Host 10388131 is mainly producing NULL results which are rejected. In addition there were more than 300,000 NULL results due to the so-called "transient" error problem. Not yet sure how to separate these results. Thanks and more soonest. Eric. You probably also need to look at Host 10388131. It is a Linux 3.19.0-32-generic/AMD FX(tm)-8300 Eight-Core Processor [Family 21 Model 2 Stepping 0] machine. It has a high count of inconclusives and invalids. And, it is still getting new tasks. |
Send message Joined: 21 Aug 07 Posts: 46 Credit: 1,503,835 RAC: 0 |
Right, I haven't forgotten. I shall try and look at this fully tomorrow. You obviously have a great memory. OTOH, I have CRS. Although, Hostid 10388131 had been reported previously, I just noticed it again, for the first time, today - when it caused my inconclusive count to got up by 1. Sorry for being redundant. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I DON'T have a good memory, but we are very hot on this with the volunteer and his HOST. No need to apologise. I would rather have too much info than not enough. :-) . Eric. P.S. I am sure you will be validated in the end! Right, I haven't forgotten. I shall try and look at this fully tomorrow. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
This is what we chasing with Uwe... http://www.tomshardware.co.uk/hyperthreading-kaby-lake-skylake-skylake-x,news-56085.htmlhttp://www.tomshardware.co.uk/hyperthreading-kaby-lake-skylake-skylake-x,news-56085.html Hyperthreading, which schedules two logical threads on one physical core, has been a boon to computing since its 2002 debut, but it hasn't been without its headaches. After 15 years, we could logically expect the kinks to be ironed out, but according to Henrique de Moraes Holschuh, a Debian Linux developer, Kaby Lake and Skylake processors have a serious flaw in their hyperthreading implementation. This warning advisory is relevant for users of systems with the Intel processors code-named "Skylake" and "Kaby Lake". These are: the 6th and 7th generation Intel Core processors (desktop, embedded, mobile and HEDT), their related server processors (such as Xeon v5 and Xeon v6), as well as select Intel Pentium processor models. [...] This advisory is about a processor/microcode defect recently identified on Intel Skylake and Intel Kaby Lake processors with hyper-threading enabled. This defect can, when triggered, cause unpredictable system behavior: it could cause spurious errors, such as application and system misbehavior, data corruption, and data loss. Intel's errata list for the recent Skylake-X processors (unearthed by Hot Hardware), provide a bit more insight into the nuts and bolts of the issue. Problem: Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (eg RAX, EAX or AX for AH) may cause unpredictable system behaviour. This can only happen when both logical processors on the same physical processor are active. Implication: Due to this erratum, the system may experience unpredictable system behavior. Workaround: It is possible for the BIOS to contain a workaround for this erratum. It appears the problem is confined to the sixth-generation Skylake and seventh-generation Kaby Lake processors, but it spans from desktop and mobile processors to Xeon models. The errata apply to any operating system, so it can also impact Windows users. The defect can lead to data loss or corruption and erratic system behavior. Unfortunately, the scope of the issue isn't well-defined. Specific code patterns in applications will trigger the defect, and as yet, there isn't a list of specific software to avoid. For now, Holschuh recommends disabling hyperthreading to circumvent the issue, but that isn't an acceptable long-term fix. There are microcode fixes available for the Kaby Lake and Skylake processors through system vendors, which means you might have to wait for a BIOS/UEFI update to rectify the issue. According to the Debian post, for Kaby Lake processors that entails a BIOS/UEFI that fixes "Intel processor errata KBL095, KBW095 or the similar one for Kaby Lake," and for Skylake you'll need a fix for "Intel erratum SKW144, SKL150, SKX150, SKZ7." Mark Shinwell, an OCaml toolchain developer, discovered the bug earlier this year, but Intel hasn't responded to his queries. Intel did issue microcode updates in the interim. It's worth mentioning that we aren't aware of the extent of the issue and how much it will impact everyday desktop users. Skylake debuted in August 2015, so if there were a considerable number of mainstream desktop applications that trigger the errata, it would have likely already been thrust into the spotlight. We do recommend caution, though, until we learn how many motherboard vendors have already issued the fix in BIOS/UEFI updates. For now, it's best to disable hyperthreading if you handle sensitive data, particularly in business applications. We've sent along the requisite request to Intel for more information and will update accordingly. |
©2024 CERN