Message boards : Sixtrack Application : Inconclusive, valid/invalid results
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Alpha

Send message
Joined: 30 Nov 16
Posts: 1
Credit: 9,189,774
RAC: 0
Message 31362 - Posted: 12 Jul 2017, 7:06:21 UTC - in response to Message 31064.  

I receive 31064 and 31102 message from LHC@home
My Boinc is running under Ubuntu 16.04 + Oracle VM Virtualbox V5.1.22 (Qt5.5.1) with Xenial install. I must be update.
How can I help you with my PC ?
Have you a 'check-list' to check my install ?
I am onlu a end-user at this point of my understanding.
Eric92.
ID: 31362 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31367 - Posted: 12 Jul 2017, 10:01:06 UTC - in response to Message 31362.  

I'll be in touch soonest. I just see you are running apparently
Linux 4.8.0-58-generic . We think we have a SixTrack
problem with this Linux version......Please don't do anything for
the moment and I'll let you know. Eric.

I receive 31064 and 31102 message from LHC@home
My Boinc is running under Ubuntu 16.04 + Oracle VM Virtualbox V5.1.22 (Qt5.5.1) with Xenial install. I must be update.
How can I help you with my PC ?
Have you a 'check-list' to check my install ?
I am onlu a end-user at this point of my understanding.
Eric92.

ID: 31367 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1272
Credit: 8,479,344
RAC: 2,123
Message 31374 - Posted: 13 Jul 2017, 9:43:17 UTC

I found another workunit that could not be validated cause of the max number of 5 tries.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=71783751
ID: 31374 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31376 - Posted: 13 Jul 2017, 16:25:59 UTC - in response to Message 31374.  

Thanks it should have been re-submitted.
Sadly some volunteer may lose credit because of this. Eric.

I found another workunit that could not be validated cause of the max number of 5 tries.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=71783751

ID: 31376 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1272
Credit: 8,479,344
RAC: 2,123
Message 31383 - Posted: 14 Jul 2017, 8:33:53 UTC

It's still happening with new tasks. Look at this machine: All tasks for computer 10489186

State: All (5352) · In progress (0) · Validation pending (3726) · Validation inconclusive (1070) · Valid (3) · Invalid (502) · Error (51)

.. and yes it's OS Linux 4.8 on the 11th of July created machine.
ID: 31383 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31387 - Posted: 14 Jul 2017, 13:03:16 UTC - in response to Message 31383.  

Well the weekend is coming up....although 4.8.0 is a NECESSARY but
NOT SUFFICIENT condition I am going to ban them all, with a NEWS.
(the vast majority of Hosts are Windows). This should help clear the
backlog. You should receive my interim report shortly and a full
report Monday latest. Thanks a lot. Eric.



It's still happening with new tasks. Look at this machine: All tasks for computer 10489186

State: All (5352) · In progress (0) · Validation pending (3726) · Validation inconclusive (1070) · Valid (3) · Invalid (502) · Error (51)

.. and yes it's OS Linux 4.8 on the 11th of July created machine.

ID: 31387 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31388 - Posted: 14 Jul 2017, 13:08:41 UTC - in response to Message 31387.  

P.S. I think it needs to be Intel Family 6 as well. We shall see.


Well the weekend is coming up....although 4.8.0 is a NECESSARY but
NOT SUFFICIENT condition I am going to ban them all, with a NEWS.
(the vast majority of Hosts are Windows). This should help clear the
backlog. You should receive my interim report shortly and a full
report Monday latest. Thanks a lot. Eric.



It's still happening with new tasks. Look at this machine: All tasks for computer 10489186

State: All (5352) · In progress (0) · Validation pending (3726) · Validation inconclusive (1070) · Valid (3) · Invalid (502) · Error (51)

.. and yes it's OS Linux 4.8 on the 11th of July created machine.

ID: 31388 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,933,414
RAC: 241,755
Message 31391 - Posted: 14 Jul 2017, 15:38:05 UTC

I agree on my Xeon v3 with the 4.8 Linux I got good results.

I think it must be a software problem as the same cpus must exist on windows and have no problems as far as we can see?
ID: 31391 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1272
Credit: 8,479,344
RAC: 2,123
Message 31428 - Posted: 16 Jul 2017, 9:12:07 UTC - in response to Message 31391.  

I agree on my Xeon v3 with the 4.8 Linux I got good results.


Can you point me to the hostID? I don't see a Linux machine on your hosts list.
ID: 31428 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,933,414
RAC: 241,755
Message 31432 - Posted: 16 Jul 2017, 13:36:14 UTC - in response to Message 31428.  
Last modified: 16 Jul 2017, 16:44:45 UTC

I put windows on afterwards to validate there wasn't any OS difference. I have a Broadwell-E so not effect by the HT issues.



https://lhcathome.cern.ch/lhcathome/results.php?hostid=9961528
ID: 31432 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 31434 - Posted: 16 Jul 2017, 15:07:47 UTC - in response to Message 31391.  
Last modified: 16 Jul 2017, 15:08:56 UTC

I think it must be a software problem as the same cpus must exist on windows and have no problems as far as we can see?

If this is indeed the hyper-threading problem that I referred to in my post in the News section, then Windows has already been patched. That discussion is on the SETI forum, where I learned about it.
http://setiathome.berkeley.edu/forum_thread.php?id=81641&postid=1875423#1875423
ID: 31434 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,933,414
RAC: 241,755
Message 31438 - Posted: 16 Jul 2017, 16:43:57 UTC

From what I can read it's not fixed on windows even in the preview builds of windows10 when some people were comparing to the Linux microcode updates which have fixed it.

On the linux distros you can update from the repositories

Fortunatly the big motherboard companies have included in there BIOS's so fixed at the hardware level.
ID: 31438 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31439 - Posted: 16 Jul 2017, 16:49:13 UTC - in response to Message 31391.  

Got another clue about SixTrack crashing from a volunteer.

Now for the "transients".

"IMPORTANT!

Apologies for the excessive ban. This is now corrected
but took more time than I had available yesterday.
The Prpject Management page at CERN is far too slow to be usable.
I had to use my own scripts to access the sixt_production database.

I have now "unbanned" 770 hosts but maintained the ban
for "banned" 73.
I have found

98707 all_linux Linux Hosts
4331 allnew48 Linux Hosts with Kernel 4.8.*
3161 allfamily6 Linux Hosts with Kernel 4.8.* and Intel Family 6 Processor(s)

Of the "banned" (max_results_day=-1) of which there are 1203,
843 are running Kernel 4.8.* on Intel Family 6.

Now I went to PCBE13978 (more disk space, and the validator logs)
and looked for all Invalids in the validator logs.
Then checked all 843 Hosts in the Invalids.
(Had to use nohup a lot as they are digging up the roads
and my Internet connection is being broken regularly,
or is it lxplus@CERN???)

Anyway, to cut a long story short. and I can't remember how to italicise or
emphasise with this interface :-(

I have found that 73 hosts account for 204,184 Invalid Results
==============================================
out of a Total of 258,725, i.e. almost 79% of all Invalids.
=========================================

No time to make a plot, but here are the Invalid counts for each of
the 73 Hosts.

39852 21733 20055 19813 19601 18485 7587
5425 5360 4848 4651 4266 4196 3791
2325 2293 1953 1825 1802 1789 1620
1535 1103 880 731 730 729 696
598 383 369 369 367 355 338
308 308 305 240 104 96 63
54 34 33 33 31 19 18
10 9 8 7 7 5 5
4 4 3 3 3 3 2
2 2 1 1 1 1 1
1 1 1

....and the HostIds in the same order....

10452223 10480022 10486162 10487841 10485156 10484503 10480909
10454365 10484659 10484606 10486251 10483458 10477752 10481907
10484752 10487436 10484663 10487212 10453783 10485912 10485911
10485913 10405110 10485905 10485907 10485906 10456121 10487210
10485908 10482829 10453149 10452598 10453254 10453494 10452614
10476277 10453157 10453507 10454458 10488834 10481344 10481733
10485179 10487938 10487900 10487190 10480804 10482592 10475984
10480775 10475982 10453730 10475983 10455704 10488196 10478598
10487688 10476101 10452585 10451971 10421428 10408937 10486716
10479782 10417991 10489602 10489459 10484733 10451832 10449556
10416774 10415082 10396588

Needless to say I shall be looking VERY closely at least the first
few of these 73 Hosts! Hint the 1st "englab" system has been banned
for a considerable time already :-)

However we have an even more urgent problem with
"Transient" errors and incorrect Validation.
I MUST look at that and write a report for Monday latest.
Eric. (Have to take a break. Too hot at the pool, and too sunny
to read my screen, and my battery is flat!)"

[mcintosh@lxplus007 ~]$


I agree on my Xeon v3 with the 4.8 Linux I got good results.

I think it must be a software problem as the same cpus must exist on windows and have no problems as far as we can see?

ID: 31439 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,777,751
RAC: 128,475
Message 31441 - Posted: 16 Jul 2017, 17:17:53 UTC

Eric,

have a small idea:

Theory,CMS and LHCb use a CERN-Linux 4.1.x.
Atlas use a CERN-Linux 3.10.x.

For me, OpenSuse 13.2 Linux-Kernel 3.16.x and OpenSuse 42.2 Linux-Kernel 4.4.x.

Is it possible to let Sixtrack only run with Linux Kernel less than 4.5.x or 4.6.x?

This new Linux-Kernels (for example 4.8) have not this stability or where bugfree.
ID: 31441 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31450 - Posted: 17 Jul 2017, 9:42:09 UTC - in response to Message 31441.  

Not so small. many thanks. I have in fact banned only 4.8.0 or higher
and only if producing an error. More news soonest.

Eric,

have a small idea:

Theory,CMS and LHCb use a CERN-Linux 4.1.x.
Atlas use a CERN-Linux 3.10.x.

For me, OpenSuse 13.2 Linux-Kernel 3.16.x and OpenSuse 42.2 Linux-Kernel 4.4.x.

Is it possible to let Sixtrack only run with Linux Kernel less than 4.5.x or 4.6.x?

This new Linux-Kernels (for example 4.8) have not this stability or where bugfree.

ID: 31450 · Report as offensive     Reply Quote
Stick

Send message
Joined: 21 Aug 07
Posts: 46
Credit: 1,503,661
RAC: 2
Message 31460 - Posted: 17 Jul 2017, 13:21:52 UTC

You probably also need to look at Host 10388131. It is a Linux 3.19.0-32-generic/AMD FX(tm)-8300 Eight-Core Processor [Family 21 Model 2 Stepping 0] machine. It has a high count of inconclusives and invalids. And, it is still getting new tasks.
ID: 31460 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31466 - Posted: 17 Jul 2017, 16:48:42 UTC - in response to Message 31460.  

Right, I haven't forgotten. I shall try and look at this fully tomorrow.
(I am still chasing the Hyperthreading......)

However the main source of Hostid 10388131 errors appears to be "NULL",
empty, result files. This has been a thorn in my flesh for many years, Some are produced
due to local server errors, but others come from a volunteer host.They are now, since beginning of July, being rejected.
So far I have found 36.766 empty result files. They come from more than 16,488
non-unique hosts. More news tomorrow. I am NOT banning 10388131
(yet). I think we have a serious infrastructure problem; in fact I know we
have with "transient"/open file errors.

Interesting that maybe AMDs have a hyperthreading problem too....
Hover Host 10388131 is mainly producing NULL results which are rejected.
In addition there were more than 300,000 NULL results due to the
so-called "transient" error problem. Not yet sure how to separate these results.

Thanks and more soonest. Eric.

You probably also need to look at Host 10388131. It is a Linux 3.19.0-32-generic/AMD FX(tm)-8300 Eight-Core Processor [Family 21 Model 2 Stepping 0] machine. It has a high count of inconclusives and invalids. And, it is still getting new tasks.

ID: 31466 · Report as offensive     Reply Quote
Stick

Send message
Joined: 21 Aug 07
Posts: 46
Credit: 1,503,661
RAC: 2
Message 31471 - Posted: 17 Jul 2017, 18:50:07 UTC - in response to Message 31466.  

Right, I haven't forgotten. I shall try and look at this fully tomorrow.
(I am still chasing the Hyperthreading......)

You obviously have a great memory. OTOH, I have CRS. Although, Hostid 10388131 had been reported previously, I just noticed it again, for the first time, today - when it caused my inconclusive count to got up by 1. Sorry for being redundant.
ID: 31471 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31487 - Posted: 18 Jul 2017, 16:41:56 UTC - in response to Message 31471.  

I DON'T have a good memory, but we are very hot on this with the
volunteer and his HOST. No need to apologise. I would rather have
too much info than not enough. :-) . Eric.
P.S. I am sure you will be validated in the end!

Right, I haven't forgotten. I shall try and look at this fully tomorrow.
(I am still chasing the Hyperthreading......)

You obviously have a great memory. OTOH, I have CRS. Although, Hostid 10388131 had been reported previously, I just noticed it again, for the first time, today - when it caused my inconclusive count to got up by 1. Sorry for being redundant.

ID: 31487 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31501 - Posted: 19 Jul 2017, 16:04:05 UTC

This is what we chasing with Uwe...

http://www.tomshardware.co.uk/hyperthreading-kaby-lake-skylake-skylake-x,news-56085.htmlhttp://www.tomshardware.co.uk/hyperthreading-kaby-lake-skylake-skylake-x,news-56085.html


Hyperthreading, which schedules two logical threads on one physical core, has
been a boon to computing since its 2002 debut, but it hasn't been without its
headaches. After 15 years, we could logically expect the kinks to be ironed out,
but according to Henrique de Moraes Holschuh, a Debian Linux developer, Kaby
Lake and Skylake processors have a serious flaw in their hyperthreading
implementation.

This warning advisory is relevant for users of systems with the
Intel processors code-named "Skylake" and "Kaby Lake". These are:
the 6th and 7th generation Intel Core processors (desktop, embedded,
mobile and HEDT), their related server processors (such as Xeon v5
and Xeon v6), as well as select Intel Pentium processor models.
[...]
This advisory is about a processor/microcode defect recently
identified on Intel Skylake and Intel Kaby Lake processors with
hyper-threading enabled. This defect can, when triggered, cause
unpredictable system behavior: it could cause spurious errors, such
as application and system misbehavior, data corruption, and data
loss.

Intel's errata list for the recent Skylake-X processors (unearthed by Hot
Hardware), provide a bit more insight into the nuts and bolts of the issue.

Problem: Under complex micro-architectural conditions, short loops
of less than 64 instructions that use AH, BH, CH or DH registers as
well as their corresponding wider register (eg RAX, EAX or AX for
AH) may cause unpredictable system behaviour. This can only happen
when both logical processors on the same physical processor are
active.

Implication: Due to this erratum, the system may experience
unpredictable system behavior.
Workaround: It is possible for the BIOS to contain a workaround for
this erratum.

It appears the problem is confined to the sixth-generation Skylake and
seventh-generation Kaby Lake processors, but it spans from desktop and mobile
processors to Xeon models. The errata apply to any operating system, so it can
also impact Windows users. The defect can lead to data loss or corruption and
erratic system behavior. Unfortunately, the scope of the issue isn't
well-defined. Specific code patterns in applications will trigger the defect,
and as yet, there isn't a list of specific software to avoid.

For now, Holschuh recommends disabling hyperthreading to circumvent the issue,
but that isn't an acceptable long-term fix. There are microcode fixes available
for the Kaby Lake and Skylake processors through system vendors, which means you
might have to wait for a BIOS/UEFI update to rectify the issue. According to the
Debian post, for Kaby Lake processors that entails a BIOS/UEFI that fixes "Intel
processor errata KBL095, KBW095 or the similar one for Kaby Lake," and for
Skylake you'll need a fix for "Intel erratum SKW144, SKL150, SKX150, SKZ7."

Mark Shinwell, an OCaml toolchain developer, discovered the bug earlier this
year, but Intel hasn't responded to his queries. Intel did issue microcode
updates in the interim.

It's worth mentioning that we aren't aware of the extent of the issue and how
much it will impact everyday desktop users. Skylake debuted in August 2015, so
if there were a considerable number of mainstream desktop applications that
trigger the errata, it would have likely already been thrust into the spotlight.

We do recommend caution, though, until we learn how many motherboard vendors
have already issued the fix in BIOS/UEFI updates. For now, it's best to disable
hyperthreading if you handle sensitive data, particularly in business
applications. We've sent along the requisite request to Intel for more
information and will update accordingly. 
ID: 31501 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Sixtrack Application : Inconclusive, valid/invalid results


©2024 CERN