[WARNING] Intel Skylake/Kaby Lake processors: broken hyper-threading


Recommended Posts

We discovered a thread we found in the Debian mailing list that documents an issue with Intel processors of both the Skylake and Kaby Lake families.  You can read the thread yourself for a complete debrief on the issue, but here is the synopsis, as also documented in the thread from the mailing list:

 

Quote

This advisory is about a processor/microcode defect recently identified
on Intel Skylake and Intel Kaby Lake processors with hyper-threading
enabled.  This defect can, when triggered, cause unpredictable system
behavior: it could cause spurious errors, such as application and system
misbehavior, data corruption, and data loss.

It was brought to the attention of the Debian project that this defect
is known to directly affect some Debian stable users (refer to the end
of this advisory for details), thus this advisory.

Please note that the defect can potentially affect any operating system
(it is not restricted to Debian, and it is not restricted to Linux-based
systems).  It can be either avoided (by disabling hyper-threading), or
fixed (by updating the processor microcode).

Due to the difficult detection of potentially affected software, and the
unpredictable nature of the defect, all users of the affected Intel
processors are strongly urged to take action as recommended by this
advisory.

 

Due to the nature of this issue, we are recommending all affected users do the following:

 

  1. Read the Debian mailing list post regarding this issue to confirm your CPU is affected.
  2. Check to see if there is a BIOS update available for your hardware.
  3. If no BIOS update is available, disable Hyperthreading in your system BIOS immediately.

 

We are looking into providing a way to allow users to apply a microcode update as a workaround that allows you to temporarily patch out of this bug on a per-boot basis, but until that time, users with these systems need to consider it risky to continue using the Hyperthreading feature.

  • Upvote 3
Link to comment

Thanks for posting this. Please do keep us in the loop with any microcode updates that become available for UnRAID.

 

My Skylake based machine has been suffering from intermittent crashes for no discernible reason since the get go, specifically when hyperthreading is enabled, so looking forward to seeing whether a microcode update will fix it. I doubt my board manufacturer will release a Bios update this late in the game :(

Link to comment
10 minutes ago, giantkingsquid said:

Thanks for posting this. Please do keep us in the loop with any microcode updates that become available for UnRAID.

 

My Skylake based machine has been suffering from intermittent crashes for no discernible reason since the get go, specifically when hyperthreading is enabled, so looking forward to seeing whether a microcode update will fix it. I doubt my board manufacturer will release a Bios update this late in the game :(

 

giant -

 

Are you taking the advice to disable hyperthreading? If you were having issues and disabling the hyperthreading resolves them, it would be overwhelmingly likely that you are being impacted by the bug. But if disabling hyperthreading does not resolve the issues, you similarly have pretty overwhelming evidence that the problem lies elsewhere.

 

I remember early in the hyperthreading era, that hyperthreading was actually detrimental to performance, and it was recommended to disable it. I just did another search and it seems very dependent on the nature of the applications that are running whether hyperthreading is of value or not. The more things you run, it seems, the more likely it will have a positive impact.

 

Clearly having 2 real cores gives 1x+1x=2x performance. Having 2 virtual cores per physical core nominally gives 0.5x + 0.5x + 0.5x + 0.5x=2x performance. And you may get 0.6x+0.6x+0.6x+0.6x=2.4x, but you may also get 0.4x+0.4x+0.4x+0.4x=1.6x. You could also get 0.6x+0.6x+0.4x+0.4x=2x :)

 

Generally I would say that one thread running 2x as fast is going to perform better than 2 threads running 1/2 as fast on a single task. It takes a lot more time and effort to code an application to efficiently implement multiple threads, and the "overhead" of the threading logic could easily eat up performance gains that hyperthreading might deliver. So even if the two HT cores deliver 1.2x the power, the app may run no faster, and could be slower. That same app running on 2 physical cores might go quite a lot faster, than taking one real and one virtual. So what's the OS to do - give an app that wants two threads two real cores on two different cores, or give it a real core with its virtual. If it got 2 different cores, it might power through a processing intensive task 2x as fast.

 

So the jury is far from out whether a particular user would loose anything, or in fact gain something, with the disabling of hyperthreading. Either way - it will likely not be night and day.

Link to comment

I have disabled hyperthreading and the system did not crash for several hours now, but then again it has run for >40 days without crashing as well, but then again it has crashed after a few hours as well. Very difficult to test for. I honestly doubt that this bug is my problem, but it's another box to tick off I suppose.

 

Thanks for the interesting write up.

 

A hypothetical for those in the know:

 

If I had a Debian vm running on unraid, and used the Debian microcode patch at the vm boot, would the vm be patched, the host and vm be patched or  nothing be patched?

Link to comment

Just a further comment. This must be an edge case of enormous unlikelyhood. Think of the number of computers based on those chips, and the millions upon millions of hours of testing and use. If a specific user were to randomly hit this issue once, it would be unlikely. But to hit it repeatedly within hours of booting - if that were the kind of symptom, it would have been found and fixed long ago IMO.

 

I remember in college there was an assignment to implement a very very basic multi-process "OS". This was on a now ancient Z80 processor. The trick was to interrupt a process (easy to do), and then precisely remember the processor state. Since each thing you do changes the processor state, this was a bit tricky to not destroy the state of the registers as you stored things away. And then to put things back precisely so that the interrupted process was oblivious to the fact out was interrupted. We were running like 10 processes in parallel giving each a time slice to run. The running processes were all the same - a mathematical calculation of some kind. It took probably 20 seconds, displayed the same answer, and looped to do the same thing over and over. I remember it being fun, because you could run several iterations and it would work and then you'd get a weird answer being displayed and you'd have to figure out what you were not preserving. After several test runs I got it ruining consistently, and was letting it run and run, several minutes - perfect. I was the first one in the lab to get it running, and was chatting with the TA when I got a wrong answer displayed. Crap. It had run a long time. Went through the code line by line, with the TA, and it was perfect. Others started finishing and ran theirs for several minutes, but left on for 15 minutes or more, it would have one wrong answer. TA couldn't explain it, and we went home confused.

 

Next class teacher explained what happened. There is a command DAA (decimal adjust accumulator if memory serves) that is dependent on an invisible carry flag that the prior instruction could set. He then explained a method to preserve even that state that aforeto we had no idea existed.

 

So maintaining state is tricky, and even is your think every nuance is covered, it may not be. I have to think this issue is similar to my college experience (in infinitely more complexity) of some extremely nuanced situation whet state is not properly preserved.

  • Upvote 1
Link to comment

Interesting.  I'm running the upcoming Skylake Purley Xeon .  I guess that's what they call the Xeon E5 v5 in the note, however the nomenclature for this upcoming cpu has changed , the current chip has this cpu info:

 

Quote

Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

 

And from the CPU instruction set flag ( http://i.imgur.com/o6Y8LWp.png ) definitely supports HyperThreading (ht), so looks like it's affected by bug.

 

I've hammered the box pretty hard but have not encountered any stability issues.. maybe I should run unRAID on it for a bit :D

 

 

 

 

Edited by Nicktdot
Link to comment
3 hours ago, tdallen said:

Hmm.  If you were building/recommending a new system right now, what would you do?

 

It will get fixed. And I believe the loss of hyperthreading is not much of a loss at all.

 

I would buy the processor that made the most sense to me, and if it was impacted, would disable hyperthreading for now.

 

Does anyone have any evidence that hyperthreading makes a significant difference in performance in transcoding, VM execution, or any of the heavy CPU intensive operation people do on their unRAID arrays? I have not found anything more than theoretical. Eight hoses that each carry 1/2 the water of 4 larger ones - seems like we're not going to move much more water. ;)

Link to comment
17 hours ago, bjp999 said:

Does anyone have any evidence that hyperthreading makes a significant difference in performance in transcoding

Here's a data point.  I ran a Handbrake encode on a 24GB .mkv file using the MP4/H.264 Normal profile under Windows 10.  It ran on a Core i7-4790 first with Hyperthreading enabled, then with it disabled.  Source and target were both on the local SSD.  Time with 8 cores was 32:38, time with 4 cores was 40:17.  All cores were maxed, disk i/o as low as expected, as was memory utilization.

 

Handbrake encoding logs show that it is aware of the cores it has to work with.  So, I retested both and looked in Process Explorer.  It was interesting to note that HandbrakeCLI.exe spooled up 39 threads when it had 8 cores and only 29 threads when it had 4 cores.  Could Handbrake have worked faster under 4 cores if it had spooled up 39 threads?  I doubt it, but it's a variable.

 

So, here's a single data point on a Haswell chip - feel free to draw conclusions or not, your call - 23% performance degradation on a multi-threaded, CPU intensive operation with Hyperthreading disabled.

  • Upvote 2
Link to comment

Thanks for running the test @tdallen! It is clear that, at least handbrake, benefits from the extra threads. Why did it pick 39 threads with 8 cores vs 29 with 4? Was there some intelligence? Would the 4 core version have been faster with more threads running on 4 cores? Don't know and doesn't really matter. Clearly, there is a pretty significant advantage with handbrake and probably other apps too. Apps get optimized based on what runs in the real world, and the real world runs hyperthreaded!

 

Certainly seems like worth getting it fixed! And that's what our users really wanted to know.

Link to comment
  • 2 weeks later...

does it include also Apollo Lake? having issues since beginning...

 

Jul 19 04:40:08 unRAIDTower root: Fix Common Problems: Error: Machine Check Events detected on your server
Jul 19 04:40:08 unRAIDTower root: mcelog: Family 6 Model 92 CPU: only decoding architectural errors

 

thanks

Link to comment
1 hour ago, yippy3000 said:

I saw Intel released a public fix for Linux, is this something I can install directly or does it need to be made part of Unraid for it to stay after updates?  If it is ok to install, dumb question, but how?

Check your motherboard's site for a BIOS update if possible

Link to comment
3 hours ago, yippy3000 said:

There is an update but there aren't any release notes so I asked support and they said they did not think it included the micro-code update.

 

Is there anyway to tell if I am running the fixed micro-code for the CPU after I do the BIOS update?

Is it a supermicro? I wish they had release notes for bios updates.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.