vfio: Unable to power on device, stuck in D3 (GPU passthrough issue)


amelius

Recommended Posts

Hi, 

 

I've been trying to get my GPU(s) to properly pass through to my VMs, and I keep encountering two weird things.

 

1) If i don't reboot between VM startups, I get a weird error: "internal error: Unknown PCI header type '127'"

 

But more problematically,

2) "vfio: Unable to power on device, stuck in D3" seems to happen in the logs whenever I boot up a VM with gpu passthrough, and the GPU doesn't get passed through, nothing shows up on screens, and if I check what the output is in VNC, it doesn't appear in device manager for windows, and for ubuntu, the whole OS seems to hang on login. 

 

System Specs:

Threadripper 1950X

Asus ROG Zenith Extreme Motherboard

64 gb ddr4-3000 memory

3x Samsung 960 Evo (this is my array)

2x GTX 1080 Ti founder's edition (what i'm trying to pass through, one to a Windows 10 vm, one to a Ubuntu 16.04 VM) 

 

I've thus far tried blacklisting the GPUs, I've tried manually specifying the ROM dump, both VMs are using OVMF and Q35, both work fine when only VNC is specified as the graphics adapter. I've tried disabling kvm, to avoid the nvidia issue where the GPUs don't work if it's in kvm, but i'm not sure if I did that right. 

 

VM XML files are attached. 

 

Syslinux config: 

 

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label unRAID OS
  menu default
  kernel /bzimage
  append iommu=pt vfio-pci.ids=10de:1b06 initrd=/bzroot
label unRAID OS GUI Mode
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append initrd=/bzroot unraidsafemode
label unRAID OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest
 

So, anyone got any ideas? 

ubuntuvm.xml

windowsvm.xml

Link to comment

LOL. Did you pull the trigger on Threadripper as soon as the passthrough fix was announced? I ran into exactly the same issues today. (Stuck in D3, and need to reboot after each VM boot.)

 

System Specs:

  • Threadripper 1950X
  • ASRock - X399 Taichi
  • 64 GB DDR4 memory
  • Lots of drives
  • 2x GTX 960

 

I'm trying to pass through one of my GPUs to a Windows 10 VM, and the other to a different Windows 10 VM. I have a 3rd GPU in the first slot, and I use that for my unraid bootup.

 

Before the server rebuild, I was able to pass the NVIDIA GPUs through with my Xeon processor as long as I did the ACS override. I would hear the fans spin up when the VM started. No such luck this time.

 

The first problem I ran into was the ROM error, so I followed the instructions in this video on how to download a ROM and edit it to work with KVM.

 

I didn't try blacklisting the devices in the kernel, but I did try disable_idle_d3=1 to the boot options. No luck.

 

Attaching my VM XML.

juggernaut-2017-11-04.xml

Link to comment

I was going to try blacklisting my cards in the kernel boot params, but both my cards have the same IDs, so I'm not sure if "vfio-pci.ids=10de:1401,10de:0fba" will work...

 

$ lspci -nn | grep NVIDIA
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev ff)
09:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fba] (rev ff)
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev ff)
41:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fba] (rev ff)

 

Link to comment

So after struggling with trying to use Unraid and looking around some forums, I ended up pivoting to ESXi, which actually has no problem with GPU passthrough (though the configuration is a pain), but has a shocking amount of difficulty passing through USB devices (but this is also surmountable). The only downside is the lack of convenient software RAID support. 

Link to comment
6 minutes ago, amelius said:

So after struggling with trying to use Unraid and looking around some forums, I ended up pivoting to ESXi, which actually has no problem with GPU passthrough (though the configuration is a pain), but has a shocking amount of difficulty passing through USB devices (but this is also surmountable). The only downside is the lack of convenient software RAID support. 

 

I thought nvidia passthrough was a no go with ESXi? Also, ESXi does not show temperatures for motherboard etc?

Link to comment
20 minutes ago, mikeyosm said:

 

I thought nvidia passthrough was a no go with ESXi? Also, ESXi does not show temperatures for motherboard etc?

Idk where you heard that, sure, that's what their site *claims* but in reality, it's totally not an issue, you just need to set hypervisor.cpuid.v0  = FALSE and it's all fine. Also, ESXi handles that D3 issue no problem, since you can set the way it handles turning PCI devices on and off. (Tip, if you want to not reboot between using the same GPU, make sure you don't do a forced shutdown on a VM that has a GPU passed through, only a proper shutdown will make it available again for that same (or another) vm without a reboot of the host. If you want a guide that outlines passing through in ESXi, https://www.reddit.com/r/Amd/comments/72ula0/tr1950x_gtx_1060_passthrough_with_esxi/ has a rough outline that works perfectly. I tested it with my configuration and had no issues with getting up and running on both Windows 10 and Ubuntu 16.04 with a GTX 1080 Ti passed through on each one (though on Ubuntu I ran into an annoying login loop i've run into before, but that's just an issue with Xorg and Nvidia drivers not playing nice, not an issue with the passthrough though.)

 

Link to comment
6 hours ago, coppit said:

I didn't try blacklisting the devices in the kernel, but I did try disable_idle_d3=1 to the boot options. No luck.

Tried that, didn't help. I looked around, and it seems that this is an issue with KVM virtualization, and that the only ones that work are a) the windows hypervisor, and b) ESXi. I've tested ESXi, and it's working well for me. If you want to make use of your system and not wait for fixes for this issue, you might want to give ESXi a shot.

 

Link to comment
3 minutes ago, amelius said:

Idk where you heard that, sure, that's what their site *claims* but in reality, it's totally not an issue, you just need to set hypervisor.cpuid.v0  = FALSE and it's all fine. Also, ESXi handles that D3 issue no problem, since you can set the way it handles turning PCI devices on and off. (Tip, if you want to not reboot between using the same GPU, make sure you don't do a forced shutdown on a VM that has a GPU passed through, only a proper shutdown will make it available again for that same (or another) vm without a reboot of the host. If you want a guide that outlines passing through in ESXi, https://www.reddit.com/r/Amd/comments/72ula0/tr1950x_gtx_1060_passthrough_with_esxi/ has a rough outline that works perfectly. I tested it with my configuration and had no issues with getting up and running on both Windows 10 and Ubuntu 16.04 with a GTX 1080 Ti passed through on each one (though on Ubuntu I ran into an annoying login loop i've run into before, but that's just an issue with Xorg and Nvidia drivers not playing nice, not an issue with the passthrough though.)

 

Ah, I see, that's good to know. I did read somewhere that hypervisor.cpuid.v0 = FALSE disables certain performance enhancements within the W10 VM so I  was reluctant to use this parameter. How about temperature monitoring? What does ESXi / vCenter show in terms of temperatures for your devices?

Link to comment
Just now, mikeyosm said:

Ah, I see, that's good to know. I did read somewhere that hypervisor.cpuid.v0 = FALSE disables certain performance enhancements within the W10 VM so I  was reluctant to use this parameter. How about temperature monitoring? What does ESXi / vCenter show in terms of temperatures for your devices?

I haven't really bothered with temperature monitoring at all yet, it supposedly might require extra drivers (not really sure about it), but I don't really care, I have a custom watercooling loop that has 600W more thermal dissipation capability than the balls-to-the-wall TDP my system components can generate overclocked. As for the hypervisor.cpuid.v0 = False disabling performance enhancements, maybe it does, but when I benchmarked it against another rig with a 1080 Ti in it, the performance difference was pretty negligible (and even then, the one that one out simply had a higher overclock by a bit anyways). I also tested it in a game, and had like... 2 fps difference at 4k and 1440p. I would say that would put any potential performance hit squarely in the "entirely imperceptible" category. 

Link to comment
2 hours ago, mikeyosm said:

Bugger that. I pulled the trigger on an Asus Zenith Extreme x399 but not on a processor yet.

I'm thinking of whether to return it and go x299 to avoid all the TR4 passthrough issues... Advice?

 

Wait... Am I to understand that when folks report success with GPU passthrough on the Ryzen threads, none of it is with Threadripper? Ugh. I assumed all Ryzen chips would work at this point!

Edited by coppit
Link to comment
11 minutes ago, coppit said:

 

Wait... Am I to understand that when folks report success with GPU passthrough on the Ryzen threads, none of it is with Threadripper? Ugh. I assumed all Ryzen chips would work at this point!

Yeah, Threadripper still has issues for KVM based virtualization, so far, the only thing I've seen that works is ESXi which operates on a different virtualization system entirely. I've heard that Windows Hypervisor also works, but I've not had a reason to test that. 

Link to comment
  • 1 month later...

 

On 06/11/2017 at 9:48 PM, amelius said:

Idk where you heard that, sure, that's what their site *claims* but in reality, it's totally not an issue, you just need to set hypervisor.cpuid.v0  = FALSE and it's all fine. Also, ESXi handles that D3 issue no problem, since you can set the way it handles turning PCI devices on and off. (Tip, if you want to not reboot between using the same GPU, make sure you don't do a forced shutdown on a VM that has a GPU passed through, only a proper shutdown will make it available again for that same (or another) vm without a reboot of the host. If you want a guide that outlines passing through in ESXi, https://www.reddit.com/r/Amd/comments/72ula0/tr1950x_gtx_1060_passthrough_with_esxi/ has a rough outline that works perfectly. I tested it with my configuration and had no issues with getting up and running on both Windows 10 and Ubuntu 16.04 with a GTX 1080 Ti passed through on each one (though on Ubuntu I ran into an annoying login loop i've run into before, but that's just an issue with Xorg and Nvidia drivers not playing nice, not an issue with the passthrough though.)

 

 

So are you saying you can passthrough GTX geforce cards through on ESXi? I have always read that was not possible.  When did that change?

Link to comment
8 hours ago, heratic said:

 

 

So are you saying you can passthrough GTX geforce cards through on ESXi? I have always read that was not possible.  When did that change?

I have no idea if or when it changed but as long as you set the property I listed above, it works fine, I gave 3 passes through 1080 Tis, and a Titan V passed through, all working. Only caveat is if you don't shut down a VM gracefully (you power it off instead of shutting down) the GPU associated with that VM won't work till you reboot the entire thing. 

Link to comment
  • 4 weeks later...
  • 1 year later...

Not to dredge up ancient history, but I am now having the "stuck in D3" issue after an upgrade of my BIOS from v14 to v18 on the MSI X470 Gaming M7 motherboard with a Ryzen 2 2700x CPU.  Using an Nvidia EVGA 1070 card.  This just happened last weekend (Mar 2019).  The BIOS update seemed innocuous enough, but it did quite a number to my VM setup... In fact, I cannot pass through my GPU or sound card without errors.

 

Sounds like all the TR folks went with a BIOS update in late 2017 and got everything working.  Why this cropped up for me now is beyond me!  I am going to buy a super cheap secondary GPU to try to run for Unraid so I can get my main GPU on a VM again...  Any other options you can see for me?

 

 

 

image.thumb.png.e0b9c2c571c0a099c5cabcb2b87bf4e2.png

Edited by mattz
Deleted extra screen shot.
Link to comment
  • 5 weeks later...
On 3/22/2019 at 11:49 PM, mattz said:

Not to dredge up ancient history, but I am now having the "stuck in D3" issue after an upgrade of my BIOS from v14 to v18 on the MSI X470 Gaming M7 motherboard with a Ryzen 2 2700x CPU.  Using an Nvidia EVGA 1070 card.  This just happened last weekend (Mar 2019).  The BIOS update seemed innocuous enough, but it did quite a number to my VM setup... In fact, I cannot pass through my GPU or sound card without errors.

 

Sounds like all the TR folks went with a BIOS update in late 2017 and got everything working.  Why this cropped up for me now is beyond me!  I am going to buy a super cheap secondary GPU to try to run for Unraid so I can get my main GPU on a VM again...  Any other options you can see for me?

 

Did you get anywhere with this?

I've tried 3 different cards now all with exactly the same result (as described in the first post)

R5 230

GT 710

GTX 1070

 

I had a GT 760 I was passing through without any trouble but "things" started acting up.  The eventual solution was to pull the 760 but not before I tried a system BIOS update.  It sounds like that was a mistake and I cant flash back

 

Currently running the 710 as system and passing the 230

If I dont try and pass through a rom the screen never lights up, vfio: Unable to power on device, stuck in D3 appears in the log but the VM eventually does start (I get a steam notification)

If I shut down the VM I cannot restart it

Turning off the VM service then restarting it give a libvert failed to start error

 

After any attempt to start a VM the server will not restart/reboot and I have to hit the reset button or power cycle

 

1700x on a Prime x370-pro

Edited by shuruga2
added my hardware
Link to comment

@shuruga2 - Sorry about the delayed response.  The only success I had with this was to downgrade the BIOS back to an earlier edition, and that works like a charm.  You should be able to do this with some *unsupported* software. 

 

Check the post I started here--other folks have helped me with my MSI bios downgrade, but I think someone mentioned Asus (which is your Prime x370, right?):

 

Link to comment
18 hours ago, mattz said:

@shuruga2 - Sorry about the delayed response.  The only success I had with this was to downgrade the BIOS back to an earlier edition, and that works like a charm.  You should be able to do this with some *unsupported* software. 

 

Check the post I started here--other folks have helped me with my MSI bios downgrade, but I think someone mentioned Asus (which is your Prime x370, right?):

 

Thanks, I'll take a look and see what I can do

Link to comment
  • 4 weeks later...
  • 3 weeks later...

hello,

 

I have same issue, though with another mobo (asus).

 

downgraded as well but maybe too much.

 

in order to isolate the agesa/am4 combopi  version that are causing the issue, can you provide me with the linkks to your msi bioses please?

 

regards

 

greg bahde

Link to comment
  • 2 years later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.