Windows VM updates have brought down UNRAID twice now?

btrcp2000 · December 8, 2017

Twice in the past several months my Win10 VM has wanted to update itself, last night was the big Fall Creators one. Both times the GUI became unresponsive, the shares disappear from the network, and I cannot get it to respond to Putty. Only thing I can do is a hard reset and parity check. Since I couldn''t get to the GUI or putty, the best I could think of is the attached "screenshot" (literally a photo of the monitor screen). Diags are from after the hard reset, not sure if that is helpful but I didn't know how else to get them before resetting. Up and running again with the VM successfully updated, so hopefully parity check goes well. Obviously I'm doing something wrong with these updates. How should I handle this in the future?

tia

unraid-diagnostics-20171208-0835.zip

btrcp2000 · December 13, 2017

Happening again. Win10 wanted to update again, and I cannot reach the GUI or data on shares. Don't know about ssh because i can't remember the IP of the unraid server, and the router is a unifi running in the docker, which is of course also inaccessible.

Does anyone have any insight?

btrcp2000 · December 13, 2017

Left it go all night but it never recovered, so I had to hard reset once again and it's now running a parity check. This time it came up with offline uncorrectable errors upon reboot.

What should I be doing differently with these windows updates?

btrcp2000 · December 14, 2017

Does anyone have any feedback at all on this? I have no idea where to begin, and it's just a matter of time before windows will update again. I now have a months old drive showing an offline uncorrectable error, running smart extended now.

How could a VM updating/restarting be affecting the GUI?

Squid · December 14, 2017

The waiting for lo to become free is nothing at all to worry about.

You're being inundated with messages like

Dec  8 06:37:23 UNRAID rpc.mountd[4620]: refused mount request from 192.168.1.101 for /sagemedia/tv (/): not exported
Dec  8 06:37:23 UNRAID rpc.mountd[4620]: refused mount request from 192.168.1.101 for /sagemedia (/): not exported
Dec  8 06:37:23 UNRAID rpc.mountd[4620]: refused mount request from 192.168.1.101 for /sagemedia/tv/PrettyinPink-17334270-0.ts (/): not exported
Dec  8 06:37:23 UNRAID rpc.mountd[4620]: refused mount request from 192.168.1.101 for /sagemedia (/): not exported

Your IP address of the server is 192.168.1.100, but all of those refused mounts are from 192.168.1.101. Unfortunately I know nothing at all about NFS sharing which it appears that you're using.

The error in your screen shot referring to loop0 says that your docker.img is having problems, but the next line says that the cache pool is also having problems (probably caused the docker.img problem)

Beyond that, nothing particularly jumps out at me in perusing the logs.

@johnnie.black is the real expert here on btrfs and cache pools

JorgeB · December 14, 2017

Difficult to say for sure with just the screenshot but likely the docker image is the only problem.

btrcp2000 · December 14, 2017

Okay, NFS is a remnant of an abandoned project, so I can fix that, and I can get rid of the Sagetv docker as I was going to redo it from scratch anyway. But why would an OS update of a VM trigger issues for the whole server? If I reboot the VM from the GUI when it doesn't have an update to do, it goes down and comes back up just fine.

Squid · December 14, 2017

It shouldn't. Some users have problems with major OS updates and having it crash the VM without setting the # of cores for the VM to be 1. (I've never had that problem)

btrcp2000 · December 14, 2017

I uninstalled the NFS stuff from the VM that I wasn't using anymore. Windows of course found another tiny update, so I went ahead and let it reboot since this was not a major update. UNRAID down again, and I can confirm no network shares, no GUI, no SSH, no VMs, it's like its not even there. All I can do is pull the plug, even IPMI is not allowing an orderly shutdown (see attached).

Has no one seen this before?

btrcp2000 · December 18, 2017

I finally decided it was time to abandon the problem Win10 VM so I spent the weekend going back to a fresh Win7 one, hoping that fewer updates would lessen the likelihood. All great until just now. VM wanted to reboot after a video card driver update, and the whole thing once again went silent, no dockers, no VMs, no ssh, no shares. Had to power cycle via IPMI, so I'm now on my 10th parity check in the past week, no exaggeration, and my newest server grade hard drive is now showing an offline uncorrectable error.

I really need some help here. How do I capture logs that will survive these hard resets so I can figure out what is happening?

pwm · December 18, 2017

Note that hard power cycling should normally not cause any offline uncorrectable errors.

A HDD normally makes use of rotational energy to make sure it can finish an ongoing sector write if it is currently writing a sector when the power is lost. It's buffered data that the drive hasn't started to write that will be lost, together with the data cached in the OS and in-transit between OS and disk.

And at least enterprise-level SSD has capacitors to make sure it has enough stored energy to finish a pending flash block write and to panic-store any internal state. So for a SSD it would normally also be cached data in the disk+OS and in-transit data between computer and disk that will be lost.

The above is the reason why better RAID cards supports battery backed cache - so the OS can hand off a larger set of disk updates and get a transfer barrier that the data has been accepted by the disk subsystem. The RAID card then makes sure that the in-transit/buffered data will be sent a second time to the disks after power is restored.

It's the huge (impossible with normal file systems without hardware help) problem of getting the OS to flush all cached data for multiple drives and get the individual drives to effectuate the individual writes that is the reason why unRAID needs to scan the parity after an unclean shutdown.

Anyway - I haven't seen any white paper where the disk manufacturers shows actual test results of their power-loss strategies. So it's just possible your offline uncorrectable error was caused by the drive not having enough power to finish an ongoing sector write, resulting in the sector containing garbage. Product management likes to mention a beautiful list of features but often without backing their claims with hard data. Enterprise disks would seldom have to suffer hard power losses since enterprise servers are running with UPS.

btrcp2000 · December 18, 2017

Ok, thanks for clarifying. I guess that part is not as scary as it might seem, and UNRAID is not really complaining about the 1 error, it just shows in in SMART reporting.

Still need some guidance on the root issue of VMs crashing the server on reboots.

NewDisplayName · December 18, 2017

Check if u did that:

btrcp2000 · December 18, 2017

Yes, that's what I followed and I am successfully passing through two tv capture cards and a USB 3.0 card (set to 2.0). Is there something specific in there I should look for?

Windows VM updates have brought down UNRAID twice now?

Recommended Posts

btrcp2000

Link to comment

btrcp2000

Link to comment

btrcp2000

Link to comment

btrcp2000

Link to comment

Squid

Link to comment

JorgeB

Link to comment

btrcp2000

Link to comment

Squid

Link to comment

btrcp2000

Link to comment

btrcp2000

Link to comment

pwm

Link to comment

btrcp2000

Link to comment

NewDisplayName

Link to comment

btrcp2000

Link to comment

Join the conversation