Windows VM updates have brought down UNRAID twice now?


Recommended Posts

Twice in the past several months my Win10 VM has wanted to update itself, last night was the big Fall Creators one. Both times the GUI became unresponsive, the shares disappear from the network, and I cannot get it to respond to Putty.  Only thing I can do is a hard reset and parity check. Since I couldn''t get to the GUI or putty, the best I could think of is the attached "screenshot" (literally a photo of the monitor screen).  Diags are from after the hard reset, not sure if that is helpful but I didn't know how else to get them before resetting. Up and running again with the VM successfully updated, so hopefully parity check goes well. Obviously I'm doing something wrong with these updates. How should I handle this in the future? 

 

tia

20171208_060816.jpg

unraid-diagnostics-20171208-0835.zip

Link to comment

Does anyone have any feedback at all on this?  I have no idea where to begin, and it's just a matter of time before windows will update again.  I now have a months old drive showing an offline uncorrectable error, running smart extended now.

 

How could a VM updating/restarting be affecting the GUI? 

Link to comment

The waiting for lo to become free is nothing at all to worry about.

You're being inundated with messages like 

Dec  8 06:37:23 UNRAID rpc.mountd[4620]: refused mount request from 192.168.1.101 for /sagemedia/tv (/): not exported
Dec  8 06:37:23 UNRAID rpc.mountd[4620]: refused mount request from 192.168.1.101 for /sagemedia (/): not exported
Dec  8 06:37:23 UNRAID rpc.mountd[4620]: refused mount request from 192.168.1.101 for /sagemedia/tv/PrettyinPink-17334270-0.ts (/): not exported
Dec  8 06:37:23 UNRAID rpc.mountd[4620]: refused mount request from 192.168.1.101 for /sagemedia (/): not exported

 

 

Your IP address of the server is 192.168.1.100, but all of those refused mounts are from 192.168.1.101.  Unfortunately I know nothing at all about NFS sharing which it appears that you're using.

 

The error in your screen shot referring to loop0 says that your docker.img is having problems, but the next line says that the cache pool is also having problems (probably caused the docker.img problem)

 

Beyond that, nothing particularly jumps out at me in perusing the logs.

 

@johnnie.black is the real expert here on btrfs and cache pools

Link to comment

Okay, NFS is a remnant of an abandoned project, so I can fix that, and I can get  rid of the Sagetv docker as I was going to redo it from scratch anyway.  But why would an OS update of a VM trigger issues for the whole server?  If I reboot the VM from the GUI when it doesn't have an update to do, it goes down and comes back up just fine.

Link to comment

I uninstalled the NFS stuff from the VM that I wasn't using anymore. Windows of course found another tiny update, so I went ahead and let it reboot since this was not a major update. UNRAID down again, and I can confirm no network shares, no GUI, no SSH, no VMs, it's like its not even there. All I can do is pull the plug, even IPMI is not allowing an orderly shutdown (see attached).

 

Has no one seen this before?

 

 

 

image.png

Link to comment

I finally decided it was time to abandon the problem Win10 VM so I spent the weekend going back to a fresh Win7 one, hoping that fewer updates would lessen the likelihood.  All great until just now. VM wanted to reboot after a video card driver update, and the whole thing once again went silent, no dockers, no VMs, no ssh, no shares.  Had to power cycle via IPMI, so I'm now on my 10th parity check in the past week, no exaggeration, and my newest server grade hard drive is now showing an offline uncorrectable error.

 

I really need some help here. How do I capture logs that will survive these hard resets so I can figure out what is happening? 

Link to comment

Note that hard power cycling should normally not cause any offline uncorrectable errors.

 

A HDD normally makes use of rotational energy to make sure it can finish an ongoing sector write if it is currently writing a sector when the power is lost. It's buffered data that the drive hasn't started to write that will be lost, together with the data cached in the OS and in-transit between OS and disk.

 

And at least enterprise-level SSD has capacitors to make sure it has enough stored energy to finish a pending flash block write and to panic-store any internal state. So for a SSD it would normally also be cached data in the disk+OS and in-transit data between computer and disk that will be lost.

 

The above is the reason why better RAID cards supports battery backed cache - so the OS can hand off a larger set of disk updates and get a transfer barrier that the data has been accepted by the disk subsystem. The RAID card then makes sure that the in-transit/buffered data will be sent a second time to the disks after power is restored.

 

It's the huge (impossible with normal file systems without hardware help) problem of getting the OS to flush all cached data for multiple drives and get the individual drives to effectuate the individual writes that is the reason why unRAID needs to scan the parity after an unclean shutdown.

 

Anyway - I haven't seen any white paper where the disk manufacturers shows actual test results of their power-loss strategies. So it's just possible your offline uncorrectable error was caused by the drive not having enough power to finish an ongoing sector write, resulting in the sector containing garbage. Product management likes to mention a beautiful list of features but often without backing their claims with hard data. Enterprise disks would seldom have to suffer hard power losses since enterprise servers are running with UPS.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.