brando56894 Posted July 18, 2017 Share Posted July 18, 2017 (edited) I just switch to unRAID from FreeNAS and I'm liking it, but I'm running into an issue where after setting up a VM for hours and getting everything to work as expected, something later on absolutely destroys the filesystem and makes the VM unusable. I'm pretty familiar with libvirt/KVM/qemu since I use it at work and have used it on my server and laptop. First off I decided to try a pre-built image called OpenFLIXR which I found on here, after wasting hours on getting it working, it would just lock up and a simple "reboot" from the unRAID UI wouldn't do anything, nor would issuing commands via the virsh shell, so I would be forced to do a hard shutdown....which would lead to FS errors. After running fsck and clearing the errors, then rebooting and remounting it again, it would be fine for a while. Then I would try to shut down and it would just hang again, forcing me to do a hard shutdown, which resulted in more errors. After getting extremely frustrated I decided to nuke the VM and disk image, and start fresh since I didn't need a lot of things in that image anyway. So about 24 hours ago I created a fresh Arch VM and the installation, post-setup and everything was working fine. No hangs during reboots, no filesystem errors, no nothing. It was downloading everything a breakneck speeds and I thought I had it finally working well after a few hours so I went to sleep. I just checked on it now NZBget was throwing errors that it couldn't write to the queue directory, I checked it and all permissions were fine. It turns our that for some reason the fileystem was mounted read-only at some point. Confused, I looked at dmesg and right after I went to sleep it started throwing filesystem and I/O errors and apparently stopped doing everything at that point. I figured a simple fsck would fix it like it had before, but nope, not this time, this thing was completely FUBAR. Instead of the normal output it told me unable to set superblock flags on /dev/vda1. So I was thinking "great, it's totally destroyed", but I did find a recovery process on the Ubuntu Forums which worked for another user so I attempted to force it to use another superblock, which seemed to work since it fixed a bunch of the errors....then it told me it couldn't write changes to the disk. Just for the hell of it, I rebooted and of course it wouldn't boot at all. So there goes a few hours worth of work. I don't want to setup another VM until I figure out what the hell the issue is because this is extremely annoying. Here's the initial errors from dmesg this morning [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 8761448 [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 0 [Mon Jul 17 09:22:26 2017] Aborting journal on device vda1-8. [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 0 [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 16880640 [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109824, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109825, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109826, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109827, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109828, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109829, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109830, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109831, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109832, lost async page write [Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109833, lost async page write [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 16881112 [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 18035984 [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 10360 [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 4196360 [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 2048 [Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 2048 [Mon Jul 17 09:22:26 2017] EXT4-fs error (device vda1): ext4_journal_check_start:60: Detected aborted journal [Mon Jul 17 09:22:26 2017] EXT4-fs (vda1): previous I/O error to superblock detected [Mon Jul 17 09:22:26 2017] EXT4-fs (vda1): Remounting filesystem read-only [Mon Jul 17 09:22:26 2017] JBD2: Error -5 detected when updating journal superblock for vda1-8. [Mon Jul 17 09:22:26 2017] EXT4-fs error (device vda1): ext4_journal_check_start:60: Detected aborted journal [Mon Jul 17 09:22:26 2017] EXT4-fs (vda1): previous I/O error to superblock detected [Mon Jul 17 09:50:18 2017] EXT4-fs (vda1): pa ffff88004db0fc98: logic 128, phys. 405504, len 128 [Mon Jul 17 09:50:18 2017] EXT4-fs error (device vda1): ext4_mb_release_inode_pa:3823: group 12, free 94, pa_free 82 [Mon Jul 17 09:50:18 2017] EXT4-fs (vda1): previous I/O error to superblock detected Here's my VM config <domain type='kvm'> <name>Pirate</name> <uuid>57d80625-ec2f-2a32-0e5d-27f6f4b5581f</uuid> <description>Pirating Apps and Media Servers</description> <metadata> <vmtemplate xmlns="unraid" name="Arch" icon="arch.png" os="arch"/> </metadata> <memory unit='KiB'>33554432</memory> <currentMemory unit='KiB'>8388608</currentMemory> <memoryBacking> <nosharepages/> </memoryBacking> <vcpu placement='static'>12</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='4'/> <vcpupin vcpu='4' cpuset='5'/> <vcpupin vcpu='5' cpuset='6'/> <vcpupin vcpu='6' cpuset='8'/> <vcpupin vcpu='7' cpuset='9'/> <vcpupin vcpu='8' cpuset='10'/> <vcpupin vcpu='9' cpuset='12'/> <vcpupin vcpu='10' cpuset='13'/> <vcpupin vcpu='11' cpuset='14'/> </cputune> <os> <type arch='x86_64' machine='pc-q35-2.7'>hvm</type> </os> <features> <acpi/> <apic/> </features> <cpu mode='host-passthrough'> <topology sockets='1' cores='6' threads='2'/> </cpu> <clock offset='utc'> <timer name='rtc' tickpolicy='catchup'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/local/sbin/qemu</emulator> <disk type='file' device='disk'> <driver name='qemu' type='raw' cache='writeback'/> <source file='/mnt/user/domains/Pirate/vdisk1.img'/> <target dev='hdc' bus='virtio'/> <boot order='1'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x03' function='0x0'/> </disk> <controller type='usb' index='0' model='ich9-ehci1'> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/> </controller> <controller type='usb' index='0' model='ich9-uhci1'> <master startport='0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/> </controller> <controller type='usb' index='0' model='ich9-uhci2'> <master startport='2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/> </controller> <controller type='usb' index='0' model='ich9-uhci3'> <master startport='4'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/> </controller> <controller type='sata' index='0'> <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/> </controller> <controller type='pci' index='0' model='pcie-root'/> <controller type='pci' index='1' model='dmi-to-pci-bridge'> <model name='i82801b11-bridge'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/> </controller> <controller type='pci' index='2' model='pci-bridge'> <model name='pci-bridge'/> <target chassisNr='2'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='virtio-serial' index='0'> <address type='pci' domain='0x0000' bus='0x02' slot='0x02' function='0x0'/> </controller> <interface type='bridge'> <mac address='52:54:00:f8:dc:c9'/> <source bridge='br0'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </interface> <serial type='pty'> <target port='0'/> </serial> <console type='pty'> <target type='serial' port='0'/> </console> <channel type='unix'> <target type='virtio' name='org.qemu.guest_agent.0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <input type='tablet' bus='usb'> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> <graphics type='vnc' port='-1' autoport='yes' websocket='-1' listen='0.0.0.0' keymap='en-us'> <listen type='address' address='0.0.0.0'/> </graphics> <video> <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </video> <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x02' slot='0x04' function='0x0'/> </memballoon> </devices> </domain> My hardware is as follows SuperMicro X10SDV-F-0 w/Xeon D 1540 (15 cores at 2 GHz) 2x 32 GB Samsung DDR4 ECC DIMMs 1x 256 GB Samsung 960 Evo NVMe drive as my cache drive 3x unused SSDs (Samsung 950 Evo [SATA], Intel S3700, Crucial M4) JBOD: 7x 4 TB HGST, 2x 1 TB WD Red, 1x 1 TB WD Green, 1x 1 TB WD Black The SMART status of all the drives is Pass, and my parity drive is showing 736 for Reallocated sector count so I don't think my disks are the issue. The first VM image was on the cache drive, the second one is most likely on one of the disks since it's in my User folder. I think this is more of an issue with the VirtIO driver than anything else. Any thoughts? Edited July 18, 2017 by brando56894 Quote Link to comment
itimpi Posted July 18, 2017 Share Posted July 18, 2017 (edited) You should provide the system diagnostics (Tools->Diagnostics) zip file to see if anyone can spot something there. on a first observation on your comments I would be worried about any drive that had 736 reallocated sectors. Although in theory reallocated sectors should not matter I find that when they get above a relatively small value the drive often becomes unreliable. Anyway the diagnostics mentioned above will include SMART reports for all drives so others can comment. There is a strong suggestion from your comments that there may be some sort of drive related issue. Edited July 18, 2017 by itimpi Quote Link to comment
brando56894 Posted July 18, 2017 Author Share Posted July 18, 2017 Here ya go. The only drive that shows that error is the parity drive and the first image was hosted on the cache drive so I don't really see how one would affect the other since the cache drive isn't protected by the parity drive, right? tower-diagnostics-20170718-0238.zip Quote Link to comment
JorgeB Posted July 18, 2017 Share Posted July 18, 2017 Logs are full of timeout errors on the NVMe device, try re-seating it: Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 332 QID 7 timeout, aborting Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 333 QID 7 timeout, aborting Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 334 QID 7 timeout, aborting Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 335 QID 7 timeout, aborting Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 336 QID 7 timeout, aborting Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 337 QID 7 timeout, aborting Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 338 QID 7 timeout, aborting Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 339 QID 7 timeout, aborting Jul 17 04:31:13 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Jul 17 04:31:13 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Jul 17 04:31:13 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000 Quote Link to comment
brando56894 Posted July 18, 2017 Author Share Posted July 18, 2017 Hmm that's odd considering it was working fine in FreeNAS two days ago and I haven't touched the server at all...I'll give it a try when I get home just for the hell of it. Quote Link to comment
SSD Posted July 18, 2017 Share Posted July 18, 2017 UnRaid does tend to stress the hardware more than other systems. In the old days, I used round ide cables in Windows all the time with absolutely no problem. They were nothing but trouble in unRaid. If the connection is not rock solid, unRaid will find it! Quote Link to comment
JorgeB Posted July 18, 2017 Share Posted July 18, 2017 Not sure what would cause them as I've never seen these errors before on NVMe devices, but they are not normal and almost certainly a hardware issue. Quote Link to comment
HellDiverUK Posted July 18, 2017 Share Posted July 18, 2017 960 Evo? Hmmm...the controller on those runs toasty, you might try sticking on a little heatsink. I use cheap RAM heatsinks that are sold for use on a RaspberryPi on my 950 Pro which stops it throttling. 1 Quote Link to comment
brando56894 Posted July 18, 2017 Author Share Posted July 18, 2017 I already have three on it along with a 120 MM fan blowing on it I reseated it this morning around 9 AM and there aren't any timeouts in dmesg now, but then again it hasn't really been doing anything. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.