HDD Image Is Consistently FUBAR


Recommended Posts

I just switch to unRAID from FreeNAS and I'm liking it, but I'm running into an issue where after setting up a VM for hours and getting everything to work as expected, something later on absolutely destroys the filesystem and makes the VM unusable. I'm pretty familiar with libvirt/KVM/qemu since I use it at work and have used it on my server and laptop. 

 

First off I decided to try a pre-built image called OpenFLIXR which I found on here, after wasting hours on getting it working, it would just lock up and a simple "reboot" from the unRAID UI wouldn't do anything, nor would issuing commands via the virsh shell, so I would be forced to do a hard shutdown....which would lead to FS errors. After running fsck and clearing the errors, then rebooting and remounting it again, it would be fine for a while. Then I would try to shut down and it would just hang again, forcing me to do a hard shutdown, which resulted in more errors.

 

After getting extremely frustrated I decided to nuke the VM and disk image, and start fresh since I didn't need a lot of things in that image anyway. So about 24 hours ago I created a fresh Arch VM and the installation, post-setup and everything was working fine. No hangs during reboots, no filesystem errors, no nothing. It was downloading everything a breakneck speeds and I thought I had it finally working well after a few hours so I went to sleep. I just checked on it now NZBget was throwing errors that it couldn't write to the queue directory, I checked it and all permissions were fine. It turns our that for some reason the fileystem was mounted read-only at some point. Confused, I looked at dmesg and right after I went to sleep it started throwing filesystem and I/O errors and apparently stopped doing everything at that point.

 

I figured a simple fsck would fix it like it had before, but nope, not this time, this thing was completely FUBAR. Instead of the normal output it told me unable to set superblock flags on /dev/vda1. So I was thinking "great, it's totally destroyed", but I did find a recovery process on the Ubuntu Forums which worked for another user so I attempted to force it to use another superblock, which seemed to work since it fixed a bunch of the errors....then it told me it couldn't write changes to the disk. Just for the hell of it, I rebooted and of course it wouldn't boot at all. So there goes a few hours worth of work.

 

I don't want to setup another VM until I figure out what the hell the issue is because this is extremely annoying. 

 

Here's the initial errors from dmesg this morning

[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 8761448
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 0
[Mon Jul 17 09:22:26 2017] Aborting journal on device vda1-8.
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 0
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 16880640
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109824, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109825, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109826, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109827, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109828, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109829, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109830, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109831, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109832, lost async page write
[Mon Jul 17 09:22:26 2017] Buffer I/O error on dev vda1, logical block 2109833, lost async page write
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 16881112
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 18035984
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 10360
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 4196360
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 2048
[Mon Jul 17 09:22:26 2017] blk_update_request: I/O error, dev vda, sector 2048
[Mon Jul 17 09:22:26 2017] EXT4-fs error (device vda1): ext4_journal_check_start:60: Detected aborted journal
[Mon Jul 17 09:22:26 2017] EXT4-fs (vda1): previous I/O error to superblock detected
[Mon Jul 17 09:22:26 2017] EXT4-fs (vda1): Remounting filesystem read-only
[Mon Jul 17 09:22:26 2017] JBD2: Error -5 detected when updating journal superblock for vda1-8.
[Mon Jul 17 09:22:26 2017] EXT4-fs error (device vda1): ext4_journal_check_start:60: Detected aborted journal
[Mon Jul 17 09:22:26 2017] EXT4-fs (vda1): previous I/O error to superblock detected
[Mon Jul 17 09:50:18 2017] EXT4-fs (vda1): pa ffff88004db0fc98: logic 128, phys. 405504, len 128
[Mon Jul 17 09:50:18 2017] EXT4-fs error (device vda1): ext4_mb_release_inode_pa:3823: group 12, free 94, pa_free 82
[Mon Jul 17 09:50:18 2017] EXT4-fs (vda1): previous I/O error to superblock detected

 

Here's my VM config

<domain type='kvm'>
  <name>Pirate</name>
  <uuid>57d80625-ec2f-2a32-0e5d-27f6f4b5581f</uuid>
  <description>Pirating Apps and Media Servers</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Arch" icon="arch.png" os="arch"/>
  </metadata>
  <memory unit='KiB'>33554432</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>12</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='4'/>
    <vcpupin vcpu='4' cpuset='5'/>
    <vcpupin vcpu='5' cpuset='6'/>
    <vcpupin vcpu='6' cpuset='8'/>
    <vcpupin vcpu='7' cpuset='9'/>
    <vcpupin vcpu='8' cpuset='10'/>
    <vcpupin vcpu='9' cpuset='12'/>
    <vcpupin vcpu='10' cpuset='13'/>
    <vcpupin vcpu='11' cpuset='14'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-2.7'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='6' threads='2'/>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/user/domains/Pirate/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x03' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='dmi-to-pci-bridge'>
      <model name='i82801b11-bridge'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
    </controller>
    <controller type='pci' index='2' model='pci-bridge'>
      <model name='pci-bridge'/>
      <target chassisNr='2'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x02' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:f8:dc:c9'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes' websocket='-1' listen='0.0.0.0' keymap='en-us'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x04' function='0x0'/>
    </memballoon>
  </devices>
</domain>

 

My hardware is as follows

SuperMicro X10SDV-F-0 w/Xeon D 1540 (15 cores at 2 GHz)

2x 32 GB Samsung DDR4 ECC DIMMs

1x 256 GB Samsung 960 Evo NVMe drive as my cache drive

3x unused SSDs (Samsung 950 Evo [SATA], Intel S3700, Crucial M4)

JBOD: 7x 4 TB HGST, 2x 1 TB WD Red, 1x 1 TB WD Green, 1x 1 TB WD Black

 

The SMART status of all the drives is Pass, and my parity drive is showing 736 for Reallocated sector count so I don't think my disks are the issue. The first VM image was on the cache drive, the second one is most likely on one of the disks since it's in my User folder. I think this is more of an issue with the VirtIO driver than anything else.

 

Any thoughts?

Edited by brando56894
Link to comment

You should provide the system diagnostics (Tools->Diagnostics) zip file to see if anyone can spot something there.

 

on a first observation on your comments I would be worried about any drive that had 736 reallocated sectors.   Although in theory reallocated sectors should not matter I find that when they get above a relatively small value the drive often becomes unreliable.   Anyway the diagnostics mentioned above will include SMART reports for all drives so others can comment.   There is a strong suggestion from your comments that there may be some sort of drive related issue.

Edited by itimpi
Link to comment

Logs are full of timeout errors on the NVMe device, try re-seating it:

 

Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 332 QID 7 timeout, aborting
Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 333 QID 7 timeout, aborting
Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 334 QID 7 timeout, aborting
Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 335 QID 7 timeout, aborting
Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 336 QID 7 timeout, aborting
Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 337 QID 7 timeout, aborting
Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 338 QID 7 timeout, aborting
Jul 17 04:31:13 Tower kernel: nvme nvme0: I/O 339 QID 7 timeout, aborting
Jul 17 04:31:13 Tower kernel: nvme nvme0: completing aborted command with status: 0000
Jul 17 04:31:13 Tower kernel: nvme nvme0: completing aborted command with status: 0000
Jul 17 04:31:13 Tower kernel: nvme nvme0: completing aborted command with status: 0000
Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000
Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000
Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000
Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000
Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000
Jul 17 04:31:14 Tower kernel: nvme nvme0: completing aborted command with status: 0000

 

Link to comment

UnRaid does tend to stress the hardware more than other systems. In the old days, I used round ide cables in Windows all the time with absolutely no problem. They were nothing but trouble in unRaid.

 

If the connection is not rock solid, unRaid will find it!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.