Harro Posted August 29, 2017 Share Posted August 29, 2017 Changing a disk out, and when I rebooted I got this off of another disk. Phase 1 - find and verify superblock... superblock read failed, offset 0, size 524288, ag 0, rval -1 fatal error -- Input/output error Ran xfs-repair -v on the disk and got no error. My short smart report ; passed and am now running an extended report. Anything else I can try? I am running testdisk on the disk I was replacing, hoping to recover the files, since I now have 2 red balled drives and only 1 parity which I have been working/ saving to get another parity. I guess when it rains it pours....a little humor since I am in TX. current.zip Quote Link to comment
JorgeB Posted August 29, 2017 Share Posted August 29, 2017 (edited) You have 2 disable disks with single parity, unRAID can't emulate them, I don't have time to check all SMART reports at the moment but if you believe they are OK you'll need to do a new config and re-sync parity, if they are failing or you are not sure wait for someone to check SMART. Edited August 29, 2017 by johnnie.black Quote Link to comment
Harro Posted August 29, 2017 Author Share Posted August 29, 2017 6 minutes ago, johnnie.black said: You have 2 disable disks with single parity, unRAID can't emulate them, I don't have time to check all SMART reports at the moment but if you believe they are OK you'll need to do a new config and re-sync parity, if they are failing or you are not sure wait for someone to check SMART. running extended smart on the one that failed after reboot of disk replacement. If I do a new config. I will lose one data drive for sure correct? Quote Link to comment
JorgeB Posted August 29, 2017 Share Posted August 29, 2017 Disk14 is disable but SMART looks fine, disk12 is missing, I'm guessing that one really failed? You can try this, if no array data changed since the 2nd failure: -Tools -> New Config -> Retain current configuration: All -> Apply -re-assign old disk14 and a new disk for disk12-check both "parity is already valid" and "maintenance mode" before starting the array -start the array -stop array, unassign disk12 -start array, check emulated disk12 mounts and contents look correct -stop array, reassign new disk12 -start array to begin rebuild 1 Quote Link to comment
Harro Posted August 30, 2017 Author Share Posted August 30, 2017 On 8/29/2017 at 4:03 PM, johnnie.black said: Disk14 is disable but SMART looks fine, disk12 is missing, I'm guessing that one really failed? You can try this, if no array data changed since the 2nd failure: -Tools -> New Config -> Retain current configuration: All -> Apply -re-assign old disk14 and a new disk for disk12-check both "parity is already valid" and "maintenance mode" before starting the array -start the array -stop array, unassign disk12 -start array, check emulated disk12 mounts and contents look correct -stop array, reassign new disk12 -start array to begin rebuild Well I have followed these steps and upon start up disk14, 15 and parity went offline. 14 & 15 as unmountable and parity as red ball. All these drives are the 8TB archive drives and all have no errors on smart reports. Either all these drives suck or my unraid is dying a slow painful death. All dockers stopped, mover disabled, everything running minimal resources. I am at the moment copying files off of my old disk 12 onto another unraid system. Took out drive and hooked it up with usb connector to that machine. Looks like I may be in for a long haul if I have to do that with the other 2 disks, each with 4TB data on them. Harro.zip Quote Link to comment
JorgeB Posted August 30, 2017 Share Posted August 30, 2017 Multiple disk errors are usually controller/cable/backplane related, much more likely than having multiple disks fail at the same time. Aug 29 16:44:20 Tower kernel: md: disk0 read error, sector=8589984024 Aug 29 16:44:20 Tower kernel: md: disk12 read error, sector=8589984032 Aug 29 16:44:20 Tower kernel: md: disk14 read error, sector=8589984032 Aug 29 16:44:20 Tower kernel: md: disk0 read error, sector=8589984032 Aug 29 16:44:20 Tower kernel: md: disk12 read error, sector=8589984040 Aug 29 16:44:20 Tower kernel: md: disk14 read error, sector=8589984040 Aug 29 16:53:02 Tower kernel: md: disk0 read error, sector=3907034824 Aug 29 16:53:02 Tower kernel: md: disk14 read error, sector=3907034832 Aug 29 16:53:02 Tower kernel: md: disk15 read error, sector=3907034832 Aug 29 16:53:02 Tower kernel: md: disk0 read error, sector=3907034832 Aug 29 16:53:02 Tower kernel: md: disk14 read error, sector=3907034840 Aug 29 16:53:02 Tower kernel: md: disk15 read error, sector=3907034840 Quote Link to comment
Harro Posted August 30, 2017 Author Share Posted August 30, 2017 5 hours ago, johnnie.black said: Multiple disk errors are usually controller/cable/backplane related, much more likely than having multiple disks fail at the same time. Aug 29 16:44:20 Tower kernel: md: disk0 read error, sector=8589984024 Aug 29 16:44:20 Tower kernel: md: disk12 read error, sector=8589984032 Aug 29 16:44:20 Tower kernel: md: disk14 read error, sector=8589984032 Aug 29 16:44:20 Tower kernel: md: disk0 read error, sector=8589984032 Aug 29 16:44:20 Tower kernel: md: disk12 read error, sector=8589984040 Aug 29 16:44:20 Tower kernel: md: disk14 read error, sector=8589984040 Aug 29 16:53:02 Tower kernel: md: disk0 read error, sector=3907034824 Aug 29 16:53:02 Tower kernel: md: disk14 read error, sector=3907034832 Aug 29 16:53:02 Tower kernel: md: disk15 read error, sector=3907034832 Aug 29 16:53:02 Tower kernel: md: disk0 read error, sector=3907034832 Aug 29 16:53:02 Tower kernel: md: disk14 read error, sector=3907034840 Aug 29 16:53:02 Tower kernel: md: disk15 read error, sector=3907034840 I saw all those errors. What confuses me is the parity disk is plugged into m/b and the other disks are running off the HP H220 controller. I guess I shall order new breakout cables, and new controller card. My goal is to shrink my array to 12 drives all 8TB. With 8 onboard ports which cards would you recommend? Quote Link to comment
JorgeB Posted August 30, 2017 Share Posted August 30, 2017 The HP220 is an LSI 9207 clone, it should work fine with unRAID, though they are still on firmware p13, try updating both to p20. I also see timeout errors on the onboard controller: Aug 29 12:58:58 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 29 12:58:58 Tower kernel: ata1.00: failed command: DEVICE CONFIGURATION OVERLAY Aug 29 12:58:58 Tower kernel: ata1.00: cmd b1/c2:00:00:00:00/00:00:00:00:00/40 tag 2 pio 512 in Aug 29 12:58:58 Tower kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 29 12:58:58 Tower kernel: ata1.00: status: { DRDY } Aug 29 12:58:58 Tower kernel: ata1: hard resetting link Aug 29 12:59:01 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 29 12:59:01 Tower kernel: ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Aug 29 12:59:01 Tower kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Aug 29 12:59:01 Tower kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Aug 29 12:59:01 Tower kernel: ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Aug 29 12:59:01 Tower kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Aug 29 12:59:01 Tower kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Aug 29 12:59:01 Tower kernel: ata1.00: configured for UDMA/133 Aug 29 12:59:01 Tower kernel: ata1: EH complete Aug 29 12:59:09 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 29 12:59:09 Tower kernel: ata3.00: failed command: DEVICE CONFIGURATION OVERLAY Aug 29 12:59:09 Tower kernel: ata3.00: cmd b1/c2:00:00:00:00/00:00:00:00:00/40 tag 1 pio 512 in Aug 29 12:59:09 Tower kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 29 12:59:09 Tower kernel: ata3.00: status: { DRDY } Aug 29 12:59:09 Tower kernel: ata3: hard resetting link Aug 29 12:59:12 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 29 12:59:12 Tower kernel: ata3.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Aug 29 12:59:12 Tower kernel: ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Aug 29 12:59:12 Tower kernel: ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Aug 29 12:59:12 Tower kernel: ata3.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Aug 29 12:59:12 Tower kernel: ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Aug 29 12:59:12 Tower kernel: ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Aug 29 12:59:12 Tower kernel: ata3.00: configured for UDMA/133 Aug 29 12:59:12 Tower kernel: ata3: EH complete These issues can also be power supply related. Quote Link to comment
Harro Posted September 3, 2017 Author Share Posted September 3, 2017 On 8/30/2017 at 3:41 PM, johnnie.black said: The HP220 is an LSI 9207 clone, it should work fine with unRAID, though they are still on firmware p13, try updating both to p20. I have updated the firmware to the latest from HP. Only says V 15.10.10.00 as of April of this year. On 8/30/2017 at 3:41 PM, johnnie.black said: These issues can also be power supply related. I have also ordered a new power supply 750 and breakout out cables. Now after firmware update I start Unraid back up but have not started the array. All disks are seen but parity is red x . saying "ALL DATA ON THIS DISK WILL BE ERASED WHEN ARRAY IS STARTED" that is ok but disk 12 which was the disk I replaced still has a "Device is emulated". If I start the array I assume that parity will start and I will lose the data from disk 12? Quote Link to comment
JorgeB Posted September 3, 2017 Share Posted September 3, 2017 On 29/08/2017 at 10:03 PM, johnnie.black said: -Tools -> New Config -> Retain current configuration: All -> Apply -re-assign old disk14 and a new disk for disk12-check both "parity is already valid" and "maintenance mode" before starting the array -start the array -stop array, unassign disk12 -start array, check emulated disk12 mounts and contents look correct -stop array, reassign new disk12 -start array to begin rebuild Repeat the procedure, it should still work if the read errors are solved. Quote Link to comment
Harro Posted September 6, 2017 Author Share Posted September 6, 2017 On 9/3/2017 at 11:09 AM, johnnie.black said: Repeat the procedure, it should still work if the read errors are solved. I have done the procedure once again. This time around disk 15 fell out of array. I have replaced the psu with a 750 and also replaced all break out cables with new ones. Took all disks but the parity and cache drive off of m/b connections and running them off the adapters. Pulled disk 15 out and am transferring those files off of it to my back up array through a usb. I have four 8TB drives that have came in yesterday that I will preclear but wondering how to go about introducing those into the broken array? Introduce 1 as the missing 15 disk and let parity rebuild and then continue down the line with each one? or? Quote Link to comment
JorgeB Posted September 6, 2017 Share Posted September 6, 2017 16 minutes ago, Harro said: Introduce 1 as the missing 15 disk and let parity rebuild and then continue down the line with each one? or? You can only use the above procedure if there's a single bad disk, either reconnect original disk15 and try again (you need to find the reason disks keeping dropping) or do a new config with the good disks and then mount the replaced disks outside the array and try to recover all possible data. Quote Link to comment
Harro Posted September 6, 2017 Author Share Posted September 6, 2017 8 minutes ago, johnnie.black said: You can only use the above procedure if there's a single bad disk, either reconnect original disk15 and try again (you need to find the reason disks keeping dropping) or do a new config with the good disks and then mount the replaced disks outside the array and try to recover all possible data. At this point I am thinking of pulling another 2 disks out and then shrink the array. Run a parity check (synch) on the shrunk array. The disks I pull out transfer the files off of them to my backup and re introduce back into my main server one by one with a clean format. Does that sound like a good game plan? If disk 15 goes offline again after this, the only thing left to look at is the back plane of the Norco 5x3 drive cage. Quote Link to comment
JorgeB Posted September 6, 2017 Share Posted September 6, 2017 Sounds good to me. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.