Disk in Error State - Next Steps Clarification

LFletcher · November 22, 2017

Hi,

Disk 7 has gone into an error state with a nice big red cross next to it.

I followed the steps in this section of the troubleshooting guide, https://wiki.lime-technology.com/Troubleshooting#What_do_I_do_if_I_get_a_red_X_next_to_a_hard_disk.3F and have the diagnostics from before and after the reboot (see attached).

From looking at the info in the syslog this is when the issue occured;

Nov 22 18:23:31 unraid kernel: sd 1:0:12:0: task abort: SUCCESS scmd(ffff8807e28d1080)
Nov 22 18:23:31 unraid kernel: sd 1:0:12:0: [sdn] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Nov 22 18:23:31 unraid kernel: sd 1:0:12:0: [sdn] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 a4 92 1b 98 00 00 00 08 00 00
Nov 22 18:23:31 unraid kernel: blk_update_request: I/O error, dev sdn, sector 2761038744
Nov 22 18:23:31 unraid kernel: md: disk7 read error, sector=2761038680
Nov 22 18:23:31 unraid kernel: md: disk7 read error, sector=5824529848
Nov 22 18:23:31 unraid kernel: md: disk7 read error, sector=5824529856
Nov 22 18:23:31 unraid kernel: md: disk7 read error, sector=5824529864

Looking at the smart info for disk 7 I can see that the Reallocated_Sector_Ct isn't great.

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       480

I ran a quick smart test on the drive after the reboot and it appeared to get stuck at 90%. The sector count increased to;

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       536

So at this stage I assume I better RMA the drive back to Seagate as if it's not dead yet it will be soon?

I have 2 questions with regards to replacing the drive as this will be the first time I've had to do it with unRaid and I don't want to do anything stupid and lose any data.

I have no idea if anything was copying specifically to this drive at the time of the failure, but I was moving stuff off my cache into the main array. How would I know if any of the data I was copying at the time has become corrupt - or more to the point, how does unRaid deal with a write failure?

Looking at the re-enable a drive section (https://wiki.lime-technology.com/Troubleshooting#Re-enable_the_drive) seems to indicate the data went to an emulated drive, so should be ok and I won't have to hunt for the file(s) which may now be corrupt - is that assumption correct?

Reading the Replace a drive section (https://wiki.lime-technology.com/Replacing_a_Data_Drive) is the following procedure correct for replacing the bad drive;

Stop the array
Unassign the old drive if still assigned (to unassign, set it to No Device)
Power down
[ Optional ] Pull the old drive (you may want to leave it installed for Preclearing or testing)
Install the new drive
Power on
Assign the new drive in the slot of the old drive
Go to the Main -> Array Operation section
Put a check in the Yes, I'm sure checkbox (next to the information indicating the drive will be rebuilt), and click the Start button

Does the checkbox mentioned in step 9 appear once you have unassigned the old drive and reassigned the new drive as this option is currently available with the rebooted and stopped array?

Thanks for any help, it's very much appreciated.

unraid-diagnostics-20171122-2037.zip

unraid-diagnostics-20171122-2209.zip

JorgeB · November 22, 2017

So at this stage I assume I better RMA the drive back to Seagate as if it's not dead yet it will be soon?

I would replace it ASAP.

I have no idea if anything was copying specifically to this drive at the time of the failure, but I was moving stuff off my cache into the main array. How would I know if any of the data I was copying at the time has become corrupt - or more to the point, how does unRaid deal with a write failure?

By the way it happened, i.e., it started with a read error and when this happens unRAID tries to write the data back to those sectors using parity plus all other data disks to calculate what should be there, so it looks like it didn't happened while writing new files to that disk, if it did you'd need to have checksums or be using btrfs to check for corruption, unRAID does it's best to start writing to the emulated disk without losing anything, but depending on how a disk fails corruptions is possible on the file being written at the moment it switches to the emulated disk.

is the following procedure correct for replacing the bad drive;

Yes, but you can skip step 2.

LFletcher · November 23, 2017

Thanks for the response.

So in summary some or no data on disk 7 may or may not now be corrupt once I restore it back to a new disk?

And I should also check the new data which was copying off the cache when the issue ordered just in case.

Are there any tools which would assist with checking the files?

In the past I've used mediainfo as that won't show container info if the file is corrupt.

I assume I could create a disk share once the restore is complete and just scan that?

JorgeB · November 23, 2017

2 minutes ago, LFletcher said:

So in summary some or no data on disk 7 may or may not now be corrupt once I restore it back to a new disk?

Most likely no corrupt data but without checksums or a btrfs filesystem no way of knowing for sure.

3 minutes ago, LFletcher said:

Are there any tools which would assist with checking the files?

For xfs disks you can use the dynamix file integrity plugin.

Disk in Error State - Next Steps Clarification

Recommended Posts

LFletcher

Link to comment

JorgeB

Link to comment

LFletcher

Link to comment

JorgeB

Link to comment

Join the conversation