Disk with errors (but green) during parity rebuild


Recommended Posts

  • Replies 92
  • Created
  • Last Reply

Top Posters In This Topic

23 minutes ago, steve1977 said:

So, need to switch to a new disk?

 

Yes if it fails the extended SMART test.

 

24 minutes ago, steve1977 said:

Is my array still safe as long as the currently ongoing parity built will complete? As mentioned, the drive still shows "green" and the parity rebuild is underway.

 

No because like I mentioned earlier parity is not valid.

Link to comment

Note that a disk that says it has one sector offline uncorrectable doesn't mean the disk need to be toast.

 

But it means that based on statistics, there is an increased probability that the drive will fail more - or totally - within a limited time span. Some disks just may get a bad sector because of a defect on the surface that wasn't noticed during the original factory scan, but there is a danger that the problem isn't just a tiny spot but a larger surface area that isn't good or that there is some issue with the head or other parts of the drive, in which case the drive is dangerous to continue to use.

 

It also means there is one sector that can't be read out correctly because the error correction code (ECC) for that sector isn't enough to correct the bit errors. If you already know the contents of that sector and tries to overwrite the sector then the disk can make use of a spare sector to store the correct data, making your RAID have a full set of disks with all correct data again.

 

As johnnie.black notes, you most definitely do not want to rebuild your parity at this stage, since the current parity is one way to recompute what contents that should have been stored in the offline uncorrectable sector (unless you happen to have a backup of the specific file data for the file that happens to make use of this specific disk sector).

 

Anyway - after a extended SMART scan, the disk will be able to tell which sector it finds the first error on. And it might potentially also increase the number of bad sectors.

Link to comment

The result was as expected - the offline uncorrectable sector will stay uncorrectable - only a direct write to that address has a chance to clear the error counter.

 

You did get to know that the drive didn't find any more errors over the first 60% of the surface - and you got the address of that uncorrectable sector - LBA 91525368.

 

I would recommend to do a selective test where you start testing from the next sector and scan the rest of the drive to see if more errors shows up.

 

If you connect using ssh you can run smartctl and specify

smartctl -t select,91525369-max /dev/<drive>

the drive will continue from the first sector after the error and to the end of the drive. 

Link to comment
3 minutes ago, steve1977 said:

Would this give me better result than just replacing the disk?

 

If parity finished syncing replacing the disk is also an option, but if you don't have checksums (or the disks is btrfs) the rebuilt disk will have some corrupt file(s) and you'll have no way of knowing which ones.

 

By copying/moving the data manually you'll know which files need to be restored.

Link to comment

Copying from the corrupt disk shouldn't be problematic. If you use rsync for example, you can have it continue with other files after a read error. And unless the last 40% of the drive have more errors, you will only have one single file that will fail to copy. Another advantage with rsync is that it is well suited to restart the copy if you for some reason get it interrupted.

 

It is normally also possible to look up what file is using the specific LBA that the SMART test indicated. Exactly how to do that will depend on used file system. If this is a file you have a backup of or do not care about, then you can overwrite the file and have a large probability of zeroing the unrecoverable sector count.

 

If you have read out all the data you can recover from the problematic disk, then you could also have unRAID restore the damaged file by replacing the problematic disk with a new disk and have unRAID recompute the content from the other parity and data disks.

 

The main thing is that you want to keep as much redundancy as possible for as long as possible. Rebuilding the parity now would make the parity computed based on failed sector(s) of the problematic disk. And replacing the disk will make you vulnerable to other issues for the fill time until unRAID have recovered the full content by use of the parity data.

Link to comment

Let me clarify - my parity rebuild has completed (the SMART was done after the rebuild created). Actually the error only occured during rebuild.

 

If I were to move the file to an UD, this would take quite some hours. If I were to replace the drive, I would need to rebuild the parity again, then delete the disk and then copy it back. Also, how is rsync different from "mv -r"?

 

I am not fully clear on the exact suggested next steps. Also, why would chkdsk or scandisk not identify corrupt files? I could then just delete the files and rebuild the disk using parity.

Link to comment
1 minute ago, steve1977 said:

Let me clarify - my parity rebuild has completed (the SMART was done after the rebuild created). Actually the error only occured during rebuild.

And like I already mentioned this is why current parity is not 100% valid.

 

3 minutes ago, steve1977 said:

If I were to move the file to an UD, this would take quite some hours. If I were to replace the drive, I would need to rebuild the parity again, then delete the disk and then copy it back

Not quite following here.

 

4 minutes ago, steve1977 said:

Also, why would chkdsk or scandisk not identify corrupt files? I could then just delete the files and rebuild the disk using parity.

 

AFAIK xfs_repair has no option to scan the complete filesystem, but even if it could identify the files that are on the bad sectors you couldn't rebuild from parity as your current parity is not valid.

Link to comment

I am still not 100% sure whether I fully understand, but let me follow the next steps:

 

* I can add an additional 6TB disk as unassigned disk (UD)

 

* I can the move all files from the "corrupt" disk to the UD ("mv -r")

 

* I can then pull the corrupt disk and put the UD into the old "corrupt" slot

 

* Reconfigure and rebuild parity

 

 

Makes sense?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.