New drives reporting errors (SOLVED)


kno

Recommended Posts

I have shrunk my unraid setup and removed an old drive with errors. Since I have added new bigger harddrives I do not need to replace it with a new disk as my storage needs are covered. Thus removing it seemed like the best option. I followed the instructions.

 

Now, new parity sync has just started and already after a few minutes I have had errors on three drives (two new drives and one that is a couple of months old). Obviously I am concerned. Should I be or is it normal that new drives have minor issues in a sector or two? I have just recovered from drive issues with my old hdd's. I solved this with you good help. Now I am again concerned.

 

Also, is it safe to run SMART short and extended tests during parity rebuild?

 

Edit: Diagnostics deleted.

 

2017-09-14.jpg

Edited by kno
Link to comment
5 hours ago, johnnie.black said:

 

No, run an extended test on those disks.

 

Can I run an extended test on the drives while parity is being rebuilt? Is that safe?

 

The three drives in question are not the parity drive, but they are being read to write the parity.

Link to comment

All disks passed the extended test, the issue is elsewhere, quind of expected since it's rare to have multiple failures at the same time, but it's good to rule them out.

 

Problem is most likely cable related, but it could also be a bad/failing power supply, controller, etc, start by checking/replacing all cables and then start a new parity sync.

Link to comment
  • 2 weeks later...

I was out travelling for a week, so I could not do any work on the Unraid server.


Yesterday evening I swapped out the cables for the disks in question (1, 2 and 5) with new 6 GB/s cables with locking ability (came with a new Asus motherboard). The old cables were only marked with serial ATA and no bandwidth (standard red/orange cables that came with old motherboard). The old cables are probably lower bandwidth. The controllers in the Unraid server does not support 6 GB/s bandwidth, so the cables “should” not be the problem, but there might be issues especially since it is the new drives that are giving me troubles.


I started parity rebuild. No errors occurred during the first hours. Last time errors occurred within 30-45 minutes.
This morning one of the drives did report 128 errors, but this is probably not in the same position as the last time. I have attached the diagnostics. The read errors are at 04:40 in the log.


I can see a lot of other errors as well. These are related to the drive that has been unassigned from the array. The drive is still plugged into the SATA controller. What do these errors mean?

 

Based on this new information, what do you think can be the problem now? What should be my new action?
-Abort parity rebuild (I guess the parity will still be corrupt due to read error on disk 5)?
-Move disk 5 to another SATA port?
-Change out PSU? I think I have another in storage. What wattage should the PSU ideally be for a 11 disk server?
-Change out the SATA controllers (this is probably an expensive option, so hopefully to be avoided)?
-Other suggestions?

 

Edit: Diagnostics deleted.

 

Edited by kno
Link to comment
2 hours ago, kno said:

Based on this new information, what do you think can be the problem now? What should be my new action?
-Abort parity rebuild (I guess the parity will still be corrupt due to read error on disk 5)?
-Move disk 5 to another SATA port?
-Change out PSU? I think I have another in storage. What wattage should the PSU ideally be for a 11 disk server?
-Change out the SATA controllers (this is probably an expensive option, so hopefully to be avoided)?
-Other suggestions?

 

-disconnect the unassigned bad disk

-move/swap disk5 to a port on another controller

-could be PSU, for 11 disks I'd say a 500/550W Corsair/Seasonic would be good

-a LSI controller to replace the Sil3124 would also be good, though in the previous run one the disks affect was on the onboard controller, so it may not help with the errors, it would help with performance.

-there's a theory still under testing that 6TB WD Reds model WD60EFRX-68L0BN1 have a firmware issue that causes errors during heavy activity, these errors appear mainly on these disks but it can also cause errors on different model disks when one of these is in use.

 

 

Link to comment
1 hour ago, Benson said:

1st you pls enable SMART monitoring the counter "199", then you counld ASAP notice abnormal happen.

PS : those counter can't clear.

 

I am not sure what that means. 199 is one of the SMART error logging messages right? Isn't this being logged already?

 

Under attributes I can see:

199 UDMA CRC error count 0x0032 200 200 000 Old age Always Never 521

 

I opened the computer case and removed my old disk 4. I also noticed that one of the two black grounding cables on the SATA power cord to goes to disk 5 was loose. I exchanged the cable for a new one. I do not know if this can have caused the error. I also moved disk 5 to another SATA port on another controller.

 

Now I am going to try a new parity rebuild.

Link to comment

He means you should add 199 to the monitored attributes:

 

Settings -> Disk Settings -> Global SMART Settings -> Default SMART attribute notifications:

 

If this attribute increases by 2 or more in a short period of time it usually means there's a bad SATA cable, but if there are old errors it will show them, as it never resets, you can acknowledge it on the dashboard.

Link to comment

Ok. 199 warning added.

 

From what I understand this is checking if data has been transferred correctly or if the data needs to be resent to the HDD?

 

I guess a few UDMA errors should be expected, right? This is not really a problem as the data will be resent. However, if many errors occur this indicates continuous bad transfers, so something is wrong with the transmission, such a cable bending/breakage, EMI/RFI interference, damage to the cable causing reduction in bandwidth, etc. Correct?

Link to comment

Ok, very interesting.

 

I did a check of all my disks. All, but three have 0 errors.

 

Disk 5: 521

Disk 7: 17

Disk 8: 16

 

This indicated that I have had some trouble earlier. I cannot explain disk 7 or 8. Disk 5 was the drive that had the most errors before I changed it last month. Disk 5 is almost new, so there must have been some serious problem with cable or controller. I will keep an eye on these values.

 

I am trying to build new parity now. It will take a while.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.