Announced death of an apparently healthy HDD


Recommended Posts

Not the first time this happens to me and I'm sure it won't be the last, just sharing so other users be aware that while SMART is usually good at predicting (or confirming) HDD issues, sometimes healthy SMART does not equal an healthy disk.

 

This disk was parity2 on one of my servers and last week I noticed writes were a lot slower than usual, eventually even got a few read errors, but the disk wasn't disabled and SMART looked fine, still I suspected a disk problem so I ran diskspeed and it confirmed there was a noticeable performance dip in the end, so I took the disk out and ran MHDD, my favorite tool to test a disk.

 

I made a little video for anyone interested, for those unfamiliar with it, important info to look at is in the right side, ACT (Current disk read speed), below that are individual sector delays (it's normal to have a few >50ms sectors, but many like that or >150ms is not normal) and finally look at the percentage below that for the current disk position.

 

Video starts at the beginning of the disk, speed is normal at ~150MB/s, I then skip to 50%, again speed is normal for that position, about 125MB, finally I skip to 90% and here the problem is clearly visible, many slow sectors with read speed dropping to <1MB/s in between zones with normal speed for that position.

 

So I preemptively replaced an apparently healthy disk with very few power on hours that it's certainly going to die soon.

 

Video:

https://www.dropbox.com/s/szg0arqlrbwypc3/slow_sectors.avi?dl=0

Model Family:     Western Digital Blue
Device Model:     WDC WD40EZRZ-00WN9B0
Serial Number:    WD-WCC4E0TN08K1

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       101
  3 Spin_Up_Time            0x0027   198   169   021    Pre-fail  Always       -       7058
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       87
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       422
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       76
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       67
194 Temperature_Celsius     0x0022   121   114   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

 

 

 

WDC_WD40EZRZ-00WN9B0_WD-WCC4E0TN08K1-20170409-1113.txt

slow_sectors.png

Edited by johnnie.black
  • Upvote 2
Link to comment

Interesting! Thanks for posting!!

 

I'll have to look into that tool!

 

Thought about short stroking the disk? I used to do it back in the days when we had a 2TB limit on drive sizes, and I wanted to buy 3TB drives due to attractive price point (and expecting support to be added eventually).

 

Creating the HBA works best on a motherboard port, but once in place, any controller will respect it.

 

I am not a WD fan since the 1TB green days. They rocked. But 2T greens were not nearly as good. And since the reviews have not been kind, yet they seem to enjoy some type of premium status. The blacks have had some mystic not borne out in reliability. Lost track of the endless color variety.

Link to comment
2 hours ago, bjp999 said:

Thought about short stroking the disk?

 

Never considered that, I fear that the slow sectors will expand, but may try it when they are only at the end and use the disk in my backup server.

 

1 hour ago, Spies said:

MHDD looks rather interesting,

 

It is, and it has other features, like bad sector remap, try to fix long delays, etc, but my favorite part is the surface scan as it gives a great indication of a disk real health, unlike extended SMART test or the manufacturer tools.

 

It has one big problem, it's an older program so it doesn't support AHCI or HBAs, it only works on the onboard ports set to IDE (and master only, not slave), so it's no practical to use on the server, I have to remove the disk and run it on my test server, @jbartletttalked about trying to do something similar, hope he or anyone else can do it in the future.

 

Based on my experience I believe these slow sectors are not that uncommon, although much less severe than the WD above, below is another video from another disk I early retired, in this case not because I think it's going to fail immediately (though it's just a question of time for some of those slow sectors to turn into bad sectors), but because it was affecting my turbo write speeds, since it can't be faster than your slowest disk at any position and I was noticing some slowdowns to <50MB/s, slow sectors start almost at the beginning of the disk and are spread out through the entire disk.

 

https://www.dropbox.com/s/bwrl4nz4hgjnu7r/slow_sectors_2.avi?dl=0

 

 

Edited by johnnie.black
Link to comment

WD40EZRZ?  Hmm...I got 4 of those about a year ago.  I used them for about a month.  2 died.  Totally dead one day, not a peep, no sign of life.  One of the others is noisy, and the last one is bad sectored.  

 

I've RMA'ed 3 so far, just waiting on them coming back.

 

I think the EZRZ is a bit shit in all honesty.

Link to comment

On the same topic, I'm converting my last and oldest server to btrfs, this is now my backup server so it uses my oldest disks, I'm already used to replacing one or two of those Samsungs per year (I still have some more 2TB spares to go through before I start replacing them with new larger disks), just got some read errors on disk1, SMART still looks OK but I bet that when I replace it and run MHDD there will be a lot of slow sectors.

 

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       1241
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   067   065   025    Pre-fail  Always       -       10271
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       649
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       7290
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       477
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always       -       13662262
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       1530
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   059   000    Old_age   Always       -       36 (Min/Max 12/41)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       1676
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       666

 

Screenshot 2017-04-10 13.34.45.png

Link to comment
  • 2 weeks later...

Finally got around to replacing disk1 on the server from the screenshot above, had to replace disk12 at the same time because it redballed last week with a pending sector (yay for dual parity!), disk1 gave me read errors twice, and like I suspect it's full of slow sectors.

 

SMART report still looks fine:


 

Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7J9GZB05057


ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       1782
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   067   065   025    Pre-fail  Always       -       10171
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       746
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       7404
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       483
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always       -       13662262
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       1742
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   058   000    Old_age   Always       -       26 (Min/Max 12/42)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       1677
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       764

 

But compare a full MHDD scan with a still healthy disk from the same model, notice the differences in the average speed, time for scan and mostly the quantity of sectors with delays >50ms, these are a bad sign, especially those with a delay >500ms, just a question of time for some to turn into bad sectors.

 

These slow sectors were enough to give read errors and bad performance on the worst affected areas, but ATM there are no bad sectors and the disk would pass an extended SMART test.

 

 

 

 

 

HD204UI_5057.png

HD204UI_9309.png

Link to comment

Unless you have changed something, 500msec should not be a read error. It is slow, but the timeout is measured in seconds and defaults to 30 seconds. This is the reason for TLER, the 30 seconds is wasted time when the array can rebuild from other drives far faster.

Link to comment
30 minutes ago, c3 said:

Unless you have changed something, 500msec should not be a read error. It is slow, but the timeout is measured in seconds and defaults to 30 seconds. This is the reason for TLER, the 30 seconds is wasted time when the array can rebuild from other drives far faster.

 

I didn't change anything, and notice that it's >500ms, that means that those sectors took more than half a second to read, not half a second exactly, so while I'm not certain if there was a read timeout or if at that time there really was a read error but unRAID successfully wrote that sector back so the disk wasn't disabled, I'm as certain as I can be that those read errors were caused by the disk, and it's not the first time this happens with a disk in similar conditions.

Link to comment

Yes, I did notice it said <500ms and >500ms, and 30,000ms is a long way from 500ms. TLER is 7,000ms.

 

If response time is being used to determine drive health, the environment needs to be carefully controlled. Disk drives do not report when they re-calibrate for temperature, vibration, or even noise abatement. Which are possible sources for the >500ms response time, without being indicative of drive health. These are transitive conditions. Repeating the test would show different results.

 

It is possible the sector was written as near off track, and deliver persistent slow performance. It can be corrected by rewriting the sector. And there are products...

 

Or the sector could be difficult to read due to the media being scuffed, particles flying about, etc. Most of which lead to permanent drive failure.

 

Another direction to take is issue additional reads. Drive drives fail, and the data must survive. So data is protected by additional writes. By issues additional reads, the failure or performance degradation is mitigated. If the data is protect single or double parity, a stripe read can be used to respond. Classically this was done serially, if the first read was slow or an error, then more reads were issued. Things like TLER were put in place to limit the performance impact. But the additional reads can be issued in parallel. Then the data is returned by the first of A) the block returned from a single target disk, or B) the block rebuilt as soon as enough stripe is returned. Either of which should be accomplished well below the performance threshold.

 

Link to comment
9 minutes ago, c3 said:

If response time is being used to determine drive health, the environment needs to be carefully controlled. Disk drives do not report when they re-calibrate for temperature, vibration, or even noise abatement. Which are possible sources for the >500ms response time, without being indicative of drive health. These are transitive conditions. Repeating the test would show different results.

 

I can repeat the scan 10 times and results will be practically identical, same for the disks with normal response times, have you ever used that program? if you try it you'll see results are very consistent.

Link to comment
27 minutes ago, c3 said:

Yes, I have used MHDD, and many others like it

 

Sorry, but by the way you talked of different results every time I assumed you never used it, because the results are always repeatable.

 

In terms of predicting failure, I can only say that in my experience when disks start to get slow sectors like those above is just a question of time before they fail, but like all predictions it's not always right. i.e., a disk can keep working with slow sectors for a long time and most disks fail before slow sectors start showing.

Link to comment
4 hours ago, c3 said:

 

It is possible the sector was written as near off track, and deliver persistent slow performance. It can be corrected by rewriting the sector. And there are products...

 

I did also cover a possible cause for non-fatal, persistent slow block, and products for remediation.

 

Your concern for both the reported errors and the measured performance is valid. If the manufacturer/source will replace it, that is a great plan.

Edited by c3
Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.