JorgeB Posted April 9, 2017 Share Posted April 9, 2017 (edited) Not the first time this happens to me and I'm sure it won't be the last, just sharing so other users be aware that while SMART is usually good at predicting (or confirming) HDD issues, sometimes healthy SMART does not equal an healthy disk. This disk was parity2 on one of my servers and last week I noticed writes were a lot slower than usual, eventually even got a few read errors, but the disk wasn't disabled and SMART looked fine, still I suspected a disk problem so I ran diskspeed and it confirmed there was a noticeable performance dip in the end, so I took the disk out and ran MHDD, my favorite tool to test a disk. I made a little video for anyone interested, for those unfamiliar with it, important info to look at is in the right side, ACT (Current disk read speed), below that are individual sector delays (it's normal to have a few >50ms sectors, but many like that or >150ms is not normal) and finally look at the percentage below that for the current disk position. Video starts at the beginning of the disk, speed is normal at ~150MB/s, I then skip to 50%, again speed is normal for that position, about 125MB, finally I skip to 90% and here the problem is clearly visible, many slow sectors with read speed dropping to <1MB/s in between zones with normal speed for that position. So I preemptively replaced an apparently healthy disk with very few power on hours that it's certainly going to die soon. Video: https://www.dropbox.com/s/szg0arqlrbwypc3/slow_sectors.avi?dl=0 Model Family: Western Digital Blue Device Model: WDC WD40EZRZ-00WN9B0 Serial Number: WD-WCC4E0TN08K1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 101 3 Spin_Up_Time 0x0027 198 169 021 Pre-fail Always - 7058 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 87 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 422 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 76 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 67 194 Temperature_Celsius 0x0022 121 114 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 WDC_WD40EZRZ-00WN9B0_WD-WCC4E0TN08K1-20170409-1113.txt Edited April 9, 2017 by johnnie.black 2 Quote Link to comment
SSD Posted April 9, 2017 Share Posted April 9, 2017 Interesting! Thanks for posting!! I'll have to look into that tool! Thought about short stroking the disk? I used to do it back in the days when we had a 2TB limit on drive sizes, and I wanted to buy 3TB drives due to attractive price point (and expecting support to be added eventually). Creating the HBA works best on a motherboard port, but once in place, any controller will respect it. I am not a WD fan since the 1TB green days. They rocked. But 2T greens were not nearly as good. And since the reviews have not been kind, yet they seem to enjoy some type of premium status. The blacks have had some mystic not borne out in reliability. Lost track of the endless color variety. Quote Link to comment
Spies Posted April 9, 2017 Share Posted April 9, 2017 MHDD looks rather interesting, i tend to use Seatools to check customer drives but it doesn't give information besides passed or failed. Quote Link to comment
JorgeB Posted April 9, 2017 Author Share Posted April 9, 2017 (edited) 2 hours ago, bjp999 said: Thought about short stroking the disk? Never considered that, I fear that the slow sectors will expand, but may try it when they are only at the end and use the disk in my backup server. 1 hour ago, Spies said: MHDD looks rather interesting, It is, and it has other features, like bad sector remap, try to fix long delays, etc, but my favorite part is the surface scan as it gives a great indication of a disk real health, unlike extended SMART test or the manufacturer tools. It has one big problem, it's an older program so it doesn't support AHCI or HBAs, it only works on the onboard ports set to IDE (and master only, not slave), so it's no practical to use on the server, I have to remove the disk and run it on my test server, @jbartletttalked about trying to do something similar, hope he or anyone else can do it in the future. Based on my experience I believe these slow sectors are not that uncommon, although much less severe than the WD above, below is another video from another disk I early retired, in this case not because I think it's going to fail immediately (though it's just a question of time for some of those slow sectors to turn into bad sectors), but because it was affecting my turbo write speeds, since it can't be faster than your slowest disk at any position and I was noticing some slowdowns to <50MB/s, slow sectors start almost at the beginning of the disk and are spread out through the entire disk. https://www.dropbox.com/s/bwrl4nz4hgjnu7r/slow_sectors_2.avi?dl=0 Edited April 9, 2017 by johnnie.black Quote Link to comment
Spies Posted April 9, 2017 Share Posted April 9, 2017 We have plenty of old crappy machines in our stock to run it in native IDE mode. It will be more of a sanity check when Seatools passes a drive but the system feels very slow. Quote Link to comment
HellDiverUK Posted April 9, 2017 Share Posted April 9, 2017 WD40EZRZ? Hmm...I got 4 of those about a year ago. I used them for about a month. 2 died. Totally dead one day, not a peep, no sign of life. One of the others is noisy, and the last one is bad sectored. I've RMA'ed 3 so far, just waiting on them coming back. I think the EZRZ is a bit shit in all honesty. Quote Link to comment
JorgeB Posted April 10, 2017 Author Share Posted April 10, 2017 On the same topic, I'm converting my last and oldest server to btrfs, this is now my backup server so it uses my oldest disks, I'm already used to replacing one or two of those Samsungs per year (I still have some more 2TB spares to go through before I start replacing them with new larger disks), just got some read errors on disk1, SMART still looks OK but I bet that when I replace it and run MHDD there will be a lot of slow sectors. Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 1241 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 067 065 025 Pre-fail Always - 10271 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 649 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 7290 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 477 181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always - 13662262 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 1530 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 059 000 Old_age Always - 36 (Min/Max 12/41) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 1676 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 666 Quote Link to comment
jbartlett Posted April 11, 2017 Share Posted April 11, 2017 This topic is fascinating. I still plan on working on the disk scanning/mapping tool but been working on another project not related to UNRAID. Quote Link to comment
JorgeB Posted April 25, 2017 Author Share Posted April 25, 2017 Finally got around to replacing disk1 on the server from the screenshot above, had to replace disk12 at the same time because it redballed last week with a pending sector (yay for dual parity!), disk1 gave me read errors twice, and like I suspect it's full of slow sectors. SMART report still looks fine: Device Model: SAMSUNG HD204UI Serial Number: S2H7J9GZB05057 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 1782 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 067 065 025 Pre-fail Always - 10171 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 746 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 7404 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 483 181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always - 13662262 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 1742 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 058 000 Old_age Always - 26 (Min/Max 12/42) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 1677 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 764 But compare a full MHDD scan with a still healthy disk from the same model, notice the differences in the average speed, time for scan and mostly the quantity of sectors with delays >50ms, these are a bad sign, especially those with a delay >500ms, just a question of time for some to turn into bad sectors. These slow sectors were enough to give read errors and bad performance on the worst affected areas, but ATM there are no bad sectors and the disk would pass an extended SMART test. Quote Link to comment
c3 Posted April 26, 2017 Share Posted April 26, 2017 Unless you have changed something, 500msec should not be a read error. It is slow, but the timeout is measured in seconds and defaults to 30 seconds. This is the reason for TLER, the 30 seconds is wasted time when the array can rebuild from other drives far faster. Quote Link to comment
JorgeB Posted April 26, 2017 Author Share Posted April 26, 2017 30 minutes ago, c3 said: Unless you have changed something, 500msec should not be a read error. It is slow, but the timeout is measured in seconds and defaults to 30 seconds. This is the reason for TLER, the 30 seconds is wasted time when the array can rebuild from other drives far faster. I didn't change anything, and notice that it's >500ms, that means that those sectors took more than half a second to read, not half a second exactly, so while I'm not certain if there was a read timeout or if at that time there really was a read error but unRAID successfully wrote that sector back so the disk wasn't disabled, I'm as certain as I can be that those read errors were caused by the disk, and it's not the first time this happens with a disk in similar conditions. Quote Link to comment
c3 Posted April 27, 2017 Share Posted April 27, 2017 Yes, I did notice it said <500ms and >500ms, and 30,000ms is a long way from 500ms. TLER is 7,000ms. If response time is being used to determine drive health, the environment needs to be carefully controlled. Disk drives do not report when they re-calibrate for temperature, vibration, or even noise abatement. Which are possible sources for the >500ms response time, without being indicative of drive health. These are transitive conditions. Repeating the test would show different results. It is possible the sector was written as near off track, and deliver persistent slow performance. It can be corrected by rewriting the sector. And there are products... Or the sector could be difficult to read due to the media being scuffed, particles flying about, etc. Most of which lead to permanent drive failure. Another direction to take is issue additional reads. Drive drives fail, and the data must survive. So data is protected by additional writes. By issues additional reads, the failure or performance degradation is mitigated. If the data is protect single or double parity, a stripe read can be used to respond. Classically this was done serially, if the first read was slow or an error, then more reads were issued. Things like TLER were put in place to limit the performance impact. But the additional reads can be issued in parallel. Then the data is returned by the first of A) the block returned from a single target disk, or B) the block rebuilt as soon as enough stripe is returned. Either of which should be accomplished well below the performance threshold. Quote Link to comment
JorgeB Posted April 27, 2017 Author Share Posted April 27, 2017 9 minutes ago, c3 said: If response time is being used to determine drive health, the environment needs to be carefully controlled. Disk drives do not report when they re-calibrate for temperature, vibration, or even noise abatement. Which are possible sources for the >500ms response time, without being indicative of drive health. These are transitive conditions. Repeating the test would show different results. I can repeat the scan 10 times and results will be practically identical, same for the disks with normal response times, have you ever used that program? if you try it you'll see results are very consistent. Quote Link to comment
c3 Posted April 27, 2017 Share Posted April 27, 2017 Yes, I have used MHDD, and many others like it. I just don't spend a lot of time trying to predict failure or performance of single components. I work to achieve data durability and performance under all conditions at scale. Quote Link to comment
JorgeB Posted April 27, 2017 Author Share Posted April 27, 2017 27 minutes ago, c3 said: Yes, I have used MHDD, and many others like it Sorry, but by the way you talked of different results every time I assumed you never used it, because the results are always repeatable. In terms of predicting failure, I can only say that in my experience when disks start to get slow sectors like those above is just a question of time before they fail, but like all predictions it's not always right. i.e., a disk can keep working with slow sectors for a long time and most disks fail before slow sectors start showing. Quote Link to comment
c3 Posted April 27, 2017 Share Posted April 27, 2017 (edited) 4 hours ago, c3 said: It is possible the sector was written as near off track, and deliver persistent slow performance. It can be corrected by rewriting the sector. And there are products... I did also cover a possible cause for non-fatal, persistent slow block, and products for remediation. Your concern for both the reported errors and the measured performance is valid. If the manufacturer/source will replace it, that is a great plan. Edited April 27, 2017 by c3 Quote Link to comment
JorgeB Posted May 6, 2017 Author Share Posted May 6, 2017 Another one Noticed the parity check taking much longer than it should, speed dropping to 10MB/s several times: 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.