SSD Posted May 29, 2017 Share Posted May 29, 2017 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 I don't like these. Have seen them accompanying other problems and see them as a sign of trouble, especially if they are new. When a pending sector gets written to, the drive will test to see if the write worked or not. If it did, it will unmark it as pending and allow it to stay in service. I think this is an incredibly poor decision on the part of firmware makers IMO. A freshly written sector may appear fine right after writing, but if that spot is weak, a day/week/month later it can be unreadable again. If the drive detects a sector as bad, it should not f@# around with it when it has the opportunity to remap. In fact, it should try to reallocate the sectors around it as well. I believe we'd start to see the reallocated sectors stop increasing if drives were more proactive in taking sectors out of service. Heck, it could reallocate a few tracks. My theory is that continuing to mess with a bad sector is like picking a scab, and the problem gets bigger and bigger. Quote Link to comment
PeteB Posted May 30, 2017 Author Share Posted May 30, 2017 The Parity check finished with 0 errors and the error count against all drives is 0, so looks like reads are ok. Smart counters are as follows: Offline Uncorrectable: 1 Current Pending Sector count: 0 Reallocated Sector count: 0 To me the counters seem to contradict each other. I'd expect the Reallocated Sector Count should be 1. With two clean parity checks now, I think it's reasonable to assume that the array is ok. As there are no further errors and two parity checks have completed succesfully, I'm leaning towards doing the replacement in two steps.id: replace/rebuild parity (run parity check) and then replace drive 6 rather than do a parity swap. The reason for this is that if I put the new parity drive in place of the drive to be removed it's not on the motherboard but rather on my LSI HBA and I'd prefer it be connected to the Motherboard port. I'm a bit concerned with shuffling drives around into different physical positions (ie: new parity into old parity physical position and old parity into removed drives physical position) and then performing the parity swap.Whilst my reading seems to indicate this is ok, I wonder whether it will lead to some additional risk that I can avoid. Thoughts? Quote Link to comment
garycase Posted May 30, 2017 Share Posted May 30, 2017 53 minutes ago, PeteB said: To me the counters seem to contradict each other. I'd expect the Reallocated Sector Count should be 1. No, as Johnnie noted, during the first parity check UnRAID attempted to re-write the sector -- and eventually succeeded, so the pending counter was reset for that sector, and no reallocation was done. Then you did another check, and everything read just fine, so there were no errors seen. Nevertheless, based on the failed SMART test -- and the likelihood that that sector could indeed be "weak" (as bjp noted) and fail again, I'd definitely replace the drive (as you are doing). Quote Link to comment
garycase Posted May 30, 2017 Share Posted May 30, 2017 ... Note that had the UnRAID re-write of that sector failed, THEN it would have been reallocated ... and your parity checks would still have been just fine, since UnRAID was able to read the data from the other drives and re-write the sector (whether or not it was to a reallocated location). The "clue" as to exactly what happened is that the reallocated count didn't increment. Quote Link to comment
PeteB Posted May 30, 2017 Author Share Posted May 30, 2017 Thanks for the response garycase. I guess my point is that when the problem re-occurred the offline uncorrectable count went from 0 to 1 and the pending sector count went from 0 to 1. I see your point that the sector was recovered and written as the pending sector count went to 0, and the reallocated sector count stayed on 0. I guess where the counters seem contradictory is that the offline uncorrectable count stayed at 1. I understand this means that the sector had been re-allocated from the spare area and so writes should have gone to this area. Maybe this counter is telling us that there WAS an uncorrectable sector but since it has been recovered the reallocated count stayed at 0? Appreciate your response as it's helping me heaps with my understanding. Quote Link to comment
garycase Posted May 30, 2017 Share Posted May 30, 2017 I'm not certain, but I believe the offline uncorrectable effectively means "I couldn't correct the data from a read via ECC" => this would be the "pending sector". But UnRAID "corrected" it by re-writing it with data regenerated from the other disks in the array ... and when it was (finally, after several tries) successfully written, then the pending flag was cleared, and it was not reallocated. But the status showing that you had the issue (the "offline uncorrectable") remains, so you know it happened. That's at least consistent with what I've seen a few times -- whether that's the correct interpretation I simply don't know for sure. In any event, since you're replacing the drive, it's somewhat mute. (and I would ALWAYS replace any drive that fails SMART). Quote Link to comment
JorgeB Posted May 30, 2017 Share Posted May 30, 2017 1 hour ago, garycase said: I'm not certain, but I believe the offline uncorrectable effectively means "I couldn't correct the data from a read via ECC" => this would be the "pending sector". But UnRAID "corrected" it by re-writing it with data regenerated from the other disks in the array ... and when it was (finally, after several tries) successfully written, then the pending flag was cleared, and it was not reallocated. But the status showing that you had the issue (the "offline uncorrectable") remains, so you know it happened. That's also what I think happened. Quote Link to comment
SSD Posted May 30, 2017 Share Posted May 30, 2017 I think that's true, as far as it goes, but implies that the outcome was somehow purposeful and "normal", and neither are true. We had a situation where a sector was marked pending. On the next parity check, it generated some 500 consecutive sector read errors. The pending sector is returned to service. And nothing is now reallocated or pending. Was this the smart system at its finest? I think this is a good example of the limitations of smart. We tend to think the sum total of disk problems are sector issues. But their are heads and actually motors and other parts that are not sector specific. These can fail too. The best indicator I know of some type of mechanical issue is the multi zone error. The description is vague, but often when I see that incremented, it it's a sign of screwy things to come. And we had something screwy here, clearly. I'll give a possible set of facts that meet the data. Not saying this is what happened, only a possibility. An actuator motor is failing, resulting in it getting stuck or imprecise head positioning. This first happened and a sector was marked pending because the surface under the head required extra effort and retries. In the parity check it was worse. Attempt after attempt to move the heads failed, creating consecutive read errors back to the OS, without ever an attempt to actual read a sector. The pending sector got cleared. It is not true that smart sees all and tells all. The attributes are pretty dumb actually. Some give the current state (like temperature and current pending sectors), others just count things (like offline uncorrectable and crc errors) which never decrement or get reduced. Every problem that leads to an OS read error is not logged! 5 hours ago, garycase said: (and I would ALWAYS replace any drive that fails SMART). Kinda funny. What exactly do you mean failed smart? The drive had one pending sector, and now has none. The offline correctable? It generated no read error and did not repeat. When smart fails a disk it says so, and I don't think that happened here. I'd actually like to see the new smart report. PeteB, can you post? Quote Link to comment
garycase Posted May 30, 2017 Share Posted May 30, 2017 On 5/28/2017 at 3:56 AM, PeteB said: The extended smart test failed. The log has the following in it: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 41081 110815568 # 2 Short offline Completed: read failure 90% 41081 110815568 # 3 Short offline Completed without error 00% 23169 - Quote Link to comment
PeteB Posted May 31, 2017 Author Share Posted May 31, 2017 Hi bjp999, I've attached the smart report now that I've successfully replaced my parity drive. I'm about to start a parity check and then I'll replace the drive that had the smart error. Thank you everyone for you advice. WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531-20170531-1838.txt Quote Link to comment
SSD Posted June 1, 2017 Share Posted June 1, 2017 Nothing of note in the smart report beyond what you had noted. Historically the read test failures have been due to spindown and I have never thought of them as definitive. Now that unRaid is forcing the spindown to be disabled, this is a much better test! The second parity check with that long string of read errors to the os and no pending or reallocated sectors, lead me to the failure conclusion independent of that test. I might have run one more parity check, expecting that the read errors would not have occurring in the same locations if at all. That would have further confirmed my suspicion that this is an intermittent mechanical problem in the drive. This case is interesting and highlights that drives can and do fail with few if any hints in the smart attributes. Not that often we see these. Will say again, the multi zone error is often an indicator something funky is going on with a drive. Even one is enough to perk my attension. Thanks PeteB for indulging in running the extra parity check! Good luck with its replacement. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.