PeteB Posted May 28, 2017 Share Posted May 28, 2017 I just upgraded my server from 6.3.3 to 6.3.5. After the update downloaded and installed, I shut all the dockers, stopped the array and rebooted the server from the reboot button. After reboot, one of my hard disks came back with a warning: unRAID Disk 6 SMART health [197]: 28-05-2017 13:03 Warning [MOOSE] - current pending sector is 1 WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531 (sdj) unRAID Disk 6 SMART health [198]: 28-05-2017 13:03 Warning [MOOSE] - offline uncorrectable is 1 WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531 (sdj) There's no errors marked against the drive in the MAIN display tab. I'm intending to replace this hard drive shortly. Just want to understand the messages: I think the first message is indicating a pending sector (ie: one sector which needs to be remapped) and I think the second message indicates it has been remapped to the spare area. Does this mean that my data at that sector is ok (ie: moved to the spare area) or does the pending sector count of 1 means that it is not readable? I've attached a diagnosis. The disk in question is disk 6. Appreciate any advice to help me understand whether my data is ok or not. moose-diagnostics-20170528-1307.zip Quote Link to comment
JorgeB Posted May 28, 2017 Share Posted May 28, 2017 Run an extended SMART test, if it fails replace the disk ASAP. Quote Link to comment
PeteB Posted May 28, 2017 Author Share Posted May 28, 2017 (edited) Thanks very much for the reply Johnnie. I'll do that. With regards to the pending sector count of 1, does that mean I have a corrupt sector on the disk? I mean I know Smart says the sector has been reallocated, but he pending sector count is still at 1, so I'm guessing that it hasn't been rewritten? Does this mean I should do a disk recovery? Edited May 28, 2017 by PeteB Quote Link to comment
JorgeB Posted May 28, 2017 Share Posted May 28, 2017 If there really is a pending sector then it means it wasn't reallocated and there's a sector that can't be read, this can prevent a successful rebuild if another disk fails, that's why it's important to replace it if it's true, the extended SMART test will confirm. Quote Link to comment
itimpi Posted May 28, 2017 Share Posted May 28, 2017 It is also worth pointing out that a Pending Sector means a read failure on the sector. Reallocation is only attempted when a write to the sector is done and the write fails. Sometimes the write succeeds and the Pending Sector status is cleared without a reallocation taking place. Quote Link to comment
PeteB Posted May 28, 2017 Author Share Posted May 28, 2017 Thanks very much Johnnie and itimpi for the replies. Sorry the following is a bit long winded, but I just want to make sure I understand what is happening and the best next steps. The extended smart test failed. The log has the following in it: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 41081 110815568 # 2 Short offline Completed: read failure 90% 41081 110815568 # 3 Short offline Completed without error 00% 23169 - Just to make sure I fully understand what's happening I've written my understanding below. Would you mind confirming I've got it all correct? 1. The fact that the pending sector count is 1 means that there was a read failure on the sector. 2. The offline uncorrectable sector count of 1 means that reallocation has taken place. 3. The pending sector count remaining at 1 means that the sector can't be read and that as such any data at that sector is unreadable. If it had been rewritten to the spare area then the pending sector count would have gone down to 0. 4. The extended smart test failure means that the disk should definitely be replaced. 5. An outstanding pending sector means that recovery for another hard disk is compromised. Looks like I've got two choices: 1. Perform the parity swap process as I have a brand new 8TB drive ready to go. (Pity the 3TB didn't hold on just a little bit longer as I was going to replace it next weekend ) 2. Purchase a replacement hard drive, perform a disk recovery and then later on swap out the parity drive for the new 8TB and redeploy the old parity into the array to expand it. I'm happy to buy a new hard drive to replace the bad one if that presents the least risk for my data. What would you recommend. Quote Link to comment
JorgeB Posted May 28, 2017 Share Posted May 28, 2017 46 minutes ago, PeteB said: 1. The fact that the pending sector count is 1 means that there was a read failure on the sector. Yes, and it's confirmed by the SMART test, although rare, sometimes they are false positives 47 minutes ago, PeteB said: 2. The offline uncorrectable sector count of 1 means that reallocation has taken place. No, that would be the reallocated sector count 47 minutes ago, PeteB said: The pending sector count remaining at 1 means that the sector can't be read and that as such any data at that sector is unreadable Correct 47 minutes ago, PeteB said: 4. The extended smart test failure means that the disk should definitely be replaced. Yes 48 minutes ago, PeteB said: 5. An outstanding pending sector means that recovery for another hard disk is compromised. Yes, unRAID will try to finish the rebuild but one or more files will be corrupt. 49 minutes ago, PeteB said: I'm happy to buy a new hard drive to replace the bad one if that presents the least risk for my data. What would you recommend. The safest would be to replace it ASAP, I would do the parity swap, but if can get a disk in a day or two it's also a good option. Quote Link to comment
SSD Posted May 28, 2017 Share Posted May 28, 2017 We cannot say that a pending sector actually caused a read error. Only that the disk has determined that a sector should be reallocated at the next opportunity. Perhaps a read generated several retries and was ultimately successful with ECC, to complete the read. But the drive is saying ... next time the user wants to write to this sector, I want to take this sector out of service and substitute a spare sector for it. Has unRaid reported an error on the drive? That would be a true indication of a read error. Otherwise, somehow, the drive returned what it believed to be correct data. The extended read test is not something I typically don't use in diagnostics of pending sectors. I believe that spindown needs to be disabled to run it correctly. I would be the last one to question @johnnie.black, but I'm not really able to comment on the results. Below is what I normally advice, and you could take as a second diagnostic process to try. In theory, one pending sector that does not generate OS read errors is not the end of the world. But it is extremely common that a single pending sector is just the beginning. I had one drive with 2 pending sectors that never got worse, never generated a read error, and I just monitored. But have had or seen a hundred others where it was the beginning of a death by a thousand cuts. But still, one pending sector would not cause me to immediately replace an expensive drive without trying to let the drive heal itself. My advice is to run parity checks. If you can run three successful checks in a row with the problem not getting worse, I would tend to trust the drive. I would not be surprised to see the pending sector go away. I've actually seen a couple of very ugly sets of pending sectors in the thousands that mysterious cleared with no apparent harm then or in the future. Happened a few times. Can't explain why. I've also seen pending sectors get reallocated (although this shouldn't really happen on a read). unRAID is actually designed to, when it gets a read error, to use the other drives to figure out the sector contents, and then do a write back to the offending drive. In theory, this should give the drive the ability to remap the sector. I do think the error count would increment in unRAID if this happened. But like I said, my normal advice (and what I do myself) if run parity checks, give the drive a chance to resolve the issue. If the problem gets worse after every parity check, and after 4 or 5 checks I'm still seeing the count increasing, even by a very small number, I'll consider the drive bad and replace it. But if they stay constant for 3, I keep it in service and pay attention to the attribute getting worse. Very very few drives pass this test. The ones that do are normally newer disks. As an aside I believe that drives should act more aggressively to address pending sectors. Instead of marking just the one, they should take it as the center and remap a small "black hole" of sectors. They should not wait for a write - if they ultimately get a good read, it should remap then, Drives, especially NAS drives, are working under a strict time limit on responding, so a tighter integration with the OS level driver and the drive's SMART system would allow such actions to take place without the OS timing out and kicking disks. This would be a very big change and unlikely to come, but would be a better architecture IMO. Good luck! Quote Link to comment
JorgeB Posted May 28, 2017 Share Posted May 28, 2017 1 minute ago, bjp999 said: We cannot say that a pending sector actually caused a read error. True, but it definitely will in the next time unRAID needs to read it, e.g., during a disk rebuild, or if the user tries to read the file that it's using it. 3 minutes ago, bjp999 said: The extended read test is not something I typically don't use in diagnostics of pending sectors. I believe that spindown needs to be disabled to run it correctly. Short test is usually enough to confirm a read error when there are pending sectors, but the extended is more reliable, since before v6.2 unRAID disables spin down during an extended test, but even if it didn't you wouldn't get a read error, would get a host reset error. Quote Link to comment
JorgeB Posted May 28, 2017 Share Posted May 28, 2017 There's another thing the OP could try, but I'm not a big fan, is rebuilding to the same disk, there's a big change the pending sector will be reallocate, but in my experience there's also a big change of the disk getting more pending sectors in the near future, personally I would replace it, maybe try the rebuild if the disk was on a backup server. Quote Link to comment
SSD Posted May 28, 2017 Share Posted May 28, 2017 2 minutes ago, johnnie.black said: True, but it definitely will in the next time unRAID needs to read it, e.g., during a disk rebuild, or if the user tries to read the file that it's using it. Short test is usually enough to confirm a read error when there are pending sectors, but the extended is more reliable, since before v6.2 unRAID disables spin down during an extended test, but even if it didn't you wouldn't get a read error, would get a host reset error. I'd question the word "definitely". Most you get from me is "possibly" Does spindown need to be disabled to run the extended test? Quote Link to comment
JorgeB Posted May 28, 2017 Share Posted May 28, 2017 Just now, bjp999 said: I'd question the word "definitely". Most you get from me is "possibly" Unless the user tries to write to that sector so it can be remapped a read of it will fail 100 out of 100 tries, but I respect your opinion. 3 minutes ago, bjp999 said: Does spindown need to be disabled to run the extended test? Yes, but unRAID now does this automatically, IIRC it started on v6.1.something Quote Link to comment
SSD Posted May 28, 2017 Share Posted May 28, 2017 8 minutes ago, johnnie.black said: Unless the user tries to write to that sector so it can be remapped a read of it will fail 100 out of 100 tries, but I respect your opinion. So a subsequent parity check will cause a read error to the OS, 100% of the time? I'd like to see if that happens in this case. It would actually be a very good thing. Because unRAID would reconstruct the sector(s) with read errors, write them back to the problem disk, SMART would remap the sectors, and all would be good. In theory this is what should happen. But unfortunately I have never seen it happen. Quote Link to comment
JorgeB Posted May 28, 2017 Share Posted May 28, 2017 3 minutes ago, bjp999 said: So a subsequent parity check will cause a read error to the OS, 100% of the time? Yes, if the user did a parity check now (non correct would be safer), that sector would give a read error, unRAID would try to right it back and one of 3 things would happen: 1-write fails and the disk redballs 2-write succeeds and the sector is realoccated 3-write succeeds but the sector remains pending (this can happen with some disk/firmwares leading then to false positives) With options 2 or 3 a subsequent parity check may or not find more (or the same) errors. Quote Link to comment
SSD Posted May 28, 2017 Share Posted May 28, 2017 PeteB, can you do this test? I'd like to see what happens. First - do we see an increment on the error count (indicating unRAID got an error on the disk) Second - do we see a remap, clear, or continued pending Quote Link to comment
PeteB Posted May 28, 2017 Author Share Posted May 28, 2017 Running a non correcting parity check now. For me this usually runs for 16 hours. I'll report back when/if it completes Quote Link to comment
PeteB Posted May 29, 2017 Author Share Posted May 29, 2017 (edited) The non-correcting parity check has completed with 0 errors (see below). The pending sector count returned to 0 and a smart short test completed successfully. It had previously failed. From that perspective all seems good, BUT there's 60 errors in the error column against the hard drive. Relevant Syslog entries below. Subtracting the first byte from the last one comes to 472 bytes, so I interpret this as all the read errors occurring within 1 sector. Not sure if this is correct, but if it is, then this goes hand in hand with the single reallocated sector. Would it be correct to assume that if I clear the statistics and run a second parity check I would expect the errors column to be 0 and have no read errors? What do you think with regards to the state of the hard drive now? I've attached the smart report Messages re Parity check ===================== unRAID Parity check: 29-05-2017 00:18 Notice [MOOSE] - Parity check started Size: 6 TB unRAID array errors: 29-05-2017 05:52 Warning [MOOSE] - array has errors Array has 1 disk with read errors unRAID Disk 6 SMART message [197]: 29-05-2017 06:05 Notice [MOOSE] - current pending sector returned to normal value WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531 (sdj) unRAID Parity check: 29-05-2017 16:34 Notice [MOOSE] - Parity check finished (0 errors) Duration: 16 hours, 15 minutes, 19 seconds. Average speed: 102.6 MB/s WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531-20170529-1635.txt Log Messages: ============= May 29 05:51:21 Moose kernel: blk_update_request: critical medium error, dev sdj, sector 4405782864 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782800 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782808 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782816 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782824 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782832 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782840 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782848 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782856 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782864 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782872 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782880 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782888 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782896 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782904 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782912 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782920 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782928 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782936 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782944 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782952 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782960 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782968 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782976 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782984 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782992 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783000 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783008 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783016 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783024 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783032 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783040 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783048 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783056 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783064 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783072 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783080 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783088 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783096 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783104 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783112 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783120 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783128 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783136 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783144 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783152 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783160 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783168 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783176 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783184 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783192 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783200 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783208 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783216 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783224 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783232 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783240 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783248 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783256 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783264 May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783272 Edited May 29, 2017 by PeteB 1 Quote Link to comment
JorgeB Posted May 29, 2017 Share Posted May 29, 2017 20 hours ago, johnnie.black said: 2-write succeeds and the sector is realoccated Option two happened, pending sector was was re-written by unRAID and reallocated by the disk. You should run another non correcting check to confirm that disk is really OK for now. 1 Quote Link to comment
JorgeB Posted May 29, 2017 Share Posted May 29, 2017 1 hour ago, PeteB said: Would it be correct to assume that if I clear the statistics and run a second parity check I would expect the errors column to be 0 and have no read errors? You need to stop and re-start the array to clear the errors. 1 hour ago, PeteB said: What do you think with regards to the state of the hard drive now? It's difficult to predict but in my experience once a disk gets one or more pending sectors it's much more likely to get more in the near future, if there are more errors on the next check replace it, if not it's up to you, depends on how important is your data, if you have backups, etc. 1 Quote Link to comment
PeteB Posted May 29, 2017 Author Share Posted May 29, 2017 Thanks for the reply Johnie.Black. This disk is going to be replaced tomorrow, so at this stage it's just academic. If it provides any helpful information, I'll run another parity check, otherwise I'll just be doing a parity swap tomorrow night and bring my 8TB drive into the array as the parity. This disk will then get a long deserved break from service. Quote Link to comment
JorgeB Posted May 29, 2017 Share Posted May 29, 2017 Just now, PeteB said: If it provides any helpful information, I'll run another parity check Yes, if you don't mind, another non correcting check, more out of curiosity and also if there are more errors you now you need to replace it for sure. Quote Link to comment
PeteB Posted May 29, 2017 Author Share Posted May 29, 2017 Running now. Will report back in 16 hours Quote Link to comment
JorgeB Posted May 29, 2017 Share Posted May 29, 2017 (edited) 2 hours ago, johnnie.black said: Option two happened, pending sector was was re-written by unRAID and reallocated by the disk. Just want to add here that the sector may have re-used and not reallocated, SMART still shows 0 reallocated sectors (although some firmwares not always update that correctly), so probably the write on the previous pending sector was successful so it was re-used, if that was the case it's much more likely to turn pending again on the next read as disks are much more likely to error on reads than writes. Edited May 29, 2017 by johnnie.black Quote Link to comment
SSD Posted May 29, 2017 Share Posted May 29, 2017 3 hours ago, PeteB said: The non-correcting parity check has completed with 0 errors (see below). The pending sector count returned to 0 and a smart short test completed successfully. It had previously failed. From that perspective all seems good, BUT there's 60 errors in the error column against the hard drive. Relevant Syslog entries below. Subtracting the first byte from the last one comes to 472 bytes, so I interpret this as all the read errors occurring within 1 sector. Not sure if this is correct, but if it is, then this goes hand in hand with the single reallocated sector. Would it be correct to assume that if I clear the statistics and run a second parity check I would expect the errors column to be 0 and have no read errors? What do you think with regards to the state of the hard drive now? What happened to the reallocated sector count? An unRAID block is 4k (8 512 byte sectors), so we are talking about ~500 sectors, not bytes. But just because unRAID fails for the block, doesn't mean every sector in that block is bad - just 1 is enough. But you should see between 60 and 500 reallocated sectors. Interesting test! 1 Quote Link to comment
PeteB Posted May 29, 2017 Author Share Posted May 29, 2017 The Reallocated Sector count is 0 along with the Current Pending Sector count. The Offline Uncorrectable count is 1. Still running the parity check at the moment. It's up to 33%. No errors yet. WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531-20170529-1635.txt Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.