Current Pending Sector Count


Recommended Posts

I just upgraded my server from 6.3.3 to 6.3.5.

 

After the update downloaded and installed, I shut all the dockers, stopped the array and rebooted the server from the reboot button.

 

After reboot, one of my hard disks came back with a warning:

 

unRAID Disk 6 SMART health [197]: 28-05-2017 13:03
Warning [MOOSE] - current pending sector is 1
WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531 (sdj)
 
unRAID Disk 6 SMART health [198]: 28-05-2017 13:03
Warning [MOOSE] - offline uncorrectable is 1
WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531 (sdj)

 

There's no errors marked against the drive in the MAIN display tab.

 

I'm intending to replace this hard drive shortly. Just want to understand the messages:

 

I think the first message is indicating a pending sector (ie: one sector which needs to be remapped) and I think the second message indicates it has been remapped to the spare area.

 

Does this mean that my data at that sector is ok (ie: moved to the spare area) or does the pending sector count of 1 means that it is not readable?

 

I've attached a diagnosis. The disk in question is disk 6. 

 

Appreciate any advice to help me understand whether my data is ok or not.

 

moose-diagnostics-20170528-1307.zip

Link to comment

Thanks very much for the reply Johnnie. I'll do that. 

 

With regards to the pending sector count of 1, does that mean I have a corrupt sector on the disk? I mean I know Smart says the sector has been reallocated, but he pending sector count is still at 1, so I'm guessing that it hasn't been rewritten?

 

Does this mean I should do a disk recovery?

 

Edited by PeteB
Link to comment

It is also worth pointing out that a Pending Sector means a read failure on the sector.    Reallocation is only attempted when a write to the sector is done and the write fails.   Sometimes the write succeeds and the Pending Sector status is cleared without a reallocation taking place.

Link to comment

Thanks very much Johnnie and itimpi for the replies. Sorry the following is a bit long winded, but I just want to make sure I understand what is happening and the best next steps.

 

The extended smart test failed. The log has the following in it:

 

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     41081         110815568
# 2  Short offline       Completed: read failure       90%     41081         110815568
# 3  Short offline       Completed without error       00%     23169         -
 

 

Just to make sure I fully understand what's happening I've written my understanding below. Would you mind confirming I've got it all correct?

 

1. The fact that the pending sector count is 1 means that there was a read failure on the sector.

2. The offline uncorrectable sector count of 1 means that reallocation has taken place.

3. The pending sector count remaining at 1 means that the sector can't be read and that as such any data at that sector is unreadable. If it had been rewritten to the spare area then the pending sector count would have gone down to 0.

4. The extended smart test failure means that the disk should definitely be replaced.

5. An outstanding pending sector means that recovery for another hard disk is compromised.

 

Looks like I've got two choices:

1. Perform the parity swap process as I have a brand new 8TB drive ready to go. (Pity the 3TB didn't hold on just a little bit longer as I was going to replace it next weekend :( )

2. Purchase a replacement hard drive, perform a disk recovery and then later on swap out the parity drive for the new 8TB and redeploy the old parity into the array to expand it.

 

I'm happy to buy a new hard drive to replace the bad one if that presents the least risk for my data. What would you recommend.

 

Link to comment
46 minutes ago, PeteB said:

1. The fact that the pending sector count is 1 means that there was a read failure on the sector.

 

Yes, and it's confirmed by the SMART test, although rare, sometimes they are false positives

 

47 minutes ago, PeteB said:

2. The offline uncorrectable sector count of 1 means that reallocation has taken place.

 

No, that would be the reallocated sector count

 

47 minutes ago, PeteB said:

The pending sector count remaining at 1 means that the sector can't be read and that as such any data at that sector is unreadable

 

Correct

 

47 minutes ago, PeteB said:

4. The extended smart test failure means that the disk should definitely be replaced.

 

Yes

 

48 minutes ago, PeteB said:

5. An outstanding pending sector means that recovery for another hard disk is compromised.

 

Yes, unRAID will try to finish the rebuild but one or more files will be corrupt.

 

49 minutes ago, PeteB said:

I'm happy to buy a new hard drive to replace the bad one if that presents the least risk for my data. What would you recommend.

 

The safest would be to replace it ASAP, I would do the parity swap, but if can get a disk in a day or two it's also a good option.

Link to comment

We cannot say that a pending sector actually caused a read error. Only that the disk has determined that a sector should be reallocated at the next opportunity. Perhaps a read generated several retries and was ultimately successful with ECC, to complete the read. But the drive is saying ... next time the user wants to write to this sector, I want to take this sector out of service and substitute a spare sector for it.

 

Has unRaid reported an error on the drive? That would be a true indication of a read error. Otherwise, somehow, the drive returned what it believed to be correct data.

 

The extended read test is not something I typically don't use in diagnostics of pending sectors. I believe that spindown needs to be disabled to run it correctly. I would be the last one to question @johnnie.black, but I'm not really able to comment on the results. Below is what I normally advice, and you could take as a second diagnostic process to try.

 

In theory, one pending sector that does not generate OS read errors is not the end of the world. But it is extremely common that a single pending sector is just the beginning. I had one drive with 2 pending sectors that never got worse, never generated a read error, and I just monitored. But have had or seen a hundred others where it was the beginning of a death by a thousand cuts. But still, one pending sector would not cause me to immediately replace an expensive drive without trying to let the drive heal itself.

 

My advice is to run parity checks. If you can run three successful checks in a row with the problem not getting worse, I would tend to trust the drive. I would not be surprised to see the pending sector go away. I've actually seen a couple of very ugly sets of pending sectors in the thousands that mysterious cleared with no apparent harm then or in the future. Happened a few times. Can't explain why. I've also seen pending sectors get reallocated (although this shouldn't really happen on a read). unRAID is actually designed to, when it gets a read error, to use the other drives to figure out the sector contents, and then do a write back to the offending drive. In theory, this should give the drive the ability to remap the sector. I do think the error count would increment in unRAID if this happened.

 

But like I said, my normal advice (and what I do myself) if run parity checks, give the drive a chance to resolve the issue. If the problem gets worse after every parity check, and after 4 or 5 checks I'm still seeing the count increasing, even by a very small number, I'll consider the drive bad and replace it. But if they stay constant for 3, I keep it in service and pay attention to the attribute getting worse. Very very few drives pass this test. The ones that do are normally newer disks.

 

As an aside I believe that drives should act more aggressively to address pending sectors. Instead of marking just the one, they should take it as the center and remap a small "black hole" of sectors. They should not wait for a write - if they ultimately get a good read, it should remap then, Drives, especially NAS drives, are working under a strict time limit on responding, so a tighter integration with the OS level driver and the drive's SMART system would allow such actions to take place without the OS timing out and kicking disks. This would be a very big change and unlikely to come, but would be a better architecture IMO.

 

Good luck!

Link to comment
1 minute ago, bjp999 said:

We cannot say that a pending sector actually caused a read error.

 

True, but it definitely will in the next time unRAID needs to read it, e.g., during a disk rebuild, or if the user tries to read the file that it's using it.

 

3 minutes ago, bjp999 said:

The extended read test is not something I typically don't use in diagnostics of pending sectors. I believe that spindown needs to be disabled to run it correctly.

 

Short test is usually enough to confirm a read error when there are pending sectors, but the extended is more reliable, since before v6.2 unRAID disables spin down during an extended test, but even if it didn't you wouldn't get a read error, would get a host reset error.

Link to comment

There's another thing the OP could try, but I'm not a big fan, is rebuilding to the same disk, there's a big change the pending sector will be reallocate, but in my experience there's also a big change of the disk getting more pending sectors in the near future, personally I would replace it, maybe try the rebuild if the disk was on a backup server.

Link to comment
2 minutes ago, johnnie.black said:

True, but it definitely will in the next time unRAID needs to read it, e.g., during a disk rebuild, or if the user tries to read the file that it's using it.

 

Short test is usually enough to confirm a read error when there are pending sectors, but the extended is more reliable, since before v6.2 unRAID disables spin down during an extended test, but even if it didn't you wouldn't get a read error, would get a host reset error.

 

I'd question the word "definitely". Most you get from me is "possibly" :)

 

Does spindown need to be disabled to run the extended test?

 

Link to comment
Just now, bjp999 said:

I'd question the word "definitely". Most you get from me is "possibly" :)

 

Unless the user tries to write to that sector so it can be remapped a read of it will fail 100 out of 100 tries, but I respect your opinion. :)

 

3 minutes ago, bjp999 said:

Does spindown need to be disabled to run the extended test?

 

Yes, but unRAID now does this automatically, IIRC it started on v6.1.something

Link to comment
8 minutes ago, johnnie.black said:

Unless the user tries to write to that sector so it can be remapped a read of it will fail 100 out of 100 tries, but I respect your opinion. :)

 

So a subsequent parity check will cause a read error to the OS, 100% of the time? I'd like to see if that happens in this case. It would actually be a very good thing. Because unRAID would reconstruct the sector(s) with read errors, write them back to the problem disk, SMART would remap the sectors, and all would be good. In theory this is what should happen. But unfortunately I have never seen it happen.

 

Link to comment
3 minutes ago, bjp999 said:

So a subsequent parity check will cause a read error to the OS, 100% of the time?

 

Yes, if the user did a parity check now (non correct would be safer), that sector would give a read error, unRAID would try to right it back and one of 3 things would happen:

 

1-write fails and the disk redballs

2-write succeeds and the sector is realoccated

3-write succeeds but the sector remains pending (this can happen with some disk/firmwares leading then to false positives)

 

With options 2 or 3 a subsequent parity check may or not find more (or the same) errors.

Link to comment

PeteB, can you do this test? I'd like to see what happens.

 

First - do we see an increment on the error count (indicating unRAID got an error on the disk)

 

Second - do we see a remap, clear, or continued pending

Link to comment

The non-correcting parity check has completed with 0 errors (see below). The pending sector count returned to 0 and a smart short test completed successfully. It had previously failed. From that perspective all seems good, BUT there's 60 errors in the error column against the hard drive. Relevant Syslog entries below. Subtracting the first byte from the last one comes to 472 bytes, so I interpret this as all the read errors occurring within 1 sector. Not sure if this is correct, but if it is, then this goes hand in hand with the single reallocated sector. 

 

Would it be correct to assume that if I clear the statistics and run a second parity check I would expect the errors column to be 0 and have no read errors?

 

What do you think with regards to the state of the hard drive now?

 

I've attached the smart report

 

Messages re Parity check

=====================

unRAID Parity check: 29-05-2017 00:18
Notice [MOOSE] - Parity check started
Size: 6 TB
 
unRAID array errors: 29-05-2017 05:52
Warning [MOOSE] - array has errors
Array has 1 disk with read errors
 
unRAID Disk 6 SMART message [197]: 29-05-2017 06:05
Notice [MOOSE] - current pending sector returned to normal value
WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531 (sdj)
 
unRAID Parity check: 29-05-2017 16:34
Notice [MOOSE] - Parity check finished (0 errors)
Duration: 16 hours, 15 minutes, 19 seconds. Average speed: 102.6 MB/s

WDC_WD30EZRX-00MMMB0_WD-WCAWZ2181531-20170529-1635.txt

 

Log Messages:

=============

May 29 05:51:21 Moose kernel: blk_update_request: critical medium error, dev sdj, sector 4405782864
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782800
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782808
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782816
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782824
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782832
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782840
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782848
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782856
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782864
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782872
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782880
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782888
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782896
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782904
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782912
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782920
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782928
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782936
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782944
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782952
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782960
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782968
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782976
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782984
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405782992
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783000
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783008
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783016
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783024
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783032
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783040
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783048
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783056
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783064
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783072
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783080
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783088
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783096
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783104
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783112
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783120
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783128
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783136
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783144
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783152
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783160
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783168
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783176
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783184
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783192
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783200
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783208
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783216
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783224
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783232
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783240
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783248
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783256
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783264
May 29 05:51:21 Moose kernel: md: disk6 read error, sector=4405783272
Edited by PeteB
  • Upvote 1
Link to comment
1 hour ago, PeteB said:

Would it be correct to assume that if I clear the statistics and run a second parity check I would expect the errors column to be 0 and have no read errors?

 

You need to stop and re-start the array to clear the errors.

 

1 hour ago, PeteB said:

What do you think with regards to the state of the hard drive now?

 

It's difficult to predict but in my experience once a disk gets one or more pending sectors it's much more likely to get more in the near future, if there are more errors on the next check replace it, if not it's up to you, depends on how important is your data, if you have backups, etc.

  • Upvote 1
Link to comment

Thanks for the reply Johnie.Black. 

 

This disk is going to be replaced tomorrow, so at this stage it's just academic. 

 

If it provides any helpful information, I'll run another parity check, otherwise I'll just be doing a parity swap tomorrow night and bring my 8TB drive into the array as the parity.

 

This disk will then get a long deserved break from service. ;)

 

Link to comment
2 hours ago, johnnie.black said:

Option two happened, pending sector was was re-written by unRAID and reallocated by the disk.

 

Just want to add here that the sector may have re-used and not reallocated, SMART still shows 0 reallocated sectors (although some firmwares not always update that correctly), so probably the write on the previous pending sector was successful so it was re-used, if that was the case it's much more likely to turn pending again on the next read as disks are much more likely to error on reads than writes.

Edited by johnnie.black
Link to comment
3 hours ago, PeteB said:

The non-correcting parity check has completed with 0 errors (see below). The pending sector count returned to 0 and a smart short test completed successfully. It had previously failed. From that perspective all seems good, BUT there's 60 errors in the error column against the hard drive. Relevant Syslog entries below. Subtracting the first byte from the last one comes to 472 bytes, so I interpret this as all the read errors occurring within 1 sector. Not sure if this is correct, but if it is, then this goes hand in hand with the single reallocated sector. 

 

Would it be correct to assume that if I clear the statistics and run a second parity check I would expect the errors column to be 0 and have no read errors?

 

What do you think with regards to the state of the hard drive now?

 

What happened to the reallocated sector count?

 

An unRAID block is 4k (8 512 byte sectors), so we are talking about ~500 sectors, not bytes. But just because unRAID fails for the block, doesn't mean every sector in that block is bad - just 1 is enough. But you should see between 60 and 500 reallocated sectors.

 

Interesting test!

  • Upvote 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.