Help reading smart report?


Recommended Posts

I am trying out unraid for the first time.  Just built a 30 drive 21TB array.  Took about 17 hours or so to run parity (I was starting from scratch... why was that necessary?).  Anyway, its finally up, and I have a drive reporting some errors.  It doesn't complete the smart short or extended tests, so I assume its a paper weight.  However its still green on the unraid dashboard?  Can someone please advise and tell me what to look out for?  I already researched the reallocated sector count - too early to tell if that's still climbing and I don't know about the rest.  Thanks!

 

A snippet follows:

 

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   104   099   006    Pre-fail  Always       -       26087299
  3 Spin_Up_Time            0x0003   100   099   085    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       639
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       7
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       86661325
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       13820
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   037   020    Old_age   Always       -       622
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   088   088   000    Old_age   Always       -       12
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   054   045    Old_age   Always       -       34 (Min/Max 27/44)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       2021682
194 Temperature_Celsius     0x0022   034   046   000    Old_age   Always       -       34 (0 13 0 0 0)
195 Hardware_ECC_Recovered  0x001a   046   042   000    Old_age   Always       -       26087299
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       4
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       9437 (60 20 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       884065040
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       391664078
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 8 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 occurred at disk power-on lifetime: 13806 hours (575 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      05:01:29.737  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      05:01:29.722  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      05:01:29.628  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      05:01:29.627  READ LOG EXT
  60 00 00 ff ff ff 4f 00      05:01:26.927  READ FPDMA QUEUED

Error 7 occurred at disk power-on lifetime: 13806 hours (575 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      05:01:26.927  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      05:01:26.888  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      05:01:26.823  READ LOG EXT
  60 00 00 ff ff ff 4f 00      05:01:24.164  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      05:01:23.460  READ FPDMA QUEUED
 

 

 

Link to comment

Ok thanks for confirming.  So even though under the disk properties it says "smart check passed" (which is weird because I can't even run the test???) you would replace it.  Conventional wisdom Ive read on the forums says if the reallocated sectors don't increase I am probably still good - but if I ever see pending sectors that basically means immediately replace?

Link to comment

A pending sector basically means that the firmware has detected a problematic sector that it intends to reallocate on the next write to that sector. But when it actually does a write to that sector, it does a final check to see if if should really reallocate, and sometimes says - its good enough - and doesn't remap it. 

 

And unRAID has some pretty cool handling of bad sector reads - if one happens it reconstructs the sector from all the other disks + parity and then WRITES that sector back to the disk with the error. The disk is supposed to remap that sector and now it has the perfect data to write there. This all sounds very good in practice, and if disks would just read the rulebook and play their part, it would work great. But disks tend to work great and have no sector issues, and when sector issues start to crop up, it is more of a symptom of illness than the illness itself. Because despite the drive's remapping and marking sectors for reallocation, the real problem is that some mechanical failure is happening. Maybe the heads aren't being positioned quite accurately. 

 

Whatever the reasons, what we know is once a disk that is a bit older, that previously had no reallocated sectors, starts to develop them, they tend to keep developing them. And it is time to replace the disk. Yes, there are tests you can run and ways to try to put a disk back in service, and maybe once in a blue moon it will work for a while, but generally they will disappoint and you'll be looking to replace them.

Link to comment

The larger issue here is that unraid requires perfection from all remaining array drives to correctly rebuild a missing drive. Leaving a questionable drive in the array will bite you at the worst possible moment, at least that's been my personal experience. You may wish to search and read up on how parity does its magic, once you understand that many of the things concerning drive failures, rebuilds and parity sync will become clear.

 

Let me just plant the seed by telling you that the parity disk by itself contains NO sensible data in most cases, and that is the way it has to be. ALL array drives read over their entire capacity combined allow recovery, not just the parity disk and the failed drive, and not just the areas of the disks containing data.

 

I would recommend never having multiple empty drives in an array, only add the drives you need to store data. Even an empty drive failing will put your remaining data in jeopardy.

 

30 older untested drives all in the array is a recipe for data loss.

Link to comment

jonathanm - Understood about using old drives.  Unfortunately, I am replacing my full 8 bay synology which is at 32TB currently with this install.  I need those old drives to get enough space to complete the migration.  Anything that appears as bad from the get go I certainly won't use for the migration... but any drives that appear OK will need to be used.  Once that's done, I will rip out my good disks from the synology (higher capacity WD reds) and put those in the unraid array.  So this is temporary.

 

trurl - Correct - brand new install and still learning.  I will configure. :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.