inh Posted September 29, 2013 Share Posted September 29, 2013 Noticed that disk 5 had redballed yesterday RIGHT BEFORE packing up the server to move... I'm at the new place now and started looking in to it. This is the first error I've ever had with unRAID before so I want to make sure I'm doing things right... First thing I did was pull a SMART report: root@Tower:~# smartctl -a -A /dev/sde smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD30EZRX-00AZ6B0 Serial Number: WD-WCC070107459 Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat Sep 28 15:19:15 2013 HST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (52380) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 249 246 021 Pre-fail Always - 9525 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1267 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 6654 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 88 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 25 193 Load_Cycle_Count 0x0032 140 140 000 Old_age Always - 181499 194 Temperature_Celsius 0x0022 123 111 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Then I ran a long self test: root@Tower:~# smartctl -t long /dev/sde smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 255 minutes for test to complete. Test will complete after Sat Sep 28 19:37:07 2013 Use smartctl -X to abort test. When I pulled the SMART report again after 5 hours, I noticed that it didnt complete, it failed with a read error: root@Tower:~# smartctl -a -A /dev/sde smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD30EZRX-00AZ6B0 Serial Number: WD-WCC070107459 Firmware Version: 80.00A80 [40/413] User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat Sep 28 19:40:00 2013 HST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (52380) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 249 246 021 Pre-fail Always - 9525 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1267 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 6659 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 88 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 25 193 Load_Cycle_Count 0x0032 140 140 000 Old_age Always - 181501 194 Temperature_Celsius 0x0022 124 111 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 6655 476481360 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. At this point I feel like this is a bad drive but maybe someone could shed some light on it. Thank you. Quote Link to comment
garycase Posted September 29, 2013 Share Posted September 29, 2013 The SMART report looks fine. I suspect you may have a loose cable or poorly seated drive (if it's in a hot-swap cage). Re-seat the cables and/or the drive. Then Start the array and see if the drive is still red-balled (it likely will be). Stop the array; Unassign the drive; Start the array (it will now show a missing drive); Stop the array and re-assign the drive to the same slot; then Start the array ... it will now rebuild the drive. This will only work if you've got good parity -- hopefully you've been running routine parity checks and have good parity. Quote Link to comment
garycase Posted September 29, 2013 Share Posted September 29, 2013 Note: After the rebuild, run a parity check to confirm all went well with the rebuild. Also, if the rebuild fails, then there's a problem with the drive that SMART hasn't detected -- if that's the case, you need to replace the drive ... and let UnRAID do the rebuild on the replacement. Quote Link to comment
inh Posted September 29, 2013 Author Share Posted September 29, 2013 It is in a hot swap cage, and it's a wonderful high quality Norco one at that! I tried moving bays, as well as re-seating it, etc. It stays redballed. Parity should be good, I've never had any issues before and it's done monthly. I was concerned about rebuilding a possibly failed drive because it seems odd that it would fail to complete the SMART test. Quote Link to comment
garycase Posted September 29, 2013 Share Posted September 29, 2013 Whoops! Sorry about that -- I overlooked that you had posted several windows of the SMART outputs. I had only looked at the SMART report itself. With the drive failing the SMART long test, I'd definitely replace it. I suppose that could have been due to a loose cable -- so you may want to try reseating it and then rerun the SMART test => but if it fails again, I wouldn't bother with a rebuild ... I'd just replace it. Quote Link to comment
inh Posted September 29, 2013 Author Share Posted September 29, 2013 Glad we're on the same page. Just to be sure, I moved the drive to another section of the case where the drives have been working fine, and am re-running the long test. I'll check the report in the morning and see if there's anything new. Quote Link to comment
inh Posted September 29, 2013 Author Share Posted September 29, 2013 Still failing the test during a read: root@Tower:~# smartctl -a -A /dev/sdd smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD30EZRX-00AZ6B0 Serial Number: WD-WCC070107459 [42/628] Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Sep 29 06:19:54 2013 HST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (52380) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. [0/628] SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 246 021 Pre-fail Always - 8100 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1269 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 6669 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 89 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26 193 Load_Cycle_Count 0x0032 140 140 000 Old_age Always - 181502 194 Temperature_Celsius 0x0022 117 111 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 6659 476481366 # 2 Extended offline Completed: read failure 90% 6655 476481360 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Looks like its time to replace the drive or is this kind of read failure somewhat common? This is one of my newer drives, my oldest ones are still going strong. I always run at least two if not three preclears on them when I first get them and this is the first errors I've seen. Quote Link to comment
garycase Posted September 29, 2013 Share Posted September 29, 2013 No, it should finish with no problem. Time to replace the drive. Quote Link to comment
inh Posted October 7, 2013 Author Share Posted October 7, 2013 After running shred on the drive for three cycles, and then preclearing it twice, hoping to run it in to the ground, I ran a SMART test and it has actually improved... smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD30EZRX-00AZ6B0 Serial Number: WD-WCC070107459 Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Oct 7 04:45:47 2013 HST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 242) Self-test routine in progress... 20% of test remaining. Total time to complete Offline data collection: (52380) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 246 021 Pre-fail Always - 9083 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1278 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 6802 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 98 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 33 193 Load_Cycle_Count 0x0032 140 140 000 Old_age Always - 181522 194 Temperature_Celsius 0x0022 120 111 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 6771 - # 2 Extended offline Completed: read failure 90% 6659 476481366 # 3 Extended offline Completed: read failure 90% 6655 476481360 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Kind of interesting... Im doubtful if I should warranty it now. Has anyone seen this happen before? Quote Link to comment
misterbeetz Posted October 7, 2013 Share Posted October 7, 2013 From your smart report... 181,522 head parks (load cycle count) 6802 power on hours. that amounts to 26.7 head parks per hour. I believe that the constant parking is what possibly messed up your drive. I know people here will disagree with me on this but as someone who has been using only wd green drives I think I've been seeing a similar pattern my self. In my case the more unreliable drives tend to be the ones with excessive head park counts. Unfortunately this happens because the Intellipark feature does not work correctly with linux based systems. Essentially the drive is not really given a chance to rest and then never really stays parked like it should. to fix this the wdidle3 utility (or similar) should be run on all green drives that are being used in Linux or Linux based nas systems... In my opinion I think you should RMA the drive but make sure you run the utility fix on the replacement to either turn off Intellipark or make it less aggressive. you should also do the same for any greens you have left in your arrays. Running this fix does not affect the data on your drive so it's safe to do... Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.