WD Datacenter Gold 12TB issues


dchamb

Recommended Posts

Hello,

I am trying to ready a Western Digital Datacenter Gold 12TB drive to replace my current parity drive. I ran the preclear for over 58 hours making it through 4 of 5 steps before failing on step 5. Here are my questions:

 

1. Am I dealing with a bad drive?

2. Is there a way to start the preclear without going through steps 1 - 4?

3. Could there be a BIOS issue here?

 

Here is the preclear report:

 

############################################################################################################################
#                                                                                                                          #
#                                         unRAID Server Preclear of disk 8DG3KEVD                                          #
#                                       Cycle 1 of 1, partition start on sector 64.                                        #
#                                                                                                                          #
#                                                                                                                          #
#   Step 1 of 5 - Pre-read verification:                                                  [17:14:50 @ 193 MB/s] SUCCESS    #
#   Step 2 of 5 - Zeroing the disk:                                                        [41:09:23 @ 80 MB/s] SUCCESS    #
#   Step 3 of 5 - Writing unRAID's Preclear signature:                                                          SUCCESS    #
#   Step 4 of 5 - Verifying unRAID's Preclear signature:                                                        SUCCESS    #
#   Step 5 of 5 - Post-Read verification:                                                                          FAIL    #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#                              Cycle elapsed time: 58:33:47 | Total elapsed time: 58:33:47                                 #
############################################################################################################################


############################################################################################################################
#                                                                                                                          #
#                                               S.M.A.R.T. Status default                                                  #
#                                                                                                                          #
#                                                                                                                          #
#   ATTRIBUTE                    INITIAL  STATUS                                                                           #
#   5-Reallocated_Sector_Ct      0        -                                                                                #
#   9-Power_On_Hours             0        -                                                                                #
#   194-Temperature_Celsius      34       -                                                                                #
#   196-Reallocated_Event_Count  0        -                                                                                #
#   197-Current_Pending_Sector   0        -                                                                                #
#   198-Offline_Uncorrectable    0        -                                                                                #
#   199-UDMA_CRC_Error_Count     131      -                                                                                #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                               #
############################################################################################################################

--> FAIL: Post-Read verification failed. Your drive is not zeroed.


root@Tower:/usr/local/emhttp#

Thanks!

 

Dale

 

Link to comment

I replaced the cable, started preclear again. This time it was going much faster and the CRC error count did not increase from where it was before the cable was replaced. I got through steps 1 through 4, but it has been hung up on step 5 for several hours at 18%. I really need to determine if the drive is faulty even though SMART shows it is fine, or if there is a problem with  preclear, or something else.

 

Help!

 

Thanks


Dale

Link to comment
4 hours ago, dchamb said:

I replaced the cable, started preclear again. This time it was going much faster and the CRC error count did not increase from where it was before the cable was replaced. I got through steps 1 through 4, but it has been hung up on step 5 for several hours at 18%. I really need to determine if the drive is faulty even though SMART shows it is fine, or if there is a problem with  preclear, or something else.

 

Help!

 

Thanks


Dale

 

Not a good sign. If the cabling is good and the drive locks up, seems like a bad drive.

 

Using fancy names like "gold" and "datacenter" may give you a warm fuzzy feeling in your psyche that this drive is going to have a long and problem free life, but truth is drives are drives, and commercial and enterprise drives have similar failure rates. The "bathtub curve" phenomenon  is real, meaning that early drive fatality is more common than fatality after a break in period.

 

12TB drives are relatively new and no where near in the sweet spot on price. We old timers tend to be very price conscious, because the premium drives we bought in the 2T and 3T days are long gone or in backup servers, and we realize that it is costly to do the refresh cycles. 8T have been the way to go for drives for past 6-8 months or so at ~$20/T. 12T are about $35/T.

 

So you are one of few I've seen with 12's. They keep saying they are pushing the laws of physics to make higher capacity drives, but somehow they keep doing it anyway. I guess HAMR is coming and maybe we'll see a jump in sizes. But these 12s may be eeking out the last bit of capacity from the current tech, and it could be that you are a bit out there on the bleeding edge, and more failures are going to be normal. Or it could just be that this is a bad drive, and a replacement will work just fine.

 

BTW, a failure in the 3rd pass is a bad thing. That literally means that the drive read something other than a zero somewhere on the disk. There is a file that gets generated that tells you where. Probably just a couple of bytes. It could be that cable crosstalk could induce a non-zero signal AFTER the read, or that the marginal cable connection did something similar. But I will say that this is extremely rare. I've only seen it a small handful of times. The drive's ECC will usually not let a bad read escape the drive. I call it spewing garbage when a drive returns data that is different than what was written to the disk. There are those that would argue that it is impossible - but it does happen as you've proven. Bit rot is sometimes blamed when it happens in the real world, but you've got some very fast rotting happening if this problem develops between the 2nd and 3rd stage of a preclear!

 

You might rule out cabling problems, but I'd be pretty quick to pull the trigger on a replacement. If you're within the return windows from whence you bought it, you'd be assured to get a brand new drive, which is better than a possible refurb from WD.

Link to comment
 
Not a good sign. If the cabling is good and the drive locks up, seems like a bad drive.
 
Using fancy names like "gold" and "datacenter" may give you a warm fuzzy feeling in your psyche that this drive is going to have a long and problem free life, but truth is drives are drives, and commercial and enterprise drives have similar failure rates. The "bathtub curve" phenomenon  is real, meaning that early drive fatality is more common than fatality after a break in period.
 
12TB drives are relatively new and no where near in the sweet spot on price. We old timers tend to be very price conscious, because the premium drives we bought in the 2T and 3T days are long gone or in backup servers, and we realize that it is costly to do the refresh cycles. 8T have been the way to go for drives for past 6-8 months or so at ~$20/T. 12T are about $35/T.
 
So you are one of few I've seen with 12's. They keep saying they are pushing the laws of physics to make higher capacity drives, but somehow they keep doing it anyway. I guess HAMR is coming and maybe we'll see a jump in sizes. But these 12s may be eeking out the last bit of capacity from the current tech, and it could be that you are a bit out there on the bleeding edge, and more failures are going to be normal. Or it could just be that this is a bad drive, and a replacement will work just fine.
 
BTW, a failure in the 3rd pass is a bad thing. That literally means that the drive read something other than a zero somewhere on the disk. There is a file that gets generated that tells you where. Probably just a couple of bytes. It could be that cable crosstalk could induce a non-zero signal AFTER the read, or that the marginal cable connection did something similar. But I will say that this is extremely rare. I've only seen it a small handful of times. The drive's ECC will usually not let a bad read escape the drive. I call it spewing garbage when a drive returns data that is different than what was written to the disk. There are those that would argue that it is impossible - but it does happen as you've proven. Bit rot is sometimes blamed when it happens in the real world, but you've got some very fast rotting happening if this problem develops between the 2nd and 3rd stage of a preclear!
 
You might rule out cabling problems, but I'd be pretty quick to pull the trigger on a replacement. If you're within the return windows from whence you bought it, you'd be assured to get a brand new drive, which is better than a possible refurb from WD.
Seems the fault lies with preclear. It crashed on a segmentation fault. I'm going to reboot and try it from a command line.

Btw, I'm not hung up on the fancy names either. That's just what they call it. I have a 10TB WD Gold that runs like a top so when they came out with a 12TB for the same price as the 10TB I grabbed it up. Being 62 myself I think I'm an old timer myself lol!

Sent from my SM-G955U using Tapatalk

Link to comment
Seems the fault lies with preclear. It crashed on a segmentation fault. I'm going to reboot and try it from a command line.

Btw, I'm not hung up on the fancy names either. That's just what they call it. I have a 10TB WD Gold that runs like a top so when they came out with a 12TB for the same price as the 10TB I grabbed it up. Being 62 myself I think I'm an old timer myself lol!

Sent from my SM-G955U using Tapatalk

Did you use preclear plugin?
It tends to die i.e. stops progressing and cpu and memory utilization for preclear script skyrocket to 100%.

Sent from my SM-G955U1 using Tapatalk

Link to comment
23 hours ago, AndroidCat said:

Did you use preclear plugin?
It tends to die i.e. stops progressing and cpu and memory utilization for preclear script skyrocket to 100%.

Sent from my SM-G955U1 using Tapatalk
 

I used the preclear plugin. But when I try to run the script, it keeps telling me the drive is busy! Why can't I get this thing to preclear? I'm thinking of just putting the drive in the array and forgetting preclear.

Screenshot (5).png

Link to comment
I used the preclear plugin. But when I try to run the script, it keeps telling me the drive is busy! Why can't I get this thing to preclear? I'm thinking of just putting the drive in the array and forgetting preclear.
5a23419cd9206_Screenshot(5).thumb.png.17c3c02d0503d0afb8af60a9f7049d70.png
I had to kill it from cli and start over. Luckily it saves progress periodically and resumes where it left off.

Sent from my SM-G955U1 using Tapatalk

Link to comment

 

2 hours ago, AndroidCat said:

I had to kill it from cli and start over. Luckily it saves progress periodically and resumes where it left off.

Sent from my SM-G955U1 using Tapatalk
 

Not sure what is there to kill. I rebooted the unRAID machine and it still says the device is busy. It looks like an error in the script to me.

Link to comment
  • 2 months later...

It was a couple of days but that was going through the first 4 phases of preclear. Preclear crashed because of the script problem in phase 5 so I never completed it. I assigned to unRAID and everything has been working fine. 

My array consists of the 12TB WD Gold for the parity drive, a 10TB Gold data and 3 6TB Red drives and it takes 1 day 40 minutes to do a parity check at 135.1 MB/s.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.