Parity issues are back again


Recommended Posts

I've had this issue with disks getting disabled during party checks for a while now. too date I've replaced the power supply, controller card and sata cables.

 

everytime without fail ill create a new config, run a parity check, it will say everything is good. then during the next 1-2 parity checks one of the drives will get redballed. I really don't know what else to do at this point.

 

syslog attached.

tower-diagnostics-20170524-1102.zip

Link to comment
26 minutes ago, coldhammer said:

I've had this issue with disks getting disabled during party checks for a while now. too date I've replaced the power supply, controller card and sata cables.

 

everytime without fail ill create a new config, run a parity check, it will say everything is good. then during the next 1-2 parity checks one of the drives will get redballed. I really don't know what else to do at this point.

 

syslog attached.

tower-diagnostics-20170524-1102.zip

 

You do have some drive issues (drives identified by the last three chars of their Serial Number)

 

550

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
200 Multi_Zone_Error_Rate   0x000a   099   099   000    Old_age   Always       -       322

Although not one of the more familiar attributes to monitor, I've seen some of the flakiest drive issues with Multi-Zone-Error-Rate above 0. But I am not that familiar with these Samsung drives, so maybe this is common for them. This drive also has 1 runtime bad block, which I am not raising as an issue, but is not great either. Not sure I would immediately replace this one, but I'd be watching it to see if the value gets larger.

 

197

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
200 Multi_Zone_Error_Rate   0x000a   099   099   000    Old_age   Always       -       60

This is the same attribute but on a Seagate drive. Same comment as above. But I am familiar with Seagates that these are not common. Not sure I would immediately replace this one, but I'd be watching it to see if the value gets larger.

 

90J

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   092   092   036    Pre-fail  Always       -       5352
195 Hardware_ECC_Recovered  0x001a   024   005   000    Old_age   Always       -       20599848

Way too many reallocated sectors. And normalized Hardware ECC i"Value"s down to 24, and has been as low as 5. These attributes should be close to 100. I would replace this drive without question.

 

G1B

  ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       1

UDMA CRC error is indicative of a cabling problem - not a drive problem. You only have one - but could point to a loose cable. The value never decreases, so no way to tell if this was a problem that was corrected already or an active issue. You said you replaced sata cables, so this may have been fixed.

 

551 / BSB / 5RV / 021 / 050 / JHN

Didn't report. Probably related to controller issue Johnnie referenced.

Link to comment
15 hours ago, coldhammer said:

are smart reports stored on the drives themselves or generated on the fly? i.e. if the loose cable was recorded before i replaced the cables would that error always be shown in the future? the G1B drive is a cache drive. 

 

When the condition occurs, the attribute is incremented. It is never reduced. The report just provides the list of the attributes and their current values.

  • Upvote 1
Link to comment

put in a new WD red drive, pre cleared with no issues. Pulled the old drive out reassigned to the new drive and ran data rebuild. Half way through the new drive got disabled with write errors again. so its not a faulty drive issue. and nothing has changed on the machine. it wasn't even rebooted between pre clear and data rebuild. 

 

An additional note is that prior to the my first error occurrence i had formatted the original drive as XFS and copied over 3TB of data from another drive. The copy process went smooth as silk.

 

In every case over the last year I have only had disk issues when a parity check is trying to write to a drive.

 

new log attached.

tower-diagnostics-20170602-1140.zip

Edited by coldhammer
Link to comment
12 minutes ago, coldhammer said:

Half way through the new drive got disabled with write errors again. so its not a faulty drive issue.

 

On 24/05/2017 at 11:37 PM, johnnie.black said:

You have both a SASLP and a SAS2LP, these are know to have those issues in some configs, disable vt-d if you don't need it, look for a bios update but your best chance of getting that resolved is replacing both with LSI controllers.

 

Problem was again caused by one of the SAS controllers, this time it was the SASLP

Link to comment

Parity disk dropped offline, check cables and reboot to get a SMART report.

 

There are also some MCE errors, these are caused by a hardware issue, possibly RAM, since the parity check on june 3rd corrected a lot of sync errors, and the next check on June 5th again found and corrected a lot of errors before the disk dropped out, you should run memtest for a few hours.

Link to comment

dont have any virtual machines setup.

 

yes it dropped off in the middle of the smart report. not sure about it being a bad disk though. its a pretty new drive. ran multiple preclears on it with no issues.

 

at this point i now have no parity disk assigned to the array. which means parity needs to be rebuilt. i have a complete new board/cpu setup for this machine. ive been waiting until i got everything up and running correctly with the array before upgrading the parts. I seem to have multiple hardware failures going on with the current setup. Since i no longer have any parity would i be alright to just go ahead and put in the new hardware then create a new config and let the parity rebuild? This approach would at least remove alot of the hardware related issues hopefully.

Link to comment
Just now, coldhammer said:

Since i no longer have any parity would i be alright to just go ahead and put in the new hardware then create a new config and let the parity rebuild?

 

Yes, and you don't need to do a new config if all the data disks will remain the same, just unassign parity disk, start array, stop array, re-assign parity and start the array to begin a parity sync.

 

Still, that disk failing twice and one of the times alone during the SMART test doesn't bode well, but try again in the new build.

Link to comment

rebuilt disk, ran parity 0 errors. immediately ran a second parity check (no new activity on the array was conducted before during or in between parity checks). Second parity check turned up 5572 errors.

 

if the first check had 0 errors and there was no new disk activity how could the second check have 5572 errors?

Link to comment

There are MCE errors on the log, these are hardware errors:
 

Jun  7 04:51:13 Tower kernel: mce: [Hardware Error]: Machine check events logged
...
Jun  7 08:52:37 Tower kernel: mce: [Hardware Error]: Machine check events logged
...
Jun  7 08:59:10 Tower kernel: mce: [Hardware Error]: Machine check events logged

 

Install the Nerd Pack and use MCELOG to see if there's more info, but running memtest for 24 hours would be a good start.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.