coldhammer Posted May 24, 2017 Share Posted May 24, 2017 I've had this issue with disks getting disabled during party checks for a while now. too date I've replaced the power supply, controller card and sata cables. everytime without fail ill create a new config, run a parity check, it will say everything is good. then during the next 1-2 parity checks one of the drives will get redballed. I really don't know what else to do at this point. syslog attached. tower-diagnostics-20170524-1102.zip Quote Link to comment
JorgeB Posted May 24, 2017 Share Posted May 24, 2017 You have both a SASLP and a SAS2LP, these are know to have those issues in some configs, disable vt-d if you don't need it, look for a bios update but your best chance of getting that resolved is replacing both with LSI controllers. Quote Link to comment
SSD Posted May 24, 2017 Share Posted May 24, 2017 26 minutes ago, coldhammer said: I've had this issue with disks getting disabled during party checks for a while now. too date I've replaced the power supply, controller card and sata cables. everytime without fail ill create a new config, run a parity check, it will say everything is good. then during the next 1-2 parity checks one of the drives will get redballed. I really don't know what else to do at this point. syslog attached. tower-diagnostics-20170524-1102.zip You do have some drive issues (drives identified by the last three chars of their Serial Number) 550 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 200 Multi_Zone_Error_Rate 0x000a 099 099 000 Old_age Always - 322 Although not one of the more familiar attributes to monitor, I've seen some of the flakiest drive issues with Multi-Zone-Error-Rate above 0. But I am not that familiar with these Samsung drives, so maybe this is common for them. This drive also has 1 runtime bad block, which I am not raising as an issue, but is not great either. Not sure I would immediately replace this one, but I'd be watching it to see if the value gets larger. 197 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 200 Multi_Zone_Error_Rate 0x000a 099 099 000 Old_age Always - 60 This is the same attribute but on a Seagate drive. Same comment as above. But I am familiar with Seagates that these are not common. Not sure I would immediately replace this one, but I'd be watching it to see if the value gets larger. 90J ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 092 092 036 Pre-fail Always - 5352 195 Hardware_ECC_Recovered 0x001a 024 005 000 Old_age Always - 20599848 Way too many reallocated sectors. And normalized Hardware ECC i"Value"s down to 24, and has been as low as 5. These attributes should be close to 100. I would replace this drive without question. G1B ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 1 UDMA CRC error is indicative of a cabling problem - not a drive problem. You only have one - but could point to a loose cable. The value never decreases, so no way to tell if this was a problem that was corrected already or an active issue. You said you replaced sata cables, so this may have been fixed. 551 / BSB / 5RV / 021 / 050 / JHN Didn't report. Probably related to controller issue Johnnie referenced. Quote Link to comment
coldhammer Posted May 25, 2017 Author Share Posted May 25, 2017 are smart reports stored on the drives themselves or generated on the fly? i.e. if the loose cable was recorded before i replaced the cables would that error always be shown in the future? the G1B drive is a cache drive. Quote Link to comment
SSD Posted May 26, 2017 Share Posted May 26, 2017 15 hours ago, coldhammer said: are smart reports stored on the drives themselves or generated on the fly? i.e. if the loose cable was recorded before i replaced the cables would that error always be shown in the future? the G1B drive is a cache drive. When the condition occurs, the attribute is incremented. It is never reduced. The report just provides the list of the attributes and their current values. 1 Quote Link to comment
coldhammer Posted June 2, 2017 Author Share Posted June 2, 2017 (edited) put in a new WD red drive, pre cleared with no issues. Pulled the old drive out reassigned to the new drive and ran data rebuild. Half way through the new drive got disabled with write errors again. so its not a faulty drive issue. and nothing has changed on the machine. it wasn't even rebooted between pre clear and data rebuild. An additional note is that prior to the my first error occurrence i had formatted the original drive as XFS and copied over 3TB of data from another drive. The copy process went smooth as silk. In every case over the last year I have only had disk issues when a parity check is trying to write to a drive. new log attached. tower-diagnostics-20170602-1140.zip Edited June 2, 2017 by coldhammer Quote Link to comment
JorgeB Posted June 2, 2017 Share Posted June 2, 2017 12 minutes ago, coldhammer said: Half way through the new drive got disabled with write errors again. so its not a faulty drive issue. On 24/05/2017 at 11:37 PM, johnnie.black said: You have both a SASLP and a SAS2LP, these are know to have those issues in some configs, disable vt-d if you don't need it, look for a bios update but your best chance of getting that resolved is replacing both with LSI controllers. Problem was again caused by one of the SAS controllers, this time it was the SASLP Quote Link to comment
JorgeB Posted June 2, 2017 Share Posted June 2, 2017 I was checking your original diags since I didn't remember which controller was the problem then, and it was the SASLP also, you have 6 disks using it and your 6 onboard SATA are unused, if you have the cables swap the 6 disks there and try another rebuild. Quote Link to comment
coldhammer Posted June 5, 2017 Author Share Posted June 5, 2017 moved all the connections from the SASLP controller to the mobo. Ran parity twice in a row. no issues. ran it a third time now the parity disk is disabled. new log attached. tower-diagnostics-20170605-1103.zip Quote Link to comment
JorgeB Posted June 5, 2017 Share Posted June 5, 2017 Parity disk dropped offline, check cables and reboot to get a SMART report. There are also some MCE errors, these are caused by a hardware issue, possibly RAM, since the parity check on june 3rd corrected a lot of sync errors, and the next check on June 5th again found and corrected a lot of errors before the disk dropped out, you should run memtest for a few hours. Quote Link to comment
coldhammer Posted June 6, 2017 Author Share Posted June 6, 2017 swapped in a new cable to the parity drive. rebooted, parity was still disabled, started an extended smart test after a few minutes parity drive just completely disappeared. now shows no device. log attached. tower-diagnostics-20170605-2115.zip Quote Link to comment
JorgeB Posted June 6, 2017 Share Posted June 6, 2017 Probably a bad disk, still can't see the SMART report since it dropped offline again. Quote Link to comment
coldhammer Posted June 6, 2017 Author Share Posted June 6, 2017 dont have any virtual machines setup. yes it dropped off in the middle of the smart report. not sure about it being a bad disk though. its a pretty new drive. ran multiple preclears on it with no issues. at this point i now have no parity disk assigned to the array. which means parity needs to be rebuilt. i have a complete new board/cpu setup for this machine. ive been waiting until i got everything up and running correctly with the array before upgrading the parts. I seem to have multiple hardware failures going on with the current setup. Since i no longer have any parity would i be alright to just go ahead and put in the new hardware then create a new config and let the parity rebuild? This approach would at least remove alot of the hardware related issues hopefully. Quote Link to comment
JorgeB Posted June 6, 2017 Share Posted June 6, 2017 Just now, coldhammer said: Since i no longer have any parity would i be alright to just go ahead and put in the new hardware then create a new config and let the parity rebuild? Yes, and you don't need to do a new config if all the data disks will remain the same, just unassign parity disk, start array, stop array, re-assign parity and start the array to begin a parity sync. Still, that disk failing twice and one of the times alone during the SMART test doesn't bode well, but try again in the new build. Quote Link to comment
coldhammer Posted June 7, 2017 Author Share Posted June 7, 2017 swapped out another cable and was able to complete an extended smart test on the parity drive. says test completed with no errors. new log attached. tower-diagnostics-20170607-0855.zip Quote Link to comment
JorgeB Posted June 7, 2017 Share Posted June 7, 2017 It can still be a flaky disk, but try syncing parity again. Quote Link to comment
coldhammer Posted June 9, 2017 Author Share Posted June 9, 2017 rebuilt disk, ran parity 0 errors. immediately ran a second parity check (no new activity on the array was conducted before during or in between parity checks). Second parity check turned up 5572 errors. if the first check had 0 errors and there was no new disk activity how could the second check have 5572 errors? Quote Link to comment
JorgeB Posted June 9, 2017 Share Posted June 9, 2017 Means there's a hardware problem somewhere Quote Link to comment
coldhammer Posted June 10, 2017 Author Share Posted June 10, 2017 hardware as in bad parity disk or hardware as in sas controllers, mobo, memory or other hard drives? is it possible to figure out where the hardware issue lies? Quote Link to comment
JorgeB Posted June 10, 2017 Share Posted June 10, 2017 It can be difficult to say with the diagnostics, impossible without them. If you didn't reboot yet since these last errors grab and post the diags. Quote Link to comment
coldhammer Posted June 12, 2017 Author Share Posted June 12, 2017 attached. tower-diagnostics-20170612-0805.zip Quote Link to comment
JorgeB Posted June 12, 2017 Share Posted June 12, 2017 There are MCE errors on the log, these are hardware errors: Jun 7 04:51:13 Tower kernel: mce: [Hardware Error]: Machine check events logged ... Jun 7 08:52:37 Tower kernel: mce: [Hardware Error]: Machine check events logged ... Jun 7 08:59:10 Tower kernel: mce: [Hardware Error]: Machine check events logged Install the Nerd Pack and use MCELOG to see if there's more info, but running memtest for 24 hours would be a good start. Quote Link to comment
coldhammer Posted June 14, 2017 Author Share Posted June 14, 2017 (edited) ran a memtest. 20 passes no errors. what should i look at next? Edited June 14, 2017 by coldhammer Quote Link to comment
JorgeB Posted June 14, 2017 Share Posted June 14, 2017 Board would be my next suspect. Quote Link to comment
coldhammer Posted June 16, 2017 Author Share Posted June 16, 2017 hah. that brings the testing to a screeching halt. not much i can do about the board. ill just rebuild with the new setup and then try and sync parity again. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.