ashman70 Posted February 26, 2017 Share Posted February 26, 2017 Came into my office this morning to see all the lights blinking on my Supermicro sever and I thought, 'here we go again, it must of rebooted' but actually it hadn't however it was running a parity check. Further investigation reveals that for some reason multiple disk errors were reported causing the server to do a parity check. Right now, I've got 8 disks showing exactly 128 errors and the parity check is almost at 28%. No drives are offline, no red x's, very odd. I've attached diagnostics. Interestingly enough, the parity check has found and corrected 128 sync errors so far. tower-diagnostics-20170226-1229.zip Quote Link to comment
ashman70 Posted February 26, 2017 Author Share Posted February 26, 2017 Actually it turns out this is a regularly scheduled parity check, however its still very odd that multiple disks are reporting errors and they are all reporting the same amount of error as well. Quote Link to comment
ashman70 Posted February 26, 2017 Author Share Posted February 26, 2017 Anyone? Quote Link to comment
trurl Posted February 26, 2017 Share Posted February 26, 2017 Lots of lines similar to this in syslog Feb 26 00:03:44 Tower kernel: md: disk12 read error, sector=1000 Feb 26 00:03:44 Tower kernel: md: disk13 read error, sector=1000 Feb 26 00:03:44 Tower kernel: md: disk14 read error, sector=1000 Feb 26 00:03:44 Tower kernel: md: disk17 read error, sector=1000 Feb 26 00:03:44 Tower kernel: md: disk18 read error, sector=1000 Feb 26 00:03:44 Tower kernel: md: disk20 read error, sector=1000 Feb 26 00:03:44 Tower kernel: md: disk21 read error, sector=1000 Feb 26 00:03:44 Tower kernel: md: disk22 read error, sector=1000 Are these on the same controller? Quote Link to comment
ashman70 Posted February 26, 2017 Author Share Posted February 26, 2017 All my disks are on the same Perc H310 controller, and I believe these are even on different back planes too. I don't believe they are real errors, though. Eight disks reporting exactly 128 errors each, and then the parity sync also reporting exactly 128 parity sync errors so far? Sounds weird to me. Quote Link to comment
ashman70 Posted February 26, 2017 Author Share Posted February 26, 2017 I am a little confused by 'raw read error rate' should the value for every disk be 0 and if its not how does the value reflect on the overall health of the disk? Quote Link to comment
Squid Posted February 27, 2017 Share Posted February 27, 2017 3 minutes ago, ashman70 said: I am a little confused by 'raw read error rate' should the value for every disk be 0 and if its not how does the value reflect on the overall health of the disk? Every disk sector contains a ton of ECC information so that in case of a misread bit or two (which happens all the time), the sector can be properly reconstructed. Its when this process fails that you get the reported uncorrectable / reallocated sectors etc. Seagate is one of the rare companies that actually reports the value. You only need to worry about it when the value begins to approach the threshold Quote Link to comment
ashman70 Posted February 27, 2017 Author Share Posted February 27, 2017 I just read an article that basically said the raw read error rate number is nonsense and to disregard it, it said you only need to worry about reallocated sector count, pending sector and offline uncorrectable sectors. Quote Link to comment
ashman70 Posted February 27, 2017 Author Share Posted February 27, 2017 (edited) Still looking for some advice on what to do about this. Parity check it taking way longer than normal for some reason, its reading at anywhere from 7MB/s to 4MB/s and at this rate says it won't finish for four days or more. Something is not right. Edited February 27, 2017 by ashman70 Quote Link to comment
ashman70 Posted February 27, 2017 Author Share Posted February 27, 2017 Parity check speed is back to normal, no change in the number of errors its 'allegedly' found so far of 128 Quote Link to comment
JorgeB Posted February 27, 2017 Share Posted February 27, 2017 Looks like there was a timeout error with the controller, hopefully it was a one time thing, just in case check that the controller is well seated and run another parity check once this one finishes. Quote Link to comment
ashman70 Posted February 27, 2017 Author Share Posted February 27, 2017 (edited) What about the errors being reported on the disks? Will they just go away after a reboot? Ughh parity checks take like a day and a half. Are these reported errors real? I don't think they are. Edited February 27, 2017 by ashman70 Quote Link to comment
JorgeB Posted February 27, 2017 Share Posted February 27, 2017 8 minutes ago, ashman70 said: Will they just go away after a reboot? Yes Quote Link to comment
ashman70 Posted February 27, 2017 Author Share Posted February 27, 2017 Where do you see in the log that it was a controller timeout, or is that just what it looks like? How does that even happen? Quote Link to comment
JorgeB Posted February 27, 2017 Share Posted February 27, 2017 Feb 26 00:03:05 Tower kernel: sd 1:0:21:0: timing out command, waited 180s ... Feb 26 00:03:12 Tower kernel: sd 1:0:22:0: timing out command, waited 180s ... It's the same on all disks so it makes sense that it was the controller that timed out. Quote Link to comment
ashman70 Posted February 27, 2017 Author Share Posted February 27, 2017 So would you recommend letting this parity check finish even though the errors its reporting are probably false. And then running another one after a reboot? Quote Link to comment
JorgeB Posted February 27, 2017 Share Posted February 27, 2017 If it's close to finishing yes, if not reboot and start another one. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.