Recurring Parity Errors


Recommended Posts

I often get exactly 5 parity errors during my parity checks and it is past time to diagnose the problem. I have had my share of unclean power shutdowns that I have mitigated but I want to be prepared for the next time it happens.

 

Are there any steps I should take before executing my next parity check that will help me narrow down what could be causing the issue? Obviously, I will include the diagnostic zip file but are there any specific logging settings that should be enabled to catch my specific problem?

 

Note: in my system log, there are recurring power failure warnings. I have a portable room a/c that triggers my UPS to kick on for 2 seconds before returning to normal operation. I am guessing there is a drop in voltage but the lights in my room don't dim and all other appliances operate normally. Hopefully, this isn't a problem and the UPS is just doing its job protecting my NAS.

Screen Shot 2017-08-02 at 9.07.05 AM.png

syslog.txt

Link to comment
4 hours ago, TheRefugee said:

I often get exactly 5 parity errors during my parity checks and it is past time to diagnose the problem. I have had my share of unclean power shutdowns that I have mitigated but I want to be prepared for the next time it happens.

 

Can you post your diagnostics.  If I'm right about why the 5 errors occur, then diagnostics would let me know. (Or force me to rethink my theory)

 

And, can you confirm or deny this:

 

  • Between March 3 and April 10th, you reset the server (or any type of shutdown)
  • Between April 10th and May 5th, you did not reset the server
  • Between May 5th and May 22nd you reset the server
  • You reset the Server sometime on May 22/23 after the parity check occurred on the 22nd, but before the parity check on the 23rd
  • After May 23rd, you reset the server before June 25th.
  • On the 25th you reset the server and ran another parity check on the 26th.
  • And sometime between June 26th and Aug 2, the server was reset.

 

Link to comment
24 minutes ago, Squid said:

Can you post your diagnostics.  If I'm right about why the 5 errors occur, then diagnostics would let me know. (Or force me to rethink my theory)

 

And, can you confirm or deny this:

 

  • Between March 3 and April 10th, you reset the server (or any type of shutdown)
  • Between April 10th and May 5th, you did not reset the server
  • Between May 5th and May 22nd you reset the server
  • You reset the Server sometime on May 22/23 after the parity check occurred on the 22nd, but before the parity check on the 23rd
  • After May 23rd, you reset the server before June 25th.
  • On the 25th you reset the server and ran another parity check on the 26th.
  • And sometime between June 26th and Aug 2, the server was reset.

 

 

My server isn't up all the time so it almost certainly was rebooted between those dates. It's rare for my server to be up for more than a month at a time. Longest I have gone is 45 days, irrc.

 

Edited by TheRefugee
Link to comment

If at all possible, move the all the 8TB (and the samsung 850) drives off of the SAS2LP and on to the motherboard ports.   If you only have enough ports for the 8TBs, then leave them off of the SAS2LP and put the samsung 850 onto the SAS2LP

 

Then reboot, and try another parity check.  If my theory holds true, you shouldn't have 5 parity check errors.

 

If you do, post another diagnostics, as I need to compare the afflicted sectors (in the last check they were:

Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151176
Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151184
Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151192
Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151200
Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151208
Edited by Squid
Link to comment
37 minutes ago, Squid said:

If at all possible, move the all the 8TB (and the samsung 850) drives off of the SAS2LP and on to the motherboard ports.   If you only have enough ports for the 8TBs, then leave them off of the SAS2LP and put the samsung 850 onto the SAS2LP

 

Then reboot, and try another parity check.  If my theory holds true, you shouldn't have 5 parity check errors.

 

If you do, post another diagnostics, as I need to compare the afflicted sectors (in the last check they were:


Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151176
Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151184
Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151192
Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151200
Aug  1 10:47:02 Tower kernel: md: recovery thread: PQ corrected, sector=2743151208

 

Currently, 6 data drives (4 x 4TB and 2 x 8TB) and 2 parity drives (2 x 8TB) are on the SAS2LP with the Samsung SSD being the only drive connected to my motherboard.

 

I have 6 total motherboard ports I can connect drives to so moving the 2 parity 8TB drives, 2 data 8TB drives and the SSD to the motherboard will work. Is this the preferred option?

 

Do I need to write down which cables are connected to which drives beforehand? Does switching which drives are connected to which port interfere with my array being recognized or is it irrelevant?

Link to comment
3 minutes ago, TheRefugee said:

I have 6 total motherboard ports I can connect drives to so moving the 2 parity 8TB drives, 2 data 8TB drives and the SSD to the motherboard will work. Is this the preferred option?

That would be my choice.

 

3 minutes ago, TheRefugee said:

Does switching which drives are connected to which port interfere with my array being recognized or is it irrelevant?

All irrelevant.  unRaid keeps track of drive assignments by serial number, so you shouldn't notice any change at all.  But, it is always prudent to make a note of the original cabling just in case.

 

Link to comment
On 8/2/2017 at 2:59 PM, Squid said:

That would be my choice.

 

All irrelevant.  unRaid keeps track of drive assignments by serial number, so you shouldn't notice any change at all.  But, it is always prudent to make a note of the original cabling just in case.

 

 

I will be able to shut down and move my 8TB disks onto the motherboard tonight and I will start the parity check tonight as well.

 

To clarify the order of operations:

 

1. Shutdown

2. Move 8TB disks and SSD onto motherboard, leaving 4TB disks on the SAS2LP

3. Power On, start array

4. Parity check

 

Does that look correct? No other restart is necessary after getting the array to start up after moving the disks around? Just wanted to clarify because I wasn't sure if you were assuming I can hot swap and then do a restart.

 

edit: My server has been up since the last parity check. Powering down to switch the disks will be the first time the server has been offline since the last 5 error parity check.

Edited by TheRefugee
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.