Sync errors on parity check [SOLVED]


Recommended Posts

So, after resolving my previous issue and my server back to running smoothly, my monthly parity check was started and flagged up 347 (I think it was) sync errors corrected. Most unusual as I never get sync errors. So I saved the diagnostic and ran parity check again. this time 166 sync errors corrected. Saved the diagnostic and ran it again - 0 sync errors. Even with a few years of experience in running my UnRAID server, I am by no means at all an expert so would love some assistance (again). I think I can pinpoint the issue down to an old HDD.

 

Some background. I upgraded my server with a new supermicro motherboard to house my Xeon CPU. The case my server lives in is saturated with 16 HDD's so at the same time as upgrading the motherboard I re-purposed an old case and motherboard and added several additional HDD's in the second case that are powered from the PSU in that case, but connected to one of the expansion cards in my server (Supermicro SAS-2LP-MV8). This is an interim step due to current space and money restrictions but the long-term plan (don't tell the wife!) is to get a rack mounted case. As I have several old HDD's that are not used anymore, I want to expand the array and use these HDD's for a dedicated share for TV programmes that I'm not really going to view anytime, but want to still have access to. Currently I used three old HDD's and, after preclearing them as unassigned disks, I added them one by one to the array and copied some data onto them. It was after this process that the issues arrived and I believe it is to do with some errors with Disk 18. 

 

I have attached the three diagnostics files in order...

tower-diagnostics-20170514-1505.zip

tower-diagnostics-20170515-1628.zip

tower-diagnostics-20170516-1856.zip

Edited by aspdend
Link to comment

All logs all full of this:

 

May 12 15:23:07 Tower kernel: XFS (md2): First 64 bytes of corrupted metadata buffer:
May 12 15:23:07 Tower kernel: ffff8803e12fc000: 47 22 7e fd df be 50 b2 76 26 c8 c3 9e db e1 00  G"~...P.v&......
May 12 15:23:07 Tower kernel: ffff8803e12fc010: 2e 4d 27 91 2c 32 f4 ef a4 09 4a 71 07 5c 65 b9  .M'.,2....Jq.\e.
May 12 15:23:07 Tower kernel: ffff8803e12fc020: bc 3a 69 72 5a f5 40 dc e4 6a 3e 78 0d 5a 5c 08  .:[email protected]\.
May 12 15:23:07 Tower kernel: ffff8803e12fc030: 08 74 f8 c8 cd 5b 2d 1d 00 94 8b 5a a4 39 6f dc  .t...[-....Z.9o.
May 12 15:23:07 Tower kernel: XFS (md2): Metadata CRC error detected at xfs_dir3_block_read_verify+0xa1/0xa9, xfs_dir3_block block 0x5f7591d0
May 12 15:23:07 Tower kernel: XFS (md2): Unmount and run xfs_repair

You need to check filesystem on disk2 (md2).

 

Also, although difficult to see anything else since the logs are spammed with the previous error, there are ATA errors on disk18, it has a lot of UDMA_CRC errors, so first step would be to replace that SATA cable.

 

After fixing those post again new diagas if more sync errors are found on next check.

Link to comment
32 minutes ago, johnnie.black said:

All logs all full of this:

 


May 12 15:23:07 Tower kernel: XFS (md2): First 64 bytes of corrupted metadata buffer:
May 12 15:23:07 Tower kernel: ffff8803e12fc000: 47 22 7e fd df be 50 b2 76 26 c8 c3 9e db e1 00  G"~...P.v&......
May 12 15:23:07 Tower kernel: ffff8803e12fc010: 2e 4d 27 91 2c 32 f4 ef a4 09 4a 71 07 5c 65 b9  .M'.,2....Jq.\e.
May 12 15:23:07 Tower kernel: ffff8803e12fc020: bc 3a 69 72 5a f5 40 dc e4 6a 3e 78 0d 5a 5c 08  .:[email protected]\.
May 12 15:23:07 Tower kernel: ffff8803e12fc030: 08 74 f8 c8 cd 5b 2d 1d 00 94 8b 5a a4 39 6f dc  .t...[-....Z.9o.
May 12 15:23:07 Tower kernel: XFS (md2): Metadata CRC error detected at xfs_dir3_block_read_verify+0xa1/0xa9, xfs_dir3_block block 0x5f7591d0
May 12 15:23:07 Tower kernel: XFS (md2): Unmount and run xfs_repair

You need to check filesystem on disk2 (md2).

 

Also, although difficult to see anything else since the logs are spammed with the previous error, there are ATA errors on disk18, it has a lot of UDMA_CRC errors, so first step would be to replace that SATA cable.

 

After fixing those post again new diagas if more sync errors are found on next check.

That's a pain - Disk 2 was the one I was having an issue with on my last post - which was resolved (I thought) by switching the SATA connection to a different set of cables on the expander card. Disk 18 is very old so I was expecting that to have an issue if anything was. I may have to get some new forward breakout cables then...and whilst I wait I will run a filesystem check on Disk 2...

Link to comment
On 2017-5-17 at 11:08 AM, johnnie.black said:

All logs all full of this:

 


May 12 15:23:07 Tower kernel: XFS (md2): First 64 bytes of corrupted metadata buffer:
May 12 15:23:07 Tower kernel: ffff8803e12fc000: 47 22 7e fd df be 50 b2 76 26 c8 c3 9e db e1 00  G"~...P.v&......
May 12 15:23:07 Tower kernel: ffff8803e12fc010: 2e 4d 27 91 2c 32 f4 ef a4 09 4a 71 07 5c 65 b9  .M'.,2....Jq.\e.
May 12 15:23:07 Tower kernel: ffff8803e12fc020: bc 3a 69 72 5a f5 40 dc e4 6a 3e 78 0d 5a 5c 08  .:[email protected]\.
May 12 15:23:07 Tower kernel: ffff8803e12fc030: 08 74 f8 c8 cd 5b 2d 1d 00 94 8b 5a a4 39 6f dc  .t...[-....Z.9o.
May 12 15:23:07 Tower kernel: XFS (md2): Metadata CRC error detected at xfs_dir3_block_read_verify+0xa1/0xa9, xfs_dir3_block block 0x5f7591d0
May 12 15:23:07 Tower kernel: XFS (md2): Unmount and run xfs_repair

You need to check filesystem on disk2 (md2).

 

Also, although difficult to see anything else since the logs are spammed with the previous error, there are ATA errors on disk18, it has a lot of UDMA_CRC errors, so first step would be to replace that SATA cable.

 

After fixing those post again new diagas if more sync errors are found on next check.

OK, I have checked and repaired the filesystem on Disk 2 and also replaced the forward breakout cables that were serving Disk 18. everything looks OK to me. Logs are attached below. I will start a parity check again now and see what happens...

 

 

tower-diagnostics-20170518-1844.zip

Link to comment
On 2017-5-17 at 11:08 AM, johnnie.black said:

All logs all full of this:

 


May 12 15:23:07 Tower kernel: XFS (md2): First 64 bytes of corrupted metadata buffer:
May 12 15:23:07 Tower kernel: ffff8803e12fc000: 47 22 7e fd df be 50 b2 76 26 c8 c3 9e db e1 00  G"~...P.v&......
May 12 15:23:07 Tower kernel: ffff8803e12fc010: 2e 4d 27 91 2c 32 f4 ef a4 09 4a 71 07 5c 65 b9  .M'.,2....Jq.\e.
May 12 15:23:07 Tower kernel: ffff8803e12fc020: bc 3a 69 72 5a f5 40 dc e4 6a 3e 78 0d 5a 5c 08  .:[email protected]\.
May 12 15:23:07 Tower kernel: ffff8803e12fc030: 08 74 f8 c8 cd 5b 2d 1d 00 94 8b 5a a4 39 6f dc  .t...[-....Z.9o.
May 12 15:23:07 Tower kernel: XFS (md2): Metadata CRC error detected at xfs_dir3_block_read_verify+0xa1/0xa9, xfs_dir3_block block 0x5f7591d0
May 12 15:23:07 Tower kernel: XFS (md2): Unmount and run xfs_repair

You need to check filesystem on disk2 (md2).

 

Also, although difficult to see anything else since the logs are spammed with the previous error, there are ATA errors on disk18, it has a lot of UDMA_CRC errors, so first step would be to replace that SATA cable.

 

After fixing those post again new diagas if more sync errors are found on next check.

Ok, 

 

The parity check has been running for 15 hours and it's at 5.5% with thousands of errors found! Normally it takes 21 hours with no errors! Any help greatly appreciated! Will the logs from this morning I posted above before I started the check be any use?

Link to comment
42 minutes ago, johnnie.black said:

 

Correct, are you sure you ran xfs_repair without using the -n flag, if not it won't fix anything.

I think so - I've never used it before. I first used the webgui with -nv and then followed the instructions in the wiki - I telnetted in and ran it using xfs_repair -v /dev/md2 .

 

Attached is what was on the telnet session screen when it completed. 

 

It produced a lost and found folder and dumped what amounted to a season's worth of TV programmes from one particular series into the folder. I used mc to move them intoa different folder and then off the server onto my windows machine. 

xfs_repair md2 2017-05-18.txt

Link to comment

Ok, so you think it's just corrupted file system on that disk, rather than anything more serious? I will look up the process for backing up a disk prior to formatting it! 

 

Or am I being a bug stupid and is it just a case of moving the files off using windows? 

 

Actually, the wiki seems a bit light on this, unless I am missing something. I move the days off/around the array. Then stop the server. Then what do I do? I don't want a new config as I want to reformat then reuse the drive. Do I unassigned the drive, start it with a drive missing, format the drive, stop then reassign it. The drive will be rebuilt with no data effectively on it. Is that right? 

Edited by aspdend
Being thick
Link to comment
2 minutes ago, johnnie.black said:

Rebuild won't help with file system corruption, you need to copy or move all files off that disk, to another computer or to another disk in the array, and then after all data is backed up format that disk.

So do I remove the disk from the array and then run parity and then assign the disk back again , i.e. shrink the array then expand it?

Link to comment
14 minutes ago, johnnie.black said:

No, copy everything from that disk, stop the array, click on disk2 and change filesystem to reiserfs, start array, format disk2, stop array, change fs back to xfs, start array and format it one more time, restore data back to that disk.

Oh yeah, now I get it! Thanks for the neverending patience in answering such questions! I will do that and update

Link to comment
On 2017-5-19 at 8:58 PM, johnnie.black said:

No, copy everything from that disk, stop the array, click on disk2 and change filesystem to reiserfs, start array, format disk2, stop array, change fs back to xfs, start array and format it one more time, restore data back to that disk.

OK,

 

took some time, but I managed to copy all the data from Disk 2 back onto the array. I then stopped, changed disk 2 to ReiserFS, started, formatted, stopped, changed it to XFS, formatted, changed it back to ReiserFS and formatted than a final time, back to XFS and formatted it again. All seems to be going fine. I took a diagnostics file at this point - all looked fine to me.

 

So I started a parity check and, after about an hour or so, it terminated the check with the message that disk 2 was suffering from read errors again. See later diagnostics file attached as well...

 

Any more assistance available will be much appreciated.

tower-diagnostics-20170521-1958.zip

tower-diagnostics-20170522-0607.zip

Link to comment
2 hours ago, johnnie.black said:

Problem now is the SAS2LP, several timeout errors, disk5 now also has filesystem corruption, ending with the disk2 read errors.

 

Try disabling vt-d if not needed, look for a bios update, or replace them with LSI controllers.

Yeah,

 

I was afraid you were going to say that! I think my best option is to can the sas2Lp cards at least one of them anyway) and get me some Dell H310 or similar that seem to be well regarded.

 

Out of interest - how do you know if the card supports SAS2 (drives over 2Tb - for example this looks nice and cheap on ebay

 

Dell H310

 

But there's nothing there that explicitly states it can handle drives over 2Tb...Is it because it is the 6Gb/s version? Or something else I am not aware of?

 

 

Edited by aspdend
more infor
Link to comment
6 minutes ago, aspdend said:

Out of interest - how do you know if the card supports SAS2 (drives over 2Tb - for example this looks nice and cheap on ebay

 

Yes it does, you'll need to crossflah it to LSI IT mode to work with unRAID, but it's not difficult, procedure is on the wiki.

Link to comment
Just now, johnnie.black said:

 

Yes it does, you'll need to crossflah it to LSI IT mode to work with unRAID, but it's not difficult, procedure is on the wiki.

That's fair enough - just wanted to make sure I got a more unRAID friendly card this time!

 

Thanks yet again for the continued assistance...

Link to comment
Just now, johnnie.black said:

 

That controller (and any other other based on the LSI SAS2008 chip) is the current recommended 8 port controller for unRAID.

Good to know! Thanks again. I will order 2/3 of these Dell cards when iget paid later this week and  give them a whirl...fingers crossed.

Link to comment

Around the birth of the 3T drives, I did a test of a number of controllers to determine which ones supported >2T and which ones did not.

 

I only found 1 controller / chipset that did not.

 

https://www.servethehome.com/ibm-serveraid-br10i-lsi-sas3082e-r-pciexpress-sas-raid-controller/

 

The card uses the LSI 1068e chipset.

 

That one I know does not support >2T drives. So other cards using this chipset would have the same issue. Initially it was thought that a firmware update would be possible to support larger drives, but either it was impossible or not profitable for such an update, and none has ever emerged.

 

I am not aware of others. >2T support has nothing to do with PCIe level, SATA level, etc.

 

Both the SASLP and SAS2LP controllers support >2T drives.

  • Upvote 1
Link to comment

Here are the controllers I tested back in 2011.

 

SuperMicro C2SEE-O MB

IBM Br10i controller

Hong Kong cheap PCIe x1 controller

Adaptec 1430sa controller

SuperMicro AOC-SASLP-MV8

SuperMicro AOC-SAS2LP-MV8

Promise TX2 controller (2 eSata ports)

ASUS P5B VM DO MB

JMicron JMB383 (on the ASUS MB)

 

As mentioned above, only the BR10i failed to recognize >2T (>2,2T drives to be more precise) drives.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.