SOLVED: BTRFS Errors on SSD Cache Drive


Recommended Posts

My Plex docker was not working correctly so I tried to restart the docker.  It would not restart and the Plex log showed "read only file system".  So I went into my UnRaid logs to see whats going on and I see a ton of BTRFS errors on my cache drive.  Smart looks okay to me.

 

I have no idea what is going on and how I should move forward with this.  Please help.

 

Brad

 

tower-syslog-20170712-1842.zip

Link to comment

Okay, I have recreated the docker image file.  Everything appears to be working immediately after, but I will monitor it for a few days.

 

Thank you Jonnie.black!

 

Before I mark this as Solved I would like to make this a learning experience. What would cause this corruption to happen?

 

Also, I'm looking for information on the Scrub command in both the docker and drive menus.  What is this function used for?  Since I have dual parity drives, is it like a raid rebuild command?  Confused.

 

Thanks again!

Link to comment
5 minutes ago, BradJ said:

What would cause this corruption to happen?

 

In your case it doesn't appear hardware related, it's quite common but not really sure the cause when it's not hardware related.

 

6 minutes ago, BradJ said:

Also, I'm looking for information on the Scrub command in both the docker and drive menus.  What is this function used for?  Since I have dual parity drives, is it like a raid rebuild command?  Confused.

 

Scrub is used to check the integrity (checksums) of a btrfs filesystem, if a checksum error is found and there are good available copies, e.g., cache pool, the errors are fixed.

Link to comment
  • BradJ changed the title to [Solved] BTRFS Errors on SSD Cache Drive
  • BradJ changed the title to BTRFS Errors on SSD Cache Drive

Help again please!

 

I deleted the docker image and restored all the dockers.  Everything seemed to be working fine so I marked this thread as solved.

 

However, here we are almost ten days later and I decide to check out the logs to see if everything has been cleared up.  Unfortunately still getting tons of BTRFS errors. 

 

Attached is the latest log files.  Please take a look and see.  To me it looks like a bunch of checksum errors but they are being corrected.  About three weeks ago I moved my SSD Cache drives from on board SATA to the LSI controller card.  Maybe try switching back to see what happens?

 

Thanks for reading this!

 

tower-syslog-20170722-1406.zip

Link to comment

root@Tower:~# btrfs dev stats /mnt/cache
[/dev/sdb1].write_io_errs   0
[/dev/sdb1].read_io_errs    0
[/dev/sdb1].flush_io_errs   0
[/dev/sdb1].corruption_errs 0
[/dev/sdb1].generation_errs 0
[/dev/sde1].write_io_errs   0
[/dev/sde1].read_io_errs    0
[/dev/sde1].flush_io_errs   0
[/dev/sde1].corruption_errs 213416
[/dev/sde1].generation_errs 0
root@Tower:~# btrfs dev stats /dev/loop0
[/dev/loop0].write_io_errs   0
[/dev/loop0].read_io_errs    0
[/dev/loop0].flush_io_errs   0
[/dev/loop0].corruption_errs 0
[/dev/loop0].generation_errs 0
 

Link to comment

Looks like that corrected all errors. What do i do from here, just monitor for a few days?  Could one of those cache drives be going bad, or (hopefully) a one time fluke?

 

First run:

scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11
	scrub started at Sat Jul 22 15:00:35 2017 and finished after 00:03:44
	total bytes scrubbed: 192.10GiB with 225529 errors
	error details: csum=225529
	corrected errors: 225529, uncorrectable errors: 0, unverified errors: 0

Second run:

scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11
	scrub started at Sat Jul 22 15:05:33 2017 and finished after 00:03:18
	total bytes scrubbed: 192.10GiB with 0 errors
Link to comment

Ugh.  Errors again.  Non correcting Scrub shows 34896 errors again:

scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11
	scrub started at Sun Jul 23 21:26:41 2017 and finished after 00:03:21
	total bytes scrubbed: 193.21GiB with 34896 errors
	error details: csum=34896
	corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

 

Interesting that the total bytes scrubbed has increased from 192.10Gb to 193.12Gb. My SSD cache drives are 256Gb and 240Gb.  I've also noticed looking at the system logs doesn't work sometimes.  The web GUI just kind of freezes and I have to close the browser tab. After a few tries I eventually got the log and uploaded it. 

 

Any other ideas?  

tower-syslog-20170723-2123.zip

Link to comment

Several timeout errors on the affected SSD:

 

Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff8801093bfc80)
Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: [sde] tag#0 CDB: opcode=0x93 93 08 00 00 00 00 00 02 00 c0 00 14 2c 48 00 00
Jul 23 05:06:52 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:06:52 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff8801093bfc80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff8801093bfc80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#0 CDB: opcode=0x93 93 08 00 00 00 00 00 02 00 c0 00 14 2c 48 00 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff8801093bfc80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528c780)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#2 CDB: opcode=0x2a 2a 00 19 8e 69 60 00 00 60 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528c780)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528db00)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#4 CDB: opcode=0x2a 2a 00 19 8e 69 c0 00 00 80 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528db00)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528ca80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#6 CDB: opcode=0x2a 2a 00 19 8e 6a 40 00 00 80 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528ca80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528d500)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#8 CDB: opcode=0x2a 2a 00 19 8e 6a c0 00 00 80 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528d500)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528d800)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#10 CDB: opcode=0x2a 2a 00 19 8e 6b 40 00 00 20 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528d800)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff880037235b00)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#13 CDB: opcode=0x2a 2a 00 03 09 48 80 00 00 20 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff880037235b00)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff880037235380)

I suggest connecting both SSDs on the onboard ports, since the LSI2008 doesn't support trim anyway, run a scrub and see if the issue persists, if yes replace the Intel SSD.

Link to comment

Ok.  I'll do exactly as you say.  Onboard ports, then replace the Intel SSD if issue persists.  

 

I see instructions for replacing a cache drive here: https://wiki.lime-technology.com/Replace_A_Cache_Drive 

 

Those instructions don't really apply since I have dual cache drives providing redundancy.  

 

Is it as simple as stopping the array, assign the new drive in slot 2 of cache, and start array? If not, can you either tell me the steps or point me in the right direction?

 

 

Link to comment

Loos like it's detecting the Intel SSD as new, start like that, if it mounts a balance should start, if it's unmountable, power off, disconnect the Intel SSD, start the array with only the Samsung in the pool, then stop the array, power down, reconnect the Intel and add it to the pool.

Link to comment

Started the pool. Got a notification about too many profiles in cache.  I ran the balance command and noticed no activity on the Intel 2nd cache drive during the balance.   From a previous post of yours I got this info from before and after the balance:

 

root@Tower:~# btrfs fi df /mnt/cache
Data, RAID1: total=150.00GiB, used=65.81GiB
Data, DUP: total=41.50GiB, used=30.77GiB
System, DUP: total=32.00MiB, used=48.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, RAID1: total=1.00GiB, used=175.36MiB
Metadata, DUP: total=512.00MiB, used=126.86MiB
GlobalReserve, single: total=109.80MiB, used=0.00B
root@Tower:~# btrfs fi df /mnt/cache
Data, DUP: total=116.29GiB, used=96.25GiB
System, DUP: total=32.00MiB, used=48.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, DUP: total=1.00GiB, used=286.69MiB
GlobalReserve, single: total=94.27MiB, used=0.00B

 

I don't fully understand the above text but it looks to me like the RAID1 profile is gone, and that is what I want.  I just wrote a file to the server and 0 activity on the Intel drive. So I'm not sure how to get back to a redundant cache pool. What procedure do you recommend?

 

I apologize that my issue is taking so long to rectify.  I really appreciate you helping me out!
 

Link to comment
  • BradJ changed the title to SOLVED: BTRFS Errors on SSD Cache Drive

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.