SOLVED: BTRFS Errors on SSD Cache Drive

BradJ · July 12, 2017

My Plex docker was not working correctly so I tried to restart the docker. It would not restart and the Plex log showed "read only file system". So I went into my UnRaid logs to see whats going on and I see a ton of BTRFS errors on my cache drive. Smart looks okay to me.

I have no idea what is going on and how I should move forward with this. Please help.

Brad

tower-syslog-20170712-1842.zip

BradJ · July 13, 2017

Looking through the forums I see that maybe I should run the scrub command for the docker file. I see the option to "correct file system errors". Should that be checked or not?

Am I going in the right direction here?

BradJ · July 13, 2017

If in fact I am supposed to run the Scrub command, do I run it from the Docker menu or from the Cache1 drive menu?

Please help, I haven't run into this before.

JorgeB · July 13, 2017

Docker image is corrupt, delete and recreate.

BradJ · July 13, 2017

Okay, I have recreated the docker image file. Everything appears to be working immediately after, but I will monitor it for a few days.

Thank you Jonnie.black!

Before I mark this as Solved I would like to make this a learning experience. What would cause this corruption to happen?

Also, I'm looking for information on the Scrub command in both the docker and drive menus. What is this function used for? Since I have dual parity drives, is it like a raid rebuild command? Confused.

Thanks again!

JorgeB · July 13, 2017

5 minutes ago, BradJ said:

What would cause this corruption to happen?

In your case it doesn't appear hardware related, it's quite common but not really sure the cause when it's not hardware related.

6 minutes ago, BradJ said:

Also, I'm looking for information on the Scrub command in both the docker and drive menus. What is this function used for? Since I have dual parity drives, is it like a raid rebuild command? Confused.

Scrub is used to check the integrity (checksums) of a btrfs filesystem, if a checksum error is found and there are good available copies, e.g., cache pool, the errors are fixed.

BradJ · July 13, 2017

Thanks you for your explanation(s)!

BradJ · July 22, 2017

Help again please!

I deleted the docker image and restored all the dockers. Everything seemed to be working fine so I marked this thread as solved.

However, here we are almost ten days later and I decide to check out the logs to see if everything has been cleared up. Unfortunately still getting tons of BTRFS errors.

Attached is the latest log files. Please take a look and see. To me it looks like a bunch of checksum errors but they are being corrected. About three weeks ago I moved my SSD Cache drives from on board SATA to the LSI controller card. Maybe try switching back to see what happens?

Thanks for reading this!

tower-syslog-20170722-1406.zip

JorgeB · July 22, 2017

Problem seems limited to the docker image, still not normal, can you post the output of:

btrfs dev stats /mnt/cache

JorgeB · July 22, 2017

And also:

btrfs dev stats /dev/loop0

BradJ · July 22, 2017

root@Tower:~# btrfs dev stats /mnt/cache
[/dev/sdb1].write_io_errs 0
[/dev/sdb1].read_io_errs 0
[/dev/sdb1].flush_io_errs 0
[/dev/sdb1].corruption_errs 0
[/dev/sdb1].generation_errs 0
[/dev/sde1].write_io_errs 0
[/dev/sde1].read_io_errs 0
[/dev/sde1].flush_io_errs 0
[/dev/sde1].corruption_errs 213416
[/dev/sde1].generation_errs 0
root@Tower:~# btrfs dev stats /dev/loop0
[/dev/loop0].write_io_errs 0
[/dev/loop0].read_io_errs 0
[/dev/loop0].flush_io_errs 0
[/dev/loop0].corruption_errs 0
[/dev/loop0].generation_errs 0

JorgeB · July 22, 2017

Run a correcting scrub on the cache pool, if all errors are repaired run another one after that to check for any more.

BradJ · July 22, 2017

Looks like that corrected all errors. What do i do from here, just monitor for a few days? Could one of those cache drives be going bad, or (hopefully) a one time fluke?

First run:

scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11
	scrub started at Sat Jul 22 15:00:35 2017 and finished after 00:03:44
	total bytes scrubbed: 192.10GiB with 225529 errors
	error details: csum=225529
	corrected errors: 225529, uncorrectable errors: 0, unverified errors: 0

Second run:

scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11
	scrub started at Sat Jul 22 15:05:33 2017 and finished after 00:03:18
	total bytes scrubbed: 192.10GiB with 0 errors

JorgeB · July 22, 2017

It's hard to say, there are no read or write errors, monitor for a few days and if there are more errors see if they are on the same SSD.

BradJ · July 22, 2017

Okay, I'll monitor and report back.

Thanks again, I would be lost without you.

BradJ · July 24, 2017

Ugh. Errors again. Non correcting Scrub shows 34896 errors again:

scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11
	scrub started at Sun Jul 23 21:26:41 2017 and finished after 00:03:21
	total bytes scrubbed: 193.21GiB with 34896 errors
	error details: csum=34896
	corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

Interesting that the total bytes scrubbed has increased from 192.10Gb to 193.12Gb. My SSD cache drives are 256Gb and 240Gb. I've also noticed looking at the system logs doesn't work sometimes. The web GUI just kind of freezes and I have to close the browser tab. After a few tries I eventually got the log and uploaded it.

Any other ideas?

tower-syslog-20170723-2123.zip

JorgeB · July 24, 2017

Only the used space is scrubbed, post the complete diagnostics instead.

BradJ · July 24, 2017

Here ya go...

tower-diagnostics-20170724-0949.zip

JorgeB · July 24, 2017

Several timeout errors on the affected SSD:

Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff8801093bfc80)
Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: [sde] tag#0 CDB: opcode=0x93 93 08 00 00 00 00 00 02 00 c0 00 14 2c 48 00 00
Jul 23 05:06:52 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:06:52 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff8801093bfc80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff8801093bfc80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#0 CDB: opcode=0x93 93 08 00 00 00 00 00 02 00 c0 00 14 2c 48 00 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff8801093bfc80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528c780)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#2 CDB: opcode=0x2a 2a 00 19 8e 69 60 00 00 60 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528c780)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528db00)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#4 CDB: opcode=0x2a 2a 00 19 8e 69 c0 00 00 80 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528db00)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528ca80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#6 CDB: opcode=0x2a 2a 00 19 8e 6a 40 00 00 80 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528ca80)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528d500)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#8 CDB: opcode=0x2a 2a 00 19 8e 6a c0 00 00 80 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528d500)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528d800)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#10 CDB: opcode=0x2a 2a 00 19 8e 6b 40 00 00 20 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528d800)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff880037235b00)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#13 CDB: opcode=0x2a 2a 00 03 09 48 80 00 00 20 00
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff880037235b00)
Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff880037235380)

I suggest connecting both SSDs on the onboard ports, since the LSI2008 doesn't support trim anyway, run a scrub and see if the issue persists, if yes replace the Intel SSD.

BradJ · July 24, 2017

Ok. I'll do exactly as you say. Onboard ports, then replace the Intel SSD if issue persists.

I see instructions for replacing a cache drive here: https://wiki.lime-technology.com/Replace_A_Cache_Drive

Those instructions don't really apply since I have dual cache drives providing redundancy.

Is it as simple as stopping the array, assign the new drive in slot 2 of cache, and start array? If not, can you either tell me the steps or point me in the right direction?

JorgeB · July 24, 2017

You can use the FAQ instructions:

BradJ · July 24, 2017

I transferred the Cache drives to onboard SATA. Upon reboot I'm getting conflicting information from the dashboard. It shows my Intel SSD as unassigned but yet gives me a notification that my Samsung SSD is missing. Can I start the array as you see in the picture attached?

FireShot Capture 1 - Tower_Main - http___192.168.1.100_Main.pdf

JorgeB · July 24, 2017

Loos like it's detecting the Intel SSD as new, start like that, if it mounts a balance should start, if it's unmountable, power off, disconnect the Intel SSD, start the array with only the Samsung in the pool, then stop the array, power down, reconnect the Intel and add it to the pool.

BradJ · July 24, 2017

Started the pool. Got a notification about too many profiles in cache. I ran the balance command and noticed no activity on the Intel 2nd cache drive during the balance. From a previous post of yours I got this info from before and after the balance:

root@Tower:~# btrfs fi df /mnt/cache
Data, RAID1: total=150.00GiB, used=65.81GiB
Data, DUP: total=41.50GiB, used=30.77GiB
System, DUP: total=32.00MiB, used=48.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, RAID1: total=1.00GiB, used=175.36MiB
Metadata, DUP: total=512.00MiB, used=126.86MiB
GlobalReserve, single: total=109.80MiB, used=0.00B
root@Tower:~# btrfs fi df /mnt/cache
Data, DUP: total=116.29GiB, used=96.25GiB
System, DUP: total=32.00MiB, used=48.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, DUP: total=1.00GiB, used=286.69MiB
GlobalReserve, single: total=94.27MiB, used=0.00B

I don't fully understand the above text but it looks to me like the RAID1 profile is gone, and that is what I want. I just wrote a file to the server and 0 activity on the Intel drive. So I'm not sure how to get back to a redundant cache pool. What procedure do you recommend?

I apologize that my issue is taking so long to rectify. I really appreciate you helping me out!

JorgeB · July 24, 2017

Stop the array.

Wipe the Intel SSD with this (confirm that it's still sdh):

blkdiscard /dev/sdh

Unassign it from the pool, start the array, wait for any cache activity to stop, then stop the array.

Re-assign the Intel SSD and re-start array, a new balance should start.

SOLVED: BTRFS Errors on SSD Cache Drive

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation