BradJ Posted July 12, 2017 Share Posted July 12, 2017 My Plex docker was not working correctly so I tried to restart the docker. It would not restart and the Plex log showed "read only file system". So I went into my UnRaid logs to see whats going on and I see a ton of BTRFS errors on my cache drive. Smart looks okay to me. I have no idea what is going on and how I should move forward with this. Please help. Brad tower-syslog-20170712-1842.zip Quote Link to comment
BradJ Posted July 13, 2017 Author Share Posted July 13, 2017 Looking through the forums I see that maybe I should run the scrub command for the docker file. I see the option to "correct file system errors". Should that be checked or not? Am I going in the right direction here? Quote Link to comment
BradJ Posted July 13, 2017 Author Share Posted July 13, 2017 If in fact I am supposed to run the Scrub command, do I run it from the Docker menu or from the Cache1 drive menu? Please help, I haven't run into this before. Quote Link to comment
JorgeB Posted July 13, 2017 Share Posted July 13, 2017 Docker image is corrupt, delete and recreate. 1 Quote Link to comment
BradJ Posted July 13, 2017 Author Share Posted July 13, 2017 Okay, I have recreated the docker image file. Everything appears to be working immediately after, but I will monitor it for a few days. Thank you Jonnie.black! Before I mark this as Solved I would like to make this a learning experience. What would cause this corruption to happen? Also, I'm looking for information on the Scrub command in both the docker and drive menus. What is this function used for? Since I have dual parity drives, is it like a raid rebuild command? Confused. Thanks again! Quote Link to comment
JorgeB Posted July 13, 2017 Share Posted July 13, 2017 5 minutes ago, BradJ said: What would cause this corruption to happen? In your case it doesn't appear hardware related, it's quite common but not really sure the cause when it's not hardware related. 6 minutes ago, BradJ said: Also, I'm looking for information on the Scrub command in both the docker and drive menus. What is this function used for? Since I have dual parity drives, is it like a raid rebuild command? Confused. Scrub is used to check the integrity (checksums) of a btrfs filesystem, if a checksum error is found and there are good available copies, e.g., cache pool, the errors are fixed. Quote Link to comment
BradJ Posted July 13, 2017 Author Share Posted July 13, 2017 Thanks you for your explanation(s)! Quote Link to comment
BradJ Posted July 22, 2017 Author Share Posted July 22, 2017 Help again please! I deleted the docker image and restored all the dockers. Everything seemed to be working fine so I marked this thread as solved. However, here we are almost ten days later and I decide to check out the logs to see if everything has been cleared up. Unfortunately still getting tons of BTRFS errors. Attached is the latest log files. Please take a look and see. To me it looks like a bunch of checksum errors but they are being corrected. About three weeks ago I moved my SSD Cache drives from on board SATA to the LSI controller card. Maybe try switching back to see what happens? Thanks for reading this! tower-syslog-20170722-1406.zip Quote Link to comment
JorgeB Posted July 22, 2017 Share Posted July 22, 2017 Problem seems limited to the docker image, still not normal, can you post the output of: btrfs dev stats /mnt/cache Quote Link to comment
JorgeB Posted July 22, 2017 Share Posted July 22, 2017 And also: btrfs dev stats /dev/loop0 Quote Link to comment
BradJ Posted July 22, 2017 Author Share Posted July 22, 2017 root@Tower:~# btrfs dev stats /mnt/cache [/dev/sdb1].write_io_errs 0 [/dev/sdb1].read_io_errs 0 [/dev/sdb1].flush_io_errs 0 [/dev/sdb1].corruption_errs 0 [/dev/sdb1].generation_errs 0 [/dev/sde1].write_io_errs 0 [/dev/sde1].read_io_errs 0 [/dev/sde1].flush_io_errs 0 [/dev/sde1].corruption_errs 213416 [/dev/sde1].generation_errs 0 root@Tower:~# btrfs dev stats /dev/loop0 [/dev/loop0].write_io_errs 0 [/dev/loop0].read_io_errs 0 [/dev/loop0].flush_io_errs 0 [/dev/loop0].corruption_errs 0 [/dev/loop0].generation_errs 0 Quote Link to comment
JorgeB Posted July 22, 2017 Share Posted July 22, 2017 Run a correcting scrub on the cache pool, if all errors are repaired run another one after that to check for any more. Quote Link to comment
BradJ Posted July 22, 2017 Author Share Posted July 22, 2017 Looks like that corrected all errors. What do i do from here, just monitor for a few days? Could one of those cache drives be going bad, or (hopefully) a one time fluke? First run: scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11 scrub started at Sat Jul 22 15:00:35 2017 and finished after 00:03:44 total bytes scrubbed: 192.10GiB with 225529 errors error details: csum=225529 corrected errors: 225529, uncorrectable errors: 0, unverified errors: 0 Second run: scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11 scrub started at Sat Jul 22 15:05:33 2017 and finished after 00:03:18 total bytes scrubbed: 192.10GiB with 0 errors Quote Link to comment
JorgeB Posted July 22, 2017 Share Posted July 22, 2017 It's hard to say, there are no read or write errors, monitor for a few days and if there are more errors see if they are on the same SSD. Quote Link to comment
BradJ Posted July 22, 2017 Author Share Posted July 22, 2017 Okay, I'll monitor and report back. Thanks again, I would be lost without you. Quote Link to comment
BradJ Posted July 24, 2017 Author Share Posted July 24, 2017 Ugh. Errors again. Non correcting Scrub shows 34896 errors again: scrub status for 6992016b-30ad-4d9f-8ffb-3b34070daa11 scrub started at Sun Jul 23 21:26:41 2017 and finished after 00:03:21 total bytes scrubbed: 193.21GiB with 34896 errors error details: csum=34896 corrected errors: 0, uncorrectable errors: 0, unverified errors: 0 Interesting that the total bytes scrubbed has increased from 192.10Gb to 193.12Gb. My SSD cache drives are 256Gb and 240Gb. I've also noticed looking at the system logs doesn't work sometimes. The web GUI just kind of freezes and I have to close the browser tab. After a few tries I eventually got the log and uploaded it. Any other ideas? tower-syslog-20170723-2123.zip Quote Link to comment
JorgeB Posted July 24, 2017 Share Posted July 24, 2017 Only the used space is scrubbed, post the complete diagnostics instead. Quote Link to comment
BradJ Posted July 24, 2017 Author Share Posted July 24, 2017 Here ya go... tower-diagnostics-20170724-0949.zip Quote Link to comment
JorgeB Posted July 24, 2017 Share Posted July 24, 2017 Several timeout errors on the affected SSD: Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff8801093bfc80) Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: [sde] tag#0 CDB: opcode=0x93 93 08 00 00 00 00 00 02 00 c0 00 14 2c 48 00 00 Jul 23 05:06:52 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2) Jul 23 05:06:52 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1) Jul 23 05:06:52 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff8801093bfc80) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff8801093bfc80) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#0 CDB: opcode=0x93 93 08 00 00 00 00 00 02 00 c0 00 14 2c 48 00 00 Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2) Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff8801093bfc80) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528c780) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#2 CDB: opcode=0x2a 2a 00 19 8e 69 60 00 00 60 00 Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2) Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528c780) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528db00) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#4 CDB: opcode=0x2a 2a 00 19 8e 69 c0 00 00 80 00 Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2) Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528db00) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528ca80) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#6 CDB: opcode=0x2a 2a 00 19 8e 6a 40 00 00 80 00 Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2) Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528ca80) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528d500) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#8 CDB: opcode=0x2a 2a 00 19 8e 6a c0 00 00 80 00 Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2) Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528d500) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff88029528d800) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#10 CDB: opcode=0x2a 2a 00 19 8e 6b 40 00 00 20 00 Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2) Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff88029528d800) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff880037235b00) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: [sde] tag#13 CDB: opcode=0x2a 2a 00 03 09 48 80 00 00 20 00 Jul 23 05:07:26 Tower kernel: scsi target1:0:3: handle(0x000a), sas_address(0x4433221102000000), phy(2) Jul 23 05:07:26 Tower kernel: scsi target1:0:3: enclosure_logical_id(0x500605b006623b00), slot(1) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: task abort: SUCCESS scmd(ffff880037235b00) Jul 23 05:07:26 Tower kernel: sd 1:0:3:0: attempting task abort! scmd(ffff880037235380) I suggest connecting both SSDs on the onboard ports, since the LSI2008 doesn't support trim anyway, run a scrub and see if the issue persists, if yes replace the Intel SSD. Quote Link to comment
BradJ Posted July 24, 2017 Author Share Posted July 24, 2017 Ok. I'll do exactly as you say. Onboard ports, then replace the Intel SSD if issue persists. I see instructions for replacing a cache drive here: https://wiki.lime-technology.com/Replace_A_Cache_Drive Those instructions don't really apply since I have dual cache drives providing redundancy. Is it as simple as stopping the array, assign the new drive in slot 2 of cache, and start array? If not, can you either tell me the steps or point me in the right direction? Quote Link to comment
JorgeB Posted July 24, 2017 Share Posted July 24, 2017 You can use the FAQ instructions: Quote Link to comment
BradJ Posted July 24, 2017 Author Share Posted July 24, 2017 I transferred the Cache drives to onboard SATA. Upon reboot I'm getting conflicting information from the dashboard. It shows my Intel SSD as unassigned but yet gives me a notification that my Samsung SSD is missing. Can I start the array as you see in the picture attached? FireShot Capture 1 - Tower_Main - http___192.168.1.100_Main.pdf Quote Link to comment
JorgeB Posted July 24, 2017 Share Posted July 24, 2017 Loos like it's detecting the Intel SSD as new, start like that, if it mounts a balance should start, if it's unmountable, power off, disconnect the Intel SSD, start the array with only the Samsung in the pool, then stop the array, power down, reconnect the Intel and add it to the pool. Quote Link to comment
BradJ Posted July 24, 2017 Author Share Posted July 24, 2017 Started the pool. Got a notification about too many profiles in cache. I ran the balance command and noticed no activity on the Intel 2nd cache drive during the balance. From a previous post of yours I got this info from before and after the balance: root@Tower:~# btrfs fi df /mnt/cache Data, RAID1: total=150.00GiB, used=65.81GiB Data, DUP: total=41.50GiB, used=30.77GiB System, DUP: total=32.00MiB, used=48.00KiB System, single: total=32.00MiB, used=0.00B Metadata, RAID1: total=1.00GiB, used=175.36MiB Metadata, DUP: total=512.00MiB, used=126.86MiB GlobalReserve, single: total=109.80MiB, used=0.00B root@Tower:~# btrfs fi df /mnt/cache Data, DUP: total=116.29GiB, used=96.25GiB System, DUP: total=32.00MiB, used=48.00KiB System, single: total=32.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=286.69MiB GlobalReserve, single: total=94.27MiB, used=0.00B I don't fully understand the above text but it looks to me like the RAID1 profile is gone, and that is what I want. I just wrote a file to the server and 0 activity on the Intel drive. So I'm not sure how to get back to a redundant cache pool. What procedure do you recommend? I apologize that my issue is taking so long to rectify. I really appreciate you helping me out! Quote Link to comment
JorgeB Posted July 24, 2017 Share Posted July 24, 2017 Stop the array. Wipe the Intel SSD with this (confirm that it's still sdh): blkdiscard /dev/sdh Unassign it from the pool, start the array, wait for any cache activity to stop, then stop the array. Re-assign the Intel SSD and re-start array, a new balance should start. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.