Cache pool I/O errors and BTRFS check filesystem status

ElJimador · June 25, 2017

A couple days ago I started noticing that the mover was taking forever to transfer movies from the cache pool to data drives, and checking the log I found a ton of BTRFS errors. Having experienced something like this before, I followed the same steps this time to re-do the cache pool (stopping docker, deleting the docker.img, moving cache shares to the array, wiping the file system on both cache drive, then moving the cache shares back). After starting the array again, I ran a BTRFS scrub which found and corrected 9 read errors, then ran it again correcting 4 read errors, then finally got no errors running the scrub the third time. After that ran BTRFS balance, then finally restarted the array in maintenance mode and ran BTRFS status under Check Filesystem Status, which returned this:

checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
Checking filesystem on /dev/sdc1
UUID: 10fca33f-5aca-4c9b-8a8e-99c373eb0fe4
found 13622984704 bytes used err is 0
total csum bytes: 13082368
total tree bytes: 222707712
total fs tree bytes: 186515456
total extent tree bytes: 20742144
btree space waste bytes: 45097406
file data blocks allocated: 13400276992
 referenced 13400276992

I assume "err is 0" means that no filesystem errors were found, however checking the attached log I still see blk_update_request I/O errors associated with sdc which is cache drive 1. Ex:

Jun 25 13:22:33 JBOX kernel: sd 1:0:1:0: [sdc] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Jun 25 13:22:33 JBOX kernel: sd 1:0:1:0: [sdc] tag#29 CDB: opcode=0x28 28 00 01 e8 d1 80 00 00 20 00
Jun 25 13:22:33 JBOX kernel: blk_update_request: I/O error, dev sdc, sector 32035200
Jun 25 13:22:33 JBOX kernel: mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Jun 25 13:22:33 JBOX kernel: mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
Jun 25 13:22:33 JBOX kernel: sd 1:0:1:0: [sdc] tag#30 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Jun 25 13:22:33 JBOX kernel: sd 1:0:1:0: [sdc] tag#30 CDB: opcode=0x28 28 00 01 e8 d1 60 00 00 20 00
Jun 25 13:22:33 JBOX kernel: blk_update_request: I/O error, dev sdc, sector 32035168
Jun 25 13:22:33 JBOX kernel: sd 1:0:1:0: [sdc] tag#31 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Jun 25 13:22:33 JBOX kernel: sd 1:0:1:0: [sdc] tag#31 CDB: opcode=0x28 28 00 01 e8 d0 a0 00 00 20 00
Jun 25 13:22:33 JBOX kernel: blk_update_request: I/O error, dev sdc, sector 32034976

So apparently there's still a problem with the cache pool even though neither SSD in the pool is reporting any SMART errors, etc. Any ideas what's going on and what I should do next? When this happened before I chocked it up to issues with the SAS2LP-MV8 controller I was using at the time (data drives on that controller would sometimes drop off the array also, including once during a parity check which wasn't fun). Since I replaced it with the LSI 9211-8i though, everything had been working fine. So do the continued I/O errors after replacing the cache pool suggest some new hardware issue (like maybe the card has become slightly unseated?) or is it more likely an issue with BTRFS? Because at this point I'm reluctant to even try restarting docker and recreating the docker.img again if there's still some kind of corruption and I think I'm about fed up w/BTRFS anyway. That being the case is it worth trying the replace cache procedure again and this time reformatting one of the cache SSDs to XFS and then hoping that copying the cache shares back to the single cache drive might somehow get rid of the I/O errors?

If the cache data is just corrupted at this point and there's no real fix except to delete all of it and start over, then I'm not recreating my Plex server from template again on top of BTRFS. It's a lot of work and the extra protection offered by the cache pool is only a nice theory if you're repeatedly having to wipe your cache due to software corruption when the drives themselves are perfectly healthy. Or that's the way I'm leaning at the moment anyway. I'd still appreciate any feedback anyone has to offer on the best way forward from here. Thanks.

syslog.txt

JorgeB · June 25, 2017

Those are hardware errors, possibly a bad cable.

ElJimador · June 26, 2017

18 hours ago, johnnie.black said:

Those are hardware errors, possibly a bad cable.

Thanks Johnnie. I'll check the cables when I get the chance but since it's a 2+ hour drive to where I run that server remotely and since I wanted to switch to a single XFS cache drive anyway, for now I simply unassigned the Samsung SSD that had been cache 1 (and was reporting all the I/O errors), changed available cache slots to 1 and formatted the Crucial SSD as XFS when I assigned it as the cache and re-started the array. Now copying the cache shares back to it from the array and so far the log looks clean so assuming the move completes without errors I'll try restarting docker once that's done.

What I'd really like to do going forward (after I check the connections to the Samsung and make sure there's no longer any HW issue) is to use that drive as sort of a hot backup to the cache -- ie. format it as XFS also and set up weekly or even daily backups to it from the cache so that if/when the Crucial even shoots craps I can just quickly assign the Samsung as the replacement. If I'm doing that though, do I have to assign the Samsung to the array or can I keep it outside and still run regular cache backups to it? Outside the array seems like a better option to me if I can just some instruction on how to do that.

JorgeB · June 26, 2017

4 minutes ago, ElJimador said:

If I'm doing that though, do I have to assign the Samsung to the array or can I keep it outside and still run regular cache backups to it? Outside the array seems like a better option to me if I can just some instruction on how to do that.

Since XFS doesn't support pools the other SSD must remain outside the array, use the Unassigned Devices plugin together with a script to make backups, note that you need to stop the Docker/VM services to be able to successfully backup them.

ElJimador · June 26, 2017

18 minutes ago, johnnie.black said:

Since XFS doesn't support pools the other SSD must remain outside the array, use the Unassigned Devices plugin together with a script to make backups, note that you need to stop the Docker/VM services to be able to successfully backup them.

Thanks again Johnnie. I'll look into that.

Cache pool I/O errors and BTRFS check filesystem status

Recommended Posts

ElJimador

Link to comment

JorgeB

Link to comment

ElJimador

Link to comment

JorgeB

Link to comment

ElJimador

Link to comment

Join the conversation