Mover running trying to cache 20TB on 250GB cache

fatpipe · August 4, 2017

Hi Support,

New weird behaviour. Was trying to work out why the cache was not being used for one of the shares yesterday when it is enabled (setting Yes) and set it to preferred to try it. Today I notice the array is accessing and some drive are up when it should be idle. After a brief freak out thinking a virus is wiping my data, it turns out it is the mover.

It is copying data off the share set to preferred and putting it on the cache drive. But the share is near 20TB and the cache is a pair of 250GB SSDs and they are now full, nearly 200GB or so is missing from the reported usage for the drives in the share and the SSD is full (760MB free).

I can't stop the array because mover disables that and it is still doing something, reading one of the data disks still.

I don't know if it is actually damaging array contents or not at this stage. It could be.

JorgeB · August 4, 2017

That's how cache prefer is supposed to work, on the console type:

mover stop

Then change back to cache yes and run the mover again.

fatpipe · August 4, 2017

Thanks JB,

Didn't mean to leave it that way but got distracted last night after I set it and forgot.

Also I imagined that setting "Prefer" would not be distructive to data already stored on the array. I hope it hasn't been.

Presumably it only moves what it can until the cache is full?

This was only done because the cache was not being utilised for this share despite being set to "Yes".

Will check that out again after the mover moves back (I hope).

JorgeB · August 4, 2017

Data should be fine, mover will error out once cache is full.

fatpipe · August 4, 2017

Rediculous.

Moving data back has resulted in an error reported on disk1 again. The disk1 I replaced 3 days ago (new) for one that failed 2 months ago.

Been online successfully for a day only.

What is going on? Seeing log errors against allmost all drives. The only one taken offline is the disk1.

Is it power supply or controllers? How can I tell? I can't believe it is disk drives. There are 2 controllers in the machine. What the chance of both being faulty?

These are on the cache but no errors in the main page.

Aug 4 17:43:00 zStore2 kernel: BTRFS warning (device sdh1): lost page write due to IO error on /dev/sdh1
Aug 4 17:43:00 zStore2 kernel: BTRFS warning (device sdh1): lost page write due to IO error on /dev/sdh1

Aug 4 17:43:23 zStore2 kernel: scsi_io_completion: 42025 callbacks suppressed
Aug 4 17:43:23 zStore2 kernel: sd 4:0:2:0: [sdh] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 4 17:43:23 zStore2 kernel: sd 4:0:2:0: [sdh] tag#0 CDB: opcode=0x28 28 00 0f 78 63 48 00 04 00 00
Aug 4 17:43:23 zStore2 kernel: blk_update_request: 42049 callbacks suppressed
Aug 4 17:43:23 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 259547976
Aug 4 17:43:23 zStore2 kernel: btrfs_dev_stat_print_on_error: 72461 callbacks suppressed
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917312, flush 165963, corrupt 0, gen 0
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917313, flush 165963, corrupt 0, gen 0
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917314, flush 165963, corrupt 0, gen 0
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917315, flush 165963, corrupt 0, gen 0
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917316, flush 165963, corrupt 0, gen 0
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917317, flush 165963, corrupt 0, gen 0
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917318, flush 165963, corrupt 0, gen 0
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917319, flush 165963, corrupt 0, gen 0
Aug 4 17:43:23 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8119658, rd 5917320, flush 165963, corrupt 0, gen 0

What is wrong with this rig? Should be drawing less power than the 4TB drives it had before. 8TB WD Reds I am having nothing but trouble with. What is this?

If I haven't lost data now it will be a miracle.

JorgeB · August 4, 2017

Diagnostics may shed some light:

Tools -> Diagnostics

fatpipe · August 4, 2017

OK. Ran that. Here it is.

Can I ask what hours the support forum is manned (Personed?).

Seems likely I might be struggling with this all weekend again.

Also, can I run a preclear on a previously failed drive? Just to prove it is faulty? What about a drive that has failed and is still part of the array?

zstore2-diagnostics-20170804-1831.zip

Also if you could look at the one from the previous drive fail to compare?

zstore2-diagnostics-20170722-0130.zip

JorgeB · August 4, 2017

19 minutes ago, fatpipe said:

Can I ask what hours the support forum is manned (Personed?).

This is a community support forum.

As for your problems, there are know issues with the SAS2LP and unRAID v6, including dropping disks for no reason, you're also virtualizing unRAID, so it may make things worse, but your best bet would be to replace the SAS2LP controllers with LSI, 9201-8i, 9211-8i or similar.

fatpipe · August 4, 2017

Oh dear. That is bad news. Good news on the disk side perhaps. I have 3 of these controllers. One as spare. So that wouldn't have helped my sanity if I tried replacing one.

Do the 9201-8i or 9211-8i work out of the box or do the need to be firmware brained for JBOD first? That was a thing back when I was building the v5 unRAID with other controllers I didn't like.

So community means it is whenever someone who thinks they can answer something asked is there then? Kind of 24/7 and kind of whenever...?

JorgeB · August 4, 2017

4 minutes ago, fatpipe said:

Do the 9201-8i or 9211-8i work out of the box or do the need to be firmware brained for JBOD first?

9201 works out of the box, 9211-8i needs to be in IT mode, a simple flash like a bios update is needed if it's in IR mode.

5 minutes ago, fatpipe said:

So community means it is whenever someone who thinks they can answer something asked is there then? Kind of 24/7 and kind of whenever...?

Yes, you'll usually get fast answers even during weekends, as long as someone can help, you can also contact LT by email if needed.

itimpi · August 4, 2017

3 minutes ago, fatpipe said:

Oh dear. That is bad news. Good news on the disk side perhaps. I have 3 of these controllers. One as spare. So that wouldn't have helped my sanity if I tried replacing one.

Do the 9201-8i or 9211-8i work out of the box or do the need to be firmware brained for JBOD first? That was a thing back when I was building the v5 unRAID with other controllers I didn't like.

So community means it is whenever someone who thinks they can answer something asked is there then? Kind of 24/7 and kind of whenever...?

I have just replaced two SASLP controllers with a LSI 9201-16i and it was just a case of moving the SAS connectors to the new card. No flashing or reconfiguring of unRAID was necessary. As a useful side-effect I got a significant boost in performance during parity checks (probably because the SASLP was bandwidth limited). I will keep the SASLP controllers around for my test unRAID server as most of the time the performed fine.

Regarding the support issue - yes it depends who is online but there tend to be enough that you normally get a response relatively quickly any time of the day. On the whole the community support tends to be quite good although it is always a good idea to query anything you are not sure of. Limetech do drop in on the forum but not on a predictable schedule. If you really want to contact them then it is best done by email. However as there are only a few people there you may well have to wait if they are asleep (bearing in mind they are USA based).

fatpipe · August 4, 2017

Thanks itimpi. Will probably look into a 9201-16i as that solves a slot problem in the combined host.

The SAS2LP was a bit faster than the SASLP I believe. I get 140MB for the parity check but it noticably slows at the top of the 4TB drives half way through and the rebuild ran 109MB the other day.

How does the 9201-16i compare with that?

itimpi · August 4, 2017

3 hours ago, fatpipe said:

Thanks itimpi. Will probably look into a 9201-16i as that solves a slot problem in the combined host.

The SAS2LP was a bit faster than the SASLP I believe. I get 140MB for the parity check but it noticably slows at the top of the 4TB drives half way through and the rebuild ran 109MB the other day.

How does the 9201-16i compare with that?

I think the 9201 runs about the same speed. That is reasonable if the SAS2LP was not bandwidth limited (it has twice the bandwidth of the SASLP).

You do pay a significant premium for the LSI 9201-16I compared to buying 2 x LSI 9201-8i so you have to decide if using one less PCIe slot is worth the additional cost.

fatpipe · August 4, 2017

This seems to be getting worse.

I tried continuing to use the array tonight and it is now in a real mess. No more drive failures but here is an excerpt form the log. Implies filesystem corruption and indeed a major node in the tree on the largest share is now "missing". It can be seen in the file browsers of the individual disks alond with content but in SMB sharing it comes up empty.

The below says to unmount and run xfs_repair. Is that literal and at the command line? What parameters? Which drive? All?

Aug 5 01:56:55 zStore2 kernel: ffff8802fe061020: e4 37 61 77 f3 b4 ce 8f 11 d9 f9 a2 6b 7d 7b 59 .7aw........k}{Y
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061030: 70 31 65 15 7c fd 44 91 12 49 a3 da 5e 3b 94 65 p1e.|.D..I..^;.e
Aug 5 01:56:55 zStore2 kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0x92/0xb8, xfs_inode block 0x180d3c048
Aug 5 01:56:55 zStore2 kernel: XFS (md1): Unmount and run xfs_repair
Aug 5 01:56:55 zStore2 kernel: XFS (md1): First 64 bytes of corrupted metadata buffer:
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061000: d7 ee a6 36 87 ca 33 a6 33 48 e2 8f fd ee 03 f6 ...6..3.3H......
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061010: 88 32 01 0c d8 22 3c 4b 27 73 49 7e 4e b9 ce dd .2..."<K'sI~N...
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061020: e4 37 61 77 f3 b4 ce 8f 11 d9 f9 a2 6b 7d 7b 59 .7aw........k}{Y
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061030: 70 31 65 15 7c fd 44 91 12 49 a3 da 5e 3b 94 65 p1e.|.D..I..^;.e
Aug 5 01:56:55 zStore2 kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0x92/0xb8, xfs_inode block 0x180d3c048
Aug 5 01:56:55 zStore2 kernel: XFS (md1): Unmount and run xfs_repair
Aug 5 01:56:55 zStore2 kernel: XFS (md1): First 64 bytes of corrupted metadata buffer:
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061000: d7 ee a6 36 87 ca 33 a6 33 48 e2 8f fd ee 03 f6 ...6..3.3H......
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061010: 88 32 01 0c d8 22 3c 4b 27 73 49 7e 4e b9 ce dd .2..."<K'sI~N...
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061020: e4 37 61 77 f3 b4 ce 8f 11 d9 f9 a2 6b 7d 7b 59 .7aw........k}{Y
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061030: 70 31 65 15 7c fd 44 91 12 49 a3 da 5e 3b 94 65 p1e.|.D..I..^;.e
Aug 5 01:56:55 zStore2 kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0x92/0xb8, xfs_inode block 0x180d3c048
Aug 5 01:56:55 zStore2 kernel: XFS (md1): Unmount and run xfs_repair
Aug 5 01:56:55 zStore2 kernel: XFS (md1): First 64 bytes of corrupted metadata buffer:
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061000: d7 ee a6 36 87 ca 33 a6 33 48 e2 8f fd ee 03 f6 ...6..3.3H......
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061010: 88 32 01 0c d8 22 3c 4b 27 73 49 7e 4e b9 ce dd .2..."<K'sI~N...
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061020: e4 37 61 77 f3 b4 ce 8f 11 d9 f9 a2 6b 7d 7b 59 .7aw........k}{Y
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061030: 70 31 65 15 7c fd 44 91 12 49 a3 da 5e 3b 94 65 p1e.|.D..I..^;.e
Aug 5 01:56:55 zStore2 kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0x92/0xb8, xfs_inode block 0x180d3c048
Aug 5 01:56:55 zStore2 kernel: XFS (md1): Unmount and run xfs_repair
Aug 5 01:56:55 zStore2 kernel: XFS (md1): First 64 bytes of corrupted metadata buffer:
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061000: d7 ee a6 36 87 ca 33 a6 33 48 e2 8f fd ee 03 f6 ...6..3.3H......
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061010: 88 32 01 0c d8 22 3c 4b 27 73 49 7e 4e b9 ce dd .2..."<K'sI~N...
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061020: e4 37 61 77 f3 b4 ce 8f 11 d9 f9 a2 6b 7d 7b 59 .7aw........k}{Y
Aug 5 01:56:55 zStore2 shfs/user: err: shfs_readdir: fstatat: dirtymasseur (117) Structure needs cleaning
Aug 5 01:56:55 zStore2 kernel: ffff8802fe061030: 70 31 65 15 7c fd 44 91 12 49 a3 da 5e 3b 94 65 p1e.|.D..I..^;.e
Aug 5 01:56:55 zStore2 kernel: XFS (md1): metadata I/O error: block 0x180d3c048 ("xfs_trans_read_buf_map") error 117 numblks 32
Aug 5 01:56:55 zStore2 kernel: XFS (md1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -117.
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 01:56:56 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Errors keep being reported against the cache also. Even though the array should be idle at the moment.

Aug 5 02:03:09 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 02:03:09 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 02:03:09 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 02:03:18 zStore2 kernel: scsi_io_completion: 24 callbacks suppressed
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#0 CDB: opcode=0x2a 2a 00 00 5f b3 a8 00 00 18 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: 25 callbacks suppressed
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6271912
Aug 5 02:03:18 zStore2 kernel: btrfs_dev_stat_print_on_error: 28 callbacks suppressed
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125109, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#1 CDB: opcode=0x2a 2a 00 00 5f b3 c0 00 00 80 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6271936
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125110, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#2 CDB: opcode=0x2a 2a 00 00 5f b4 40 00 00 80 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6272064
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125111, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#3 CDB: opcode=0x2a 2a 00 00 5f b4 c0 00 00 80 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6272192
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125112, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#4 CDB: opcode=0x2a 2a 00 00 5f b5 40 00 00 68 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6272320
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125113, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#1 CDB: opcode=0x2a 2a 00 00 5f b3 a8 00 00 18 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6271912
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125114, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#2 CDB: opcode=0x2a 2a 00 00 5f b3 c0 00 00 80 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6271936
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125115, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#3 CDB: opcode=0x2a 2a 00 00 5f b4 40 00 00 80 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6272064
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125116, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#4 CDB: opcode=0x2a 2a 00 00 5f b4 c0 00 00 80 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6272192
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125117, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Aug 5 02:03:18 zStore2 kernel: sd 4:0:2:0: [sdh] tag#5 CDB: opcode=0x2a 2a 00 00 5f b5 40 00 00 68 00
Aug 5 02:03:18 zStore2 kernel: blk_update_request: I/O error, dev sdh, sector 6272320
Aug 5 02:03:18 zStore2 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 8125118, rd 5920439, flush 166032, corrupt 0, gen 0
Aug 5 02:03:18 zStore2 kernel: BTRFS warning (device sdh1): lost page write due to IO error on /dev/sdh1
Aug 5 02:03:18 zStore2 kernel: BTRFS warning (device sdh1): lost page write due to IO error on /dev/sdh1
Aug 5 02:03:39 zStore2 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Aug 5 02:03:39 zStore2 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Aug 5 02:03:39 zStore2 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Aug 5 02:03:40 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 02:03:40 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Aug 5 02:03:40 zStore2 rc.diskinfo[31453]: PHP Warning: strpos(): Empty needle in /etc/rc.d/rc.diskinfo on line 339
Am I screwed?

JorgeB · August 4, 2017

md1 is disk1, but I would use the server as little as possible until you replace the SAS2LP controllers, mvsas driver used by them was constantly crashing on your logs.

fatpipe · August 5, 2017

Controller purchased and on the whay but looks like 2 weeks minimum.

Miraculously, that folder has returned after the array went to sleep and woke up again having left it for a day almost.

How do I run the xfs_repair if the volume it needs to run on is synthetic but has to be dismounted to run the command? Would I have to resync the contents in the corrupted form first back onto a drive and then run it or can I address the synthesised content with the array in maintenance mode and repair it like that?

I have set the mover to 1 month schedule to stop it making more churn of this until I can get the replacement controller in place. Is there another way of stoping it from running?

Thanks for the assists all so far. Would never have worked out the controllers easily although I was suspicious from some of the problems experienced since moving to v6.

itimpi · August 5, 2017

You can put the array into Maintenance mode and run the repair from there. In fact Maintenance mode is always the recommended mode for running repairs.

JorgeB · August 5, 2017

5 minutes ago, fatpipe said:

How do I run the xfs_repair if the volume it needs to run on is synthetic but has to be dismounted to run the command?

https://wiki.lime-technology.com/Check_Disk_Filesystems#Drives_formatted_with_XFS

fatpipe · August 8, 2017

Hi Folks,

Just so I am clear when the controller arrives, I was poking around with the current situation and have a few further questions.

I ran xfs_repair -nv against the various disks in the array set.

Disk7 OK

Disk6 OK

Disk5 OK

Disk4 OK

Disk3 repairs advised including lost+found approx 20 items and a few other things.

Disk2 OK

How can I run a Disk1 repair on the emulated contents? I presume I can't and have to wait for the controller, "replace" the drive and wait for it to rebuild the (presumed) corrupted contents onto it and then run the repair. Is that correct?

Also tried the repair on Disk3 and got the below dialog.

xfs_repair -v disk3

Phase 1 - find and verify superblock...
        - block cache size set to 1028432 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 235851 tail block 235847
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

Tried mounting the array and during that process, Disk4 reported some errors and the array failed to mount. Now I have missing data that was stored on that drive also.

I have powered down the unRAID VM entirely and resigned to leave it till the LSI controller arrives before doing anything more but in anticipation of that, what is the best way to untangle that?

Only some of the array data is backed up elsewhere as I hadn't got around to building a second unRAID host (physically buiilt but not working) because of the continuous issues with the main array.

JorgeB · August 8, 2017

11 minutes ago, fatpipe said:

How can I run a Disk1 repair on the emulated contents?

Same, start in maintenance mode and use md1, but since it will read from all other disks, do it when the new controller arrives.

13 minutes ago, fatpipe said:

Also tried the repair on Disk3 and got the below dialog.

You'll need to use -L, usually there isn't data loss, also recheck disk4 again.

fatpipe · August 8, 2017

I ran those using the GUI with command options and there was no selection for Disk1 since it is faulted. (fwiw, the original 8TB disk has passed a preclear cycle so won't need to return that it seems). Did not try using CLI but given how things have gone with it I dare not touch it till the controllers can be replaced.

44 minutes ago, johnnie.black said:

Same, start in maintenance mode and use md1, but since it will read from all other disks, do it when the new controller arrives.

Should this be tried before or after rebuilding the contents of the disk1 onto a real drive? Will rebuild work with potentially corrupted content? I don't see why it couldn't.

35 minutes ago, johnnie.black said:

You'll need to use -L, usually there isn't data loss, also recheck disk4 again.

Are you saying that -L is not as distructive as it sounds?

-L

Force Log Zeroing. Forces xfs_repair to zero the log even if it is dirty (contains metadata changes). When using this option the filesystem will likely appear to be corrupt, and can cause the loss of user files and/or data.

===================

Rant on the controllers.

The log reports a version of mvsas of 0.8.16.

I can't seem to find out if that is old, new or even valid. Not my area of expertise but surely someone would care enough to diagnose and fix it if nothing else? Seems somewhat recent activity in code by some parties. Surely a problem with virtualisation is not unique to unRAID. Marvel sould care to some degree that their product is being ejected and labeled unsuitable which has got to affect its popularity.

[1/1] mvsas: add SGPIO support to Marvell 94xx - Patchwork

https://patchwork.kernel.org/patch/7924021/

Dec 27, 2015 - add SGPIO support to Marvell 94xx Signed-off-by: Wilfried Weissmann ... c b/drivers/scsi/mvsas/mv_94xx.c index 9270d15..f6fc4a7 100644 ...

LIBSAS driver for Marvell 88SE63xx/64xx/68xx/94xx SAS controller ...

https://sourceforge.net/p/scst/svn/5321/tree/trunk/mvsas_tgt/README?format...

LIBSAS driver for Marvell 88SE63xx/64xx/68xx/94xx SAS controller ... To install the target driver, type 'make install' in mvsas/ subdirectory. The target driver will ...

JorgeB · August 8, 2017

47 minutes ago, fatpipe said:

Should this be tried before or after rebuilding the contents of the disk1 onto a real drive? Will rebuild work with potentially corrupted content? I don't see why it couldn't.

Either way.

48 minutes ago, fatpipe said:

Are you saying that -L is not as distructive as it sounds?

It's quite common and u usually there's no data loss.

trurl · August 8, 2017

On 8/5/2017 at 10:16 AM, fatpipe said:

I have set the mover to 1 month schedule to stop it making more churn of this until I can get the replacement controller in place. Is there another way of stoping it from running?

Been a couple of days since this was asked. Mover only moves files for user shares that are set to cache:prefer or cache:yes, so if all your shares are set to cache:no or cache:only, nothing for mover to move. More detailed explanation here:

Joseph · August 8, 2017

On August 4, 2017 at 4:38 AM, fatpipe said:

Thanks itimpi. Will probably look into a 9201-16i as that solves a slot problem in the combined host.

The SAS2LP was a bit faster than the SASLP I believe. I get 140MB for the parity check but it noticably slows at the top of the 4TB drives half way through and the rebuild ran 109MB the other day.

How does the 9201-16i compare with that?

fwiw, here's a link to an issue regarding the SAS2LP that was resolved by replacing the controller.

The Dell H310 is inexpensive and easy to flash. I now have 2 of them installed in my unRAID rig... and barring unforeseen faulty cable issues, its been running like a champ!

http://lime-technology.com/wiki/index.php/Crossflashing_Controllers#LSI_SAS2008_chipset

Edited August 8, 2017 by Joseph

fatpipe · August 18, 2017

Hi Again Forum users.

Wish I was coming back to report all was well with the new card and for a while it looked like it was as of about a couple of hours ago at least.

Replaced two AOC-SAS2LP-MV8 cards with one LSI 9201-16i and thing were going fairly well with only one anomally up to then.

Got the card booting under VMware passthru and seeing all the drives except Disk4 which continued to play up (and of course the Disk1 that was marked failed).

Re-slotted the Disk4 into another drive bay and it seemed to be happy. Backed up some of the more vulnerable data, ran xfs_repair -nv on Disk4 and it came up OK.

Ran it on Disk2 and Disks 5-7 all ok. So that left just disk1 and disk3 still showing corruptions (ignoring the cache pool corruptions for now).

The Disk1 repair check came back straight away with errors for the superblock and I was convinced it was going to be toast. Disk3 came back from the check saying the message about buffered updates or words to that effect to playback and to mount the volume so I did. After doing that I then put the array back into maintenance mode and checked disk3 again and it came up with mostly the same errors but no complaints about the buffers. Disk1 (md1) came back as clean (miracle!) and so I thought all was going well and put the replacement disk (original "failure" 8TB drive) into the Disk1 slot which had passed a preclear (1x 3 stage full pass) and left it to rebuild.

Well a few hours later, I come back to this disaster. What the hell is this? It did not do that many reads and writes. The worry though (appart from the obvious failed rebuild) is the errors column and for all the 8TB drives.

Disk 2 and Disk3 8TB drives are currently showing up as unassigned devices also.

Is this drivers? Firmware or some problem with large disks? Surely can't be all 3 drives.

Diagnostic output attached.zstore2-diagnostics-20170818-2249.zip

Mover running trying to cache 20TB on 250GB cache

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation