Dynamix File Integrity plugin


bonienl

Recommended Posts

Just now, bonienl said:

FIP has some logic to start concurrent "hashing" sessions depending on the number of available cores in the system.

 

Perhaps I should make that a variable which can be adjusted thru the GUI, it would allow people to experiment and see what works best on their system.

 

 

That would be a great addition and I think I could definitely get around the issue by just setting it to one or two.

 

It would also be great to be able to limit concurrent hashing sessions per physical disk though, as that is the real bottleneck in my system.  My disk just couldn't keep up with all of the concurrent read/writes, whereas if the files were on multiple disks I think I would have been fine.

 

Anyway, thanks for the response! :)

Link to comment
4 minutes ago, ksarnelli said:

My disk just couldn't keep up with all of the concurrent read/writes, whereas if the files were on multiple disks I think I would have been fine.

 

IMO this is more important, ie, even with multiple cores hashing multiple files on the same disk concurrently will always be slower than one at a time.

Link to comment
54 minutes ago, johnnie.black said:

even with multiple cores hashing multiple files on the same disk concurrently will always be slower than one at a time.

Depends on the hashing algorithm. If disk I/O and access time is only 1/20 of the total time required to hash and record the results, it would be faster to assign multiple cores to the same disk. It's all about processor math vs I/O speed.

 

Granted, the current profile of hashing algorithms, disk I/O and processor speed is suited to 1 disk per thread, that's not an absolute.

Link to comment

I'd disagree with that => a single seek on a modern disk typically takes 10-15ms (seek plus latency) ... that's 10,000 - 15,000 usec, which is a LONG time at modern CPU speeds.    Thrashing between multiple files on the same disk will result in a LOT of those seek delays.   The problem is that even if disk I/O and access time is 1/20th of the total time, UNLESS the entire file is read on a single access, that would expand dramatically as the heads had to re-seek to read the next piece of the file.   Granted, if the code was modified to read an entire file; THEN hash it; then it might be advantageous to have multiple threads operating on the same disk (as long as there's enough memory to buffer the files being processed) ... but I think it's simply best to limit any given disk to a single thread.

 

Link to comment
  • 3 weeks later...
On 1/9/2017 at 1:25 PM, bonienl said:

Doing a manual Build will add any files missing previously.

 

It is recommended to exclude folders/files which change frequently, here the normal error detection of the disk will take care. Detection of bitrot is more meaningful on folders/files which are not accessed or modified frequently.

 

 

So I should exclude documents folders containing working documents, and only use this for integrity checking of largely static data, such as photos, home videos, media collection, etc?

Link to comment
  • 2 weeks later...
On 3/8/2017 at 7:59 PM, bonienl said:

FIP has some logic to start concurrent "hashing" sessions depending on the number of available cores in the system.

 

Perhaps I should make that a variable which can be adjusted thru the GUI, it would allow people to experiment and see what works best on their system.

 

Any news on this? It is especially painful when a mover operation is running. Not only FIP tries to calculate checksums for multiple files at a time on one disk, mover is also trying to copy more files to said disk. This increases the move time significantly and basically makes the array unusable while it is running.

Link to comment
  • 1 month later...

I am getting notifications of hash key mismatches from directories that are in my exclude custom folder list.

 

I want to include my /mnt/user/Backups share because all my machines at home are backing up to here, but I want to exclude a subset of directories under that share because my server is a backup location for several family member's machines.

 

My structure is:

/mnt/user/Backups/...

  + Local machine 1 backup

  + Local machine 2 backup

  + Local machine 3 backup

  + CommunityApplicationAppdataBackup

  + Crashplan machine 1 backup

  + Crashplan machine 2 backup

  + Crashplan machine n backup

 

In the excluded files and folders Custom Folders box, I have this:

 

619375375455617289,619380445513515330,619452463559020549,622829716619395332,622831926866608387,682140704451330318,712875340537760924,CommunityApplicationsAppdataBackup

The numbered directories are the ones CrashPlan creates when an incoming backup is created. I recall that when I first excluded the folders, there was some sort of drop-down picker and that the format above was created by that, but now I'm not so certain of my recollection and I may have just made it up.

 

2 part question, then:

1) why am I still getting notifications of file mismatches in these folders?

2) if the answer to 1) is "because the exclude folder format is wrong", what is the correct format for this box? Do I need the full path, and if so from root (/)?

 

 

Link to comment
  • 2 weeks later...

I've been rebuilding all my hashes one disk at a time to attempt to eliminate the constant warning I was getting because I think I improperly excluded some paths. I came home this evening to a very unusual display.

 

It's telling me that my Disk11 is 100% completed with an ETA of 00:00:00, however it's reporting that it's working on file 664,413 of 2,248,611, and the current file number is incrementing.

 

Is there possibly an issue with the size of the counter variable that's overflowing and causing the percentage to think it's done?

 

 

File Integrity oddity.PNG

Link to comment
  • 4 weeks later...

I have 17 disks and I setup the schedule for 1 task for each disk daily at 00:00. Yesterday disk5 ran fine. Today disk 6 should have run but i did not get a notification it started and I do not see the bunker script running. Is there any way to find out why it didn't run this morning?

Link to comment

I looked on the console and saw these messages, not sure what it means :)

 

grep: /boot/config/plugins/dynamix.file.integrity/disks.ini no such file or directory

grep: /boot/config/plugins/dynamix.file.integrity/disks.ini no such file or directory

/usr/local/emhttp/plugins/dynamix.file.integrity/scripts/bunker: line 347: 24875 Terminated   $exec $argv "$file" > $tmpfile.0

 

 

 

Link to comment
  • 3 weeks later...

Can someone explain what the "Task" checkboxes are supposed to be for?

 

I'm trying to learn more about how to use this plugin after I found out some of my media files start acting crazy at random sections. I suspect some of the files got messed up. Ive had a few bad disks over the years and no way to tell if any files got damaged.

 

 

Also, I just have ONE reiserfs disc remaining that I am planning to replace. Should I replace that before getting started with this plugin? I did notice the warning on the first page but was curious if I could just excude the disk that is reiserfs format. I'm guessing that the answer is no.

Edited by DazedAndConfused
Link to comment
16 hours ago, DazedAndConfused said:

just excude the disk that is reiserfs format.

 

That is exactly what I did. 

 

When I got my first disk converted from RFS to XFS, I started the FIP running against that one XFS drive. Each time I got a drive converted, I built & exported that disk and added it to the check schedule. Eventually I got the whole server converted to XFS and now all drives are being tested on a regular basis.

Link to comment

I've noticed that one disk takes an extreme amount of time to be checked by FIP. By extreme, I mean 30+ hours.

 

I've got a dozen data disks:

 

root@NAS:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/md1        932G  914G   18G  99% /mnt/disk1
/dev/md2        1.9T  1.8T   22G  99% /mnt/disk2
/dev/md3        932G  929G  2.9G 100% /mnt/disk3
/dev/md4        1.9T  1.9T  9.5G 100% /mnt/disk4
/dev/md5        3.7T  3.7T   33G 100% /mnt/disk5
/dev/md6        1.9T  1.9T   16G 100% /mnt/disk6
/dev/md7        1.9T  1.3T  571G  70% /mnt/disk7
/dev/md8        1.9T  1.9T  151M 100% /mnt/disk8
/dev/md9        932G  932G   55M 100% /mnt/disk9
/dev/md10       2.8T  2.8T   25G 100% /mnt/disk10
/dev/md11       3.7T  2.1T  1.7T  56% /mnt/disk11
/dev/md12       3.7T  1.9T  1.9T  51% /mnt/disk12

Of the 4TB drives (md4, 11 & 12) 5 & 12 run to completion in 6-10 hours (don't recall off the top of my head), but disk 11 regularly takes the 30+ hours.

 

Using jbartlett's excellent drive performance test all 3 of those disks are performing at right about the same speeds. (FIP is currently processing disks 11 & 12, so there were some files open that probably slowed them down a bit.)

Disk 5: 	HGST HDN724040ALE640 PK1338P4GT2D9B  	4 TB	  133 MB/sec avg
Disk 11: 	HGST HMS5C4040ALE640 PL2331LAG6W5WJ  	4 TB	  100 MB/sec avg
Disk 12: 	HGST HMS5C4040ALE640 PL1331LAHE2R0H  	4 TB	  102 MB/sec avg

So I mark this up to the very large number of files on disk11:

root@NAS:/mnt# ls disk5 -ARl|egrep -c '^-'
74328
root@NAS:/mnt# ls disk11 -ARl|egrep -c '^-'
2311532
root@NAS:/mnt# ls disk12 -ARl|egrep -c '^-'
380921

 

Which brings me back to the question I asked here - how do I properly exclude directories that I don't want to have checked? Since asking that question, I changed my directory exclude settings to have full paths from root, then executed 'Clear', 'Remove', 'Build', 'Export' for each and every disk in turn in an effort to update FIP's understanding of what it's supposed to do, but I'm still getting bunker reports of hash key mismatches on directories that should be excluded. I've set the "Exclude" paths from /mnt/users, do I need to exclude /mnt/diskx instead? I would think doing this would be a major pain since I'm writing to user shares that can easily span multiple drives - to begin with I'd need to exclude the paths from every existing disk, then I would need to remember to update my FIP settings every time I add a new disk. (Granted, I don't do it that often, but that's still a royal pain.)

 

I've confirmed that disk11 does contain a large portion of the files I'd like to exclude from FIP scanning.

 

Is this an issue with how FIP is skipping the paths in the "exclude" setting or how I'm defining them, or is there something else I'm missing completely?

Link to comment
  • 1 month later...

Just building my first unRAID and copying TBs of data to it. Would it be best to wait until all data is copied over and then run the Build? Or should I just enable it right off the bat and let it compute checksums as stuff is copied to the array? Any major perf hit by just turning it on now and having it start doing its thing while I copy data?

 

Also I'm copying the files in SSH screen by using unassigned devices mounts and cp direct from my previous NTFS drives directly to the /mnt/diskN/share locations. Does it still compute the checksum when moving files around this way rather than via SMB shares and such?

Edited by deusxanime
more info/questions
Link to comment
1 hour ago, deusxanime said:

Just building my first unRAID and copying TBs of data to it. Would it be best to wait until all data is copied over and then run the Build? Or should I just enable it right off the bat and let it compute checksums as stuff is copied to the array? Any major perf hit by just turning it on now and having it start doing its thing while I copy data?

 

The issue I brought up about 6 months ago (concurrent checksum calculations on a single disk) still hasn't been corrected, so I would recommend not having the automatic option enabled during your initial copy.  I actually had to leave it disabled permanently because every time my mover ran the checksum calculations would crush one of my drives.

Link to comment
57 minutes ago, ksarnelli said:

The issue I brought up about 6 months ago (concurrent checksum calculations on a single disk) still hasn't been corrected, so I would recommend not having the automatic option enabled during your initial copy.  I actually had to leave it disabled permanently because every time my mover ran the checksum calculations would crush one of my drives.

 

Ouch thanks for the heads up. Guess I'll leave it off for now and turn it on later once things have settled in.

Link to comment
  • 1 month later...
On 7/23/2017 at 10:30 AM, FreeMan said:

Which brings me back to the question I asked here - how do I properly exclude directories that I don't want to have checked? Since asking that question, I changed my directory exclude settings to have full paths from root, then executed 'Clear', 'Remove', 'Build', 'Export' for each and every disk in turn in an effort to update FIP's understanding of what it's supposed to do, but I'm still getting bunker reports of hash key mismatches on directories that should be excluded. I've set the "Exclude" paths from /mnt/users, do I need to exclude /mnt/diskx instead? I would think doing this would be a major pain since I'm writing to user shares that can easily span multiple drives - to begin with I'd need to exclude the paths from every existing disk, then I would need to remember to update my FIP settings every time I add a new disk. (Granted, I don't do it that often, but that's still a royal pain.)

 

Sadly, I'm still getting notification of errors on files that should be excluded, so either the exclude logic is broken or I don't understand how to use it. It seems that the "how to use it" part is a big secret, since I've asked about it twice and nobody's felt it was appropriate to share.

 

I'm also getting 2 types of notifications from FIP:

Quote

Event: unRAID file corruption
Subject: Notice [NAS] - bunker verify command
Description: Found 14 files with BLAKE2 hash key mismatch
Importance: warning

 

 and

Quote

Event: unRAID file corruption
Subject: Notice [NAS] - bunker verify command
Description: Found 6 files with BLAKE2 hash key corruption
Importance: alert

 

English is difficult (even though it's my native language) and the semantics can be particular. I'm not sure whether a "warning" is more or less important than an "alert", though I would think that "corruption" is worse than a "mismatch". However, I'm not sure what the difference between the two of those really is, either. In what way can the hash key be mismatched if not due to corruption?

Link to comment

Try the following steps:

  1.  Settings -> FIP -> Automatically protect = disabled
  2.  Settings -> FIP -> Disk verification schedule = disabled
  3.  Settings -> FIP -> Fill in all folders + files which need to be excluded
  4.  Apply
  5.  Tools -> FIP -> Select all disks
  6.  Tools -> FIP -> Clear
  7.  Settings -> FIP -> Automatically protect = enabled
  8.  Settings -> FIP -> Disk verification schedule = enabled (select schedule)
  9. Apply

Files with a "key mismatch" have been updated (modification time changed), but their key is incorrect. Usually happens when applications make changes to files without proper open/close notifications. Best practice: exclude these folders or files.

 

Files with a "key corruption" are not modified and the key signifies the file has corrupted content. Sometimes this can be a false positive, but in general these files need to be checked against a backup copy to verify their content (manual action).

 

Link to comment

Thanks, @bonienl. I've followed your instructions and I've got my verifications scheduled again. I'll continue to monitor to see if I get any additional issues.

 

If anything, I would guess that it's the CA-Backup that's not properly opening/closing files, because all the issues are in my Backups share on the CA backups path. I've had them excluded with "/mnt/user/Backups/CommunityApplicationsAppdataBackup", and that doesn't seem to have excluded these files from the verification, so I just changed it to "/mnt/user/Backups/CommunityApplicationsAppdataBackup/*" (with the /* at the end) in the hopes that would do it. I've excluded other directories, as well, and comma separated them - is that the proper way of doing it? I believe I also had spaces after the comma between the paths, maybe that was breaking things - I've eliminated that now, too.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.