Finding all file changes - new, deleted, modified and renamed files


shEiD

Recommended Posts

I am not a real programmer, I'm just a self-taught hobbyist. Pretty much the only thing I know is c#. I am also a windows guy - no linux at all. Using unraid for a couple of months now - that's all my linux experience so far. So please excuse my maybe silly questions :)

 

I want to write a simple console app to track all the file changes on my media library shares. Basically, I would periodically use it (once a day, or once a week), on my main windows rig. It would scan the user shares on my unraid server over LAN, detect changes in file system and calculate hashes for all the and modified files. And write the log on all the changes - new, deleted, renamed(moved) and modified files.

I am gonna use hashes to identify the files. Using the hashes, I will be able to track all the changes to the files. Like I said, the files are media (movies, tv shows). There is nothing fancy happening with them, so the logic to find and decide what happened using hashes is pretty simple.

 

The part I need some help with and wanna ask some questions about, is making the scanning process as fast as possible. basically, I want to skip all the unchanged files as best/smart/fast as possible. I am not even an expert on windows file and folder time-stamps, let alone linux. Hence the questions.

  1. What is the best way to check if a file has not been changed in any way? I was thinking - match the full path and modified time-stamp. Is that enough god enough?
  2. Maybe there is a better way to see if file has been changed in any way?
  3. Is there any faster way to do this, than to check every single file individually? By somehow skipping whole folders, if no files changed inside.
  4. I tried to read up on folder modified times changes in linux. As I understand, the folders get their modified time updated only on file additions, deletions or renames. It seems, if files are merely modified (as in text file is edited), this modification time does not bubble up even to the parent folder's modified time, is that correct? If this is correct, that means, I cannot rely folder mtimes at all.

 

Basically, what's the fastest way to traverse the file system tree, by skipping the unchanged files/folders as fast as possible?

Link to comment
If you want to write this in C# on Windows ... change the following as needed, but this should be a good starter point.
 
You likely want to enumerate the directory entries skipping anything older than a specified timespan. You can use the CreationTime or the LastModifiedTime attributes, depending on your check for being created or being modified.
 
The only way to scan for deleted files, after the fact, is having an index of files that exists and then checking to see if they're still there, but then you need to SCAN every single file, which means there's no quick way.
 
 
                string OutputFolder = @"\\server\share\rootdir";
                int MaximumFileAge  = 7 * 24 * 60 * 60 * 1000; // 7 days, 24 hours per day, 60 minutes per hour, 60 seconds per minute, 1000 seconds per millisecond
 
                // Get list of documents created in last MinFileAge old (24 hrs?)
                IList<string> documentNames =
                    EnumerateFiles(OutputFolder, "*.*", true)
                        .Where(f => f.CreationTimeUtc > CurrentTimeUtc().Subtract(new TimeSpan(0, 0, 0, 0, MaximumFileAge)))
                        .Select(f => f.Name)
                        .ToList();
 
        // extracted for testability
        [System.Diagnostics.CodeAnalysis.ExcludeFromCodeCoverage]
        protected internal virtual DateTime CurrentTimeUtc()
        {
            return DateTime.UtcNow;
        }
 
        /// <summary>
        /// Enumerate list of files in a folder.  Optionally recurse sub folders.  Hidden, system, temporary files
        /// will be ignored along with files that don't match the file pattern.
        /// </summary>
        /// <param name="folderName">Name of the folder to retrieve files from</param>
        /// <param name="filePattern">File pattern to retrieve.  ex: "*.*", "*.pdf", "apk?.*" </param>
        /// <param name="includeSubFolders">Recurse sub folders or not</param>
        /// <returns>Enumeration of file information objects</returns>
        public virtual IEnumerable<FileInfo> EnumerateFiles(string folderName, string filePattern, bool includeSubFolders)
        {
            if (!DirectoryExists(folderName))
                throw new IOException(string.Format("Folder doesn't exist or is otherwise unreachable. folder name = {0}", folderName));
 
            SearchOption searchOption = (includeSubFolders) ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly;
 
            IEnumerable<FileInfo> files = new DirectoryInfo((String.IsNullOrWhiteSpace(folderName)) ? "*.*" : folderName)
                .EnumerateFiles(filePattern, searchOption)
                .Where(f => !f.Attributes.HasFlag(FileAttributes.Hidden) &&
                            !f.Attributes.HasFlag(FileAttributes.System) &&
                            !f.Attributes.HasFlag(FileAttributes.Temporary));

            return files;
        }
Link to comment

@BRiT thanks for the reply.

 

I already have the app for scanning and hashing all the files. I made years ago, when I was using/or trying to use the ill-fated Bitcasa unlimited cloud drive. Bitcasa never worked reliably - it constantly corrupted or out right lost the files. So I've written that simple console app to save the hashes before syncing files to Bitcasa, and then verifying them later.

 

Anyway. What I was asking and hoping for, was maybe some linux-specific or XFS-specific way to be able to speed up the file scanning part. I mean I was hoping, maybe I misunderstood something when googling, and all the file modifications bubble up at least to the parent folder mtime. or something along these lines. Anything that would help to reliably skip traversing and checking all the files themselves as much as possible.

 

I guess, as it stands, there is no way to to skip any of the files, without checking every single file individually for it's full path and mtime. I just wanted to be sure I am not missing something useful.

Link to comment

There is a way of using inotify to listen for live events. But you'll have to up the listen limits. That live listener would be useful after you created a full base. Some of this is buried in the handful of plug-in threads that provide checksum/verification/integrity or antimalware protection (bait files that should never be changed). Have you looked at dynamix file integrity / bunker plugins? Not sure if they're still supported or been superceded by something better. 

Link to comment

Nah, I do not want an active event listener. it's just media files, that do not "change" that much or that often. It is way easier and way less complicated to have as a console app, and run it as needed. It's completely fine to do it once a week or so.

 

This is just a stepping stone for another thing I want to make for myself in the future. But for that I will need to find out:

  • how create asp.net core web app to use in docker on unraid
  • how create file hard links using .net core

Using hard links on user shares in unraid is freaking awesome :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.