NAS

Project: Duplicate file handling tool

Recommended Posts

Forked from a thread that was going OT.

 

I predict Joe L will post a clever SED script any time now :)

Share this post


Link to post
Share on other sites

Ok, copied from the other thread...

 

I made it into two lines for readability.  You can put it all on one line

grep "duplicate object" /var/log/syslog | cut -d" " -f8- | 
    sed -e "s/^\/[^\/]*\/[^\/]*\/\(.*\)/ls -l \/*\/*\/'\1'/" | sort -u | sh -

 

This should list the duplicate files as found by user-shares in parallel folders in the /mnt/disk?? shares.

 

If your syslog is HUGE, perhaps you need to just take the tail end of the syslog like this:

tail -10000 /var/log/syslog | grep "duplicate object" | cut -d" " -f8- | 
    sed -e "s/^\/[^\/]*\/[^\/]*\/\(.*\)/ls -l \/*\/*\/'\1'/" | sort -u | sh -

 

The trick to regular expressions is all in knowing where to put the backslashes.

 

Joe L.

Share this post


Link to post
Share on other sites

Added this to the UnRAID Add Ons wiki page, here.  Feel free to edit.

 

This needs more instruction I think, and an example, for new users.  Could use a link to the original thread too.

Share this post


Link to post
Share on other sites

This topic originally began here, in the "Spin down timers - are they in HDD firmware or stored in slackware" thread.

 

The easiest way to identify duplicates is to install the UnMENU addon, and use its Dupe files plugin.

 

If you don't install UnMENU, the syslog gives you *some* information, enough to figure out where they are.  You can manually locate the duplicates by finding a particular file listed in the syslog as a "duplicate object", making a note of its drive and path, then searching the syslog for additional copies on other drives.  That will provide you with a list of all but the first, which you can assume has the same path, but is on one of the drives that are LOWER than the lowest drive you have found listed in the syslog.  An example:

 

  /mnt/disk2/Movies/Action/Terminator.mpg  (first one is never a duplicate, will not be in syslog)

  /mnt/disk3/Movies/Action/Terminator.mpg  (found in syslog as "duplicate object")

  /mnt/disk6/Movies/Action/Terminator.mpg  (found in syslog as "duplicate object")

 

The syslog will indicate that Terminator is duplicated twice, with copies on Disk 3 and Disk 6, and you can conclude that there is a third copy, and that it is on either Disk 1 or Disk 2, with the same path as the others.

Share this post


Link to post
Share on other sites

How does the duplicate handling work?  Just checking file names?  or some kind of hash check?

It just checks the names.  If they are in parallel folders on different disks, with the same name, only the lower numbered disk file is accessible in the user-share.  The others are logged to the syslog, but the log entry does not tell you where the first one was located, only the subsequent ones... The script above in this thread finds the file with the similar name in the parallel path on each of the disks.

 

The files can be completely different, or identical... It is up to you to figure out what to do with them.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


Copyright © 2005-2017 Lime Technology, Inc. unRAID® is a registered trademark of Lime Technology, Inc.