unRAIDFindDuplicates.sh


Recommended Posts

Is it possible the size being compared is the actual size on disk? If so, a sparse file will be incorrectly reported as failing the size comparison, when the content of the file if read out or checksummed is identical.

Possible.  The size is being obtained using a command of the form

du -s filename

so you could see if they are reported the same on your system?  If that is the case then an alternative command could be used.

Try
du -sb filename

Link to comment

Is it possible the size being compared is the actual size on disk? If so, a sparse file will be incorrectly reported as failing the size comparison, when the content of the file if read out or checksummed is identical.

Possible.  The size is being obtained using a command of the form

du -s filename

so you could see if they are reported the same on your system?  If that is the case then an alternative command could be used.

Try
du -sb filename

If you change the script on your system (it should be easy enough to find the du -s command in the script) does it fix the issue on your system?    As I currently do not have an example of it going wrong then it is a bit harder to test here.    Actually looking at the du options I am not sure why I used the -s option - it looks as if simply using the -b option instead would work?

 

I find it intriguing that the script has been available for over a year and this is the first time this issue has come up.  It is  just shows how difficult it can be to allow for all the edge cases.

Link to comment

du -b

 

That seems to solve the issue I was having with this script. Thank you!

Thanks for confirming that works.

 

I will do a bit of testing at my end and assuming nothing shows up I will upload an updated version of the script to the first post in this thread.

Link to comment
  • 2 months later...

First, thanks for this script :)

 

I found this because I stumbled upon an empty directory in one of my shares and was looking for something to locate those.

 

The good news: I found about 250GB of duplicate files I had no idea existed. So again, thank you.

 

The bad news: I cannot for the life of me get the -f/-F args to produce anything.  I created a duplicate folder on the chance that the one I stumbled on was the only one, but it is not being reported by the script.

 

Any help appreciated,

 

Thanks!

Link to comment
  • 7 months later...

Sorry for reviving an old thread but just tried this out today and realised the script doesn't like apostrophes in file names.

 

Not sure if it's worth fixing or not?

 

ls: cannot access '/mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example': No such file or directory

 

The full directory is /mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example's/

Link to comment

Sorry for reviving an old thread but just tried this out today and realised the script doesn't like apostrophes in file names.

 

Not sure if it's worth fixing or not?

 

ls: cannot access '/mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example': No such file or directory

 

The full directory is /mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example's/

Not sure if the script is still being maintained but fix common problems plugin on the extended tests also checks for dupes

 

Sent from my LG-D852 using Tapatalk

 

 

Link to comment

Sorry for reviving an old thread but just tried this out today and realised the script doesn't like apostrophes in file names.

 

Not sure if it's worth fixing or not?

 

ls: cannot access '/mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example': No such file or directory

 

The full directory is /mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example's/

i'll look to see if this can be fixed simply and if so I will apply an update.    However if it is going to be hard to do without major changes to the script I probably will not bother.
Link to comment

Sorry for reviving an old thread but just tried this out today and realised the script doesn't like apostrophes in file names.

 

Not sure if it's worth fixing or not?

 

ls: cannot access '/mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example': No such file or directory

 

The full directory is /mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example's/

i'll look to see if this can be fixed simply and if so I will apply an update.    However if it is going to be hard to do without major changes to the script I probably will not bother.

No worries. I didn't expect anything I just thought it was worth mentioning in case others were having issues now or in the future. Thanks itimpi and great script!

 

Sent from my SM-G930F using Tapatalk

 

 

Link to comment

Sorry for reviving an old thread but just tried this out today and realised the script doesn't like apostrophes in file names.

 

Not sure if it's worth fixing or not?

 

ls: cannot access '/mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example': No such file or directory

The full directory is /mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example's/

i'll look to see if this can be fixed simply and if so I will apply an update.    However if it is going to be hard to do without major changes to the script I probably will not bother.

No worries. I didn't expect anything I just thought it was worth mentioning in case others were having issues now or in the future. Thanks itimpi and great script!

 

Sent from my SM-G930F using Tapatalk

I've worked out what line the message comes from.  It appears to be a quirk of the way bash handles wildcard expansion.  I will have to look up my bash special character handling to see if I can see a way of avoiding the issue.

 

Link to comment

Sorry for reviving an old thread but just tried this out today and realised the script doesn't like apostrophes in file names.

 

Not sure if it's worth fixing or not?

 

ls: cannot access '/mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example': No such file or directory

The full directory is /mnt/disk*/STORAGE/laptop backup/Dropbox/EMPLOYEES/example's/

i'll look to see if this can be fixed simply and if so I will apply an update.    However if it is going to be hard to do without major changes to the script I probably will not bother.

No worries. I didn't expect anything I just thought it was worth mentioning in case others were having issues now or in the future. Thanks itimpi and great script!

 

Sent from my SM-G930F using Tapatalk

I've worked out what line the message comes from.  It appears to be a quirk of the way bash handles wildcard expansion.  I will have to look up my bash special character handling to see if I can see a way of avoiding the issue.

Thanks itimpi. Please don't waste your time if it's to much work though. It's only my OCD that makes me want a clean error free log file. Cheers

 

Sent from my SM-G930F using Tapatalk

 

 

Link to comment
  • 7 months later...

Does anyone know a good way to automatically delete the duplicate files that this script finds?

 

I had a 6TB drive become unmountable, recovered over 5TB of files from the emulated drive, then realized I hadn't tried repairing the filesystem, and got my drive back, so now I've got 5+TB of duplicate files, spread all across my array.

Link to comment
  • 3 weeks later...

Have you given the script execute permission?     If you downloaded it to the flash drive this should be automatic (because it is FAT32 format) but will not be the case if put elsewhere.  Alternatively run it using the ‘sh’ command which does not require the script to have ‘execute’ permission.

Edited by itimpi
  • Like 1
Link to comment
  • 3 months later...
On 9/16/2014 at 3:03 PM, itimpi said:

I had thought to add an option to automatically delete duplicates, but certainly did not want it in an initial iteration of the tool.

 

+1 to automatically remove duplicates.

 

Hi, I just discovered your tool and it turns out I have numerous duplicates because one of my disks went off line and emby rebuilt the metadata elsewhere. Because its metadata the filenames match, but they may not match at the binary level. I know which disk contains the duplicate data so it would be great if you could add a switch to your script to specify which disk to remove the duplicate if/when you ever add a delete option to it.  Thanks.

Link to comment

ha, funnily enough, I'm working on finding duplicates right now.  I noticed in the file integrity plugin, there's an option to do so, and I didn't realize how many of my files have gotten duplicated (not sure how).

 

I checked community applications for "duplicate" and found dupeguru docker, and it just finished installing.  I'm running it for the first time right now, but so far it doesn't seem to be doing much/anything.

 

I'd prefer if this tool would do it for me instead; less to install/maintain.

 

**it only looks for duplicate files in the shares, not duplicates on multiple disks, so not a solution anyway

Edited by JustinChase
add info
Link to comment
19 minutes ago, JustinChase said:

**it only looks for duplicate files in the shares, not duplicates on multiple disks, so not a solution anyway

Not quite sure what you mean by this?    It looks for duplicates with the same relative path (I.e. the same name) that exist on more than one disk.  This can happen if you have been copying files between disks on unRAID and have not deleted the source files.   That can be a bit confusing on unRAID as when you look at the User Share Level you will only see one instance.     It does NOT look for files with different paths/names that have the same contents if that is what you are looking for.

Link to comment
36 minutes ago, Joseph said:

 

+1 to automatically remove duplicates.

 

Hi, I just discovered your tool and it turns out I have numerous duplicates because one of my disks went off line and emby rebuilt the metadata elsewhere. Because its metadata the filenames match, but they may not match at the binary level. I know which disk contains the duplicate data so it would be great if you could add a switch to your script to specify which disk to remove the duplicate if/when you ever add a delete option to it.  Thanks.

I am afraid that I am unlikely to add a delete switch as that is too dangerous and could easily result in data loss if you are not very careful.  I thought about it when developing the script but then decided against it as I could easily delete the wrong copy if the files are not actually identical at the binary level.   

 

The easiest way to handle this would be to output the results to a file and then edit them to make it into a shell script with ‘rm’ commands for the files that are not wanted.    Depending on how your duplicates are located you might be able to use a ‘rm -r’ type command on containing directories.   

Link to comment
11 minutes ago, itimpi said:

Not quite sure what you mean by this?    It looks for duplicates with the same relative path (I.e. the same name) that exist on more than one disk.  This can happen if you have been copying files between disks on unRAID and have not deleted the source files.   That can be a bit confusing on unRAID as when you look at the User Share Level you will only see one instance.     It does NOT look for files with different paths/names that have the same contents if that is what you are looking for.

 

It looks like it only showed duplicate files with different paths/names, and has no reference to the disk they are on.  the file integrity plugin shows dups on different disks, like 2nd screenshot.  This is what I was looking for.  It just seems this tool doesn't update in real-time, as some of the dups it reported have been corrected, but it's still showing them.

 

Oh well.

dups.jpg

dups2.jpg

Link to comment
44 minutes ago, JustinChase said:

**it only looks for duplicate files in the shares, not duplicates on multiple disks, so not a solution anyway

FYI, FCP running the extended tests looks for duplicated files.  (same file name existing on multiple disks in the same folder).

Link to comment
12 minutes ago, JustinChase said:

t looks like it only showed duplicate files with different paths/names, and has no reference to the disk they are on.  the file integrity plugin shows dups on different disks, like 2nd screenshot.  This is what I was looking for.  It just seems this tool doesn't update in real-time, as some of the dups it reported have been corrected, but it's still showing

Not sure what makes you think this - it definitely shows which disks duplicated files are on!   I developed it and used it to identify where I had duplicate files on my own disks.

 

it was developed a long time ago (on unRAID v5 although it also works on v6) before plugins like the File Integrity plugin, so feel free to use that if you prefer.

Link to comment
24 minutes ago, itimpi said:

Not sure what makes you think this - it definitely shows which disks duplicated files are on!   I developed it and used it to identify where I had duplicate files on my own disks.

 

it was developed a long time ago (on unRAID v5 although it also works on v6) before plugins like the File Integrity plugin, so feel free to use that if you prefer.

 

My bad, I didn't realize which thread this was in, I thought I was in a different thread.  My mistake, sorry.

Link to comment
2 hours ago, itimpi said:

I am afraid that I am unlikely to add a delete switch as that is too dangerous and could easily result in data loss if you are not very careful.  I thought about it when developing the script but then decided against it as I could easily delete the wrong copy if the files are not actually identical at the binary level.   

 

The easiest way to handle this would be to output the results to a file and then edit them to make it into a shell script with ‘rm’ commands for the files that are not wanted.    Depending on how your duplicates are located you might be able to use a ‘rm -r’ type command on containing directories.   

not sure how to create a shell script, but I'll look around to figure it out and try that...thanks for the tip!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.