get md5sum of all files in first dir 1
find /dir1/ -type f -exec md5sum {} + | sort -k 1 > dir1.txt
Output will contain lines like this
01ac660edad41658b5d6ba67f371aa7eee3211fc ./Deepak/Programm/ocv2/star.jpg
033098d600668b69a8e607899687a9ab33ab54f1 ./Deepak/MIT/Seminar/report.log
sort
usage
- the argument
-k
is used to sort by column - in our case we sort by the
md5sum
field present in column 1 - that’s why we use
-k 1
repeat the same for the other dir, i.e
dir2
to compare with/
merge both the files and sort
cat dir1.txt dir2.txt | sort -k 1 > all_files_md5.txt
run this awk script to get all the files that have duplicates
delete_duplicate_md5_but_keep_line.awk
awk '{if ( $1==old ) { if (cnt == 0 ) {cnt=cnt+1; print "\n\n" oldline "\n" $0 } else { print $0 } } else { cnt=0; }; old=$1; oldline=$0;}' all_files_md5.txt > duplicates.txt
diff the duplicates with orignal md5 file list
diff -u duplicates.txt all_files_md5.txt | grep + | sort -k2
explanation
- we find the diff between the duplicates and orignal
- the ones that are shown as added output are our unique files
- we sort them by folder path, so we can address the conflicts