find重复的文件并删除它们
我正在编写一个Python程序来查找和删除文件夹中的重复文件。
我有多个MP3文件和其他一些文件的副本。 我正在使用sh1algorithm。
我怎样才能find这些重复的文件,并将其删除?
recursion文件夹版本:
此版本使用文件大小和内容的散列来查找重复项。 您可以传递多个path,它将recursion扫描所有path并报告find的所有重复项。
import sys import os import hashlib def chunk_reader(fobj, chunk_size=1024): """Generator that reads a file in chunks of bytes""" while True: chunk = fobj.read(chunk_size) if not chunk: return yield chunk def check_for_duplicates(paths, hash=hashlib.sha1): hashes = {} for path in paths: for dirpath, dirnames, filenames in os.walk(path): for filename in filenames: full_path = os.path.join(dirpath, filename) hashobj = hash() for chunk in chunk_reader(open(full_path, 'rb')): hashobj.update(chunk) file_id = (hashobj.digest(), os.path.getsize(full_path)) duplicate = hashes.get(file_id, None) if duplicate: print "Duplicate found: %s and %s" % (full_path, duplicate) else: hashes[file_id] = full_path if sys.argv[1:]: check_for_duplicates(sys.argv[1:]) else: print "Please pass the paths to check as parameters to the script"
def remove_duplicates(dir): unique = [] for filename in os.listdir(dir): if os.path.isfile(filename): filehash = md5.md5(file(filename).read()).hexdigest() if filehash not in unique: unique.append(filehash) else: os.remove(filename)
//编辑:
对于MP3,您可能也对这个主题感兴趣检测具有不同比特率和/或不同ID3标签的重复的MP3文件?
最快的algorithm – 与接受的答案(真的:)相比提高了100倍的性能)
其他解决scheme中的方法非常酷,但是他们忘记了重复文件的重要属性 – 它们具有相同的文件大小。 仅在相同大小的文件上计算昂贵的哈希将节省大量的CPU; 最后的performance比较,这里的解释。
迭代@nosklo给出的可靠答案,并借用@Raffi的思想,以快速散列每个文件的开头,然后计算快速散列中的完整冲突,这里是步骤:
- build立文件的哈希表,文件大小是关键。
- 对于具有相同大小的文件,创build一个散列表,其首个1024字节的散列; 非碰撞元素是独一无二的
- 对于第一个1k字节有相同散列的文件,计算完整内容的散列值 – 匹配的文件不是唯一的。
代码:
#!/usr/bin/env python import sys import os import hashlib def chunk_reader(fobj, chunk_size=1024): """Generator that reads a file in chunks of bytes""" while True: chunk = fobj.read(chunk_size) if not chunk: return yield chunk def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1): hashobj = hash() file_object = open(filename, 'rb') if first_chunk_only: hashobj.update(file_object.read(1024)) else: for chunk in chunk_reader(file_object): hashobj.update(chunk) hashed = hashobj.digest() file_object.close() return hashed def check_for_duplicates(paths, hash=hashlib.sha1): hashes_by_size = {} hashes_on_1k = {} hashes_full = {} for path in paths: for dirpath, dirnames, filenames in os.walk(path): for filename in filenames: full_path = os.path.join(dirpath, filename) try: file_size = os.path.getsize(full_path) except (OSError,): # not accessible (permissions, etc) - pass on pass duplicate = hashes_by_size.get(file_size) if duplicate: hashes_by_size[file_size].append(full_path) else: hashes_by_size[file_size] = [] # create the list for this file size hashes_by_size[file_size].append(full_path) # For all files with the same file size, get their hash on the 1st 1024 bytes for __, files in hashes_by_size.items(): if len(files) < 2: continue # this file size is unique, no need to spend cpy cycles on it for filename in files: small_hash = get_hash(filename, first_chunk_only=True) duplicate = hashes_on_1k.get(small_hash) if duplicate: hashes_on_1k[small_hash].append(filename) else: hashes_on_1k[small_hash] = [] # create the list for this 1k hash hashes_on_1k[small_hash].append(filename) # For all files with the hash on the 1st 1024 bytes, get their hash on the full file - collisions will be duplicates for __, files in hashes_on_1k.items(): if len(files) < 2: continue # this hash of fist 1k file bytes is unique, no need to spend cpy cycles on it for filename in files: full_hash = get_hash(filename, first_chunk_only=False) duplicate = hashes_full.get(full_hash) if duplicate: print "Duplicate found: %s and %s" % (filename, duplicate) else: hashes_full[full_hash] = filename if sys.argv[1:]: check_for_duplicates(sys.argv[1:]) else: print "Please pass the paths to check as parameters to the script"
而且,这里有趣的部分 – 性能比较。
基线 –
- 一个目录1047个文件,32个mp4,1015 – jpg,总大小 – 5445.998 GiB – 即我的手机的相机自动上传目录:)
- 小型(但function齐全)处理器 – 1600个BogoMIPS,1.2 GHz 32L1 + 256L2 Kbscaching,/ proc / cpuinfo:
处理器:Feroceon 88FR131 rev 1(v5l)BogoMIPS:1599.07
(即我的低端NAS :),运行Python 2.7.11。
所以,@ nosklo非常方便的解决scheme的输出:
root@NAS:InstantUpload# time ~/scripts/checkDuplicates.py Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg real 5m44.198s user 4m44.550s sys 0m33.530s
而且,下面是带有filter大小检查的版本,然后是小散列,如果发现冲突,则最后是完整散列:
root@NAS:InstantUpload# time ~/scripts/checkDuplicatesSmallHash.py . "/i-data/51608399/photo/Todor phone" Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg real 0m1.398s user 0m1.200s sys 0m0.080s
两个版本都跑了3次,以获得平均所需的时间。
所以v1是(用户+ sys) 284s ,其他 – 2s ; 相当差异,呵呵:)随着这个增长,可以去SHA512,甚至更有趣的是 – 所需的计算量越less,性能惩罚就会减轻。
劣势:
- 更多的磁盘访问比其他版本 – 每个文件都访问一次的大小统计(这很便宜,但仍然是磁盘IO),每个副本打开两次(对于小的第一个1k字节散列,并为完整的内容散列)
- 由于存储哈希表运行时会消耗更多的内存
我前段时间用Python写了一个 – 欢迎使用它。
import sys import os import hashlib check_path = (lambda filepath, hashes, p = sys.stdout.write: (lambda hash = hashlib.sha1 (file (filepath).read ()).hexdigest (): ((hash in hashes) and (p ('DUPLICATE FILE\n' ' %s\n' 'of %s\n' % (filepath, hashes[hash]))) or hashes.setdefault (hash, filepath)))()) scan = (lambda dirpath, hashes = {}: map (lambda (root, dirs, files): map (lambda filename: check_path (os.path.join (root, filename), hashes), files), os.walk (dirpath))) ((len (sys.argv) > 1) and scan (sys.argv[1]))
更快的algorithm
如果需要分析许多“大尺寸”文件(图像,mp3,pdf文档),那么使用以下比较algorithm会更有趣/更快:
-
在文件的前N个字节(比如1KB)上执行第一个快速哈希。 这个散列会说如果文件是毫无疑问的不同,但不会说如果两个文件是完全一样的(散列的准确性,有限的数据从磁盘读取)
-
如果在第一阶段发生冲突,那么第二个较慢的哈希将更精确并且对文件的整个内容执行
这里是这个algorithm的实现:
import hashlib def Checksum(current_file_name, check_type = 'sha512', first_block = False): """Computes the hash for the given file. If first_block is True, only the first block of size size_block is hashed.""" size_block = 1024 * 1024 # The first N bytes (1KB) d = {'sha1' : hashlib.sha1, 'md5': hashlib.md5, 'sha512': hashlib.sha512} if(not d.has_key(check_type)): raise Exception("Unknown checksum method") file_size = os.stat(current_file_name)[stat.ST_SIZE] with file(current_file_name, 'rb') as f: key = d[check_type].__call__() while True: s = f.read(size_block) key.update(s) file_size -= size_block if(len(s) < size_block or first_block): break return key.hexdigest().upper() def find_duplicates(files): """Find duplicates among a set of files. The implementation uses two types of hashes: - A small and fast one one the first block of the file (first 1KB), - and in case of collision a complete hash on the file. The complete hash is not computed twice. It flushes the files that seems to have the same content (according to the hash method) at the end. """ print 'Analyzing', len(files), 'files' # this dictionary will receive small hashes d = {} # this dictionary will receive full hashes. It is filled # only in case of collision on the small hash (contains at least two # elements) duplicates = {} for f in files: # small hash to be fast check = Checksum(f, first_block = True, check_type = 'sha1') if(not d.has_key(check)): # d[check] is a list of files that have the same small hash d[check] = [(f, None)] else: l = d[check] l.append((f, None)) for index, (ff, checkfull) in enumerate(l): if(checkfull is None): # computes the full hash in case of collision checkfull = Checksum(ff, first_block = False) l[index] = (ff, checkfull) # for each new full hash computed, check if their is # a collision in the duplicate dictionary. if(not duplicates.has_key(checkfull)): duplicates[checkfull] = [ff] else: duplicates[checkfull].append(ff) # prints the detected duplicates if(len(duplicates) != 0): print print "The following files have the same sha512 hash" for h, lf in duplicates.items(): if(len(lf)==1): continue print 'Hash value', h for f in lf: print '\t', f.encode('unicode_escape') if \ type(f) is types.UnicodeType else f return duplicates
find_duplicates
函数获取一个文件列表。 这样,也可以比较两个目录(例如,为了更好地同步它们的内容)。创build具有指定扩展名的文件列表并避免进入某些目录的函数示例如下:
def getFiles(_path, extensions = ['.png'], subdirs = False, avoid_directories = None): """Returns the list of files in the path :'_path', of extension in 'extensions'. 'subdir' indicates if the search should also be performed in the subdirectories. If extensions = [] or None, all files are returned. avoid_directories: if set, do not parse subdirectories that match any element of avoid_directories.""" l = [] extensions = [p.lower() for p in extensions] if not extensions is None \ else None for root, dirs, files in os.walk(_path, topdown=True): for name in files: if(extensions is None or len(extensions) == 0 or \ os.path.splitext(name)[1].lower() in extensions): l.append(os.path.join(root, name)) if(not subdirs): while(len(dirs) > 0): dirs.pop() elif(not avoid_directories is None): for d in avoid_directories: if(d in dirs): dirs.remove(d) return l
这个方法对于不分析.svn
path很方便,肯定会触发find_duplicates
文件find_duplicates
。
反馈是受欢迎的。
import hashlib import os import sys from sets import Set def read_chunk(fobj, chunk_size = 2048): """ Files can be huge so read them in chunks of bytes. """ while True: chunk = fobj.read(chunk_size) if not chunk: return yield chunk def remove_duplicates(dir, hashfun = hashlib.sha512): unique = Set() for filename in os.listdir(dir): filepath = os.path.join(dir, filename) if os.path.isfile(filepath): hashobj = hashfun() for chunk in read_chunk(open(filepath,'rb')): hashobj.update(chunk) # the size of the hashobj is constant # print "hashfun: ", hashfun.__sizeof__() hashfile = hashobj.hexdigest() if hashfile not in unique: unique.add(hashfile) else: os.remove(filepath) try: hashfun = hashlib.sha256 remove_duplicates(sys.argv[1], hashfun) except IndexError: print """Please pass a path to a directory with duplicate files as a parameter to the script."""
@ IanLee1521 在这里有一个很好的解决scheme。 这是非常有效的,因为它首先检查基于文件大小的副本。
#! /usr/bin/env python # Originally taken from: # http://www.pythoncentral.io/finding-duplicate-files-with-python/ # Original Auther: Andres Torres # Adapted to only compute the md5sum of files with the same size import argparse import os import sys import hashlib def find_duplicates(folders): """ Takes in an iterable of folders and prints & returns the duplicate files """ dup_size = {} for i in folders: # Iterate the folders given if os.path.exists(i): # Find the duplicated files and append them to dup_size join_dicts(dup_size, find_duplicate_size(i)) else: print('%s is not a valid path, please verify' % i) return {} print('Comparing files with the same size...') dups = {} for dup_list in dup_size.values(): if len(dup_list) > 1: join_dicts(dups, find_duplicate_hash(dup_list)) print_results(dups) return dups def find_duplicate_size(parent_dir): # Dups in format {hash:[names]} dups = {} for dirName, subdirs, fileList in os.walk(parent_dir): print('Scanning %s...' % dirName) for filename in fileList: # Get the path to the file path = os.path.join(dirName, filename) # Check to make sure the path is valid. if not os.path.exists(path): continue # Calculate sizes file_size = os.path.getsize(path) # Add or append the file path if file_size in dups: dups[file_size].append(path) else: dups[file_size] = [path] return dups def find_duplicate_hash(file_list): print('Comparing: ') for filename in file_list: print(' {}'.format(filename)) dups = {} for path in file_list: file_hash = hashfile(path) if file_hash in dups: dups[file_hash].append(path) else: dups[file_hash] = [path] return dups # Joins two dictionaries def join_dicts(dict1, dict2): for key in dict2.keys(): if key in dict1: dict1[key] = dict1[key] + dict2[key] else: dict1[key] = dict2[key] def hashfile(path, blocksize=65536): afile = open(path, 'rb') hasher = hashlib.md5() buf = afile.read(blocksize) while len(buf) > 0: hasher.update(buf) buf = afile.read(blocksize) afile.close() return hasher.hexdigest() def print_results(dict1): results = list(filter(lambda x: len(x) > 1, dict1.values())) if len(results) > 0: print('Duplicates Found:') print( 'The following files are identical. The name could differ, but the' ' content is identical' ) print('___________________') for result in results: for subresult in result: print('\t\t%s' % subresult) print('___________________') else: print('No duplicate files found.') def main(): parser = argparse.ArgumentParser(description='Find duplicate files') parser.add_argument( 'folders', metavar='dir', type=str, nargs='+', help='A directory to parse for duplicates', ) args = parser.parse_args() find_duplicates(args.folders) if __name__ == '__main__': sys.exit(main())
为了安全起见(如果出现问题,自动删除它们可能是危险的!),根据@ zalew的回答,这里是我使用的。
请注意,md5总和代码与@ zalew的代码略有不同,因为他的代码产生了太多错误的重复文件 (这就是为什么我说自动删除它们是危险的!)。
import hashlib, os unique = dict() for filename in os.listdir('.'): if os.path.isfile(filename): filehash = hashlib.md5(open(filename, 'rb').read()).hexdigest() if filehash not in unique: unique[filehash] = filename else: print filename + ' is a duplicate of ' + unique[filehash]