Python:比较两个CSV文件并search相似的项目
所以我有两个CSV文件,我试图比较,并得到相似的项目的结果。 第一个文件hosts.csv如下所示:
Path Filename Size Signature C:\ a.txt 14kb 012345 D:\ b.txt 99kb 678910 C:\ c.txt 44kb 111213
第二个文件masterlist.csv如下所示:
Filename Signature b.txt 678910 x.txt 111213 b.txt 777777 c.txt 999999
正如你所看到的,行不匹配,masterlist.csv总是大于hosts.csv文件。 我想要search的唯一部分是签名部分。 我知道这看起来像这样:
hosts[3] == masterlist[1]
我正在寻找一个解决scheme,会给我像下面的东西(基本上hosts.csv文件与一个新的结果列):
Path Filename Size Signature RESULTS C:\ a.txt 14kb 012345 NOT FOUND in masterlist D:\ b.txt 99kb 678910 FOUND in masterlist (row 1) C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
我已经search了这些post,发现类似于这里的东西,但我不太明白,因为我还在学习python。
编辑使用Python 2.6
编辑:虽然我的解决scheme正常工作,请检查下面的Martijn的答案更有效的解决scheme。
你可以在这里findpython CSV模块的文档。
你在找什么是这样的:
import csv f1 = file('hosts.csv', 'r') f2 = file('masterlist.csv', 'r') f3 = file('results.csv', 'w') c1 = csv.reader(f1) c2 = csv.reader(f2) c3 = csv.writer(f3) masterlist = list(c2) for hosts_row in c1: row = 1 found = False for master_row in masterlist: results_row = hosts_row if hosts_row[3] == master_row[1]: results_row.append('FOUND in master list (row ' + str(row) + ')') found = True break row = row + 1 if not found: results_row.append('NOT FOUND in master list') c3.writerow(results_row) f1.close() f2.close() f3.close()
srgerg的答案是非常低效的,因为它运行在二次时间。 这里是一个线性时间解决scheme,使用Python 2.6兼容的语法:
import csv with open('masterlist.csv', 'rb') as master: master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master))) with open('hosts.csv', 'rb') as hosts: with open('results.csv', 'wb') as results: reader = csv.reader(hosts) writer = csv.writer(results) writer.writerow(next(reader, []) + ['RESULTS']) for row in reader: index = master_indices.get(row[3]) if index is not None: message = 'FOUND in master list (row {})'.format(index) else: message = 'NOT FOUND in master list' writer.writerow(row + [message])
这将生成一个字典,首先将masterlist.csv
中的签名映射到行号。 字典中的查找需要一定的时间,使hosts.csv
行上的第二个循环与hosts.csv
行数masterlist.csv
。 更不用说代码更简单了。
Python的CSV和集合模块,特别是OrderedDict ,在这里确实很有帮助。 你想使用OrderedDict来保存键的顺序等,你不必,但它是有用的!
import csv from collections import OrderedDict signature_row_map = OrderedDict() with open('hosts.csv') as file_object: for line in csv.DictReader(file_object, delimiter='\t'): signature_row_map[line['Signature']] = {'line': line, 'found_at': None} with open('masterlist.csv') as file_object: for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1): if line['Signature'] in signature_row_map: signature_row_map[line['Signature']]['found_at'] = i with open('newhosts.csv', 'w') as file_object: fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS'] writer = csv.DictWriter(file_object, fieldnames, delimiter='\t') writer.writer.writerow(fieldnames) for signature_info in signature_row_map.itervalues(): result = '{0} FOUND in masterlist {1}' # explicit check for sentinel if signature_info['found_at'] is not None: result = result.format('', '(row %s)' % signature_info['found_at']) else: result = result.format('NOT', '') payload = signature_info['line'] payload['RESULTS'] = result writer.writerow(payload)
以下是使用testingCSV文件的输出:
Path Filename Size Signature RESULTS C:\ a.txt 14kb 012345 NOT FOUND in masterlist D:\ b.txt 99kb 678910 FOUND in masterlist (row 1) C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
请原谅错位,他们是分开的标签:)
csv
模块在分析csv文件时非常方便。 但是为了好玩,我只是简单地将input分割为空白来获取数据。
只需parsing数据,为masterlist.csv中的数据构build一个dict
,其签名为key,行号为value。 现在,对hosts.csv的每一行,我们可以查询dict
并找出是否有一个相应的条目存在于masterlist.csv,如果是的话在哪一行。
#! /usr/bin/env python def read_data(filename): input_source=open(filename,'r') input_source.readline() return [line.split() for line in input_source] if __name__=='__main__': hosts=read_data('hosts.csv') masterlist=read_data('masterlist.csv') master=dict() for index,data in enumerate(masterlist): master[data[-1]]=index+1 for row in hosts: try: found="FOUND in masterlist (row %s)"%master[row[-1]] except KeyError: found="NOT FOUND in masterlist" line=row+[found] print "%s %s %s %s %s"%tuple(line)