在列表中查找并列出重复项?
如何在Python列表中find重复项并创build重复项的另一个列表? 清单只是整数。
要删除重复使用set(a)
,打印重复 – 类似的东西
a = [1,2,3,2,1,5,6,5,5,5] import collections print [item for item, count in collections.Counter(a).items() if count > 1] ## [1, 2, 5]
请注意, Counter
不是特别有效( 时间 ),可能在这里矫枉过正, set
将performance更好:
seen = set() uniq = [] for x in a: if x not in seen: uniq.append(x) seen.add(x)
或者更简洁:
seen = set() uniq = [x for x in a if x not in seen and not seen.add(x)]
我不推荐后者的样式,因为它不是什么not seen.add(x)
在做(set add()
方法总是返回None
,所以不需要)。
如果列表元素不可散列,则不能使用set / dicts,而必须求助于二次时间解(比较每一个),例如:
a = [ [1], [2], [3], [1], [5], [3] ] no_dupes = [x for n, x in enumerate(a) if x not in a[:n]] print no_dupes # [[1], [2], [3], [5]] dupes = [x for n, x in enumerate(a) if x in a[:n]] print dupes # [[1], [3]]
>>> l = [1,2,3,4,4,5,5,6,1] >>> set([x for x in l if l.count(x) > 1]) set([1, 4, 5])
你不需要点数,只是这个物品是否被看过。 修改了这个问题的答案 :
def list_duplicates(seq): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in seq if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) a = [1,2,3,2,1,5,6,5,5,5] list_duplicates(a) # yields [1, 2, 5]
以防万一,速度至关重要,下面是一些时机:
# file: test.py import collections def thg435(l): return [x for x, y in collections.Counter(l).items() if y > 1] def moooeeeep(l): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in l if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) def RiteshKumar(l): return list(set([x for x in l if l.count(x) > 1])) def JohnLaRooy(L): seen = set() seen2 = set() seen_add = seen.add seen2_add = seen2.add for item in L: if item in seen: seen2_add(item) else: seen_add(item) return list(seen2) l = [1,2,3,2,1,5,6,5,5,5]*100
下面是结果:(做得好@JohnLaRooy!)
$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 10000 loops, best of 3: 74.6 usec per loop $ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 10000 loops, best of 3: 91.3 usec per loop $ python -mtimeit -s 'import test' 'test.thg435(test.l)' 1000 loops, best of 3: 266 usec per loop $ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)' 100 loops, best of 3: 8.35 msec per loop
有趣的是,除了时间本身,使用pypy时,排名也会略有变化。 最有趣的是,基于Counter的方法从pypy的优化中获益匪浅,而我提出的方法caching方法似乎几乎没有效果。
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 100000 loops, best of 3: 17.8 usec per loop $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)' 10000 loops, best of 3: 23 usec per loop $ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 10000 loops, best of 3: 39.3 usec per loop
可以看出,这种效应与input数据的“重复性”有关。 我l = [random.randrange(1000000) for i in xrange(10000)]
设置了l = [random.randrange(1000000) for i in xrange(10000)]
并得到了以下结果:
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 1000 loops, best of 3: 495 usec per loop $ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 1000 loops, best of 3: 499 usec per loop $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)' 1000 loops, best of 3: 1.68 msec per loop
我遇到这个问题,同时寻找相关的东西 – 并想知道为什么没有人提供了一个基于生成器的解决scheme? 解决这个问题将是:
>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5])) [1, 2, 5]
我关心的是可扩展性,所以testing了几种方法,包括那些在小列表上运行良好的天真项目,但是随着列表变得更大(注意 – 使用timeit会更好,但这是说明性的)。
我包括@moooeeeep比较(这是令人印象深刻的快速:如果input列表是最快的,是完全随机的)和itertools的方法,对于大多数sorting的列表再次更快…现在包括pandas的方法@firelynx – 慢,但不可怕如此,简单。 注意 – sorting/ tee / zip方法是我的机器上一贯最快的大型大多数有序列表,moooeeeep是最快的洗牌清单,但你的里程可能会有所不同。
优点
- 使用相同的代码非常简单地testing“任何”重复
假设
- 重复只应报告一次
- 重复的顺序不需要保存
- 重复可能在列表中的任何地方
最快的解决scheme,1米条目:
def getDupes(c): '''sort/tee/izip''' a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in itertools.izip(a, b): if k != g: continue if k != r: yield k r = k
方法testing
import itertools import time import random def getDupes_1(c): '''naive''' for i in xrange(0, len(c)): if c[i] in c[:i]: yield c[i] def getDupes_2(c): '''set len change''' s = set() for i in c: l = len(s) s.add(i) if len(s) == l: yield i def getDupes_3(c): '''in dict''' d = {} for i in c: if i in d: if d[i]: yield i d[i] = False else: d[i] = True def getDupes_4(c): '''in set''' s,r = set(),set() for i in c: if i not in s: s.add(i) elif i not in r: r.add(i) yield i def getDupes_5(c): '''sort/adjacent''' c = sorted(c) r = None for i in xrange(1, len(c)): if c[i] == c[i - 1]: if c[i] != r: yield c[i] r = c[i] def getDupes_6(c): '''sort/groupby''' def multiple(x): try: x.next() x.next() return True except: return False for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))): yield k def getDupes_7(c): '''sort/zip''' c = sorted(c) r = None for k, g in zip(c[:-1],c[1:]): if k == g: if k != r: yield k r = k def getDupes_8(c): '''sort/izip''' c = sorted(c) r = None for k, g in itertools.izip(c[:-1],c[1:]): if k == g: if k != r: yield k r = k def getDupes_9(c): '''sort/tee/izip''' a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in itertools.izip(a, b): if k != g: continue if k != r: yield k r = k def getDupes_a(l): '''moooeeeep''' seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice for x in l: if x in seen or seen_add(x): yield x def getDupes_b(x): '''iter*/sorted''' x = sorted(x) def _matches(): for k,g in itertools.izip(x[:-1],x[1:]): if k == g: yield k for k, n in itertools.groupby(_matches()): yield k def getDupes_c(a): '''pandas''' import pandas as pd vc = pd.Series(a).value_counts() i = vc[vc > 1].index for _ in i: yield _ def hasDupes(fn,c): try: if fn(c).next(): return True # Found a dupe except StopIteration: pass return False def getDupes(fn,c): return list(fn(c)) STABLE = True if STABLE: print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array' else: print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array' for location in (50,250000,500000,750000,999999): for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6, getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c): print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location), deltas = [] for FIRST in (True,False): for i in xrange(0, 5): c = range(0,1000000) if STABLE: c[0] = location else: c.append(location) random.shuffle(c) start = time.time() if FIRST: print '.' if location == test(c).next() else '!', else: print '.' if [location] == list(test(c)) else '!', deltas.append(time.time()-start) print ' -- %0.3f '%(sum(deltas)/len(deltas)), print print
“所有模式”testing的结果是一致的,在这个数组中find“所有”重复的“第一”重复:
Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array Test set len change : 500000 - . . . . . -- 0.264 . . . . . -- 0.402 Test in dict : 500000 - . . . . . -- 0.163 . . . . . -- 0.250 Test in set : 500000 - . . . . . -- 0.163 . . . . . -- 0.249 Test sort/adjacent : 500000 - . . . . . -- 0.159 . . . . . -- 0.229 Test sort/groupby : 500000 - . . . . . -- 0.860 . . . . . -- 1.286 Test sort/izip : 500000 - . . . . . -- 0.165 . . . . . -- 0.229 Test sort/tee/izip : 500000 - . . . . . -- 0.145 . . . . . -- 0.206 * Test moooeeeep : 500000 - . . . . . -- 0.149 . . . . . -- 0.232 Test iter*/sorted : 500000 - . . . . . -- 0.160 . . . . . -- 0.221 Test pandas : 500000 - . . . . . -- 0.493 . . . . . -- 0.499
当清单首先被洗牌时,价格会变得明显 – 效率明显下降,@moooeeeep方法占主导地位,set&dict方法类似,但出租人performance不佳:
Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array Test set len change : 500000 - . . . . . -- 0.321 . . . . . -- 0.473 Test in dict : 500000 - . . . . . -- 0.285 . . . . . -- 0.360 Test in set : 500000 - . . . . . -- 0.309 . . . . . -- 0.365 Test sort/adjacent : 500000 - . . . . . -- 0.756 . . . . . -- 0.823 Test sort/groupby : 500000 - . . . . . -- 1.459 . . . . . -- 1.896 Test sort/izip : 500000 - . . . . . -- 0.786 . . . . . -- 0.845 Test sort/tee/izip : 500000 - . . . . . -- 0.743 . . . . . -- 0.804 Test moooeeeep : 500000 - . . . . . -- 0.234 . . . . . -- 0.311 * Test iter*/sorted : 500000 - . . . . . -- 0.776 . . . . . -- 0.840 Test pandas : 500000 - . . . . . -- 0.539 . . . . . -- 0.540
计数器是python 2.7中的新function:
Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 a = [1,2,3,2,1,5,6,5,5,5] import collections print [x for x, y in collections.Counter(a).items() if y > 1] Type "help", "copyright", "credits" or "license" for more information. File "", line 1, in AttributeError: 'module' object has no attribute 'Counter' >>>
在较早的版本中,您可以使用传统的字典:
a = [1,2,3,2,1,5,6,5,5,5] d = {} for elem in a: if elem in d: d[elem] += 1 else: d[elem] = 1 print [x for x, y in d.items() if y > 1]
我会用pandas做这个,因为我用pandas很多
import pandas as pd a = [1,2,3,3,3,4,5,6,6,7] vc = pd.Series(a).value_counts() vc[vc > 1].index.tolist()
给
[3,6]
可能效率不高,但肯定比其他许多答案less,所以我认为我会做出贡献
这是一个简洁明了的解决scheme –
for x in set(li): li.remove(x) li = list(set(li))
使用pandas:
>>> import pandas as pd >>> a = [1, 2, 1, 3, 3, 3, 0] >>> pd.Series(a)[pd.Series(a).duplicated()].values array([1, 3, 3])
有点晚了,但也许对一些有帮助。 对于一个大的列表,我发现这对我有用。
l=[1,2,3,5,4,1,3,1] s=set(l) d=[] for x in l: if x in s: s.remove(x) else: d.append(x) d [1,3,1]
只显示所有重复项并保留顺序。
在Python中进行一次迭代的简单而快速的方法是:
testList = ['red', 'blue', 'red', 'green', 'blue', 'blue'] testListDict = {} for item in testList: try: testListDict[item] += 1 except: testListDict[item] = 1 print testListDict
输出将如下所示:
>>> print testListDict {'blue': 3, 'green': 1, 'red': 2}
被接受的答案的第三个例子给出了一个错误的答案,而不是试图给出重复。 这是正确的版本:
number_lst = [1, 1, 2, 3, 5, ...] seen_set = set() duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x)) unique_set = seen_set - duplicate_set
如何简单地遍历列表中的每个元素,通过检查出现次数,然后将它们添加到一个集合,然后将打印重复。 希望这可以帮助那里的人。
myList = [2 ,4 , 6, 8, 4, 6, 12]; newList = set() for i in myList: if myList.count(i) >= 2: newList.add(i) print(list(newList)) ## [4 , 6]
我们可以使用itertools.groupby
为了find所有有dups的项目:
from itertools import groupby myList = [2, 4, 6, 8, 4, 6, 12] # when the list is sorted, groupby groups by consecutive elements which are similar for x, y in groupby(sorted(myList)): # list(y) returns all the occurences of item x if len(list(y)) > 1: print x
输出将是:
4 6
list2 = [1, 2, 3, 4, 1, 2, 3] lset = set() [(lset.add(item), list2.append(item)) for item in list2 if item not in lset] print list(lset)
一行解决scheme:
set([i for i in list if sum([1 for a in list if a == i]) > 1])
在这里有很多答案,但我认为这是一个相当可读和易于理解的方法:
def get_duplicates(sorted_list): duplicates = [] last = sorted_list[0] for x in sorted_list[1:]: if x == last: duplicates.append(x) last = x return set(duplicates)
笔记:
- 如果你想保留重复计数,摆脱剧组在底部“设置”,以获得完整列表
- 如果您更喜欢使用生成器,请用yield xreplaceduplicateates.append(x) ,并使用底部的return语句(以后可以强制转换)
这是一个快速的发生器,它使用一个字典来存储每个元素作为一个布尔值的关键,检查是否已经产生重复的项目。
对于所有可散列元素的列表:
def gen_dupes(array): unique = {} for value in array: if value in unique and unique[value]: unique[value] = False yield value else: unique[value] = True array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6] print(list(gen_dupes(array))) # => [2, 1, 6]
对于可能包含列表的列表:
def gen_dupes(array): unique = {} for value in array: is_list = False if type(value) is list: value = tuple(value) is_list = True if value in unique and unique[value]: unique[value] = False if is_list: value = list(value) yield value else: unique[value] = True array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6] print(list(gen_dupes(array))) # => [2, [1, 2], 6]
你可以使用iteration_utilities.duplicates
:
>>> from iteration_utilities import duplicates >>> list(duplicates([1,1,2,1,2,3,4,2])) [1, 1, 2, 2]
或者如果您只希望每个重复项中的一项可以与iteration_utilities.unique_everseen
结合使用:
>>> from iteration_utilities import unique_everseen >>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2]))) [1, 2]
1这是来自我写的第三方库: iteration_utilities
。
def removeduplicates(a): seen = set() for i in a: if i not in seen: seen.add(i) return seen print(removeduplicates([1,1,2,2]))
这是我不得不这样做的方式,因为我挑战自己不要使用其他方法:
def dupList(oldlist): if type(oldlist)==type((2,2)): oldlist=[x for x in oldlist] newList=[] newList=newList+oldlist oldlist=oldlist forbidden=[] checkPoint=0 for i in range(len(oldlist)): #print 'start i', i if i in forbidden: continue else: for j in range(len(oldlist)): #print 'start j', j if j in forbidden: continue else: #print 'after Else' if i!=j: #print 'i,j', i,j #print oldlist #print newList if oldlist[j]==oldlist[i]: #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j] forbidden.append(j) #print 'forbidden', forbidden del newList[j-checkPoint] #print newList checkPoint=checkPoint+1 return newList
所以你的示例工作如下:
>>>a = [1,2,3,3,3,4,5,6,6,7] >>>dupList(a) [1, 2, 3, 4, 5, 6, 7]
使用sort()
函数。 可以通过循环来检查重复,并检查l1[i] == l1[i+1]
。