在列表中查找并列出重复项？

如何在Python列表中find重复项并创build重复项的另一个列表？清单只是整数。

要删除重复使用set(a) ，打印重复 – 类似的东西

 a = [1,2,3,2,1,5,6,5,5,5] import collections print [item for item, count in collections.Counter(a).items() if count > 1] ## [1, 2, 5]

请注意， Counter不是特别有效（时间），可能在这里矫枉过正， set将performance更好：

 seen = set() uniq = [] for x in a: if x not in seen: uniq.append(x) seen.add(x)

或者更简洁：

 seen = set() uniq = [x for x in a if x not in seen and not seen.add(x)]

我不推荐后者的样式，因为它不是什么not seen.add(x)在做（set add()方法总是返回None ，所以不需要）。

如果列表元素不可散列，则不能使用set / dicts，而必须求助于二次时间解（比较每一个），例如：

 a = [ [1], [2], [3], [1], [5], [3] ] no_dupes = [x for n, x in enumerate(a) if x not in a[:n]] print no_dupes # [[1], [2], [3], [5]] dupes = [x for n, x in enumerate(a) if x in a[:n]] print dupes # [[1], [3]]

 >>> l = [1,2,3,4,4,5,5,6,1] >>> set([x for x in l if l.count(x) > 1]) set([1, 4, 5])

你不需要点数，只是这个物品是否被看过。修改了这个问题的答案：

 def list_duplicates(seq): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in seq if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) a = [1,2,3,2,1,5,6,5,5,5] list_duplicates(a) # yields [1, 2, 5]

以防万一，速度至关重要，下面是一些时机：

 # file: test.py import collections def thg435(l): return [x for x, y in collections.Counter(l).items() if y > 1] def moooeeeep(l): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in l if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) def RiteshKumar(l): return list(set([x for x in l if l.count(x) > 1])) def JohnLaRooy(L): seen = set() seen2 = set() seen_add = seen.add seen2_add = seen2.add for item in L: if item in seen: seen2_add(item) else: seen_add(item) return list(seen2) l = [1,2,3,2,1,5,6,5,5,5]*100

下面是结果：（做得好@JohnLaRooy！）

 $ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 10000 loops, best of 3: 74.6 usec per loop $ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 10000 loops, best of 3: 91.3 usec per loop $ python -mtimeit -s 'import test' 'test.thg435(test.l)' 1000 loops, best of 3: 266 usec per loop $ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)' 100 loops, best of 3: 8.35 msec per loop

有趣的是，除了时间本身，使用pypy时，排名也会略有变化。最有趣的是，基于Counter的方法从pypy的优化中获益匪浅，而我提出的方法caching方法似乎几乎没有效果。

 $ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 100000 loops, best of 3: 17.8 usec per loop $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)' 10000 loops, best of 3: 23 usec per loop $ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 10000 loops, best of 3: 39.3 usec per loop

可以看出，这种效应与input数据的“重复性”有关。我l = [random.randrange(1000000) for i in xrange(10000)]设置了l = [random.randrange(1000000) for i in xrange(10000)]并得到了以下结果：

 $ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 1000 loops, best of 3: 495 usec per loop $ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 1000 loops, best of 3: 499 usec per loop $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)' 1000 loops, best of 3: 1.68 msec per loop

我遇到这个问题，同时寻找相关的东西 – 并想知道为什么没有人提供了一个基于生成器的解决scheme？解决这个问题将是：

 >>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5])) [1, 2, 5]

我关心的是可扩展性，所以testing了几种方法，包括那些在小列表上运行良好的天真项目，但是随着列表变得更大（注意 – 使用timeit会更好，但这是说明性的）。

我包括@moooeeeep比较（这是令人印象深刻的快速：如果input列表是最快的，是完全随机的）和itertools的方法，对于大多数sorting的列表再次更快…现在包括pandas的方法@firelynx – 慢，但不可怕如此，简单。注意 – sorting/ tee / zip方法是我的机器上一贯最快的大型大多数有序列表，moooeeeep是最快的洗牌清单，但你的里程可能会有所不同。

优点

使用相同的代码非常简单地testing“任何”重复

假设

重复只应报告一次
重复的顺序不需要保存
重复可能在列表中的任何地方

最快的解决scheme，1米条目：

 def getDupes(c): '''sort/tee/izip''' a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in itertools.izip(a, b): if k != g: continue if k != r: yield k r = k

方法testing

 import itertools import time import random def getDupes_1(c): '''naive''' for i in xrange(0, len(c)): if c[i] in c[:i]: yield c[i] def getDupes_2(c): '''set len change''' s = set() for i in c: l = len(s) s.add(i) if len(s) == l: yield i def getDupes_3(c): '''in dict''' d = {} for i in c: if i in d: if d[i]: yield i d[i] = False else: d[i] = True def getDupes_4(c): '''in set''' s,r = set(),set() for i in c: if i not in s: s.add(i) elif i not in r: r.add(i) yield i def getDupes_5(c): '''sort/adjacent''' c = sorted(c) r = None for i in xrange(1, len(c)): if c[i] == c[i - 1]: if c[i] != r: yield c[i] r = c[i] def getDupes_6(c): '''sort/groupby''' def multiple(x): try: x.next() x.next() return True except: return False for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))): yield k def getDupes_7(c): '''sort/zip''' c = sorted(c) r = None for k, g in zip(c[:-1],c[1:]): if k == g: if k != r: yield k r = k def getDupes_8(c): '''sort/izip''' c = sorted(c) r = None for k, g in itertools.izip(c[:-1],c[1:]): if k == g: if k != r: yield k r = k def getDupes_9(c): '''sort/tee/izip''' a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in itertools.izip(a, b): if k != g: continue if k != r: yield k r = k def getDupes_a(l): '''moooeeeep''' seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice for x in l: if x in seen or seen_add(x): yield x def getDupes_b(x): '''iter*/sorted''' x = sorted(x) def _matches(): for k,g in itertools.izip(x[:-1],x[1:]): if k == g: yield k for k, n in itertools.groupby(_matches()): yield k def getDupes_c(a): '''pandas''' import pandas as pd vc = pd.Series(a).value_counts() i = vc[vc > 1].index for _ in i: yield _ def hasDupes(fn,c): try: if fn(c).next(): return True # Found a dupe except StopIteration: pass return False def getDupes(fn,c): return list(fn(c)) STABLE = True if STABLE: print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array' else: print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array' for location in (50,250000,500000,750000,999999): for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6, getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c): print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location), deltas = [] for FIRST in (True,False): for i in xrange(0, 5): c = range(0,1000000) if STABLE: c[0] = location else: c.append(location) random.shuffle(c) start = time.time() if FIRST: print '.' if location == test(c).next() else '!', else: print '.' if [location] == list(test(c)) else '!', deltas.append(time.time()-start) print ' -- %0.3f '%(sum(deltas)/len(deltas)), print print

“所有模式”testing的结果是一致的，在这个数组中find“所有”重复的“第一”重复：

 Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array Test set len change : 500000 - . . . . . -- 0.264 . . . . . -- 0.402 Test in dict : 500000 - . . . . . -- 0.163 . . . . . -- 0.250 Test in set : 500000 - . . . . . -- 0.163 . . . . . -- 0.249 Test sort/adjacent : 500000 - . . . . . -- 0.159 . . . . . -- 0.229 Test sort/groupby : 500000 - . . . . . -- 0.860 . . . . . -- 1.286 Test sort/izip : 500000 - . . . . . -- 0.165 . . . . . -- 0.229 Test sort/tee/izip : 500000 - . . . . . -- 0.145 . . . . . -- 0.206 * Test moooeeeep : 500000 - . . . . . -- 0.149 . . . . . -- 0.232 Test iter*/sorted : 500000 - . . . . . -- 0.160 . . . . . -- 0.221 Test pandas : 500000 - . . . . . -- 0.493 . . . . . -- 0.499

当清单首先被洗牌时，价格会变得明显 – 效率明显下降，@moooeeeep方法占主导地位，set＆dict方法类似，但出租人performance不佳：

 Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array Test set len change : 500000 - . . . . . -- 0.321 . . . . . -- 0.473 Test in dict : 500000 - . . . . . -- 0.285 . . . . . -- 0.360 Test in set : 500000 - . . . . . -- 0.309 . . . . . -- 0.365 Test sort/adjacent : 500000 - . . . . . -- 0.756 . . . . . -- 0.823 Test sort/groupby : 500000 - . . . . . -- 1.459 . . . . . -- 1.896 Test sort/izip : 500000 - . . . . . -- 0.786 . . . . . -- 0.845 Test sort/tee/izip : 500000 - . . . . . -- 0.743 . . . . . -- 0.804 Test moooeeeep : 500000 - . . . . . -- 0.234 . . . . . -- 0.311 * Test iter*/sorted : 500000 - . . . . . -- 0.776 . . . . . -- 0.840 Test pandas : 500000 - . . . . . -- 0.539 . . . . . -- 0.540

计数器是python 2.7中的新function：

 Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 a = [1,2,3,2,1,5,6,5,5,5] import collections print [x for x, y in collections.Counter(a).items() if y > 1] Type "help", "copyright", "credits" or "license" for more information. File "", line 1, in AttributeError: 'module' object has no attribute 'Counter' >>>

在较早的版本中，您可以使用传统的字典：

 a = [1,2,3,2,1,5,6,5,5,5] d = {} for elem in a: if elem in d: d[elem] += 1 else: d[elem] = 1 print [x for x, y in d.items() if y > 1]

我会用pandas做这个，因为我用pandas很多

 import pandas as pd a = [1,2,3,3,3,4,5,6,6,7] vc = pd.Series(a).value_counts() vc[vc > 1].index.tolist()

给

 [3,6]

可能效率不高，但肯定比其他许多答案less，所以我认为我会做出贡献

这是一个简洁明了的解决scheme –

 for x in set(li): li.remove(x) li = list(set(li))

使用pandas：

 >>> import pandas as pd >>> a = [1, 2, 1, 3, 3, 3, 0] >>> pd.Series(a)[pd.Series(a).duplicated()].values array([1, 3, 3])

有点晚了，但也许对一些有帮助。对于一个大的列表，我发现这对我有用。

 l=[1,2,3,5,4,1,3,1] s=set(l) d=[] for x in l: if x in s: s.remove(x) else: d.append(x) d [1,3,1]

只显示所有重复项并保留顺序。

在Python中进行一次迭代的简单而快速的方法是：

 testList = ['red', 'blue', 'red', 'green', 'blue', 'blue'] testListDict = {} for item in testList: try: testListDict[item] += 1 except: testListDict[item] = 1 print testListDict

输出将如下所示：

 >>> print testListDict {'blue': 3, 'green': 1, 'red': 2}

这和更多在我的博客http://www.howtoprogramwithpython.com

被接受的答案的第三个例子给出了一个错误的答案，而不是试图给出重复。这是正确的版本：

 number_lst = [1, 1, 2, 3, 5, ...] seen_set = set() duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x)) unique_set = seen_set - duplicate_set

如何简单地遍历列表中的每个元素，通过检查出现次数，然后将它们添加到一个集合，然后将打印重复。希望这可以帮助那里的人。

 myList = [2 ,4 , 6, 8, 4, 6, 12]; newList = set() for i in myList: if myList.count(i) >= 2: newList.add(i) print(list(newList)) ## [4 , 6]

我们可以使用itertools.groupby为了find所有有dups的项目：

 from itertools import groupby myList = [2, 4, 6, 8, 4, 6, 12] # when the list is sorted, groupby groups by consecutive elements which are similar for x, y in groupby(sorted(myList)): # list(y) returns all the occurences of item x if len(list(y)) > 1: print x

输出将是：

4 6

 list2 = [1, 2, 3, 4, 1, 2, 3] lset = set() [(lset.add(item), list2.append(item)) for item in list2 if item not in lset] print list(lset)

一行解决scheme：

 set([i for i in list if sum([1 for a in list if a == i]) > 1])

在这里有很多答案，但我认为这是一个相当可读和易于理解的方法：

 def get_duplicates(sorted_list): duplicates = [] last = sorted_list[0] for x in sorted_list[1:]: if x == last: duplicates.append(x) last = x return set(duplicates)

笔记：

如果你想保留重复计数，摆脱剧组在底部“设置”，以获得完整列表
如果您更喜欢使用生成器，请用yield xreplaceduplicateates.append（x），并使用底部的return语句（以后可以强制转换）

这是一个快速的发生器，它使用一个字典来存储每个元素作为一个布尔值的关键，检查是否已经产生重复的项目。

对于所有可散列元素的列表：

 def gen_dupes(array): unique = {} for value in array: if value in unique and unique[value]: unique[value] = False yield value else: unique[value] = True array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6] print(list(gen_dupes(array))) # => [2, 1, 6]

对于可能包含列表的列表：

 def gen_dupes(array): unique = {} for value in array: is_list = False if type(value) is list: value = tuple(value) is_list = True if value in unique and unique[value]: unique[value] = False if is_list: value = list(value) yield value else: unique[value] = True array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6] print(list(gen_dupes(array))) # => [2, [1, 2], 6]

你可以使用iteration_utilities.duplicates ：

 >>> from iteration_utilities import duplicates >>> list(duplicates([1,1,2,1,2,3,4,2])) [1, 1, 2, 2]

或者如果您只希望每个重复项中的一项可以与iteration_utilities.unique_everseen结合使用：

 >>> from iteration_utilities import unique_everseen >>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2]))) [1, 2]

^{1这是来自我写的第三方库： iteration_utilities 。}

 def removeduplicates(a): seen = set() for i in a: if i not in seen: seen.add(i) return seen print(removeduplicates([1,1,2,2]))

这是我不得不这样做的方式，因为我挑战自己不要使用其他方法：

 def dupList(oldlist): if type(oldlist)==type((2,2)): oldlist=[x for x in oldlist] newList=[] newList=newList+oldlist oldlist=oldlist forbidden=[] checkPoint=0 for i in range(len(oldlist)): #print 'start i', i if i in forbidden: continue else: for j in range(len(oldlist)): #print 'start j', j if j in forbidden: continue else: #print 'after Else' if i!=j: #print 'i,j', i,j #print oldlist #print newList if oldlist[j]==oldlist[i]: #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j] forbidden.append(j) #print 'forbidden', forbidden del newList[j-checkPoint] #print newList checkPoint=checkPoint+1 return newList

所以你的示例工作如下：

 >>>a = [1,2,3,3,3,4,5,6,6,7] >>>dupList(a) [1, 2, 3, 4, 5, 6, 7]

使用sort()函数。可以通过循环来检查重复，并检查l1[i] == l1[i+1] 。

在列表中查找并列出重复项？

Python列表的底层数据结构是什么？

目录中特定文件的PHP列表

Prolog只删除唯一的元素

这里我们再次去：在R中追加一个元素到列表中

Set和List有什么区别？

我怎样才能在Python中创build一个数组/列表的字典？

我如何循环遍历一个列表？

join列表中的元素 – Python

我怎样才能初始化我声明它的同一行中的C＃列表。（IEnumerablestring集合示例）

如何合并2 List <T>与在C＃中删除重复值

在列表中查找并列出重复项？

Python列表的底层数据结构是什么？

目录中特定文件的PHP列表

Prolog只删除唯一的元素

这里我们再次去：在R中追加一个元素到列表中

Set和List有什么区别？

我怎样才能在Python中创build一个数组/列表的字典？

我如何循环遍历一个列表？

join列表中的元素 – Python

我怎样才能初始化我声明它的同一行中的C＃列表。 （IEnumerablestring集合示例）

如何合并2 List <T>与在C＃中删除重复值

我怎样才能初始化我声明它的同一行中的C＃列表。（IEnumerablestring集合示例）