我如何删除重复的行?
从相当大的表中删除重复行的最佳方法是什么(即300,000行以上)?
由于RowID标识字段的存在,行当然不会是完美的重复。
MyTable ----------- RowID int not null identity(1,1) primary key, Col1 varchar(20) not null, Col2 varchar(2048) not null, Col3 tinyint not null
假设没有空值, GROUP BY
唯一列,并SELECT
MIN (or MAX)
RowId作为要保留的行。 然后,删除没有行ID的所有东西:
DELETE FROM MyTable LEFT OUTER JOIN ( SELECT MIN(RowId) as RowId, Col1, Col2, Col3 FROM MyTable GROUP BY Col1, Col2, Col3 ) as KeepRows ON MyTable.RowId = KeepRows.RowId WHERE KeepRows.RowId IS NULL
万一你有一个GUID而不是一个整数,你可以replace
MIN(RowId)
同
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
另一个可能的方法是
; --Ensure that any immediately preceding statement is terminated with a semicolon above WITH cte AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY ( SELECT 0)) RN FROM #MyTable) DELETE FROM cte WHERE RN > 1;
我使用上面的ORDER BY (SELECT 0)
,因为在任何情况下,它是任意的行保留。
例如,要保留最新的RowID
顺序,可以使用ORDER BY RowID DESC
执行计划
执行计划通常比接受的答案更简单,更高效,因为它不需要自join。
然而,情况并非总是如此。 GROUP BY
解决scheme可能是首选的一个地方是散列聚合将优先于stream聚集而被select的情况。
ROW_NUMBER
解决scheme总是给出几乎相同的计划,而GROUP BY
策略则更加灵活。
可能有利于哈希聚合方法的因素是
- 分区列上没有有用的索引
- 相对较less的组,每组重复次数相对较多
在第二种情况的极端版本中(如果每个组中有很多重复的组),可以考虑简单地插入行以保存到一个新表中,然后对原始数据进行TRUNCATE
,并将其复制回来以最大程度地减less日志logging很高比例的行。
有一篇关于删除 Microsoft支持网站上的重复的文章。 这是非常保守的 – 他们让你做一切分开的步骤 – 但它应该很好地对付大桌子。
过去,我使用了自连接来完成这个任务,但是可能会用HAVING子句做一些修改:
DELETE dupes FROM MyTable dupes, MyTable fullTable WHERE dupes.dupField = fullTable.dupField AND dupes.secondDupField = fullTable.secondDupField AND dupes.uniqueField > fullTable.uniqueField
以下查询对删除重复行很有用。 此示例中的表具有ID
作为标识列,具有重复数据的列是Column1
, Column2
和Column3
。
DELETE FROM TableName WHERE ID NOT IN (SELECT MAX(ID) FROM TableName GROUP BY Column1, Column2, Column3 /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially nullable. Because of semantics of NOT IN (NULL) including the clause below can simplify the plan*/ HAVING MAX(ID) IS NOT NULL)
以下脚本在一个查询中显示GROUP BY
, HAVING
, ORDER BY
用法,并返回包含重复列和计数的结果。
SELECT YourColumnName, COUNT(*) TotalCount FROM YourTableName GROUP BY YourColumnName HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC
delete t1 from table t1, table t2 where t1.columnA = t2.columnA and t1.rowid>t2.rowid
Postgres的:
delete from table t1 using table t2 where t1.columnA = t2.columnA and t1.rowid > t2.rowid
DELETE LU FROM (SELECT *, Row_number() OVER ( partition BY col1, col1, col3 ORDER BY rowid DESC) [Row] FROM mytable) LU WHERE [row] > 1
这会删除第一行以外的重复行
DELETE FROM Mytable WHERE RowID NOT IN ( SELECT MIN(RowID) FROM Mytable GROUP BY Col1, Col2, Col3 )
请参阅( http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server )
快速和肮脏删除确切的重复行(小表):
select distinct * into t2 from t1; delete from t1; insert into t1 select * from t2; drop table t2;
我宁愿CTE从sql server表中删除重复的行
强烈build议遵循这篇文章:: http://dotnetmob.com/sql-server-article/delete-duplicate-rows-in-sql-server/
保持原来的
WITH CTE AS ( SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN FROM MyTable ) DELETE FROM CTE WHERE RN<>1
不保留原创
WITH CTE AS (SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3) FROM MyTable) DELETE CTE WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
我更喜欢子查询\有内存联接计数(*)> 1解决scheme,因为我发现它更容易阅读,它很容易变成一个SELECT语句,以validation在运行它之前将被删除。
--DELETE FROM table1 --WHERE id IN ( SELECT MIN(id) FROM table1 GROUP BY col1, col2, col3 -- could add a WHERE clause here to further filter HAVING count(*) > 1 --)
SELECT DISTINCT * INTO tempdb.dbo.tmpTable FROM myTable TRUNCATE TABLE myTable INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable DROP TABLE tempdb.dbo.tmpTable
另一个简单的解决scheme可以在这里粘贴的链接find 。 这个容易理解,似乎对大部分类似的问题都是有效的。 这是SQL Server,但使用的概念是可以接受的。
以下是链接页面的相关部分:
考虑这些数据:
EMPLOYEE_ID ATTENDANCE_DATE A001 2011-01-01 A001 2011-01-01 A002 2011-01-01 A002 2011-01-01 A002 2011-01-01 A003 2011-01-01
那么我们如何删除这些重复的数据呢?
首先,使用以下代码在该表中插入标识列:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
使用下面的代码来解决它:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _ FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
我想我会分享我的解决scheme,因为它在特殊情况下工作。 我的情况下,具有重复值的表没有外键(因为值是从另一个数据库中复制)。
begin transaction -- create temp table with identical structure as source table Select * Into #temp From tableName Where 1 = 2 -- insert distinct values into temp insert into #temp select distinct * from tableName -- delete from source delete from tableName -- insert into source from temp insert into tableName select * from #temp rollback transaction -- if this works, change rollback to commit and execute again to keep you changes!!
PS:在处理这样的事情时,我总是使用一个事务,这不仅确保了所有的事情都是作为一个整体来执行的,而且还允许我在没有任何冒险的情况下进行testing。 但是,当然,您应该采取备份,只是为了确保…
使用CTE:
;with cte as ( select min(PrimaryKey) as PrimaryKey UniqueColumn1, UniqueColumn2 from dbo.DuplicatesTable group by UniqueColumn1, UniqueColumn1 having count(*) > 1 ) delete d from dbo.DuplicatesTable d inner join cte on d.PrimaryKey > cte.PrimaryKey and d.UniqueColumn1 = cte.UniqueColumn1 and d.UniqueColumn2 = cte.UniqueColumn2;
这个查询对我来说performance非常好:
DELETE tbl FROM MyTable tbl WHERE EXISTS ( SELECT * FROM MyTable tbl2 WHERE tbl2.SameValue = tbl.SameValue AND tbl.IdUniqueValue < tbl2.IdUniqueValue )
它从2M(50%重复)的表中略微超过30秒删除了1M行,
这是另一个很好的文章删除重复 。
它讨论了为什么它很难:“ SQL是基于关系代数的,并且在关系代数中不会出现重复,因为在一个集合中不允许重复。
临时表解决scheme和两个mysql示例。
将来你会在数据库级别,还是从应用程序的angular度来防止它。 我会build议数据库级别,因为你的数据库应该负责维护参照完整性,开发人员只会造成问题;)
哦没问题。 使用临时表。 如果你想要一个单独的,非常高效的“有效”的陈述,你可以这样做:
DELETE FROM MyTable WHERE NOT RowID IN (SELECT (SELECT TOP 1 RowID FROM MyTable mt2 WHERE mt2.Col1 = mt.Col1 AND mt2.Col2 = mt.Col2 AND mt2.Col3 = mt.Col3) FROM MyTable mt)
基本上,对于表中的每一行,子select都会查找与所考虑的行完全相同的所有行的顶部RowID。 所以你最终得到一个表示“原始”非重复行的RowID列表。
我有一张桌子,我需要保留非重复的行。 我不确定速度或效率。
DELETE FROM myTable WHERE RowID IN ( SELECT MIN(RowID) AS IDNo FROM myTable GROUP BY Col1, Col2, Col3 HAVING COUNT(*) = 2 )
另一种方法是创build一个具有相同字段和唯一索引 的新表。 然后将所有数据从旧表移到新表 。 自动SQL SERVER忽略(如果存在重复值:忽略,中断或某事)也有一个选项重复的值。 所以我们有没有重复行的同一个表。 如果你不想唯一索引,在传输数据后,你可以放弃它 。
特别是对于较大的表格,您可以使用DTS(SSIS包导入/导出数据)将所有数据快速传输到新的唯一索引表中。 700万行,只需要几分钟。
用这个
WITH tblTemp as ( SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name) As RowNumber,* FROM <table_name> ) DELETE FROM tblTemp where RowNumber >1
获取重复行:
SELECT name, email, COUNT(*) FROM users GROUP BY name, email HAVING COUNT(*) > 1
删除重复行:
DELETE users WHER rowid NOT IN SELECT MIN(rowid) FROM users GROUP BY name, email);
通过使用下面的查询,我们可以删除基于单列或多列的重复logging。 下面的查询是基于两列删除的。 表名是: testing
和列名empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno) AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1) or empname not in (select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno) AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
我会提到这种方法,以及它可以是有用的,并在所有SQL服务器工作:往往只有一个 – 两个重复,Ids和重复计数已知。 在这种情况下:
SET ROWCOUNT 1 -- or set to number of rows to be deleted delete from myTable where RowId = DuplicatedID SET ROWCOUNT 0
-
用相同的结构创build新的空白表
-
像这样执行查询
INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) > 1
-
然后执行这个查询
INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) = 1
DELETE FROM table_name T1 WHERE rowid > ( SELECT min(rowid) FROM table_name T2 WHERE T1.column_name = T2.column_name );
从应用程序级别(不幸)。 我同意防止重复的正确方法是通过使用唯一索引在数据库级别,但是在SQL Server 2005中,索引只允许为900个字节,而我的varchar(2048)字段则将其吹走。
我不知道它会如何执行,但我认为你可以写一个触发器来强制执行,即使你不能直接使用索引。 就像是:
-- given a table stories(story_id int not null primary key, story varchar(max) not null) CREATE TRIGGER prevent_plagiarism ON stories after INSERT, UPDATE AS DECLARE @cnt AS INT SELECT @cnt = Count(*) FROM stories INNER JOIN inserted ON ( stories.story = inserted.story AND stories.story_id != inserted.story_id ) IF @cnt > 0 BEGIN RAISERROR('plagiarism detected',16,1) ROLLBACK TRANSACTION END
另外,varchar(2048)听起来很腥(生活中有些东西是2048字节,但是很less见)。 应该真的不是varchar(max)?
DELETE FROM MyTable WHERE NOT EXISTS ( SELECT min(RowID) FROM Mytable WHERE (SELECT RowID FROM Mytable GROUP BY Col1, Col2, Col3 )) );
另一种做法是:
DELETE A FROM TABLE A, TABLE B WHERE A.COL1 = B.COL1 AND A.COL2 = B.COL2 AND A.UNIQUEFIELD > B.UNIQUEFIELD
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int) INSERT INTO car(PersonId,CarId) VALUES(1,2),(1,3),(1,2),(2,4) --SELECT * FROM car ;WITH CTE as( SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car) DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
我想要预览您即将删除的行,并保留要保留哪些重复行的控制权。 见http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as ( SELECT ROW_NUMBER() OVER ( PARTITION BY DuplicateKey1 ,DuplicateKey2 -- optional ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed ) RN FROM MyTable ) DELETE FROM MYCTE WHERE RN > 1