使用T-SQL进行模糊匹配
我有一张桌子人与个人数据等。 有很多列,但有一次,这里的兴趣是: addressindex
firstname
, lastname
和名,其中addressindex
是一个独特的地址钻到公寓的门。 所以如果我有两个lastname
和一个firstnames
相同的“下面”,他们很可能是重复的。
我需要一种方法来列出这些重复。
tabledata: personid 1 firstname "Carl" lastname "Anderson" addressindex 1 personid 2 firstname "Carl Peter" lastname "Anderson" addressindex 1
我知道如果我要完全匹配所有列,但是我需要模糊匹配来完成(从上面的例子),结果如下:
Row personid addressindex lastname firstname 1 2 1 Anderson Carl Peter 2 1 1 Anderson Carl .....
任何提示如何解决这个好方法?
我发现SQL Server让你做模糊匹配的东西非常笨重。 使用Levenshtein距离algorithm和一些权重,我用自己的CLR函数运气真好。 使用该algorithm,我创build了一个名为GetSimilarityScore的UDF,它接受两个string,并返回0.0到1.0之间的分数。 越接近1.0,比赛越好。 然后,用大于或等于0.8的阈值来查询最可能的匹配。 像这样的东西:
if object_id('tempdb..#similar') is not null drop table #similar select a.id, ( select top 1 x.id from MyTable x where x.id <> a.id order by dbo.GetSimilarityScore(a.MyField, x.MyField) desc ) as MostSimilarId into #similar from MyTable a select *, dbo.GetSimilarityScore(a.MyField, c.MyField) from MyTable a join #similar b on a.id = b.id join MyTable c on b.MostSimilarId = c.id
只是不要用真正的大桌子来做。 这是一个缓慢的过程。
这里是CLR UDF:
''' <summary> ''' Compute the distance between two strings. ''' </summary> ''' <param name="s1">The first of the two strings.</param> ''' <param name="s2">The second of the two strings.</param> ''' <returns>The Levenshtein cost.</returns> <Microsoft.SqlServer.Server.SqlFunction()> _ Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32 If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null Dim s1 As String = string1.Value Dim s2 As String = string2.Value Dim n As Integer = s1.Length Dim m As Integer = s2.Length Dim d As Integer(,) = New Integer(n, m) {} ' Step 1 If n = 0 Then Return m If m = 0 Then Return n ' Step 2 For i As Integer = 0 To n d(i, 0) = i Next For j As Integer = 0 To m d(0, j) = j Next ' Step 3 For i As Integer = 1 To n 'Step 4 For j As Integer = 1 To m ' Step 5 Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1) ' Step 6 d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost) Next Next ' Step 7 Return d(n, m) End Function ''' <summary> ''' Returns a score between 0.0-1.0 indicating how closely two strings match. 1.0 is a 100% ''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings. ''' </summary> <Microsoft.SqlServer.Server.SqlFunction()> _ Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c) Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c) If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too Dim flatLevScore As Double = InternalGetSimilarityScore(s1, s2) Dim letterS1 As String = GetLetterSimilarityString(s1) Dim letterS2 As String = GetLetterSimilarityString(s2) Dim letterScore As Double = InternalGetSimilarityScore(letterS1, letterS2) 'Dim wordS1 As String = GetWordSimilarityString(s1) 'Dim wordS2 As String = GetWordSimilarityString(s2) 'Dim wordScore As Double = InternalGetSimilarityScore(wordS1, wordS2) If flatLevScore = 1.0F AndAlso letterScore = 1.0F Then Return 1.0F If flatLevScore = 0.0F AndAlso letterScore = 0.0F Then Return 0.0F ' Return weighted result Return (flatLevScore * 0.2F) + (letterScore * 0.8F) End Function Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As Double Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2) Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length) If maxLen = 0 Then Return 1.0F Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen) End Function ''' <summary> ''' Sorts all the alpha numeric characters in the string in alphabetical order ''' and removes everything else. ''' </summary> Private Shared Function GetLetterSimilarityString(s1 As String) As String Dim allChars = If(s1, "").ToUpper().ToCharArray() Array.Sort(allChars) Dim result As New StringBuilder() For Each ch As Char In allChars If Char.IsLetterOrDigit(ch) Then result.Append(ch) End If Next Return result.ToString() End Function ''' <summary> ''' Removes all non-alpha numeric characters and then sorts ''' the words in alphabetical order. ''' </summary> Private Shared Function GetWordSimilarityString(s1 As String) As String Dim words As New List(Of String)() Dim curWord As StringBuilder = Nothing For Each ch As Char In If(s1, "").ToUpper() If Char.IsLetterOrDigit(ch) Then If curWord Is Nothing Then curWord = New StringBuilder() End If curWord.Append(ch) Else If curWord IsNot Nothing Then words.Add(curWord.ToString()) curWord = Nothing End If End If Next If curWord IsNot Nothing Then words.Add(curWord.ToString()) End If words.Sort(StringComparer.OrdinalIgnoreCase) Return String.Join(" ", words.ToArray()) End Function
除了其他好的信息之外,您可能还想考虑使用比SOUNDEX更好的Double Metaphone语音algorithm。 有一个Transact-SQL版本 ( 链接到代码在这里 )。
这将有助于匹配名称与轻微的拼写错误,例如, 卡尔与卡尔 。
我会使用SQL Server全文索引,这将允许您做search,并返回的东西,不仅包含单词,但也可能有拼写错误。
自Master Data Services首次发布以来,您可以访问比SOUNDEX所实现的更高级的模糊逻辑algorithm。 因此,假设您已经安装了MDS,您可以在mdq模式(MDS数据库)中find名为Similarity()的函数。
有关如何工作的更多信息: http : //blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/
我个人使用了Jaro-Winkleralgorithm的CLR实现,这个algorithm似乎工作得很好 – 它比string长度大于15个字符,并且不喜欢匹配的电子邮件地址,但是相当不错 – 可以find完整的实现指南这里
如果由于某种原因无法使用CLR函数,也许可以尝试通过SSIS包(使用模糊转换查找)运行数据 – 在此处详述
关于重复的东西你的string拆分和匹配是伟大的第一次削减。 如果知道可以利用的数据来减less工作量和/或产生更好的结果,利用它们总是好的。 请记住,通常为了消除重复,不可能完全消除手动工作,尽pipe通过尽可能多地捕捉手动工作,然后生成关于“不确定性情况”的报告,您可以轻松得多。
关于名称匹配:SOUNDEX对于匹配的质量是非常糟糕的,特别是对于你正在尝试的工作types不利,因为它会匹配离目标太远的事情。 最好使用双重metaphone结果和Levenshtein距离的组合来执行名称匹配。 通过适当的偏置,这个工作真的很好,可能可以用来清理你的知识后的第二遍。
您可能还需要考虑使用SSIS包并查看模糊查找和分组转换(http://msdn.microsoft.com/en-us/library/ms345128(SQL.90).aspx)。
使用SQL全文search(http://msdn.microsoft.com/en-us/library/cc879300.aspx)也是一种可能性,但可能不适合您的特定问题域。;
您可以使用SQL Server中的SOUNDEX和相关DIFFERENCEfunction来查找相似的名称。 在MSDN的参考是在这里 。
这样做
create table person( personid int identity(1,1) primary key, firstname varchar(20), lastname varchar(20), addressindex int, sound varchar(10) )
并在稍后创build一个触发器
create trigger trigoninsert for dbo.person on insert as declare @personid int; select @personid=personid from inserted; update person set sound=soundex(firstname) where personid=@personid;
现在我可以做的是我可以创build一个这样的过程
create procedure getfuzzi(@personid int) as declare @sound varchar(10); set @sound=(select sound from person where personid=@personid; select personid,firstname,lastname,addressindex from person where sound=@sound
这将返回所有与特定personid提供的名称几乎相匹配的名称