模糊string比较
我正在努力完成的是一个程序,该程序读入一个文件,并根据原来的句子来比较每个句子。 与原文完美匹配的句子将得到1分,相反的句子将得到0分。所有其他的模糊语句将在1到0之间得到一个分数。
我不确定要使用哪种操作来允许我在Python 3中完成此操作。
我已经包含了示例文本,其中文本1是原始文本,其他前面的string是比较。
文本:示例
文本1:这是一个黑暗和暴风雨的夜晚。 我独自坐在一把红色的椅子上。 我并不完全孤单,因为我有三只猫。
案文20:这是一个阴暗暴风雨的夜晚。 我独自一人坐在深红色的椅子上。 我并不完全孤单,因为我有三只猫,//得分高,但不是1
文本21:这是一个阴暗暴躁的夜晚。 我独自一人坐在深红的教堂里。 我没有完全孤单,因为我有三只猫科动物//比分数低20
文字22:我独自一人坐在深红色的教堂里。 我并不完全孤单,因为我有三只猫。 这是一个阴暗暴风雨的夜晚。 //得分低于文本21但不是0
文字24:这是一个黑暗和暴风雨的夜晚。 我并不孤单。 我没有坐在红色的椅子上。 我有三只猫。 / /应该得分0!
有一个叫fuzzywuzzy
的包。 通过点安装:
pip install fuzzywuzzy
简单的用法:
>>> from fuzzywuzzy import fuzz >>> fuzz.ratio("this is a test", "this is a test!") 96
这个软件包是build立在difflib
之上的。 为什么不使用这个,你问? 除了简单一点之外,它还有许多不同的匹配方法(如令牌不敏感,部分string匹配),这使得它在实践中更加强大。 process.extract
函数特别有用:从集合中find最匹配的string和比率。 从自述:
偏比率
>>> fuzz.partial_ratio("this is a test", "this is a test!") 100
令牌分类比例
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 90 >>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100
令牌集合比率
>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 84 >>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100
处理
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90)
标准库(称为difflib
)中有一个模块,可以比较string并根据其相似性返回分数。 SequenceMatcher
类应该做你以后的事情。
编辑:从Python提示的小例子:
>>> from difflib import SequenceMatcher as SM >>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.' >>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.' >>> SM(None, s1, s2).ratio() 0.9112903225806451
HTH!
fuzzyset
比fuzzywuzzy
( difflib
)在索引和search上快得多。
from fuzzyset import FuzzySet corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night. It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats.""" corpus = [line.lstrip() for line in corpus.split("\n")] fs = FuzzySet(corpus) query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats." fs.get(query) # [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]
警告:小心不要在您的模糊集中混用unicode
和bytes
。
这个任务被称为释义识别 ,它是自然语言处理研究的一个活跃领域。 我已经链接了几种最先进的论文,其中很多你可以在GitHub上find开源代码。
请注意,所有回答的问题都假设两个句子之间存在一些string/表面的相似性,而实际上,两个string相似度较低的句子在语义上可能是相似的。
如果你对这种相似性感兴趣,你可以使用Skip-Thoughts 。 按照GitHub指南安装软件,并转到自述文件中的释义检测部分:
import skipthoughts model = skipthoughts.load_model() vectors = skipthoughts.encode(model, X_sentences)
这将您的句子(X_sentences)转换为向量。 稍后,您可以通过以下方式find两个向量的相似性
similarity = 1 - scipy.spatial.distance.cosine(vectors[0], vectors[1])
我们假设vector[0]和vector1是X_sentences [0],X_sentences 1的对应vector,您希望find它们的分数。
还有其他模型可以将句子转换为vector,您可以在这里find它。
一旦将句子转换为vector,相似性只是寻找这些vector之间的余弦相似度的问题。