Python在句子上分割文本
我有一个文本文件。 我需要得到一个句子列表。
这怎么能被执行? 有很多微妙之处,比如缩写中使用的点。
我的旧正则expression式工作不好。
re.compile('(\. |^|!|\?)([AZ][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
自然语言工具包( nltk.org )有你所需要的。 这个post表明这样做:
import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') fp = open("test.txt") data = fp.read() print '\n-----\n'.join(tokenizer.tokenize(data))
(我没有尝试过!)
这个function可以在大约0.1秒内将哈克贝利·费恩的全部文本分成句子,并处理许多让句子分析变得不重要的更加痛苦的边缘情况,比如“ 约翰·约翰逊小姐先生出生在美国,但赢得了他的博士学位。 D.在以色列join耐克公司之前曾在以色列工作,他还曾在Craigslist.org担任商业分析师。
# -*- coding: utf-8 -*- import re caps = "([AZ])" prefixes = "(Mr|St|Mrs|Ms|Dr)[.]" suffixes = "(Inc|Ltd|Jr|Sr|Co)" starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)" acronyms = "([AZ][.][AZ][.](?:[AZ][.])?)" websites = "[.](com|net|org|io|gov)" def split_into_sentences(text): text = " " + text + " " text = text.replace("\n"," ") text = re.sub(prefixes,"\\1<prd>",text) text = re.sub(websites,"<prd>\\1",text) if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>") text = re.sub("\s" + caps + "[.] "," \\1<prd> ",text) text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text) text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>\\3<prd>",text) text = re.sub(caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>",text) text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text) text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text) text = re.sub(" " + caps + "[.]"," \\1<prd>",text) if "”" in text: text = text.replace(".”","”.") if "\"" in text: text = text.replace(".\"","\".") if "!" in text: text = text.replace("!\"","\"!") if "?" in text: text = text.replace("?\"","\"?") text = text.replace(".",".<stop>") text = text.replace("?","?<stop>") text = text.replace("!","!<stop>") text = text.replace("<prd>",".") sentences = text.split("<stop>") sentences = sentences[:-1] sentences = [s.strip() for s in sentences] return sentences
这是一个不依赖任何外部图书馆的道路方法。 我使用列表理解来排除缩写和终止符之间的重叠,以及排除终止变化之间的重叠,例如:'。' 与“。”“
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'ie': 'for example', 'eg': 'for example', 'vs.': 'versus'} terminators = ['.', '!', '?'] wrappers = ['"', "'", ')', ']', '}'] def find_sentences(paragraph): end = True sentences = [] while end > -1: end = find_sentence_end(paragraph) if end > -1: sentences.append(paragraph[end:].strip()) paragraph = paragraph[:end] sentences.append(paragraph) sentences.reverse() return sentences def find_sentence_end(paragraph): [possible_endings, contraction_locations] = [[], []] contractions = abbreviations.keys() sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators] for sentence_terminator in sentence_terminators: t_indices = list(find_all(paragraph, sentence_terminator)) possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices])) for contraction in contractions: c_indices = list(find_all(paragraph, contraction)) contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices])) possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations] if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]: max_end_start = max([pe[0] for pe in possible_endings]) possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start] possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')] end = (-1 if not len(possible_endings) else max(possible_endings)) return end def find_all(a_str, sub): start = 0 while True: start = a_str.find(sub, start) if start == -1: return yield start start += len(sub)
我从这个条目中使用了Karl的find_all函数: 在Python中查找所有出现的子string
对于简单的情况(句子终止正常),这应该工作:
import re text = ''.join(open('somefile.txt').readlines()) sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
正则expression式是*\. +
*\. +
,它与一个句点相匹配,句点左边为0或更多的空格,右边为1或更多(以防止类似于re.split中的句点被视为句子中的变化)。
显然,不是最强大的解决scheme,但在大多数情况下,它会很好。 唯一不能涵盖的就是缩写(也许是通过句子列表来检查句子中的每个string是否以大写字母开始?)
@Artyom,
嗨! 你可以使用这个函数为俄语(和一些其他语言)制作一个新的标记器:
def russianTokenizer(text): result = text result = result.replace('.', ' . ') result = result.replace(' . . . ', ' ... ') result = result.replace(',', ' , ') result = result.replace(':', ' : ') result = result.replace(';', ' ; ') result = result.replace('!', ' ! ') result = result.replace('?', ' ? ') result = result.replace('\"', ' \" ') result = result.replace('\'', ' \' ') result = result.replace('(', ' ( ') result = result.replace(')', ' ) ') result = result.replace(' ', ' ') result = result.replace(' ', ' ') result = result.replace(' ', ' ') result = result.replace(' ', ' ') result = result.strip() result = result.split(' ') return result
然后用这种方式调用它:
text = 'вы выполняете поиск, используя Google SSL;' tokens = russianTokenizer(text)
祝你好运,Marilena。
毫无疑问,NLTK是最适合这个目的的。 但是开始使用NLTK是非常痛苦的(但是一旦你安装了它,你只需要获得奖励)
所以这里是简单的基于代码可在http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html
# split up a paragraph into sentences # using regular expressions def splitParagraphIntoSentences(paragraph): ''' break a paragraph into sentences and return a list ''' import re # to split by multile characters # regular expressions are easiest (and fastest) sentenceEnders = re.compile('[.!?]') sentenceList = sentenceEnders.split(paragraph) return sentenceList if __name__ == '__main__': p = """This is a sentence. This is an excited sentence! And do you think this is a question?""" sentences = splitParagraphIntoSentences(p) for s in sentences: print s.strip() #output: # This is a sentence # This is an excited sentence # And do you think this is a question
而不是使用正则expression式来分割文本成句子,你也可以使用nltk库。
>>> from nltk import tokenize >>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3." >>> tokenize.sent_tokenize(p) ['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']
ref: https : //stackoverflow.com/a/9474645/2877052