如何摆脱标点符号使用NLTK tokenizer？

我刚刚开始使用NLTK，我不太明白如何从文本中获取单词列表。如果我使用nltk.word_tokenize() ，我会得到一个单词和标点符号列表。我只需要这个词。我怎样才能摆脱标点符号？此外， word_tokenize不适用于多个句子：点被添加到最后一个单词。

看看nltk 在这里提供的其他标记化选项。例如，您可以定义一个标记器来挑选字母数字字符序列作为标记，并删除其他所有东西：

 from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+') tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')

输出：

 ['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

你并不需要NLTK去除标点符号。你可以用简单的python删除它。对于string：

 import string s = '... some string with punctuation ...' s = s.translate(None, string.punctuation)

或者对于unicode：

 import string translate_table = dict((ord(char), None) for char in string.punctuation) s.translate(translate_table)

然后在你的标记器中使用这个string。

PSstring模块有一些其他可以删除的元素（如数字）。

正如在注释中注意到的那样，从sent_tokenize（）开始，因为word_tokenize（）仅对单个句子起作用。你可以使用filter（）过滤掉标点符号。如果你有一个unicodestring，请确保它是一个unicode对象（不是用'utf-8'等编码编码的'str'）。

 from nltk.tokenize import word_tokenize, sent_tokenize text = '''It is a blue, small, and extraordinary ball. Like no other''' tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)] print filter(lambda word: word not in ',-', tokens)

我只是使用下面的代码，它删除了所有的标点符号：

 tokens = nltk.wordpunct_tokenize(raw) type(tokens) text = nltk.Text(tokens) type(text) words = [w.lower() for w in text if w.isalpha()]

我认为你需要某种正则expression式匹配（下面的代码在Python 3中）：

 import string import re import nltk s = "I can't do this now, because I'm so tired. Please give me some time." l = nltk.word_tokenize(s) ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)] print(l) print(ll)

输出：

 ['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.'] ['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

大多数情况下应该工作得很好，因为它可以在保留象“not”这样的标记的同时删除标点符号，这是不能从正则expression式标记器（如wordpunct_tokenize 。

下面的代码将删除所有的标点符号以及非字母字符。从他们的书复制。

http://www.nltk.org/book/ch01.html

 import nltk s = "I can't do this now, because I'm so tired. Please give me some time. @ sd 4 232" words = nltk.word_tokenize(s) words=[word.lower() for word in words if word.isalpha()] print(words)

产量

 ['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

我使用这个代码来删除标点符号：

 import nltk def getTerms(sentences): tokens = nltk.word_tokenize(sentences) words = [w.lower() for w in tokens if w.isalnum()] print tokens print words getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

如果你想检查一个标记是否是一个有效的英文单词，你可能需要PyEnchant

教程：

  import enchant d = enchant.Dict("en_US") d.check("Hello") d.check("Helo") d.suggest("Helo")

如何摆脱标点符号使用NLTK tokenizer？

模糊string比较

如何从一系列文本条目中提取常见/重要的短语

我怎样才能正确的前缀一个单词“一”和“一个”？

任何开发聊天机器人的教程？

你如何实现“你的意思”？

如何从代码configurationnltk数据目录？

从文本中检测短语和关键字的algorithm

word2vec：负面抽样（非专业术语）？

产生真实词汇的词干algorithm

确定语句/文本的正面或负面的algorithm