wordnet lemmatization和pos标签在python中
我想在python中使用wordnet lemmatizer,并且我已经了解到默认的pos标签是NOUN,并且它不会为动词输出正确的引理,除非pos标签显式指定为VERB。
我的问题是为了准确地进行上述的词性化,最好的办法是什么?
我使用nltk.pos_tag
进行了pos标记,而且我正在将树库pos标记集成到wordnet兼容的pos标记中。 请帮忙
from nltk.stem.wordnet import WordNetLemmatizer lmtzr = WordNetLemmatizer() tagged = nltk.pos_tag(tokens)
我得到NN,JJ,VB,RB中的输出标签。 如何将这些更改为与wordnet兼容的标签?
还有,我必须训练nltk.pos_tag()
带标签的语料库,或者我可以直接在我的数据上使用它来评估?
首先,你可以直接使用nltk.pos_tag()
而不用训练它。 该函数将从文件加载预训练标记器。 您可以使用nltk.tag._POS_TAGGER
查看文件名:
nltk.tag._POS_TAGGER >>> 'taggers/maxent_treebank_pos_tagger/english.pickle'
在使用Treebank语料库进行培训时,它也使用Treebank标签集 。
下面的函数会将treebank标签映射到WordNet部分的语音名称:
from nltk.corpus import wordnet def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return ''
然后你可以使用lemmatizer的返回值:
from nltk.stem.wordnet import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('going', wordnet.VERB) >>> 'go'
如在nltk.corpus.reader.wordnet的源代码( http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html )
#{ Part-of-speech constants ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v' #} POS_LIST = [NOUN, VERB, ADJ, ADV]
@Suzana_K正在工作。 但是我有一些情况导致KeyError作为@ Clock Slave提到。
将树库标签转换为Wordnet标签
from nltk.corpus import wordnet def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return None # for easy if-statement
现在,只有有了networking标签,我们才把posinput到lemmatize函数中
from nltk.stem.wordnet import WordNetLemmatizer lemmatizer = WordNetLemmatizer() tagged = nltk.pos_tag(tokens) for word, tag in tagged: wntag = get_wordnet_pos(tag) if wntag is None:# not supply tag in case of None lemma = lemmatizer.lemmatize(word) else: lemma = lemmatizer.lemmatize(word, pos=wntag)
步骤转换: 文档 – >句子 – >标记 – > POS->引文
import nltk from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet #example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad' class Splitter(object): """ split the document into sentences and tokenize each sentence """ def __init__(self): self.splitter = nltk.data.load('tokenizers/punkt/english.pickle') self.tokenizer = nltk.tokenize.TreebankWordTokenizer() def split(self,text): """ out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.'] """ # split into single sentence sentences = self.splitter.tokenize(text) # tokenization in each sentences tokens = [self.tokenizer.tokenize(sent) for sent in sentences] return tokens class LemmatizationWithPOSTagger(object): def __init__(self): pass def get_wordnet_pos(self,treebank_tag): """ return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) """ if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: # As default pos in lemmatization is Noun return wordnet.NOUN def pos_tag(self,tokens): # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') .... pos_tokens = [nltk.pos_tag(token) for token in tokens] # lemmatization using pos tagg # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag] pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens] return pos_tokens lemmatizer = WordNetLemmatizer() splitter = Splitter() lemmatization_using_pos_tagger = LemmatizationWithPOSTagger() #step 1 split document into sentence followed by tokenization tokens = splitter.split(text) #step 2 lemmatization using pos tagger lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens) print(lemma_pos_token)