词形化java
我正在寻找Java中英语的lemmatisation实现。 我已经find了一些,但我需要一些不需要太多内存来运行(1 GB的顶部)。 谢谢。 我不需要一个词干。
斯坦福大学CoreNLP Java库包含一个稍微有点资源密集的lemmatizer ,但是我已经在我的笔记本电脑上运行了<512MB的RAM。
要使用它:
- 下载jar文件 ;
- 在您select的编辑器中创build一个新项目/制作一个ant脚本,其中包含您刚刚下载的档案中包含的所有jar文件;
- 创build一个新的Java,如下所示(基于斯坦福大学网站的片段);
import java.util.Properties; public class StanfordLemmatizer { protected StanfordCoreNLP pipeline; public StanfordLemmatizer() { // Create StanfordCoreNLP object properties, with POS tagging // (required for lemmatization), and lemmatization Properties props; props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); // StanfordCoreNLP loads a lot of models, so you probably // only want to do this once per execution this.pipeline = new StanfordCoreNLP(props); } public List<String> lemmatize(String documentText) { List<String> lemmas = new LinkedList<String>(); // create an empty Annotation just with the given text Annotation document = new Annotation(documentText); // run all Annotators on this text this.pipeline.annotate(document); // Iterate over all of the sentences found List<CoreMap> sentences = document.get(SentencesAnnotation.class); for(CoreMap sentence: sentences) { // Iterate over all tokens in a sentence for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // Retrieve and add the lemma for each word into the list of lemmas lemmas.add(token.get(LemmaAnnotation.class)); } } return lemmas; } }
克里斯关于斯坦福德Lemmatizer的答案是伟大的! 简直美极了。 他甚至包含了一个指向jar文件的指针,所以我不必为此而使用google。
但他的一行代码有一个语法错误(他以某种方式改变了以“lemmas.add …”开头的行中的结尾closures括号和分号),并且忘记了包括import。
至于NoSuchMethodError错误,通常是由于该方法不是公共静态的,但如果你看看代码本身(在http://grepcode.com/file/repo1.maven.org/maven2/com.guokr /stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h )这不是问题。 我怀疑问题是在构buildpath中的某处(我使用Eclipse Kepler,所以configuration我在项目中使用的33个jar文件没有问题)。
下面是我对Chris的代码的一个小修改,以及一个例子(我对Evanescence屠杀他们完美的歌词表示歉意):
import java.util.LinkedList; import java.util.List; import java.util.Properties; import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.util.CoreMap; public class StanfordLemmatizer { protected StanfordCoreNLP pipeline; public StanfordLemmatizer() { // Create StanfordCoreNLP object properties, with POS tagging // (required for lemmatization), and lemmatization Properties props; props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); /* * This is a pipeline that takes in a string and returns various analyzed linguistic forms. * The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator), * and then other sequence model style annotation can be used to add things like lemmas, * POS tags, and named entities. These are returned as a list of CoreLabels. * Other analysis components build and store parse trees, dependency graphs, etc. * * This class is designed to apply multiple Annotators to an Annotation. * The idea is that you first build up the pipeline by adding Annotators, * and then you take the objects you wish to annotate and pass them in and * get in return a fully annotated object. * * StanfordCoreNLP loads a lot of models, so you probably * only want to do this once per execution */ this.pipeline = new StanfordCoreNLP(props); } public List<String> lemmatize(String documentText) { List<String> lemmas = new LinkedList<String>(); // Create an empty Annotation just with the given text Annotation document = new Annotation(documentText); // run all Annotators on this text this.pipeline.annotate(document); // Iterate over all of the sentences found List<CoreMap> sentences = document.get(SentencesAnnotation.class); for(CoreMap sentence: sentences) { // Iterate over all tokens in a sentence for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // Retrieve and add the lemma for each word into the // list of lemmas lemmas.add(token.get(LemmaAnnotation.class)); } } return lemmas; } public static void main(String[] args) { System.out.println("Starting Stanford Lemmatizer"); String text = "How could you be seeing into my eyes like open doors? \n"+ "You led me down into my core where I've became so numb \n"+ "Without a soul my spirit's sleeping somewhere cold \n"+ "Until you find it there and led it back home \n"+ "You woke me up inside \n"+ "Called my name and saved me from the dark \n"+ "You have bidden my blood and it ran \n"+ "Before I would become undone \n"+ "You saved me from the nothing I've almost become \n"+ "You were bringing me to life \n"+ "Now that I knew what I'm without \n"+ "You can've just left me \n"+ "You breathed into me and made me real \n"+ "Frozen inside without your touch \n"+ "Without your love, darling \n"+ "Only you are the life among the dead \n"+ "I've been living a lie, there's nothing inside \n"+ "You were bringing me to life."; StanfordLemmatizer slem = new StanfordLemmatizer(); System.out.println(slem.lemmatize(text)); } }
这里是我的结果(我印象非常深刻,它抓住了“有”的“有”,几乎所有的事情都完美无缺):
开始斯坦福大学Lemmatizer
添加注释器标记大小
添加注释器ssplit
添加注释者pos
从edu / stanford / nlp / models / pos-tagger / english-left3words / english-left3words-distsim.tagger …完成[1.7秒]阅读POS标记模型。
添加注释器引理
[你怎么可以看到我的眼睛就像打开门一样,你,领导,我,往下,进入,我的,核心的,在哪里,没有,灵魂,我的灵魂,睡觉,某个地方,冷,直到你find它在那里和领导它回到家在你醒来我在里面,呼叫,我的名字,而且,除了我之外,还有,我的,我的,我的,我的,我的,我的,从现在起,我,现在,我,现在,我,现在,什么,我,现在,几乎,变成,你,是,不pipe,离开,我,你,呼吸,进入,我,和,让我,真实,冻结,里面,没有,你,触摸,没有,你,爱,亲爱的,只有你,其中,死者,我,有,是,活着,a,谎言,在那里,是,没有,在里面,你,是,带来,我,生活,。
hunspell有一个JNI,是开放式办公室和FireFox中使用的检查器。 http://hunspell.sourceforge.net/
您可以在这里尝试免费的Lemmatizer API: http : //twinword.com/lemmatizer.php
向下滚动以findLemmatizer终点。
这样可以让你把“狗”变成“狗”,“能力”变成“能力”。
如果传递一个名为“text”的POST或GET参数,并带有“walked plants”之类的string:
// These code snippets use an open-source library. http://unirest.io/java HttpResponse<JsonNode> response = Unirest.post("[ENDPOINT URL]") .header("X-Mashape-Key", "[API KEY]") .header("Content-Type", "application/x-www-form-urlencoded") .header("Accept", "application/json") .field("text", "walked plants") .asJson();
你得到这样的回应:
{ "lemma": { "plant": 1, "walk": 1 }, "result_code": "200", "result_msg": "Success" }
检查出Lucene雪球 。