使用scikit-learn将其分为多个类别
我试图使用scikit-learn的监督式学习方法将文本分成一个或多个类别。 所有我试过的algorithm的预测函数只返回一个匹配。
例如,我有一段文字“纽约的剧院与伦敦的剧院相比”。我已经训练了algorithm,为每一个文本片段select一个地方。
在上面的例子中,我希望它返回纽约和伦敦,但它只返回纽约。
是否有可能使用Scikit学习返回多个结果? 甚至还可以以最高的概率返回标签?
谢谢你的帮助
—更新
我尝试使用OneVsRestClassifier,但我仍然只有一个选项回到每一块文本。 以下是我使用的示例代码
y_train = ('New York','London') train_set = ("new york nyc big apple", "london uk great britain") vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5} count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab) test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too') X_vectorized = count.transform(train_set).todense() smatrix2 = count.transform(test_set).todense() base_clf = MultinomialNB(alpha=1) clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train) Y_pred = clf.predict(smatrix2) print Y_pred
结果:['纽约''伦敦''伦敦']
你想要什么叫多标签分类。 Scikits-learn可以做到这一点。 看到这里: http : //scikit-learn.org/dev/modules/multiclass.html 。
我不确定你的例子中出了什么问题,我的sklearn版本显然没有WordNGramAnalyzer。 也许这是一个使用更多的训练例子或尝试不同的分类器的问题? 虽然请注意,多标签分类器期望目标是元组/列表标签的列表。
以下为我工作:
import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier X_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"]) y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]] X_test = np.array(['nice day in nyc', 'welcome to london', 'hello welcome to new york. enjoy it here and london too']) target_names = ['New York', 'London'] classifier = Pipeline([ ('vectorizer', CountVectorizer(min_n=1,max_n=2)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, y_train) predicted = classifier.predict(X_test) for item, labels in zip(X_test, predicted): print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))
对我来说,这产生了输出:
nice day in nyc => New York welcome to london => London hello welcome to new york. enjoy it here and london too => New York, London
希望这可以帮助。
编辑:更新为Python 3,scikit学习0.18.1使用MultiLabelBinarizerbuild议。
我也一直在做这个工作,并且对mwv可能有用的优秀答案做了一个小小的改进。 它将文本标签作为input而不是二进制标签,并使用MultiLabelBinarizer进行编码。
import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import MultiLabelBinarizer X_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"]) y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"], ["new york"],["london"],["london"],["london"],["london"], ["london"],["london"],["new york","london"],["new york","london"]] X_test = np.array(['nice day in nyc', 'welcome to london', 'london is rainy', 'it is raining in britian', 'it is raining in britian and the big apple', 'it is raining in britian and nyc', 'hello welcome to new york. enjoy it here and london too']) target_names = ['New York', 'London'] mlb = MultiLabelBinarizer() Y = mlb.fit_transform(y_train_text) classifier = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, Y) predicted = classifier.predict(X_test) all_labels = mlb.inverse_transform(predicted) for item, labels in zip(X_test, all_labels): print('{0} => {1}'.format(item, ', '.join(labels)))
这给了我以下输出:
nice day in nyc => new york welcome to london => london london is rainy => london it is raining in britian => london it is raining in britian and the big apple => new york it is raining in britian and nyc => london, new york hello welcome to new york. enjoy it here and london too => london, new york
我也遇到了这个问题,对我来说问题是我的y_Train是一串string,而不是一系列string序列。 显然,OneVsRestClassifier将根据input的标签格式来决定是使用多类还是多标签。 所以改变:
y_train = ('New York','London')
至
y_train = (['New York'],['London'])
显然这将在未来消失,因为它所有的标签是相同的: https : //github.com/scikit-learn/scikit-learn/pull/1987
改变这一行,使其在新版本的Python中工作
# lb = preprocessing.LabelBinarizer() lb = preprocessing.MultiLabelBinarizer()