如何一个热点编码变体长度的function？

给出一个变体长度特征列表：

features = [ ['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2'] ]

其中每个样本具有不同数量的特征，并且特征dtype是str并且已经很热。

为了使用sklearn的特征select工具，我必须将features转换成如下的2D数组：

  f1 f2 f3 f4 f5 f6 s1 1 1 1 0 0 0 s2 0 1 0 1 1 1 s3 1 1 0 0 0 0

我怎么能通过sklearn或numpy实现呢？

您可以使用专门用于执行此操作的scikit中的MultiLabelBinarizer 。

代码为你的例子：

 features = [ ['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2'] ] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() new_features = mlb.fit_transform(features)

输出：

 array([[1, 1, 1, 0, 0, 0], [0, 1, 0, 1, 1, 1], [1, 1, 0, 0, 0, 0]])

这也可以在pipe道中使用，以及其他feature_selection实用程序。

这里有一个方法与NumPy方法和输出pandas数据框 –

 import numpy as np import pandas as pd lens = list(map(len, features)) N = len(lens) unq, col = np.unique(np.concatenate(features),return_inverse=1) row = np.repeat(np.arange(N), lens) out = np.zeros((N,len(unq)),dtype=int) out[row,col] = 1 indx = ['s'+str(i+1) for i in range(N)] df_out = pd.DataFrame(out, columns=unq, index=indx)

示例input，输出 –

 In [80]: features Out[80]: [['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2']] In [81]: df_out Out[81]: f1 f2 f3 f4 f5 f6 s1 1 1 1 0 0 0 s2 0 1 0 1 1 1 s3 1 1 0 0 0 0

如何一个热点编码变体长度的function？

numpy数组的高效阈值filter

NumPy数组就地转换

Python将numpy数组插入sqlite3数据库

列表成numpy数组列表

从SciPy Sparse Matrix填充一个Pandas SparseDataFrame

有没有办法使用pythonappend与SWIG的新的内置function？

RuntimeError：根据API版本编译的模块，但是这个版本的numpy是9

更好的方式来搅乱两个numpyarrays

ValueError：用一个序列设置一个数组元素

用Cython简单的包装C代码