Python的pandas，DF.groupby（）。agg（），列引用在agg（）

在具体的问题上，说我有一个DataFrame DF

word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10

我想为每个“单词”find“最多”的“标签” 。所以回报会是这样的

  word tag count 1 the S 20 2 a T 60 3 an T 5

我不在乎计数栏，或者订单/索引是原始的还是乱七八糟的。返回字典{ 'the'：'S' ，…}就好了。

我希望我能做到

 DF.groupby(['word']).agg(lambda x: x['tag'][ x['count'].argmax() ] )

但它不起作用。我无法访问列信息。

更抽象地说， agg（ function ）中的函数是什么意思呢？

顺便说一句，.agg（）和.aggregate（）一样吗？

非常感谢。

agg与aggregate相同。可调用的是一次一个地传递DataFrame的列（ Series对象）。

您可以使用idxmax来收集最大数量的行的索引标签：

 idx = df.groupby('word')['count'].idxmax() print(idx)

产量

 word a 2 an 3 the 1 Name: count

然后使用loc来selectword和tag列中的那些行：

 print(df.loc[idx, ['word', 'tag']])

产量

  word tag 2 a T 3 an T 1 the S

请注意， idxmax返回索引标签。可以使用df.loc按标签select行。但是，如果索引不是唯一的 – 也就是说，如果存在具有重复索引标签的行，则df.loc将select具有在idx列出的标签的所有行 。所以要小心，如果您使用idxmax与df.loc df.index.is_unique是True

另外，你可以使用apply 。 apply的可调用是通过一个子数据框，让您访问所有的列：

 import pandas as pd df = pd.DataFrame({'word':'a the a an the'.split(), 'tag': list('SSTTT'), 'count': [30, 20, 60, 5, 10]}) print(df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))

产量

 word a T an T the S

使用idxmax和loc通常比apply更快，特别是对于大型DataFrame。使用IPython的％timeit：

 N = 10000 df = pd.DataFrame({'word':'a the a an the'.split()*N, 'tag': list('SSTTT')*N, 'count': [30, 20, 60, 5, 10]*N}) def using_apply(df): return (df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()])) def using_idxmax_loc(df): idx = df.groupby('word')['count'].idxmax() return df.loc[idx, ['word', 'tag']] In [22]: %timeit using_apply(df) 100 loops, best of 3: 7.68 ms per loop In [23]: %timeit using_idxmax_loc(df) 100 loops, best of 3: 5.43 ms per loop

如果你想要一个字典映射到标签，那么你可以使用set_index和to_dict像这样：

 In [36]: df2 = df.loc[idx, ['word', 'tag']].set_index('word') In [37]: df2 Out[37]: tag word a T an T the S In [38]: df2.to_dict()['tag'] Out[38]: {'a': 'T', 'an': 'T', 'the': 'S'}

这里有一个简单的方法来找出什么是通过（unutbu）解决scheme，然后“适用”！

 In [33]: def f(x): ....: print type(x) ....: print x ....: In [34]: df.groupby('word').apply(f) <class 'pandas.core.frame.DataFrame'> word tag count 0 a S 30 2 a T 60 <class 'pandas.core.frame.DataFrame'> word tag count 0 a S 30 2 a T 60 <class 'pandas.core.frame.DataFrame'> word tag count 3 an T 5 <class 'pandas.core.frame.DataFrame'> word tag count 1 the S 20 4 the T 10

你的函数只是在框架的一个子部分操作（在这个例子中），分组variables都具有相同的值（在这个'字'中），如果你正在传递一个函数，那么你必须处理聚合可能是非string的列; 标准function，如“总和”为你做这个

自动不会聚合在string列上

 In [41]: df.groupby('word').sum() Out[41]: count word a 90 an 5 the 30

您正在汇总所有列

 In [42]: df.groupby('word').apply(lambda x: x.sum()) Out[42]: word tag count word a aa ST 90 an an T 5 the thethe ST 30

你可以在函数中做很多事情

 In [43]: df.groupby('word').apply(lambda x: x['count'].sum()) Out[43]: word a 90 an 5 the 30