pandas占总数的百分比
这显然是简单的,但作为一个numpy纽伯我卡住了。
我有一个CSV文件,其中包含3列,国家,Office ID和该办事处的销售。
我想计算给定州内每个办公室的销售额百分比(各州百分比总和为100%)。
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': range(1, 7) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
这返回:
sales state office_id AZ 2 839507 4 373917 6 347225 CA 1 798585 3 890850 5 454423 CO 1 819975 3 202969 5 614011 WA 2 163942 4 369858 6 959285
我似乎无法弄清楚如何“达到” groupby
的state
水平,以总计全state
的sales
来计算分数。
Paul H的答案是正确的,你将不得不第二个groupby
对象,但你可以用一个更简单的方式计算百分比 – 只是state_office
,并将sales
除以其总和。 复制Paul H的答案的开头:
# From Paul H import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': list(range(1, 7)) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'}) # Change: groupby state_office and divide by sum state_pcts = state_office.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
返回:
sales state office_id AZ 2 16.981365 4 19.250033 6 63.768601 CA 1 19.331879 3 33.858747 5 46.809373 CO 1 36.851857 3 19.874290 5 43.273852 WA 2 34.707233 4 35.511259 6 29.781508
您需要按照状态进行分组,然后使用div
方法创build第二个分组:
import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': list(range(1, 7)) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'}) state = df.groupby(['state']).agg({'sales': 'sum'}) state_office.div(state, level='state') * 100 sales state office_id AZ 2 16.981365 4 19.250033 6 63.768601 CA 1 19.331879 3 33.858747 5 46.809373 CO 1 36.851857 3 19.874290 5 43.273852 WA 2 34.707233 4 35.511259 6 29.781508
div
的level='state'
state'kwarg告诉大pandas根据索引state
级别的值广播/joindataframe。
我知道这是一个古老的问题,但exp1orer的答案是非常缓慢的数据集具有大量独特的组(可能是因为拉姆达)。 我build立了他们的答案,把它变成一个数组计算,所以现在它超快! 以下是示例代码:
创build具有50,000个唯一组的testing数据框
import random import string import pandas as pd import numpy as np np.random.seed(0) # This is the total number of groups to be created NumberOfGroups = 50000 # Create a lot of groups (random strings of 4 letters) Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10 Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2 FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)] # Make the numbers NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)] # Make the dataframe df = pd.DataFrame({'Group 1': Group1, 'Group 2': Group2, 'Final Group': FinalGroup, 'Numbers I want as percents': NumbersForPercents})
分组时,它看起来像:
Numbers I want as percents Group 1 Group 2 Final Group AAAH AQYR RMCH 847 XDCL 182 DQGO ALVF 132 AVPH 894 OVGH NVOO 650 VKQP 857 VNLY HYFW 884 MOYH 469 XOOC GIDS 168 HTOY 544 AACE HNXU RAXK 243 YZNK 750 NOYI NYGC 399 ZYCI 614 QKGK CRLF 520 UXNA 970 TXAR MLNB 356 NMFJ 904 VQYG NPON 504 QPKQ 948 ... [50000 rows x 1 columns]
数组查找百分比的方法:
# Initial grouping (basically a sorted version of df) PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index() # Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index) SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index() # Merge the two dataframes Percents_df = pd.merge(PreGroupby_df, SumGroup_df) # Divide the two columns Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100 # Drop the extra _Sum column Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
这种方法大约需要0.15秒
最佳答案方法(使用lambda函数):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'}) state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
这种方法需要大约21秒才能产生相同的结果。
结果:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group 0 AAAH AQYR RMCH 847 82.312925 1 AAAH AQYR XDCL 182 17.687075 2 AAAH DQGO ALVF 132 12.865497 3 AAAH DQGO AVPH 894 87.134503 4 AAAH OVGH NVOO 650 43.132050 5 AAAH OVGH VKQP 857 56.867950 6 AAAH VNLY HYFW 884 65.336290 7 AAAH VNLY MOYH 469 34.663710 8 AAAH XOOC GIDS 168 23.595506 9 AAAH XOOC HTOY 544 76.404494
您可以将整个DataFrame
sum
除以state
总和:
# Copying setup from Paul H answer import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': list(range(1, 7)) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) # Add a column with the sales divided by state total sales. df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales'] df
返回
office_id sales state sales_ratio 0 1 405711 CA 0.193319 1 2 535829 WA 0.347072 2 3 217952 CO 0.198743 3 4 252315 AZ 0.192500 4 5 982371 CA 0.468094 5 6 459783 WA 0.297815 6 1 404137 CO 0.368519 7 2 222579 AZ 0.169814 8 3 710581 CA 0.338587 9 4 548242 WA 0.355113 10 5 474564 CO 0.432739 11 6 835831 AZ 0.637686
但请注意,这只适用于因为state
以外的所有列都是数字,所以可以对整个DataFrame进行求和。 例如,如果office_id
是字符,则会出现错误:
df.office_id = df.office_id.astype(str) df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
TypeError:不支持的操作数types为/:'str'和'str'
为了简明起见,我会使用SeriesGroupBy:
In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count") In [12]: c Out[12]: state office_id AZ 2 925105 4 592852 6 362198 CA 1 819164 3 743055 5 292885 CO 1 525994 3 338378 5 490335 WA 2 623380 4 441560 6 451428 Name: count, dtype: int64 In [13]: c / c.groupby(level=0).sum() Out[13]: state office_id AZ 2 0.492037 4 0.315321 6 0.192643 CA 1 0.441573 3 0.400546 5 0.157881 CO 1 0.388271 3 0.249779 5 0.361949 WA 2 0.411101 4 0.291196 6 0.297703 Name: count, dtype: float64
对于多个组,你必须使用transform(使用Radical的df ):
In [21]: c = df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count") In [22]: c / c.groupby(level=[0, 1]).transform("sum") Out[22]: Group 1 Group 2 Final Group AAHQ BOSC OWON 0.331006 TLAM 0.668994 MQVF BWSI 0.288961 FXZM 0.711039 ODWV NFCH 0.262395 ... Name: count, dtype: float64
这似乎比其他答案稍微高性能(只是不到Radical答案速度的两倍,对我来说就是0.08秒)。