如何按组分组variables?
假设我有两列数据。 第一个包含“First”,“Second”,“Third”等类别。第二个包含表示我看到“First”的次数的数字。
例如:
Category Frequency First 10 First 15 First 5 Second 2 Third 14 Third 20 Second 3
我想按类别对数据进行sorting并对频率进行求和:
Category Frequency First 30 Second 5 Third 34
我如何在R中做到这一点?
使用aggregate
:
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum) Category x 1 First 30 2 Second 5 3 Third 34
(embedded@thelatemail评论), aggregate
也有一个公式接口
aggregate(Frequency ~ Category, x, sum)
或者如果你想聚合多个列,你可以使用.
符号(也适用于一列)
aggregate(. ~ Category, x, sum)
或者tapply
:
tapply(x$Frequency, x$Category, FUN=sum) First Second Third 30 5 34
使用这些数据:
x <- data.frame(Category=factor(c("First", "First", "First", "Second", "Third", "Third", "Second")), Frequency=c(10,15,5,2,14,20,3))
最近,您也可以使用dplyr软件包来达到这个目的:
library(dplyr) x %>% group_by(Category) %>% summarise(Frequency = sum(Frequency)) #Source: local data frame [3 x 2] # # Category Frequency #1 First 30 #2 Second 5 #3 Third 34
或者,对于多个摘要列 (与一列一起工作):
x %>% group_by(Category) %>% summarise_each(funs(sum))
dplyr> = 0.5的更新:summarise_each已被sumrise_all,summarise_at和summarise_ifreplace为dplyr中的函数族。
或者,如果您有多个要分组的列,则可以在group_by
中用逗号分隔所有的列 :
mtcars %>% group_by(cyl, gear) %>% # multiple group columns summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
有关更多信息,包括%>%
运算符,请参阅dplyr的介绍 。
rcs提供的答案很有用,而且很简单。 但是,如果您正在处理较大的数据集并需要提高性能,则有一个更快的select:
library(data.table) data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3)) data[, sum(Frequency), by = Category] # Category V1 # 1: First 30 # 2: Second 5 # 3: Third 34 system.time(data[, sum(Frequency), by = Category] ) # user system elapsed # 0.008 0.001 0.009
让我们比较一下,使用data.frame和上面的同样的东西:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3)) system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum)) # user system elapsed # 0.008 0.000 0.015
如果你想保留这个列,这是语法:
data[,list(Frequency=sum(Frequency)),by=Category] # Category Frequency # 1: First 30 # 2: Second 5 # 3: Third 34
大数据集的差异会变得更明显,如下面的代码所示:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000), Frequency=rnorm(100000)) system.time( data[,sum(Frequency),by=Category] ) # user system elapsed # 0.055 0.004 0.059 data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), Frequency=rnorm(100000)) system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) ) # user system elapsed # 0.287 0.010 0.296
对于多个聚合,您可以按如下方式组合lapply
和.SD
data[, lapply(.SD, sum), by = Category] # Category Frequency # 1: First 30 # 2: Second 5 # 3: Third 34
这与这个问题有些相关 。
你也可以使用by()函数:
x2 <- by(x$Frequency, x$Category, sum) do.call(rbind,as.list(x2))
那些其他的包(plyr,reshape)具有返回data.frame的好处,但值得熟悉的是(),因为它是一个基本函数。
library(plyr) ddply(tbl, .(Category), summarise, sum = sum(Frequency))
只是添加第三个选项:
require(doBy) summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
编辑:这是一个非常古老的答案。 现在我会推荐使用group_by并从dplyr中进行汇总,就像在@docendo中回答一样。
如果x
是数据的数据框,那么下面的代码就可以做你想要的:
require(reshape) recast(x, Category ~ ., fun.aggregate=sum)
几年之后,仅仅为了添加另一个简单的基本R解决scheme,由于某些xtabs
,这里不存在
xtabs(Frequency ~ Category, df) # Category # First Second Third # 30 5 34
或者,如果想要一个data.frame
回来
as.data.frame(xtabs(Frequency ~ Category, df)) # Category Freq # 1 First 30 # 2 Second 5 # 3 Third 34
虽然我最近成为了dplyr
转换为大多数这些types的操作, sqldf
包仍然是非常好的(和恕我直言更可读)的一些东西。
这是一个如何用sqldf
来回答这个问题的例子
x <- data.frame(Category=factor(c("First", "First", "First", "Second", "Third", "Third", "Second")), Frequency=c(10,15,5,2,14,20,3)) sqldf("select Category ,sum(Frequency) as Frequency from x group by Category") ## Category Frequency ## 1 First 30 ## 2 Second 5 ## 3 Third 34