用dplyr拟合几个回归模型
我想使用dplyr为每个小时(因子variables)拟合一个模型,我得到一个错误,并且我不太确定有什么问题。
df.h <- data.frame( hour = factor(rep(1:24, each = 21)), price = runif(504, min = -10, max = 125), wind = runif(504, min = 0, max = 2500), temp = runif(504, min = - 10, max = 25) ) df.h <- tbl_df(df.h) df.h <- group_by(df.h, hour) group_size(df.h) # checks out, 21 obs. for each factor variable # different attempts: reg.models <- do(df.h, formula = price ~ wind + temp) reg.models <- do(df.h, .f = lm(price ~ wind + temp, data = df.h))
我尝试了各种变化,但我无法得到它的工作。
大约在2015年5月,最简单的方法就是使用broom
。 broom
包含三个函数,它们处理来自统计操作的复杂的返回对象: tidy
(处理来自统计操作的系数向量), glance
(处理来自统计操作的汇总统计),以及augment
(处理观察水平来自统计操作的结果)。
下面是一个示例,它用于将各种线性回归的结果提取到整齐的data_frame
。
-
tidy
:library(dplyr) library(broom) df.h = data.frame( hour = factor(rep(1:24, each = 21)), price = runif(504, min = -10, max = 125), wind = runif(504, min = 0, max = 2500), temp = runif(504, min = - 10, max = 25) ) dfHour = df.h %>% group_by(hour) %>% do(fitHour = lm(price ~ wind + temp, data = .)) # get the coefficients by group in a tidy data_frame dfHourCoef = tidy(dfHour, fitHour) dfHourCoef
这使,
Source: local data frame [72 x 6] Groups: hour hour term estimate std.error statistic p.value 1 1 (Intercept) 53.336069324 21.33190104 2.5002961 0.022294293 2 1 wind -0.008475175 0.01338668 -0.6331053 0.534626575 3 1 temp 1.180019541 0.79178607 1.4903262 0.153453756 4 2 (Intercept) 77.737788772 23.52048754 3.3051096 0.003936651 5 2 wind -0.008437212 0.01432521 -0.5889765 0.563196358 6 2 temp -0.731265113 1.00109489 -0.7304653 0.474506855 7 3 (Intercept) 38.292039924 17.55361626 2.1814331 0.042655670 8 3 wind 0.005422492 0.01407478 0.3852630 0.704557388 9 3 temp 0.426765270 0.83672863 0.5100402 0.616220435 10 4 (Intercept) 30.603119492 21.05059583 1.4537888 0.163219027 .. ... ... ... ... ... ...
-
augment
:# get the predictions by group in a tidy data_frame dfHourPred = augment(dfHour, fitHour) dfHourPred
这使,
Source: local data frame [504 x 11] Groups: hour hour price wind temp .fitted .se.fit .resid .hat .sigma .cooksd .std.resid 1 1 83.8414055 67.3780 -6.199231 45.44982 22.42649 38.391590 0.27955950 42.24400 0.1470891067 1.0663820 2 1 0.3061628 2073.7540 15.134085 53.61916 14.10041 -53.312993 0.11051343 41.43590 0.0735584714 -1.3327207 3 1 80.3790032 520.5949 24.711938 78.08451 20.03558 2.294497 0.22312869 43.64059 0.0003606305 0.0613746 4 1 121.9023855 1618.0864 12.382588 54.23420 10.31293 67.668187 0.05911743 40.23212 0.0566557575 1.6447224 5 1 -0.4039594 1542.8150 -5.544927 33.71732 14.53349 -34.121278 0.11740628 42.74697 0.0325125137 -0.8562896 6 1 29.8269832 396.6951 6.134694 57.21307 16.04995 -27.386085 0.14318542 43.05124 0.0271028701 -0.6975290 7 1 -7.1865483 2009.9552 -5.657871 29.62495 16.93769 -36.811497 0.15946292 42.54487 0.0566686969 -0.9466312 8 1 -7.8548693 2447.7092 22.043029 58.60251 19.94686 -66.457379 0.22115706 39.63999 0.2983443034 -1.7753911 9 1 94.8736726 1525.3144 24.484066 69.30044 15.93352 25.573234 0.14111563 43.12898 0.0231796755 0.6505701 10 1 54.4643001 2473.2234 -7.656520 23.34022 21.83043 31.124076 0.26489650 42.74790 0.0879837510 0.8558507 .. ... ... ... ... ... ... ... ... ... ... ...
-
glance
:# get the summary statistics by group in a tidy data_frame dfHourSumm = glance(dfHour, fitHour) dfHourSumm
这使,
Source: local data frame [24 x 12] Groups: hour hour r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual 1 1 0.12364561 0.02627290 42.41546 1.2698179 0.30487225 3 -106.8769 221.7538 225.9319 32383.29 18 2 2 0.03506944 -0.07214506 36.79189 0.3270961 0.72521125 3 -103.8900 215.7799 219.9580 24365.58 18 3 3 0.02805424 -0.07993974 39.33621 0.2597760 0.77406651 3 -105.2942 218.5884 222.7665 27852.07 18 4 4 0.17640603 0.08489559 41.37115 1.9277147 0.17434859 3 -106.3534 220.7068 224.8849 30808.30 18 5 5 0.12575453 0.02861615 42.27865 1.2945915 0.29833246 3 -106.8091 221.6181 225.7962 32174.72 18 6 6 0.08114417 -0.02095092 35.80062 0.7947901 0.46690268 3 -103.3164 214.6328 218.8109 23070.31 18 7 7 0.21339168 0.12599076 32.77309 2.4415266 0.11529934 3 -101.4609 210.9218 215.0999 19333.36 18 8 8 0.21655629 0.12950699 40.92788 2.4877430 0.11119114 3 -106.1272 220.2543 224.4324 30151.65 18 9 9 0.23388711 0.14876346 35.48431 2.7476160 0.09091487 3 -103.1300 214.2601 218.4381 22664.45 18 10 10 0.18326177 0.09251307 40.77241 2.0194425 0.16171339 3 -106.0472 220.0945 224.2726 29923.01 18 .. ... ... ... ... ... ... .. ... ... ... ... ...
在dplyr 0.4中,你可以这样做:
df.h %>% do(model = lm(price ~ wind + temp, data = .))
从do
的文档:
.f
:适用于每件作品的function。 提供给.f的第一个未命名参数将是一个数据框。
所以:
reg.models <- do(df.h, .f=function(data){ lm(price ~ wind + temp, data=data) })
也许有用也保存模型适合哪个小时:
reg.models <- do(df.h, .f=function(data){ m <- lm(price ~ wind + temp, data=data) m$hour <- unique(data$hour) m })
我想你可以用更合适的方式使用dplyr
,而不需要像@fabians anwser那样定义函数。
results<-df.h %.% group_by(hour) %.% do(failwith(NULL, lm), formula = price ~ wind + temp)
要么
results<-do(group_by(tbl_df(df.h), hour), failwith(NULL, lm), formula = price ~ wind + temp)
编辑:当然它也没有failwith
results<-df.h %.% group_by(hour) %.% do(lm, formula = price ~ wind + temp) results<-do(group_by(tbl_df(df.h), hour), lm, formula = price ~ wind + temp)