R中一个模型matrix中所有级别的因子

我有一个由数字和因子variables组成的data.frame ，如下所示。

 testFrame <- data.frame(First=sample(1:10, 20, replace=T), Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T), Fourth=rep(c("Alice","Bob","Charlie","David"), 5), Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

我想创build一个matrix ，将虚拟variables分配给因子，并单独保留数字variables。

 model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

正如预期的那样，当运行lm这留下了每个因素的一个水平作为参考水平。但是，我想要为所有因素的每个级别构build一个具有虚拟/指标variables的matrix 。我为glmnetbuild立这个matrix，所以我不担心多重共线性。

有没有办法让model.matrix为每个级别的因素创build一个虚拟的？

您需要重置因子variables的contrasts ：

 model.matrix(~ Fourth + Fifth, data=testFrame, contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), Fifth=contrasts(testFrame$Fifth, contrasts=F)))

或者用less一点的文字和没有正确的名字：

 model.matrix(~ Fourth + Fifth, data=testFrame, contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), Fifth=diag(nlevels(testFrame$Fifth))))

（试图赎回自己…）为了回应Jared对@Fabians关于自动化的回答，请注意，您需要提供的是对比matrix的命名列表。 contrasts()取一个向量/因子，并从中产生对比matrix。为此，我们可以使用lapply()在我们的数据集中的每个因子上运行contrasts() ，例如提供的testFrame示例：

 > lapply(testFrame[,4:5], contrasts, contrasts = FALSE) $Fourth Alice Bob Charlie David Alice 1 0 0 0 Bob 0 1 0 0 Charlie 0 0 1 0 David 0 0 0 1 $Fifth Edward Frank Georgia Hank Isaac Edward 1 0 0 0 0 Frank 0 1 0 0 0 Georgia 0 0 1 0 0 Hank 0 0 0 1 0 Isaac 0 0 0 0 1

哪些插入到@fabians很好回答：

 model.matrix(~ ., data=testFrame, contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))

caret dummyVars也可以使用。 http://caret.r-forge.r-project.org/preprocess.html

caret实现了一个很好的函数dummyVars来实现这个2行：

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

检查最后的列：

 colnames(testFrame2) "First" "Second" "Third" "Fourth.Alice" "Fourth.Bob" "Fourth.Charlie" "Fourth.David" "Fifth.Edward" "Fifth.Frank" "Fifth.Georgia" "Fifth.Hank" "Fifth.Isaac"

这里最好的一点是你得到了原始的数据框架，再加上排除原来用于转换的虚拟variables。

更多信息： http : //amunategui.github.io/dummyVar-Walkthrough/

好。只要阅读以上内容并将其放在一起即可。假设你想要matrix例如“X.factors”乘以你的系数向量来得到你的线性预测值。还有一些额外的步骤：

 X.factors = model.matrix( ~ ., data=X, contrasts.arg = lapply(data.frame(X[,sapply(data.frame(X), is.factor)]), contrasts, contrasts = FALSE))

（注意，如果只有一个因子列，则需要将X [*]转换回数据框。）

然后说你得到这样的东西：

 attr(X.factors,"assign") [1] 0 1 **2** 2 **3** 3 3 **4** 4 4 5 6 7 8 9 10 #emphasis added

我们希望摆脱每个因素的**参考水平

 att = attr(X.factors,"assign") factor.columns = unique(att[duplicated(att)]) unwanted.columns = match(factor.columns,att) X.factors = X.factors[,-unwanted.columns] X.factors = (data.matrix(X.factors))

使用R包'CatEncoders'

 library(CatEncoders) testFrame <- data.frame(First=sample(1:10, 20, replace=T), Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T), Fourth=rep(c("Alice","Bob","Charlie","David"), 5), Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4)) fit <- OneHotEncoder.fit(testFrame) z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output z <- transform(fit,testFrame,sparse=FALSE) # give the dense output

 model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)

要么

 model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)

应该是最直接的

我目前正在学习Lasso模型和glmnet::cv.glmnet() ， model.matrix()和Matrix::sparse.model.matrix() （对于高维matrix，使用model.matrix会杀死我们的时间， glmnet作者）。

只是在那里分享有一个整洁的编码，以获得与@fabians和@ Gavin的答案相同的答案。同时，@ asdf123也引入了另一个包library('CatEncoders') 。

 > require('useful') > # always use all levels > build.x(First ~ Second + Fourth + Fifth, textFrame, contrasts = FALSE) > > # just use all levels for Fourth > build.x(First ~ Second + Fourth + Fifth, testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

来源： R for Everyone：高级分析和graphics （第273页）

R中一个模型matrix中所有级别的因子

Eclipse计算代码行数

什么是最好的Haskell库来操作一个程序？

可扩展列表视图将组图标指示符向右移动

计算代码度量

绘制进程的内存使用情况

Eclipse指标插件build议

如果有的话，“代码行数”是一个有用的指标？

DropWizard度量标准与定时器

神话人月10行每个开发者日 – 大型项目有多接近？

你如何创build一个带点指示器的Android视图寻呼机？