将逗号分隔的列拆分成单独的行
我有一个数据框,像这样:
data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))
正如您所看到的, director
列中的一些条目是用逗号分隔的多个名称。 我想将这些条目分成不同的行,同时保持另一列的值。 例如,上面的数据框中的第一行应该被分成两行,在director
列中有一个名字,在AB
列中有'A'。
这个老问题经常被用作欺骗目标(用r-faq
标记)。 截至今天为止,已经有三次提供了6种不同的方法,但缺乏一个基准作为指导,其中的方法是最快的1 。
基准解决scheme包括
- Matthew Lundberg的基本R方法,但根据Rich Scriven的评论进行修改,
- Jaap的两个
data.table
方法和两个dplyr
/tidyr
方法, - Ananda的
splitstackshape
解决scheme , - 还有另外两个Jaap的
data.table
方法。
总的来说,使用microbenchmark
软件包,在6种不同大小的dataframe上对8种不同的方法进行了基准testing(见下面的代码)。
由OP给出的样本数据仅包含20行。 为了创build更大的dataframe,这20行简单地重复1,10,100,1000,10000和100000次,这给出高达2百万行的问题大小。
基准testing结果
基准testing结果显示,对于足够大的dataframe,所有data.table
方法都比其他任何方法都快。 对于超过5000行的dataframe,Jaap的data.table
方法2和变体DT3
是最快的,比最慢的方法更快。
值得注意的是,两种tidyverse
方法和splistackshape
解决scheme的时序非常相似,以至于很难对图表中的曲线进行splistackshape
。 它们是所有dataframe大小中基准testing方法中速度最慢的。
对于较小的dataframe,Matt的基本R解决scheme和data.table
方法4的开销似乎比其他方法less。
码
director <- c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", "Anne Fontaine", "Anthony Harvey") AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", "A", "B", "B", "B", "B", "B", "B", "A") library(data.table) library(magrittr)
为问题规模n
基准运行定义函数
run_mb <- function(n) { # compute number of benchmark runs depending on problem size `n` mb_times <- scales::squish(10000L / n , c(3L, 100L)) cat(n, " ", mb_times, "\n") # create data DF <- data.frame(director = rep(director, n), AB = rep(AB, n)) DT <- as.data.table(DF) # start benchmarks microbenchmark::microbenchmark( matt_mod = { s <- strsplit(as.character(DF$director), ',') data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))}, jaap_DT1 = { DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB ][!is.na(director)]}, jaap_DT2 = { DT[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]}, jaap_dplyr = { DF %>% dplyr::mutate(director = strsplit(as.character(director), ",")) %>% tidyr::unnest(director)}, jaap_tidyr = { tidyr::separate_rows(DF, director, sep = ",")}, cSplit = { splitstackshape::cSplit(DF, "director", ",", direction = "long")}, DT3 = { DT[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)][, director := NULL][ , setnames(.SD, "V1", "director")]}, DT4 = { DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), by = .(AB)]}, times = mb_times ) }
针对不同的问题大小运行基准
# define vector of problem sizes n_rep <- 10L^(0:5) # run benchmark for different problem sizes mb <- lapply(n_rep, run_mb)
准备绘图数据
mbl <- rbindlist(mb, idcol = "N") mbl[, n_row := NROW(director) * n_rep[N]] mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)] mba[, expr := forcats::fct_reorder(expr, -median_time)]
创build图表
library(ggplot2) ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + geom_point() + geom_smooth(se = FALSE) + scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + xlab("number of rows") + ylab("median of execution time [ms]") + ggtitle("microbenchmark results") + theme_bw()
会话信息和包版本(摘录)
devtools::session_info() #Session info # version R version 3.3.2 (2016-10-31) # system x86_64, mingw32 #Packages # data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2) # dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1) # forcats 0.2.0 2017-01-23 CRAN (R 3.3.2) # ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2) # magrittr * 1.5 2014-11-22 CRAN (R 3.3.0) # microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.3.3) # scales 0.4.1 2016-11-09 CRAN (R 3.3.2) # splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3) # tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2)
1 我的好奇心被这个充满活力的评论 激怒了! 数量级更快! 对这个问题的一个重复的问题作了一个简单的回答。
几个select:
1)用data.table
两种方式:
library(data.table) # method 1 (preferred) setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB ][!is.na(director)] # method 2 setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director) ][,.(director = V1, AB)]
2) dplyr
/ tidyr
组合:或者,您也可以使用dplyr
/ tidyr
组合:
library(dplyr) library(tidyr) v %>% mutate(director = strsplit(as.character(director), ",")) %>% unnest(director)
3)只用tidyr
:用tidyr 0.5.0
(及以后),你也可以使用separate_rows
:
separate_rows(v, director, sep = ",")
4)与基地R:
# if 'director' is a character-column: stack(setNames(strsplit(df$director,','), df$AB)) # if 'director' is a factor-column: stack(setNames(strsplit(as.character(df$director),','), df$AB))
命名您的原始data.frame v
,我们有这样的:
> s <- strsplit(as.character(v$director), ',') > data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length))) director AB 1 Aaron Blaise A 2 Bob Walker A 3 Akira Kurosawa B 4 Alan J. Pakula A 5 Alan Parker A 6 Alejandro Amenabar B 7 Alejandro Gonzalez Inarritu B 8 Alejandro Gonzalez Inarritu B 9 Benicio Del Toro B 10 Alejandro González Iñárritu A 11 Alex Proyas B 12 Alexander Hall A 13 Alfonso Cuaron B 14 Alfred Hitchcock A 15 Anatole Litvak A 16 Andrew Adamson B 17 Marilyn Fox B 18 Andrew Dominik B 19 Andrew Stanton B 20 Andrew Stanton B 21 Lee Unkrich B 22 Angelina Jolie B 23 John Stevenson B 24 Anne Fontaine B 25 Anthony Harvey A
注意使用rep
来构build新的AB列。 这里, sapply
返回每个原始行中的名称数量。
在晚会之后,另一个广义的select是使用我的“splitstackshape”包中有一个direction
参数的cSplit。 将其设置为"long"
以获得您指定的结果:
library(splitstackshape) head(cSplit(mydf, "director", ",", direction = "long")) # director AB # 1: Aaron Blaise A # 2: Bob Walker A # 3: Akira Kurosawa B # 4: Alan J. Pakula A # 5: Alan Parker A # 6: Alejandro Amenabar B