加快write.table的性能
我有一个data.frame
,我想写出来。 我的data.frame
的维度是256行65536列。 write.csv
什么更快的select?
如果所有的列都是同一类,那么在写出之前转换为matrix,提供近6倍的加速。 另外,你可以从包MASS
使用write.matrix()
,尽pipe这个例子没有certificate更快。 也许我没有正确设置一些东西:
#Fake data m <- matrix(runif(256*65536), nrow = 256) #AS a data.frame system.time(write.csv(as.data.frame(m), "dataframe.csv")) #---------- # user system elapsed # 319.53 13.65 333.76 #As a matrix system.time(write.csv(m, "matrix.csv")) #---------- # user system elapsed # 52.43 0.88 53.59 #Using write.matrix() require(MASS) system.time(write.matrix(m, "writematrix.csv")) #---------- # user system elapsed # 113.58 59.12 172.75
编辑
为了解决上面提到的问题,上面的结果对data.frame是不公平的,这里有更多的结果和时间表明整个消息仍然是“如果可能的话,把你的数据对象转换成一个matrix,如果不行的话,处理或者重新考虑为什么你需要写出一个200MB + CSV格式的文件,如果时间是最重要的“:
#This is a data.frame m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256)) #This is still 6x slower system.time(write.csv(m2, "dataframe.csv")) # user system elapsed # 317.85 13.95 332.44 #This even includes the overhead in converting to as.matrix in the timing system.time(write.csv(as.matrix(m2), "asmatrix.csv")) # user system elapsed # 53.67 0.92 54.67
所以,没有什么变化。 要确认这是合理的,请考虑as.data.frame()
的相对时间成本:
m3 <- as.matrix(m2) system.time(as.data.frame(m3)) # user system elapsed # 0.77 0.00 0.77
所以,没有什么大不了的,也不像下面的评论那样歪曲信息。 如果您仍然不相信在大型data.frames上使用write.csv()
在性能方面不是一个好主意,请参阅“ Note
下的手册:
write.table can be slow for data frames with large numbers (hundreds or more) of columns: this is inevitable as each column could be of a different class and so must be handled separately. If they are all of the same class, consider using a matrix instead.
最后,如果你仍然失眠,而不是更快速地保存,那么考虑转移到本地的RData对象
system.time(save(m2, file = "thisisfast.RData")) # user system elapsed # 21.67 0.12 21.81
data.table::fwrite()
由Otto Seiskari提供,可用于1.9.8以上版本。 Matt在顶部(包括并行)做了额外的增强,并写了一篇关于它的文章 。 请在跟踪器上报告任何问题。
首先,我们将上面的@chase(即大量的列: 65,000列(!) x 256行)和fwrite
和write_feather
,以便在机器间保持一致性。 注意在base R中compress=FALSE
的巨大差异
# ----------------------------------------------------------------------------- # function | object type | output type | compress= | Runtime | File size | # ----------------------------------------------------------------------------- # save | matrix | binary | FALSE | 0.3s | 134MB | # save | data.frame | binary | FALSE | 0.4s | 135MB | # feather | data.frame | binary | FALSE | 0.4s | 139MB | # fwrite | data.table | csv | FALSE | 1.0s | 302MB | # save | matrix | binary | TRUE | 17.9s | 89MB | # save | data.frame | binary | TRUE | 18.1s | 89MB | # write.csv | matrix | csv | FALSE | 21.7s | 302MB | # write.csv | data.frame | csv | FALSE | 121.3s | 302MB |
请注意, fwrite()
并行运行。 这里显示的时间是在13芯Macbook Pro 2核心和1线程/核心(+2虚拟线程通过超线程),512GB固态硬盘,256KB /核心二级caching和4MB L4caching。 根据您的系统规格,YMMV。
我也在相对更可能(和更大)的数据上重新制定基准:
library(data.table) NN <- 5e6 # at this number of rows, the .csv output is ~800Mb on my machine set.seed(51423) DT <- data.table( str1 = sample(sprintf("%010d",1:NN)), #ID field 1 str2 = sample(sprintf("%09d",1:NN)), #ID field 2 # varying length string field--think names/addresses, etc. str3 = replicate(NN,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")), # factor-like string field with 50 "levels" str4 = sprintf("%05d",sample(sample(1e5,50),NN,T)), # factor-like string field with 17 levels, varying length str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),NN,T), # lognormally distributed numeric num1 = round(exp(rnorm(NN,mean=6.5,sd=1.5)),2), # 3 binary strings str6 = sample(c("Y","N"),NN,T), str7 = sample(c("M","F"),NN,T), str8 = sample(c("B","W"),NN,T), # right-skewed (integer type) int1 = as.integer(ceiling(rexp(NN))), num2 = round(exp(rnorm(NN,mean=6,sd=1.5)),2), # lognormal numeric that can be positive or negative num3 = (-1)^sample(2,NN,T)*round(exp(rnorm(NN,mean=6,sd=1.5)),2)) # ------------------------------------------------------------------------------- # function | object | out | other args | Runtime | File size | # ------------------------------------------------------------------------------- # fwrite | data.table | csv | quote = FALSE | 1.7s | 523.2MB | # fwrite | data.frame | csv | quote = FALSE | 1.7s | 523.2MB | # feather | data.frame | bin | no compression | 3.3s | 635.3MB | # save | data.frame | bin | compress = FALSE | 12.0s | 795.3MB | # write.csv | data.frame | csv | row.names = FALSE | 28.7s | 493.7MB | # save | data.frame | bin | compress = TRUE | 48.1s | 190.3MB | # -------------------------------------------------------------------------------
所以在这个testing中, fwrite
比feather
快两倍。 这是在上面提到的同一台机器上运行的, fwrite
在两个内核上并行运行。
feather
似乎也相当快的二进制格式,但没有压缩呢。
这里试图展示fwrite
如何比较尺度:
注意:基准已经通过运行基本R的save()
进行更新, compress = FALSE
(因为羽还没有被压缩)。
所以fwrite
是这个数据上运行速度最快的(在2个内核上运行)加上它创build一个.csv
,可以很容易地查看,检查并传递给grep
, sed
等。
复制代码:
require(data.table) require(microbenchmark) require(feather) ns <- as.integer(10^seq(2, 6, length.out = 25)) DTn <- function(nn) data.table( str1 = sample(sprintf("%010d",1:nn)), str2 = sample(sprintf("%09d",1:nn)), str3 = replicate(nn,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")), str4 = sprintf("%05d",sample(sample(1e5,50),nn,T)), str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),nn,T), num1 = round(exp(rnorm(nn,mean=6.5,sd=1.5)),2), str6 = sample(c("Y","N"),nn,T), str7 = sample(c("M","F"),nn,T), str8 = sample(c("B","W"),nn,T), int1 = as.integer(ceiling(rexp(nn))), num2 = round(exp(rnorm(nn,mean=6,sd=1.5)),2), num3 = (-1)^sample(2,nn,T)*round(exp(rnorm(nn,mean=6,sd=1.5)),2)) count <- data.table(n = ns, c = c(rep(1000, 12), rep(100, 6), rep(10, 7))) mbs <- lapply(ns, function(nn){ print(nn) set.seed(51423) DT <- DTn(nn) microbenchmark(times = count[n==nn,c], write.csv=write.csv(DT, "writecsv.csv", quote=FALSE, row.names=FALSE), save=save(DT, file = "save.RData", compress=FALSE), fwrite=fwrite(DT, "fwrite_turbo.csv", quote=FALSE, sep=","), feather=write_feather(DT, "feather.feather"))}) png("microbenchmark.png", height=600, width=600) par(las=2, oma = c(1, 0, 0, 0)) matplot(ns, t(sapply(mbs, function(x) { y <- summary(x)[,"median"] y/y[3]})), main = "Relative Speed of fwrite (turbo) vs. rest", xlab = "", ylab = "Time Relative to fwrite (turbo)", type = "l", lty = 1, lwd = 2, col = c("red", "blue", "black", "magenta"), xaxt = "n", ylim=c(0,25), xlim=c(0, max(ns))) axis(1, at = ns, labels = prettyNum(ns, ",")) mtext("# Rows", side = 1, las = 1, line = 5) legend("right", lty = 1, lwd = 3, legend = c("write.csv", "save", "feather"), col = c("red", "blue", "magenta")) dev.off()
另一种select是使用羽化文件格式。
df <- as.data.frame(matrix(runif(256*65536), nrow = 256)) system.time(feather::write_feather(df, "df.feather")) #> user system elapsed #> 0.237 0.355 0.617
Feather是一种二进制文件格式,它的devise非常高效,可以读写。 它的devise目的是使用多种语言:目前有R和Python客户端,并且正在开发一个Julia客户端。
为了比较,这是saveRDS
需要多长时间:
system.time(saveRDS(df, "df.rds")) #> user system elapsed #> 17.363 0.307 17.856
现在,这是一个有点不公平的比较,因为saveRDS
的默认值是压缩数据,这里的数据是不可压缩的,因为它是完全随机的。 closures压缩使saveRDS
显着加快:
system.time(saveRDS(df, "df.rds", compress = FALSE)) #> user system elapsed #> 0.181 0.247 0.473
事实上,它现在比羽毛稍快。 那为什么要用羽毛呢? 那么,它通常比readRDS()
更快,而且通常相对于读取次数来说,写入数据的次数相对较less。
system.time(readRDS("df.rds")) #> user system elapsed #> 0.198 0.090 0.287 system.time(feather::read_feather("df.feather")) #> user system elapsed #> 0.125 0.060 0.185
你也可以尝试'readr'包的read_rds(比较data.table :: fread)和write_rds(比较data.table :: fwrite)。
以下是我的数据集(1133行和429499列)中的一个简单示例:
写入数据集
fwrite(rankp2,file="rankp2_429499.txt",col.names=T,row.names=F,quote = F,sep="\t")
write_rds(rankp2,"rankp2_429499.rds")
读取数据集(1133行和429499列)
system.time(fread("rankp2_429499.txt",sep="\t",header=T,fill = TRUE)) user system elapsed 42.391 0.526 42.949
system.time(read_rds("rankp2_429499.rds")) user system elapsed 2.157 0.388 2.547
希望它有帮助。