为什么as.Date在字符向量上变慢？

我开始在R中使用data.table包来提高我的代码的性能。我正在使用下面的代码：

sp500 <- read.csv('../rawdata/GMTSP.csv') days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday") # Using data.table to get the things much much faster sp500 <- data.table(sp500, key="Date") sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")] sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)] sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)] sp500 <- sp500[,Month:=(as.POSIXlt(Date)$mon+1)]

我注意到，与其他创build工作日的函数相比，由as.Date函数完成的转换非常缓慢。为什么？有更好/更快的解决scheme，如何转换成date格式？（如果你问我是否真的需要date格式，可能是的，因为然后使用ggplot2来绘制情节，就像这种types的数据一样。

更确切地说

 > system.time(sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]) user system elapsed 92.603 0.289 93.014 > system.time(sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]) user system elapsed 1.938 0.062 2.001 > system.time(sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)]) user system elapsed 0.304 0.001 0.305

在MacAir i5上稍微less于3000000的观测值。

谢谢

我认为这只是as.Date通过POSIXlt使用strptime将character转换为Date 。我相信， strptime非常慢。

通过自己来追踪它，inputas.Date ，然后inputmethods(as.Date) ，然后查看character方法。

 > as.Date function (x, ...) UseMethod("as.Date") <bytecode: 0x2cf4b20> <environment: namespace:base> > methods(as.Date) [1] as.Date.character as.Date.date as.Date.dates as.Date.default [5] as.Date.factor as.Date.IDate* as.Date.numeric as.Date.POSIXct [9] as.Date.POSIXlt Non-visible functions are asterisked > as.Date.character function (x, format = "", ...) { charToDate <- function(x) { xx <- x[1L] if (is.na(xx)) { j <- 1L while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j] if (is.na(xx)) f <- "%Y-%m-%d" } if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d", tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d", tz = "GMT"))) return(strptime(x, f)) stop("character string is not in a standard unambiguous format") } res <- if (missing(format)) charToDate(x) else strptime(x, format, tz = "GMT") #### slow part, I think #### as.Date(res) } <bytecode: 0x2cf6da0> <environment: namespace:base> >

为什么as.POSIXlt(Date)$year+1900比较快？再次追溯到：

 > as.POSIXct function (x, tz = "", ...) UseMethod("as.POSIXct") <bytecode: 0x2936de8> <environment: namespace:base> > methods(as.POSIXct) [1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default [5] as.POSIXct.IDate* as.POSIXct.ITime* as.POSIXct.numeric as.POSIXct.POSIXlt Non-visible functions are asterisked > as.POSIXlt.Date function (x, ...) { y <- .Internal(Date2POSIXlt(x)) names(y$year) <- names(x) y } <bytecode: 0x395e328> <environment: namespace:base> >

好奇，让我们深入到Date2POSIXlt。对于这一点，我们需要grep main / src来知道要查看哪个.c文件。

 ~/R/Rtrunk/src/main$ grep Date2POSIXlt * names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}}, $

现在我们知道我们需要寻找D2POSIXlt：

 ~/R/Rtrunk/src/main$ grep D2POSIXlt * datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env) names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}}, $

哦，我们可以猜到datetime.c。无论如何，所以看最新的现场复制：

datetime.c

在那里searchD2POSIXlt ，你会看到从Date（数字）到POSIXlt是多么简单。您还将看到POSIXlt是一个实数向量（8字节）加上七个整数向量（每个4字节）。这是40个字节，每个date！

所以问题的核心（我认为）是为什么strptime如此缓慢，也许这可以在R中得到改善，或者直接或间接地避免POSIXlt 。

这里有一个可重复使用的例子（3000000）：

 > Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days") > Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y") > system.time(as.Date(Date, "%m/%d/%Y")) user system elapsed 21.681 0.060 21.760 > system.time(strptime(Date, "%m/%d/%Y")) user system elapsed 29.594 8.633 38.270 > system.time(strptime(Date, "%m/%d/%Y", tz="GMT")) user system elapsed 19.785 0.000 19.802

通过tz似乎加速strptime ，这as.Date.character一样。所以也许这取决于你的语言环境。但似乎是罪魁祸首，而不是data.table 。也许重新运行这个例子，看看你的机器是否需要90秒钟？

正如其他人所说， strptime （从字符转换到POSIXlt）是这里的瓶颈。另一个简单的解决scheme是使用lubridate包和它的fast_strptime方法。

以下是我的数据的外观：

 > tables() NAME NROW MB COLS [1,] pp 3,718,339 126 session_id,date,user_id,path,num_sessions KEY [1,] user_id,date Total: 126MB > pp[, 2, with = F] date 1: 2013-09-25 2: 2013-09-25 3: 2013-09-25 4: 2013-09-25 5: 2013-09-25 --- 3718335: 2013-09-25 3718336: 2013-09-25 3718337: 2013-09-25 3718338: 2013-10-11 3718339: 2013-10-11 > system.time(pp[, date := as.Date(fast_strptime(date, "%Y-%m-%d"))]) user system elapsed 0.315 0.026 0.344

为了比较：

 > system.time(pp[, date := as.Date(date, "%Y-%m-%d")]) user system elapsed 108.193 0.399 108.844

这快了316倍！

感谢您的build议。我通过自己写date的高斯algorithm解决了这个问题，并得到了更好的结果，见下文。

 getWeekDay <- function(year, month, day) { # Implementation of the Gaussian algorithm to get weekday 0 - Sunday, ... , 7 - Saturday Y <- year Y[month<3] <- (Y[month<3] - 1) d <- day m <- ((month + 9)%%12) + 1 c <- floor(Y/100) y <- Yc*100 dayofweek <- (d + floor(2.6*m - 0.2) + y + floor(y/4) + floor(c/4) - 2*c) %% 7 return(dayofweek) } sp500 <- read.csv('../rawdata/GMTSP.csv') days <- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday") # Using data.table to get the things much much faster sp500 <- data.table(sp500, key="Date") sp500 <- sp500[,Month:=as.integer(substr(Date,1,2))] sp500 <- sp500[,Day:=as.integer(substr(Date,4,5))] sp500 <- sp500[,Year:=as.integer(substr(Date,7,10))] #sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")] #sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)] sp500 <- sp500[,Weekday:=factor(getWeekDay(Year, Month, Day))] levels(sp500$Weekday) <- days

运行上面的整个块（包括从csv读取date）… Data.table确实令人印象深刻。

 user system elapsed 19.074 0.803 20.284

转换本身的时间是3.49过去了。

这是一个古老的问题，但我认为这个小窍门可能是有用的。如果你有多个具有相同date的行，你可以这样做

data[, date := as.Date(date[1]), by = date]

它的速度要快得多，因为它只处理每个date一次（在我的数据集中有4000万行，从25秒到0.5秒）。

我原本以为：“上面的as.Date的参数没有指定的格式。”

我现在认为：我认为你所键入的date值是标准格式。我猜不会。所以你正在做两个过程。您正在将字符重新格式化为date格式，并且您将根据具有完全不同的sorting顺序的新值重新sorting。

为什么as.Date在字符向量上变慢？

如何更改data.table中的因子列的级别

.EACHI在data.table中

按多列进行分组并合计其他多列

为什么pandas在python合并比data.table合并R？

Data.table元编程

统计每组中的logging数并生成行号

在我自己的包中使用data.table包

在R data.table计算中使用前一行中的值

在data.table中设置密钥的目的是什么？

在data.table中对行进行sorting