从分组数据中select第一行和最后一行

题

使用dplyr ，如何在一个语句中select分组数据的顶部和底部观察/行？

数据和示例

给定一个数据框

 df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), stopId=c("a","b","c","a","b","c","a","b","c"), stopSequence=c(1,2,3,3,1,4,3,1,2))

我可以使用slice得到每个组的顶部和底部观察值，但是使用两个单独的expression式：

 firstStop <- df %>% group_by(id) %>% arrange(stopSequence) %>% slice(1) %>% ungroup lastStop <- df %>% group_by(id) %>% arrange(stopSequence) %>% slice(n()) %>% ungroup

我可以将这两个statmenets结合成一个select顶部和底部观察？

有可能是一个更快的方法：

 df %>% group_by(id) %>% arrange(stopSequence) %>% filter(row_number()==1 | row_number()==n())

为了完整性，您可以传递一个索引向量：

 df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))

这使

  id stopId stopSequence 1 1 a 1 2 1 c 3 3 2 b 1 4 2 c 4 5 3 b 1 6 3 a 3

不是dplyr ，但它更直接使用data.table ：

 library(data.table) setDT(df) df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ] # id stopId stopSequence # 1: 1 a 1 # 2: 1 c 3 # 3: 2 b 1 # 4: 2 c 4 # 5: 3 b 1 # 6: 3 a 3

更详细的解释：

 # 1) get row numbers of first/last observations from each group # * basically, we sort the table by id/stopSequence, then, # grouping by id, name the row numbers of the first/last # observations for each id; since this operation produces # a data.table # * .I is data.table shorthand for the row number # * here, to be maximally explicit, I've named the variable V1 # as row_num to give other readers of my code a clearer # understanding of what operation is producing what variable first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id] idx = first_last$row_num # 2) extract rows by number df[idx]

请务必查看入门 wiki来获取data.table基础知识

就像是：

 library(dplyr) df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), stopId=c("a","b","c","a","b","c","a","b","c"), stopSequence=c(1,2,3,3,1,4,3,1,2)) first_last <- function(x) { bind_rows(slice(x, 1), slice(x, n())) } df %>% group_by(id) %>% arrange(stopSequence) %>% do(first_last(.)) %>% ungroup ## Source: local data frame [6 x 3] ## ## id stopId stopSequence ## 1 1 a 1 ## 2 1 c 3 ## 3 2 b 1 ## 4 2 c 4 ## 5 3 b 1 ## 6 3 a 3

你几乎可以在团队中执行任何操作，但是@ jeremycg的答案对于这个任务来说更合适。

我知道指定dplyr的问题。但是，由于其他人已经发布了使用其他软件包的解决scheme，我决定去使用其他软件包：

基础包：

 df <- df[with(df, order(id, stopSequence, stopId)), ] merge(df[!duplicated(df$id), ], df[!duplicated(df$id, fromLast = TRUE), ], all = TRUE)

data.table：

 df <- setDT(df) df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]

sqldf：

 library(sqldf) min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence FROM df GROUP BY id ORDER BY id, StopSequence, stopId") max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence FROM df GROUP BY id ORDER BY id, StopSequence, stopId") sqldf("SELECT * FROM min UNION SELECT * FROM max")

在一个查询中：

 sqldf("SELECT * FROM (SELECT id, stopId, min(stopSequence) AS StopSequence FROM df GROUP BY id ORDER BY id, StopSequence, stopId) UNION SELECT * FROM (SELECT id, stopId, max(stopSequence) AS StopSequence FROM df GROUP BY id ORDER BY id, StopSequence, stopId)")

输出：

  id stopId StopSequence 1 1 a 1 2 1 c 3 3 2 b 1 4 2 c 4 5 3 a 3 6 3 b 1

从分组数据中select第一行和最后一行

dplyr :: select函数与MASS :: select发生冲突

按dplyr中的多列进行分组，使用string向量input

dplyr data.table，我真的使用data.table吗？

查找使用dplyr / group_by的行数

replace为dplyr中的“重命名”

滚动平均（移动平均）由组/ id与dplyr

dplyr中的标准评估：sumrise_以variablesforms给出的string

在tbl_df中包装时查看整个数据框？

可以使用dplyr包进行有条件的变异吗？

在data.frame中使用dplyr过滤全部案例（逐案删除）