删除R中除撇号之外的所有标点符号

我想使用R的gsub删除除撇号以外的文本中的所有标点符号。 我是相当新的正则expression式,但正在学习。

例:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?" gsub("[[:punct:]]", "", as.character(x)) 

目前的输出(不要撇号)

 [1] "I like to chew gum but dont like bubble gum" 

期望的输出(我希望撇号不要停留)

 [1] "I like to chew gum but don't like bubble gum" 
 x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?" gsub("[^[:alnum:][:space:]']", "", x) [1] "I like to chew gum but don't like bubble gum" 

上面的正则expression式更直截了当。 它用一个空stringreplace所有不是字母数字符号,空格或撇号(符号!)的东西。

这里是一个例子:

 > gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x) [1] "I like to chew gum but don't like bubble gum" 

主要是为了多样化,这里是一个使用gsubfn()从相同的名字了不起的包的解决scheme。 在这个应用程序中,我只是喜欢如何很好地expression它允许的解决scheme是:

 library(gsubfn) gsubfn(pattern = "[[:punct:]]", engine = "R", replacement = function(x) ifelse(x == "'", "'", ""), x) [1] "I like to chew gum but don't like bubble gum" 

engine = "R"在这里是需要的,否则使用默认的tcl引擎,匹配正则expression式的规则略有不同:如果用它来处理上面的string,例如, pattern = "[[:punct:]$|^]"感谢G. Grothendieck指出了这个细节。)

您可以使用双重否定punct从POSIX类punct排除撇号:

 [^'[:^punct:]] 

码:

 x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?" gsub("[^'[:^punct:]]", "", x, perl=T) #[1] "I like to chew gum but don't like bubble gum" 

ideone演示