删除R中除撇号之外的所有标点符号
我想使用R的gsub删除除撇号以外的文本中的所有标点符号。 我是相当新的正则expression式,但正在学习。
例:
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?" gsub("[[:punct:]]", "", as.character(x))
目前的输出(不要撇号)
[1] "I like to chew gum but dont like bubble gum"
期望的输出(我希望撇号不要停留)
[1] "I like to chew gum but don't like bubble gum"
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?" gsub("[^[:alnum:][:space:]']", "", x) [1] "I like to chew gum but don't like bubble gum"
上面的正则expression式更直截了当。 它用一个空stringreplace所有不是字母数字符号,空格或撇号(符号!)的东西。
这里是一个例子:
> gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x) [1] "I like to chew gum but don't like bubble gum"
主要是为了多样化,这里是一个使用gsubfn()
从相同的名字了不起的包的解决scheme。 在这个应用程序中,我只是喜欢如何很好地expression它允许的解决scheme是:
library(gsubfn) gsubfn(pattern = "[[:punct:]]", engine = "R", replacement = function(x) ifelse(x == "'", "'", ""), x) [1] "I like to chew gum but don't like bubble gum"
( engine = "R"
在这里是需要的,否则使用默认的tcl引擎,匹配正则expression式的规则略有不同:如果用它来处理上面的string,例如, pattern = "[[:punct:]$|^]"
感谢G. Grothendieck指出了这个细节。)
您可以使用双重否定punct
从POSIX类punct
排除撇号:
[^'[:^punct:]]
码:
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?" gsub("[^'[:^punct:]]", "", x, perl=T) #[1] "I like to chew gum but don't like bubble gum"
ideone演示