从Unicode格式的string中删除标点符号

我有一个从string列表中删除标点符号的函数：

def strip_punctuation(input): x = 0 for word in input: input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x]) x += 1 return input

我最近修改我的脚本使用Unicodestring，所以我可以处理其他非西方字符。这个函数在遇到这些特殊字符时会中断，并返回空的Unicodestring。我怎样才能可靠地从Unicode格式的string中删除标点符号？

你可以使用unicode.translate()方法：

 import unicodedata import sys tbl = dict.fromkeys(i for i in xrange(sys.maxunicode) if unicodedata.category(unichr(i)).startswith('P')) def remove_punctuation(text): return text.translate(tbl)

您也可以使用正则expression式模块支持的r'\p{P}' ：

 import regex as re def remove_punctuation(text): return re.sub(ur"\p{P}+", "", text)

如果你想在Python 3中使用JF Sebastian的解决scheme：

 import unicodedata import sys tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')) def remove_punctuation(text): return text.translate(tbl)

基于Daenyth答案的一个较短的版本

 import unicodedata def strip_punctuation(text): """ >>> strip_punctuation(u'something') u'something' >>> strip_punctuation(u'something.,:else really') u'somethingelse really' """ punctutation_cats = set(['Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po']) return ''.join(x for x in text if unicodedata.category(x) not in punctutation_cats) input_data = [u'somehting', u'something, else', u'nothing.'] without_punctuation = map(strip_punctuation, input_data)

您可以使用unicodedata模块的category函数遍历string，以确定该字符是否是标点符号。

有关category可能输出，请参见unicode.org关于常规类别值的文档

 import unicodedata.category as cat def strip_punctuation(word): return "".join(char for char in word if cat(char).startswith('P')) filtered = [strip_punctuation(word) for word in input]

另外，请确保您正确地处理编码和types。这个演示文稿是一个很好的开始： http : //bit.ly/unipain

从Unicode格式的string中删除标点符号

整数的最大值和最小值

鹈鹕3.3 pelican快速启动错误“ValueError：未知区域：UTF-8”

pylotly python：完全免费的？

如何使用subprocess和Popen从我的.exe获取所有输出？

为什么“is”关键字在string中有一个点时会有不同的行为？

我怎样才能快速估计两个（纬度，经度）点之间的距离？

读/写Python闭包

sqlite3.ProgrammingError：提供的绑定数量不正确。目前的声明使用1，并提供了74

pandas可以绘制date的直方图吗？

从python执行命令行程序

从Unicode格式的string中删除标点符号

整数的最大值和最小值

鹈鹕3.3 pelican快速启动错误“ValueError：未知区域：UTF-8”

pylotly python：完全免费的？

如何使用subprocess和Popen从我的.exe获取所有输出？

为什么“is”关键字在string中有一个点时会有不同的行为？

我怎样才能快速估计两个（纬度，经度）点之间的距离？

读/写Python闭包

sqlite3.ProgrammingError：提供的绑定数量不正确。 目前的声明使用1，并提供了74

pandas可以绘制date的直方图吗？

从python执行命令行程序

sqlite3.ProgrammingError：提供的绑定数量不正确。目前的声明使用1，并提供了74