我如何删除非ASCII字符，但留下句点和空格使用Python？

我正在处理一个.txt文件。我想要一个没有非ASCII字符的文件的string。但是，我想留下空间和时间。目前，我也在剥离这些。代码如下：

def onlyascii(char): if ord(char) < 48 or ord(char) > 127: return '' else: return char def get_my_string(file_path): f=open(file_path,'r') data=f.read() f.close() filtered_data=filter(onlyascii, data) filtered_data = filtered_data.lower() return filtered_data

我应该如何修改onlyascii（）留下空格和句点？我想这不是太复杂，但我无法弄清楚。

您可以使用string.printable过滤string中不可打印的所有字符，如下所示：

 >>> s = "some\x00string. with\x15 funny characters" >>> import string >>> printable = set(string.printable) >>> filter(lambda x: x in printable, s) 'somestring. with funny characters'

我的机器上的string.printable包含：

 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

更改为不同编解码器的一种简单方法是使用encode（）或decode（）。在你的情况下，你想转换为ASCII并忽略所有不支持的符号。例如，瑞典语字母不是一个ASCII字符：

  >>>s = u'Good bye in Swedish is Hej d\xe5' >>>s = s.encode('ascii',errors='ignore') >>>print s Good bye in Swedish is Hej d

编辑：

Python3：str – > bytes – > str

 >>>"Hej då".encode("ascii", errors="ignore").decode() 'hej d'

Python2：unicode – > str – > unicode

 >>> u"hej då".encode("ascii", errors="ignore").decode() u'hej d'

Python2：str – > unicode – > str（解码和逆序编码）

 >>> "hej d\xe5".decode("ascii", errors="ignore").encode() 'hej d'

根据@artfulrobot，这应该比filter和lambda更快：

 re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

在这里查看更多示例用一个空格replace非ASCII字符

你的问题是模棱两可的前两句合在一起意味着你相信空格和“句点”是非ASCII字符。这是不正确的。所有字符，例如ord（char）<= 127是ASCII字符。例如，你的函数排除了这些字符！“＃$％＆\'（）* +， – 。/但包括其他几个例如[] {}。

请退后一步，思考一下，编辑你的问题，告诉我们你正在做什么，而不用提到ASCII这个词，为什么你认为这样的字符，使得ord（char）> = 128是可以忽略的。另外：哪个版本的Python？ input数据的编码是什么？

请注意，您的代码将整个input文件作为单个string读取，而您的评论（“极好的解决scheme”）意味着您不关心数据中的换行符。如果你的文件包含这样的两行：

 this is line 1 this is line 2

结果会是'this is line 1this is line 2' ……这是你真正想要的吗？

更好的解决scheme将包括：

filter函数的名称比onlyascii

认识到如果要保留参数，过滤函数只需要返回一个真值：

 def filter_func(char): return char == '\n' or 32 <= ord(char) <= 126 # and later: filtered_data = filter(filter_func, data).lower()

如果你想打印ascii字符，你可能应该纠正你的代码：

 if ord(char) < 32 or ord(char) > 126: return ''

（'\ x'，'\ x0c'和'\ r'）除外，这与string.printable不符合你的问题的范围

我的方式通过stream利的Python（Ramalho） – 强烈推荐。列表理解第二章的启发：

 onlyascii = ''.join([s for s in data if ord(s) < 127]) onlymatch = ''.join([s for s in data if s in 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

我如何删除非ASCII字符，但留下句点和空格使用Python？

在文本文件的指定位置插入行

垂直的文本方向

从网页获取文本到string

Java数组打印出奇怪的数字和文本

如何有效地使用grep？

如何从PDF中提取文本？

Doxygen替代C ++

以编程方式使用Objective-C读取文本文件

如何垂直alignment2个不同大小的文本？

我如何用CSS代替文字？