用一个空格replace非ASCII字符

我需要用空格replace所有非ASCII（\ x00- \ x7F）字符。我很惊讶，这不是在python死容易，除非我失去了一些东西。以下function只是删除所有非ASCII字符：

def remove_non_ascii_1(text): return ''.join(i for i in text if ord(i)<128)

而这个字符代码点中的字节数（即–字符被replace为3个空格）用空格的数量replace非ASCII字符：

 def remove_non_ascii_2(text): return re.sub(r'[^\x00-\x7F]',' ', text)

我怎样才能用一个空格replace所有的非ASCII字符？

在大量类似的 SO 问题中，没有一个解决了与剥离相反的字符 replace 问题，另外还解决了所有非ASCII字符不是特定字符的问题。

你的''.join()expression式是过滤，删除任何非ASCII; 你可以使用一个条件expression式来代替：

 return ''.join([i if ord(i) < 128 else ' ' for i in text])

这将逐个处理字符，并且每个字符仍将使用一个空格replace。

你的正则expression式应该用一个空格replace连续的非ASCII字符：

 re.sub(r'[^\x00-\x7F]+',' ', text)

注意在那里。

对你来说，得到你最初的string最相似的表示，我build议：

 from unidecode import unidecode def remove_non_ascii(text): return unidecode(unicode(text, encoding = "utf-8"))

那么你可以在一个string中使用它：

 remove_non_ascii("Ceñía") Cenia

对于字符处理，请使用Unicodestring：

 PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32. >>> s='ABC马克def' >>> import re >>> re.sub(r'[^\x00-\x7f]',r' ',s) # Each char is a Unicode codepoint. 'ABC def' >>> b = s.encode('utf8') >>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence. b'ABC def'

但是请注意，如果string包含分解的Unicode字符（例如，单独的字符和组合的重音符号），则仍然存在问题：

 >>> s = 'mañana' >>> len(s) 6 >>> import unicodedata as ud >>> n=ud.normalize('NFD',s) >>> n 'mañana' >>> len(n) 7 >>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint 'ma ana' >>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced 'man ana'

这个如何？

 def replace_trash(unicode_string): for i in range(0, len(unicode_string)): try: unicode_string[i].encode("ascii") except: #means it's non-ASCII unicode_string=unicode_string[i].replace(" ") #replacing it with a single space return unicode_string

如果replace字符可以是'？' 而不是一个空格，那么我会build议result = text.encode('ascii', 'replace').decode() ：

 """Test the performance of different non-ASCII replacement methods.""" import re from timeit import timeit # 10_000 is typical in the project that I'm working on and most of the text # is going to be non-ASCII. text = 'Æ' * 10_000 print(timeit( """ result = ''.join([c if ord(c) < 128 else '?' for c in text]) """, number=1000, globals=globals(), )) print(timeit( """ result = text.encode('ascii', 'replace').decode() """, number=1000, globals=globals(), ))

结果：

 0.7208260721400134 0.009975979187503592

用一个空格replace非ASCII字符

如何在Python中对URL参数进行百分比编码？

C＃Base64string到JPEG图像

Microsoft Excel在.csv文件中损坏变音符号？

编码和字符集有什么区别？

来自PHP的电子邮件已破坏主题头编码

如何在android textview中通过unicode设置emoji

如何在Maven中configuration编码？

VIM设置编码和文件编码utf-8

Spring Security：数据库和applicationContext中的密码编码

在Python中序列化JSON时，“TypeError：（Integer）不是JSON序列化”？