删除Python unicodestring中的重音符号的最佳方法是什么？

我在Python中有一个Unicodestring，我想删除所有的重音符（变音符号）。

我在网上find了一个在Java中这样做的优雅方法：

将Unicodestring转换为其长规格化forms（使用单独的字母和变音符号）
删除Unicodetypes为“diacritic”的所有字符。

我是否需要安装一个库，比如pyICU，或者只用python标准库就可以吗？那么python 3呢？

重要说明：我想避免代码显式地从重音字符映射到非重音字符。

Unidecode是正确的答案。它将任何unicodestring音译为ascii文本中最接近的可能表示forms。

例：

accented_string = u'Málaga' # accented_string is of type 'unicode' import unidecode unaccented_string = unidecode.unidecode(accented_string) # unaccented_string contains 'Malaga'and is of type 'str'

这个怎么样：

 import unicodedata def strip_accents(s): return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

这也适用于希腊信件：

 >>> strip_accents(u"A \u00c0 \u0394 \u038E") u'A A \u0394 \u03a5' >>>

字符类别 “Mn”代表Nonspacing_Mark ，类似于在MiniQuark的答案unicodedata.combining（我没有想到unicodedata.combining，但它可能是更好的解决scheme，因为它更明确）类似。

请记住，这些操作可能会显着改变文本的含义。口音，变音等不是“装饰”。

我刚刚在网上find了这个答案：

 import unicodedata def remove_accents(input_str): nfkd_form = unicodedata.normalize('NFKD', input_str) only_ascii = nfkd_form.encode('ASCII', 'ignore') return only_ascii

它工作正常（例如法语），但我认为第二步（删除重音）可以比删除非ASCII字符更好的处理，因为这将失败的一些语言（例如希腊语）。最好的解决scheme可能是明确删除被标记为变音符的unicode字符。

编辑：这是诀窍：

 import unicodedata def remove_accents(input_str): nfkd_form = unicodedata.normalize('NFKD', input_str) return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c)将返回true，如果字符c可以与前面的字符组合，那主要是它是一个变音符号。

编辑2 ： remove_accents需要一个unicodestring，而不是一个字节string。如果你有一个字节string，那么你必须把它解码成一个unicodestring，如下所示：

 encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use byte_string = b"café" # or simply "café" before python 3. unicode_string = byte_string.decode(encoding)

我很惊讶没有人提出一个简单的build议：

 from unidecode import unidecode s="Montréal, über, 12.89, Mère, Françoise, noël, 889" #s.encode("ascii") #doesn't work - traceback t=unidecode(s) t.encode("ascii") #works fine, because all non-ASCII from s are replaced with their equivalents print(t) #gives: 'Montreal, uber, 12.89, Mere, Francoise, noel, 889'

你可以在这里下载unidecode lib。

这不仅处理重音，而且还处理“笔触”（如在等）：

 import unicodedata as ud def rmdiacritics(char): ''' Return the base character of char, by "removing" any diacritics like accents or curls and strokes and the like. ''' desc = ud.name(unicode(char)) cutoff = desc.find(' WITH ') if cutoff != -1: desc = desc[:cutoff] return ud.lookup(desc)

这是我能想到的最优雅的方式（亚历克西斯在本页评论中已经提到过），但我认为它确实不是很优雅。

还有一些特殊的字母不能被这个字符处理，比如翻转和倒转的字母，因为它们的unicode名字不包含“WITH”。这取决于你想要做什么。我有时需要重音剥离实现字典sorting顺序。

其实我工作的项目兼容Python 2.6,2.7和3.4，我必须从免费的用户条目创buildID。

感谢你，我创造了奇迹般的function。

 import re import unicodedata def strip_accents(text): """ Strip accents from input String. :param text: The input string. :type text: String. :returns: The processed String. :rtype: String. """ try: text = unicode(text, 'utf-8') except NameError: # unicode is a default on python 3 pass text = unicodedata.normalize('NFD', text) text = text.encode('ascii', 'ignore') text = text.decode("utf-8") return str(text) def text_to_id(text): """ Convert input text to id. :param text: The input string. :type text: String. :returns: The processed String. :rtype: String. """ text = strip_accents(text.lower()) text = re.sub('[ ]+', '_', text) text = re.sub('[^0-9a-zA-Z_-]', '', text) return text

结果：

 text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889") >>> 'montreal_uber_1289_mere_francoise_noel_889'

回复@MiniQuark的回答：

我试图阅读一个csv文件是半法语（包含口音），还有一些最终将成为整数和浮动的string。作为一个testing，我创build了一个如下所示的test.txt文件：

蒙特利尔，12.89，Mère，Françoise，Noël，889

我必须包括第2行和第3行才能使它起作用（我在python票中find了它），并且合并了@ Jabba的评论：

 import sys reload(sys) sys.setdefaultencoding("utf-8") import csv import unicodedata def remove_accents(input_str): nkfd_form = unicodedata.normalize('NFKD', unicode(input_str)) return u"".join([c for c in nkfd_form if not unicodedata.combining(c)]) with open('test.txt') as f: read = csv.reader(f) for row in read: for element in row: print remove_accents(element)

结果：

 Montreal uber 12.89 Mere Francoise noel 889

（注意：我在Mac OS X 10.8.4上使用Python 2.7.3）

 import unicodedata s = 'Émission' search_string = ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

对于Python 3.X

 print (search_string)

对于Python 2.X

 print search_string

一些语言将变音符作为语言字母和重音符号来指定重音。

我认为明确指定要删除的是哪一个diactrics更为安全：

 def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')): accents = set(map(unicodedata.lookup, accents)) chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents] return unicodedata.normalize('NFC', ''.join(chars))

删除Python unicodestring中的重音符号的最佳方法是什么？

如何改变变音符号为非变音符号

我应该在url中使用重音字符吗？

如何在SQLite查询中忽略重音（Android）

Javastringsearch忽略重音

匹配任何非单词字符（不包括变音符号）

如何从.NET中的string中删除变音符号（重音符号）？

Microsoft Excel在.csv文件中损坏变音符号？

将符号，口音字母转换为英文字母

有没有办法摆脱重音，并将整个string转换为常规字母？

如何防止诸如Zalgo文本的变音符号