UnicodeEncodeError：'ascii'编解码器不能编码字符u'\ xa0'在位置20：序号不在范围内（128）

我在处理来自不同网页（不同网站）的文本中的Unicode字符时遇到了问题。我正在使用BeautifulSoup。

问题是错误不总是可重现的; 它有时可以和一些页面一起工作，有时它通过抛出一个UnicodeEncodeError 。我已经尝试了所有我能想到的事情，但是我还没有发现任何能够一致工作的东西，而没有抛出某种与Unicode相关的错误。

下面显示了导致问题的代码段之一：

 agent_telno = agent.find('div', 'agent_contact_number') agent_telno = '' if agent_telno is None else agent_telno.contents[0] p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

当上面的代码片段运行时，这是在一些string上产生的堆栈跟踪：

 Traceback (most recent call last): File "foobar.py", line 792, in <module> p.agent_info = str(agent_contact + ' ' + agent_telno).strip() UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

我怀疑这是因为某些页面（或者更具体地说，来自某些站点的页面）可能被编码，而其他页面可能是未编码的。所有网站都设在英国，并提供英国消费的数据 – 所以不存在与内部化或处理用英文以外的其他文字相关的问题。

有没有人有任何想法如何解决这个问题，我可以一致解决这个问题？

你需要阅读Python Unicode HOWTO 。这个错误是第一个例子。

基本上，停止使用str从unicode转换为编码的文本/字节。

相反，正确使用.encode()来编码string：

 p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

或者完全以unicode工作。

这是一个经典的python unicode痛点！考虑以下几点：

 a = u'bats\u00E0' print a => batsà

到目前为止所有的好，但是如果我们叫str（a），让我们看看会发生什么：

 str(a) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

噢，呃，这不会对任何人有任何好处！要修正错误，请使用.encode显式编码字节，然后告诉python使用哪个编解码器：

 a.encode('utf-8') => 'bats\xc3\xa0' print a.encode('utf-8') => batsà

Voil \ u00E0！

问题是当你调用str（）时，python使用默认的字符编码来尝试编码你给它的字节，在你的情况下有时候是表示unicode字符。为了解决这个问题，你必须告诉Python如何处理你使用.encode（'whatever_unicode'）给它的string。大多数时候，你应该很好地使用utf-8。

有关此主题的精彩阐述，请参阅Ned Batchelder的PyCon讲座： http : //nedbatchelder.com/text/unipain.html

我发现优雅的工作去除符号，继续保持string为string，如下所示：

 yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

注意使用ignore选项是很危险的，因为它默默地从使用它的代码中删除任何unicode（和国际化）支持，如下所示：

 >>> 'City: Malmö'.encode('ascii', 'ignore').decode('ascii') 'City: Malm'

导致即使打印失败的一个微妙的问题是你的环境variables设置错误，例如。这里LC_ALL设置为“C”。在Debian中，他们不鼓励将它设置为： Debian wiki on Locale

 $ echo $LANG en_US.utf8 $ echo $LC_ALL C $ python -c "print (u'voil\u00e0')" Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128) $ export LC_ALL='en_US.utf8' $ python -c "print (u'voil\u00e0')" voilà $ unset LC_ALL $ python -c "print (u'voil\u00e0')" voilà

以及我尝试了一切，但没有帮助，谷歌search后，我觉得以下，它有帮助。 python 2.7正在使用中。

 # encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8')

实际上我发现，在我的大部分情况下，删除这些字符要简单得多：

 s = mystring.decode('ascii', 'ignore')

对我来说，有效的是：

 BeautifulSoup(html_text,from_encoding="utf-8")

希望这有助于某人。

在脚本的开头添加下面的行（或作为第二行）：

 # -*- coding: utf-8 -*-

这是python源代码编码的定义。更多信息在PEP 263 。

问题是你试图打印一个Unicode字符，但你的terminal不支持它。

您可以尝试安装language-pack-en软件包来修复：

 sudo apt-get install language-pack-en

它为所有支持的软件包（包括Python）提供英文翻译数据更新。根据需要安装不同的语言包（取决于您要打印的字符）。

在某些Linux发行版中，为了确保默认的英文语言环境正确设置（unicode字符可以由shell / terminal进行处理），需要使用它。有时安装它比手动configuration更容易。

然后在编写代码时，确保在代码中使用正确的编码。

例如：

 open(foo, encoding='utf-8')

如果仍有问题，请仔细检查您的系统configuration，例如：

你的语言环境文件（ /etc/default/locale ），它应该有例如
```
 LANG="en_US.UTF-8" LC_ALL="en_US.UTF-8" 
```
LANG / LC_CTYPE在shell中的值
检查你的shell支持的语言环境：
```
 locale -a | grep "UTF-8" 
```

在新鲜的虚拟机中演示问题和解决scheme。

初始化和configurationVM（例如使用vagrant ）：
```
 vagrant init ubuntu/trusty64; vagrant up; vagrant ssh 
```
^{请参阅：可用的Ubuntu盒子。} 。

打印Unicode字符（如商标符号™ ）：

 $ python -c 'print(u"\u2122");' Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)

现在安装language-pack-en ：

 $ sudo apt-get -y install language-pack-en The following extra packages will be installed: language-pack-en-base Generating locales... en_GB.UTF-8... /usr/sbin/locale-gen: done Generation complete.

现在问题解决了：
```
 $ python -c 'print(u"\u2122");' ™ 
```

简单的帮手function在这里find。

 def safe_unicode(obj, *args): """ return the unicode representation of obj """ try: return unicode(obj, *args) except UnicodeDecodeError: # obj is byte string ascii_text = str(obj).encode('string_escape') return unicode(ascii_text) def safe_str(obj): """ return the byte string representation of obj """ try: return str(obj) except UnicodeEncodeError: # obj is unicode return unicode(obj).encode('unicode_escape')

我只是使用了以下内容：

 import unicodedata message = unicodedata.normalize("NFKD", message)

看看有什么文件说的：

unicodedata.normalize（form，unistr）返回Unicodestringunistr的标准表单forms。表单的有效值是“NFC”，“NFKC”，“NFD”和“NFKD”。

Unicode标准根据规范等价和兼容性等价的定义来定义Unicodestring的各种规范化forms。在Unicode中，可以用各种方式表示几个字符。例如，字符U + 00C7（拉丁文大写字母C与CEDILLA）也可以表示为序列U + 0043（拉丁文大写字母C）U + 0327（CEDILLA组合）。

对于每个字符，有两种正常forms：正常formsC和正常formsD.正态formsD（NFD）也被称为规范分解，并将每个字符翻译成其分解forms。标准formsC（NFC）首先应用规范分解，然后再组合预先组合的字符。

除了这两种forms之外，还有两种基于兼容性等价的正常forms。在Unicode中，某些字符通常会与其他字符一起被支持。例如，U + 2160（ROMAN NUMERAL ONE）与U + 0049（拉丁大写字母I）是相同的。但是，为了与现有的字符集兼容（例如gb2312），它在Unicode中得到了支持。

标准formsKD（NFKD）将应用兼容性分解，即将所有兼容性字符replace为它们的等价物。标准formsKC（NFKC）首先应用兼容性分解，然后是规范组合。

即使两个unicodestring被标准化，并且对于人类阅读者来说看起来是相同的，如果一个字符组合了字符而另一个没有，则它们可能不会相等。

为我解决。简单和容易。

我刚刚遇到了这个问题，Google带我到这里来，为了增加这里的一般解决scheme，这对我来说是有效的：

 # 'value' contains the problematic data unic = u'' unic += value value = unic

在阅读奈德的演讲后，我有了这个想法。

尽pipe如此，我并不是完全明白为什么这会起作用。所以如果任何人都可以编辑这个答案，或者在评论中解释，我会感激。

这里有一些其他所谓的“警察出来”的答案。有些情况下，尽pipe在这里发表抗议，简单地扔掉麻烦的字符/string是一个很好的解决scheme。

 def safeStr(obj): try: return str(obj) except UnicodeEncodeError: return obj.encode('ascii', 'ignore').decode('ascii') return ""

testing它：

 if __name__ == '__main__': print safeStr( 1 ) print safeStr( "test" ) print u'98\xb0' print safeStr( u'98\xb0' )

结果：

 1 test 98° 98

下面的解决scheme为我工作，刚刚添加

你“串”

（代表string为unicode）在我的string之前。

 result_html = result.to_html(col_space=1, index=False, justify={'right'}) text = u""" <html> <body> <p> Hello all, <br> <br> Here's weekly enterprise enrollment summary report. Let me know if you have any questions. <br> <br> 7 Day Summary <br> <br> <br> {0} </p> <p>Thanks,</p> <p>Lookout Data Team</p> </body></html> """.format(result_html)

UnicodeEncodeError：'ascii'编解码器不能编码字符u'\ xa0'在位置20：序号不在范围内（128）

BeautifulSoup和Scrapy爬虫之间的区别？

BeautifulSoup获取href

我怎样才能从使用Python的HTML获得href链接？

ImportError：没有名为BeautifulSoup的模块

如何在窗口上安装python 2.7美丽的汤4

Python和BeautifulSoup编码问题

美丽的汤findAll没有find他们全部

Python / BeautifulSoup – 如何从元素中删除所有标签？

在pythonparsingHTML – lxml或BeautifulSoup？哪种更适合哪种用途？

使用BeautifulSoup删除标签，但保留其内容

UnicodeEncodeError：'ascii'编解码器不能编码字符u'\ xa0'在位置20：序号不在范围内（128）

BeautifulSoup和Scrapy爬虫之间的区别？

BeautifulSoup获取href

我怎样才能从使用Python的HTML获得href链接？

ImportError：没有名为BeautifulSoup的模块

如何在窗口上安装python 2.7美丽的汤4

Python和BeautifulSoup编码问题

美丽的汤findAll没有find他们全部

Python / BeautifulSoup – 如何从元素中删除所有标签？

在pythonparsingHTML – lxml或BeautifulSoup？ 哪种更适合哪种用途？

使用BeautifulSoup删除标签，但保留其内容

在pythonparsingHTML – lxml或BeautifulSoup？哪种更适合哪种用途？