Python和BeautifulSoup编码问题

我正在使用BeautifulSoup用Python编写一个爬虫程序，并且一切都很顺利，直到我跑进这个网站：

http://www.elnorte.ec/

我正在获取请求库的内容：

r = requests.get('http://www.elnorte.ec/') content = r.content

如果我在这一点做了一个内容variables的打印，所有的西class牙特殊字符似乎工作正常。但是，一旦我尝试将内容variables提供给BeautifulSoup，它就会变得混乱：

 soup = BeautifulSoup(content) print(soup) ... <a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&amp;month=08&amp;day=27&amp;modid=203" title="1009 artÃculos en este dÃa"> ...

这显然是在捣毁所有西class牙特色字符（口音和什么）。我试过做content.decode（'utf-8'），content.decode（'latin-1'），也试着把fromEncoding参数搞乱到BeautifulSoup，把它设置成fromEncoding ='utf-8'和fromEncoding ='拉丁-1'，但仍然没有骰子。

任何指针将不胜感激。

你可以试试：

 r = urllib.urlopen('http://www.elnorte.ec/') x = BeautifulSoup.BeautifulSoup(r.read) r.close() print x.prettify('latin-1')

我得到正确的输出。哦，在这种特殊情况下，你也可以使用x.__str__(encoding='latin1') 。

我想这是因为内容是在ISO-8859-1（5）和meta http-equiv内容types错误地说“UTF-8”。

你能确认吗？

在你的情况下，这个网页有错误的UTF-8数据混淆了BeautifulSoup，并认为你的网页使用Windows-1252，你可以做到这一点：

 soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))

通过这样做，您将丢弃任何来自页面源的错误符号，BeautifulSoup将客户编码权。

你可以用'replace'replace'忽略'，并检查文本'？' 看符号是什么被丢弃的。

实际上编写抓取工具是一件非常困难的事情，它可以每次都有100％的机会猜测页面编码（现在的浏览器非常好），你可以使用像'chardet'这样的模块，但是在你的情况下，它会猜测编码作为ISO-8859-2，这也是不正确的。

如果你真的需要能够获得编码的任何页面用户可能提供 – 你应该build立一个多层次（尝试utf-8，尝试拉丁，尝试等）检测function（就像我们在我们的项目）或者使用firefox或chromium的一些检测代码作为C模块。

第一个答案是对的，这个function有时是有效的。

  def __if_number_get_string(number): converted_str = number if isinstance(number, int) or \ isinstance(number, float): converted_str = str(number) return converted_str def get_unicode(strOrUnicode, encoding='utf-8'): strOrUnicode = __if_number_get_string(strOrUnicode) if isinstance(strOrUnicode, unicode): return strOrUnicode return unicode(strOrUnicode, encoding, errors='ignore') def get_string(strOrUnicode, encoding='utf-8'): strOrUnicode = __if_number_get_string(strOrUnicode) if isinstance(strOrUnicode, unicode): return strOrUnicode.encode(encoding) return strOrUnicode

我build议采取更有条不紊的方法。

 # 1. get the raw data raw = urllib.urlopen('http://www.elnorte.ec/').read() # 2. detect the encoding and convert to unicode content = toUnicode(raw) # see my caricature for toUnicode below # 3. pass unicode to beautiful soup. soup = BeautifulSoup(content) def toUnicode(s): if type(s) is unicode: return s elif type(s) is str: d = chardet.detect(s) (cs, conf) = (d['encoding'], d['confidence']) if conf > 0.80: try: return s.decode( cs, errors = 'replace' ) except Exception as ex: pass # force and return only ascii subset return unicode(''.join( [ i if ord(i) < 128 else ' ' for i in s ]))

无论你在这里扔什么，你都可以推理，它会一直发送有效的unicode给bs。

因此，每当你有新的数据时，你的parsing树就会performance得更好，并且不会以更新的更有趣的方式失败。

试验和错误不能在代码中工作 – 有太多的组合:-)

你可以试试这个，它适用于每一种编码

  from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector headers = {"User-Agent": USERAGENT} resp = requests.get(url, headers=headers) http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)

Python和BeautifulSoup编码问题

TypeError：需要类似字节的对象，而不是python和CSV中的“str”

Python：BeautifulSoup – 根据name属性获取属性值

ImportError：没有名为BeautifulSoup的模块

通过urllib和python下载图片

BeautifulSoup抓住可见的网页文本

我怎样才能从使用Python的HTML获得href链接？

美丽的汤findAll没有find他们全部

如何在窗口上安装python 2.7美丽的汤4

BeautifulSoup和Scrapy爬虫之间的区别？

ImportError：没有模块命名为bs4（BeautifulSoup）