在Pythonstring中解码HTML实体？

我正在使用Beautiful Soup 3parsing一些HTML，但是它包含了一个Beautiful Soup 3不会自动为我解码的HTML实体：

>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup("<p>&pound;682m</p>") >>> text = soup.find("p").string >>> print text &pound;682m

我如何解码text的HTML实体来获得"£682m"而不是"£682m" 。

Python 3.4+

HTMLParser.unescape已经被弃用了，应该在3.5版本中被删除，尽pipe它被遗漏了。它将很快从语言中删除。相反，使用html.unescape() ：

 import html print(html.unescape('&pound;682m'))

请参阅https://docs.python.org/3/library/html.html#html.unescape

Python 2.6-3.3

您可以使用标准库中的HTMLparsing器：

 >>> try: ... # Python 2.6-2.7 ... from HTMLParser import HTMLParser ... except ImportError: ... # Python 3 ... from html.parser import HTMLParser ... >>> h = HTMLParser() >>> print(h.unescape('&pound;682m')) £682m

请参阅http://docs.python.org/2/library/htmlparser.html

您也可以使用six兼容性库来简化导入：

 >>> from six.moves.html_parser import HTMLParser >>> h = HTMLParser() >>> print(h.unescape('&pound;682m')) £682m

美丽的汤把柄实体转换。在Beautiful Soup 3中，您需要指定BeautifulSoup构造函数的convertEntities参数（请参阅存档文档的“实体转换”部分）。在美丽的汤4，实体自动解码。

美丽的汤3

 >>> from BeautifulSoup import BeautifulSoup >>> BeautifulSoup("<p>&pound;682m</p>", ... convertEntities=BeautifulSoup.HTML_ENTITIES) <p>£682m</p>

美丽的汤4

 >>> from bs4 import BeautifulSoup >>> BeautifulSoup("<p>&pound;682m</p>") <html><body><p>£682m</p></body></html>

您可以使用w3lib.html库中的replace_entities

 In [202]: from w3lib.html import replace_entities In [203]: replace_entities("&pound;682m") Out[203]: u'\xa3682m' In [204]: print replace_entities("&pound;682m") £682m

美丽的汤4允许您设置一个格式化程序到您的输出

如果你传入formatter=None ，Beautiful Soup将不会在输出上修改string。这是最快的select，但可能会导致Beautiful Soup生成无效的HTML / XML，如下例所示：

 print(soup.prettify(formatter=None)) # <html> # <body> # <p> # Il a dit <<Sacré bleu!>> # </p> # </body> # </html> link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') print(link_soup.a.encode(formatter=None)) # <a href="http://example.com/?foo=val1&bar=val2">A link</a>

这可能与此无关。但是为了从整个文档中删除这些html实体，你可以这样做：（假设文档=页面，请原谅这个潦草的代码，但如果你有如何使它更好的想法，我所有的耳朵 – 我是新来的这个）。

 import re import HTMLParser regexp = "&.+?;" list_of_html = re.findall(regexp, page) #finds all html entites in page for e in list_of_html: h = HTMLParser.HTMLParser() unescaped = h.unescape(e) #finds the unescaped value of the html entity page = page.replace(e, unescaped) #replaces html entity with unescaped value

在Pythonstring中解码HTML实体？

Python 3.4+

Python 2.6-3.3

美丽的汤3

美丽的汤4

如何禁用HTML <textarea>的大小调整器？

必需的属性在Safari浏览器中不起作用

CSS单元格边距

如何在CSSselect器中排除特定的类名？

使用jQuery调整表格列

使用“X-UA-Compatible”模拟IE8的IE8，但不适用于IE9

如果input标签没有名字，表单数据是否仍然传输？

点击<a>链接时如何显示确认对话框？

我应该使用<i>标签来代替<span>图标吗？

具有等间距DIV的stream体宽度