将HTML实体转换为Unicode，反之亦然

可能的重复项目：

在Python中将XML / HTML实体转换为Unicodestring

HTML实体代码到文本

如何在HTML中将HTML实体转换为Unicode，反之亦然？

至于“反之亦然”（我需要我自己，导致我find这个问题，这没有帮助，后来又有一个网站有答案）：

u'some string'.encode('ascii', 'xmlcharrefreplace')

将返回任何非ASCII字符转换为XML（HTML）实体的纯string。

你需要有BeautifulSoup 。

 from BeautifulSoup import BeautifulStoneSoup import cgi def HTMLEntitiesToUnicode(text): """Converts HTML entities to unicode. For example '&amp;' becomes '&'.""" text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES)) return text def unicodeToHTMLEntities(text): """Converts unicode to HTML entities. For example '&' becomes '&amp;'.""" text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace') return text text = "&amp;, &reg;, &lt;, &gt;, &cent;, &pound;, &yen;, &euro;, &sect;, &copy;" uni = HTMLEntitiesToUnicode(text) htmlent = unicodeToHTMLEntities(uni) print uni print htmlent # &, ®, <, >, ¢, £, ¥, €, §, © # &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

Python 2.7和BeautifulSoup4的更新

htmlparser – Unicode HTML到htmlparser （Python 2.7标准库）的unicode：

 >>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood' >>> from HTMLParser import HTMLParser >>> htmlparser = HTMLParser() >>> unescaped = htmlparser.unescape(escaped) >>> unescaped u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood' >>> print unescaped Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

bs4 – Unicode HTML到bs4 （BeautifulSoup4）的unicode：

 >>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>''' >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> soup.text u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood' >>> print soup.text Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

转义 – Unicode到Unicode与bs4 （BeautifulSoup4）的HTML：

 >>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood' >>> from bs4.dammit import EntitySubstitution >>> escaper = EntitySubstitution() >>> escaped = escaper.substitute_html(unescaped) >>> escaped u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

正如hekevintran的答案所暗示的那样，你可以使用cgi.escape(s)来编码cgi.escape(s) ，但是请注意，在这个函数中，quote的编码默认是false，并且把quote=True关键字parameter passing给你的string可能是个好主意。但是，即使通过传递quote=True ，函数也不会转义单引号（ "'" ）（由于这些问题，函数从版本3.2开始已被弃用）

有人build议使用html.escape(s)而不是cgi.escape(s) 。（3.2版本中的新function）

在3.4版本中也引入了 html.unescape(s) 。

所以在Python 3.4中，你可以：

使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()将特殊字符转换为HTML实体。
用于将HTML实体转换回纯文本格式的html.unescape(text) 。

我使用下面的函数将xls文件中的unicode转换为html文件，同时保留xls文件中的特殊字符：

 def html_wr(f, dat): ''' write dat to file f as html . file is assumed to be opened in binary format . if dat is nul it is replaced with non breakable space . non-ascii characters are translated to xml ''' if not dat: dat = '&nbsp;' try: f.write(dat.encode('ascii')) except: f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

希望这对别人有用

将HTML实体转换为Unicode，反之亦然

CSS Flexbox问题：为什么我的flexchildren的宽度受其内容影响？

使用jQuery以特定的时间间隔显示和隐藏div

在div中垂直alignment文本

响应正方形的网格

如何使用JavaScriptparsingRSS提要？

隐藏机器人的电子邮件地址 – 保持mailto：

KnockoutJS中variables$ data的来源和目的是什么？

使用z-index将div取得另一个div

Web浏览器：隐藏鼠标光标

<section>和<div>有什么区别？