如何使用Python / Django执行HTML解码/编码？

我有一个string是HTML编码的：

&lt;img class=&quot;size-medium wp-image-113&quot; style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;

我想改变它：

 <img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />

我希望这个注册为HTML，以便通过浏览器呈现为图像，而不是显示为文本。

我发现如何在C＃中执行此操作，而不是在Python中执行此操作。有人可以帮我吗？

谢谢。

编辑：有人问为什么我的string像这样存储。这是因为我正在使用networking抓取工具“扫描”网页并从中获取特定的内容。该工具（BeautifulSoup）以该格式返回string。

有关

在Python中将XML / HTML实体转换为Unicodestring

鉴于Django用例，这有两个答案。这里是它的django.utils.html.escape函数，供参考：

 def escape(html): """Returns the given HTML with ampersands, quotes and carets encoded.""" return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

为了扭转这种情况，Jake的答案中描述的Cheetah函数应该可以工作，但是缺less单引号。此版本包含更新的元组，更换的顺序相反以避免对称性问题：

 def html_decode(s): """ Returns the ASCII decoded version of the given HTML string. This does NOT remove normal HTML tags like <p>. """ htmlCodes = ( ("'", '&#39;'), ('"', '&quot;'), ('>', '&gt;'), ('<', '&lt;'), ('&', '&amp;') ) for code in htmlCodes: s = s.replace(code[1], code[0]) return s unescaped = html_decode(my_string)

但是，这不是一个通用的解决办法。它只适用于用django.utils.html.escape编码的string。更一般地说，坚持使用标准库是个好主意：

 # Python 2.x: import HTMLParser html_parser = HTMLParser.HTMLParser() unescaped = html_parser.unescape(my_string) # Python 3.x: import html.parser html_parser = html.parser.HTMLParser() unescaped = html_parser.unescape(my_string)

作为一个build议：将未转义的HTML存储在数据库中可能更有意义。如果可能的话，从BeautifulSoup获得非转义的结果是值得的，并且完全避免这个过程。

使用Django，转义只发生在模板渲染过程中; 所以为了防止逃跑，你只要告诉模板引擎不要逃离你的string。为此，请在模板中使用以下选项之一：

 {{ context_var|safe }} {% autoescape off %} {{ context_var }} {% endautoescape %}

用标准库：

HTML转义

 try: from html import escape # python 3.x except ImportError: from cgi import escape # python 2.x print(escape("<"))

HTML Unescape

 try: from html import unescape # python 3.4+ except ImportError: try: from html.parser import HTMLParser # python 3.x (<3.4) except ImportError: from HTMLParser import HTMLParser # python 2.x unescape = HTMLParser().unescape print(unescape("&gt;"))

对于html编码，有标准库中的cgi.escape ：

 >> help(cgi.escape) cgi.escape = escape(s, quote=None) Replace special characters "&", "<" and ">" to HTML-safe sequences. If the optional flag quote is true, the quotation mark character (") is also translated.

对于html解码，我使用以下内容：

 import re from htmlentitydefs import name2codepoint # for some reason, python 2.5.2 doesn't have this one (apostrophe) name2codepoint['#39'] = 39 def unescape(s): "unescape HTML code refs; cf http://wiki.python.org/moin/EscapingHtml" return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: unichr(name2codepoint[m.group(1)]), s)

对于更复杂的事情，我使用BeautifulSoup。

如果编码字符集相对受限，则使用丹尼尔的解决scheme。否则，请使用众多HTMLparsing库之一。

我喜欢BeautifulSoup，因为它可以处理畸形的XML / HTML：

http://www.crummy.com/software/BeautifulSoup/

对于你的问题，在他们的文档中有一个例子

 from BeautifulSoup import BeautifulStoneSoup BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] # u'Sacr\xe9 bleu!'

在这个页面的底部可以看到Python维基，至less有两个选项可用于“unescape”html。

在Python 3.4+中：

 import html html.unescape(your_string)

丹尼尔的回答是：

“在模板渲染过程中，只能在Django中进行转义，因此，不需要unescape，只要告诉模板引擎不要转义即可{{context_var | safe}}或{％autoescape off％} {{context_var}} { ％endautoescape％}“

我在http://snippets.dzone.com/posts/show/4569find了一个很好的函数;

 def decodeHtmlentities(string): import re entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});") def substitute_entity(match): from htmlentitydefs import name2codepoint as n2cp ent = match.group(2) if match.group(1) == "#": return unichr(int(ent)) else: cp = n2cp.get(ent) if cp: return unichr(cp) else: return match.group() return entity_re.subn(substitute_entity, string)[0]

如果有人正在通过django模板寻找一个简单的方法来做到这一点，你总是可以使用这样的filter：

 <html> {{ node.description|safe }} </html>

我有一些来自供应商的数据，而且我发布的所有内容都实际上已经写在了呈现的页面上，就好像你正在查看源代码一样。上面的代码对我非常有帮助。希望这可以帮助别人。

干杯！！

尽pipe这是一个非常古老的问题，但这可能会起作用。

Django 1.5.5

 In [1]: from django.utils.text import unescape_entities In [2]: unescape_entities('&lt;img class=&quot;size-medium wp-image-113&quot; style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;') Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

我在猎豹的源代码（这里）

 htmlCodes = [ ['&', '&amp;'], ['<', '&lt;'], ['>', '&gt;'], ['"', '&quot;'], ] htmlCodesReversed = htmlCodes[:] htmlCodesReversed.reverse() def htmlDecode(s, codes=htmlCodesReversed): """ Returns the ASCII decoded version of the given HTML string. This does NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode().""" for code in codes: s = s.replace(code[1], code[0]) return s

不知道他们为什么要颠倒这个列表，我认为这与他们编码的方式有关，所以你可能不需要颠倒。此外，如果我是你，我会改变HTMLCodes是一个元组清单，而不是列表清单…这是在我的图书馆虽然:)

我注意到你的标题也要求编码，所以这里是猎豹的编码function。

 def htmlEncode(s, codes=htmlCodes): """ Returns the HTML encoded version of the given string. This is useful to display a plain ASCII text string on a web page.""" for code in codes: s = s.replace(code[0], code[1]) return s

你也可以使用django.utils.html.escape

 from django.utils.html import escape something_nice = escape(request.POST['something_naughty'])

下面是一个使用模块htmlentitydefs的python函数。这并不完美。我有的htmlentitydefs版本是不完整的，它假定所有的实体都解码为一个对于像&NotEqualTilde;这样的实体是错误的代码点&NotEqualTilde; ：

http://www.w3.org/TR/html5/named-character-references.html

 NotEqualTilde; U+02242 U+00338 ≂̸

有了这些警告，这是代码。

 def decodeHtmlText(html): """ Given a string of HTML that would parse to a single text node, return the text value of that node. """ # Fast path for common case. if html.find("&") < 0: return html return re.sub( '&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));', _decode_html_entity, html) def _decode_html_entity(match): """ Regex replacer that expects hex digits in group 1, or decimal digits in group 2, or a named entity in group 3. """ hex_digits = match.group(1) # '&#10;' -> unichr(10) if hex_digits: return unichr(int(hex_digits, 16)) decimal_digits = match.group(2) # '&#x10;' -> unichr(0x10) if decimal_digits: return unichr(int(decimal_digits, 10)) name = match.group(3) # name is 'lt' when '&lt;' was matched. if name: decoding = (htmlentitydefs.name2codepoint.get(name) # Treat &GT; like &gt;. # This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML. # If htmlentitydefs included mappings for those entities, # then this code will magically work. or htmlentitydefs.name2codepoint.get(name.lower())) if decoding is not None: return unichr(decoding) return match.group(0) # Treat "&noSuchEntity;" as "&noSuchEntity;"

这是这个问题的最简单的解决scheme –

 {% autoescape on %} {{ body }} {% endautoescape %}

从这个页面。

如何使用Python / Django执行HTML解码/编码？

有关

Django：如何覆盖form.save（）？

获取查询集中的最后一条logging

Django：我应该如何储存货币价值？

如何在Django模板中做math？

我怎样才能直接从testing驱动程序调用一个自定义的Django manage.py命令？

由于egg_info错误，无法通过pip安装

Django筛选ManyToMany计数模型？

带有Django的AngularJS – 冲突的模板标签

Django中的一个应用程序的外键

在Django中进行Python日志logging