在lxml中获取标签内的所有文本

我想写一个代码片段，它会抓取下面所有三个实例中包含代码标签的<content>标签中的所有文本，包括lxml。我试过tostring(getchildren())但会错过标签之间的文本。我没有太多的运气searchAPI的相关function。你能帮我吗？

 <!--1--> <content> <div>Text inside tag</div> </content> #should return "<div>Text inside tag</div> <!--2--> <content> Text with no tag </content> #should return "Text with no tag" <!--3--> <content> Text outside tag <div>Text inside tag</div> </content> #should return "Text outside tag <div>Text inside tag</div>"

尝试：

 def stringify_children(node): from lxml.etree import tostring from itertools import chain parts = ([node.text] + list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + [node.tail]) # filter removes possible Nones in texts and tails return ''.join(filter(None, parts))

例：

 from lxml import etree node = etree.fromstring("""<content> Text outside tag <div>Text <em>inside</em> tag</div> </content>""") stringify_children(node)

产生： '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

是否text_content（）做你所需要的？

只需使用node.itertext()方法，如下所示：

  "".join([x for x in node.itertext()])

albertov的stringify内容的版本解决了hoju报告的错误：

 def stringify_children(node): from lxml.etree import tostring from itertools import chain parts = ([node.text] + list(chain(*([tostring(c, with_tail=False), c.tail] for c in node.getchildren()))) + [node.tail]) # filter removes possible Nones in texts and tails return ''.join(filter(None, parts))

下面的代码使用python生成器完美的工作，是非常有效的。

''.join(node.itertext()).strip()

 import urllib2 from lxml import etree url = 'some_url'

获取url

 test = urllib2.urlopen(url) page = test.read()

获取包含表格标签的所有html代码

 tree = etree.HTML(page)

xpathselect器

 table = tree.xpath("xpath_here") res = etree.tostring(table)

res是表格的html代码，这是为我做的工作。

所以你可以用xpath_text（）和标签（包括它们的内容）使用tostring（）来提取标签内容

 div = tree.xpath("//div") div_res = etree.tostring(div)

 text = tree.xpath_text("//content")

或text = tree.xpath（“// content / text（）”）

 div_3 = tree.xpath("//content") div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

这最后一个使用strip方法的行并不好，但它只是起作用

为了回应@ Richard的评论，如果你将stringify_children修补为：

  parts = ([node.text] + -- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + ++ list(chain(*([tostring(c)] for c in node.getchildren()))) + [node.tail])

似乎避免了他所指的重复。

用这种方式定义stringify_children可能不那么复杂：

 from lxml import etree def stringify_children(node): s = node.text if s is None: s = '' for child in node: s += etree.tostring(child, encoding='unicode') return s

或在一行

 return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

基本原理与此答案相同：将子节点的序列化保留为lxml。在这种情况下node的tail是node ，因为它在结束标签“后面”。请注意， encoding参数可以根据需要进行更改。

另一种可能的解决scheme是序列化节点本身，然后剥离开始和结束标记：

 def stringify_children(node): s = etree.tostring(node, encoding='unicode', with_tail=False) return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

这有点可怕。这个代码是正确的，只有node没有属性，我不认为有人会想要使用它。

我知道这是一个古老的问题，但这是一个普遍的问题，我有一个解决scheme，似乎比迄今为止build议的更简单：

 def stringify_children(node): """Given a LXML tag, return contents as a string >>> html = "<p><strong>Sample sentence</strong> with tags.</p>" >>> node = lxml.html.fragment_fromstring(html) >>> extract_html_content(node) "<strong>Sample sentence</strong> with tags." """ if node is None or (len(node) == 0 and not getattr(node, 'text', None)): return "" node.attrib.clear() opening_tag = len(node.tag) + 2 closing_tag = -(len(node.tag) + 3) return lxml.html.tostring(node)[opening_tag:closing_tag]

与这个问题的其他答案不同，这个解决scheme保留了其中包含的所有标签，并从与其他工作解决scheme不同的angular度攻击问题。

其中一个最简单的代码片段，实际上为我和http://lxml.de/tutorial.html#using-xpath-to-find-text上的文档工作是;

 etree.tostring(html, method="text")

etree是一个节点/标签，它的完整文本，你正在尝试阅读。不过，它并没有摆脱脚本和样式标签。

这是一个工作解决scheme。我们可以用父标签获取内容，然后从输出中剪切父标签。

 import re from lxml import etree def _tostr_with_tags(parent_element, html_entities=False): RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' content_with_parent = etree.tostring(parent_element) def _replace_html_entities(s): RE_ENTITY = r'&#(\d+);' def repl(m): return unichr(int(m.group(1))) replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE) return replaced if not html_entities: content_with_parent = _replace_html_entities(content_with_parent) content_with_parent = content_with_parent.strip() # remove 'white' characters on margins start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0] if start_tag != end_tag: raise Exception('Start tag does not match to end tag while getting content with tags.') return content_without_parent

parent_element必须具有Elementtypes。

请注意，如果你想要文本内容（而不是HTML文本），请将html_entities参数设为False。

lxml有一个方法：

 node.text_content()

如果这是一个标签，你可以尝试：

 node.values()

 import re from lxml import etree node = etree.fromstring(""" <content>Text before inner tag <div>Text <em>inside</em> tag </div> Text after inner tag </content>""") print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

在lxml中获取标签内的所有文本

非ASCII字符的SyntaxError

在python中安装lxml模块

通过lxml的属性查找元素

builtins.TypeError：必须是str，而不是字节

在Windows上为Python 2.7构buildlxml

pip安装lxml错误

在pythonparsingHTML – lxml或BeautifulSoup？哪种更适合哪种用途？

如何删除lxml中的元素

点是不能正确安装包：权限被拒绝错误

对大型XML文件使用Python Iterparse

在lxml中获取标签内的所有文本

非ASCII字符的SyntaxError

在python中安装lxml模块

通过lxml的属性查找元素

builtins.TypeError：必须是str，而不是字节

在Windows上为Python 2.7构buildlxml

pip安装lxml错误

在pythonparsingHTML – lxml或BeautifulSoup？ 哪种更适合哪种用途？

如何删除lxml中的元素

点是不能正确安装包：权限被拒绝错误

对大型XML文件使用Python Iterparse

在pythonparsingHTML – lxml或BeautifulSoup？哪种更适合哪种用途？