使用BeautifulSoup删除标签，但保留其内容

目前我有这样的代码：

soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.extract() soup.renderContents()

除了我不想丢弃无效标签内的内容。如何在调用soup.renderContents（）时去掉标签，但保留内容？

我使用的策略是用它的内容replace一个标签，如果它们的types是NavigableString ，如果它们不是，那么recursion到它们并用NavigableStringreplace它们的内容。试试这个：

 from BeautifulSoup import BeautifulSoup, NavigableString def strip_tags(html, invalid_tags): soup = BeautifulSoup(html) for tag in soup.findAll(True): if tag.name in invalid_tags: s = "" for c in tag.contents: if not isinstance(c, NavigableString): c = strip_tags(unicode(c), invalid_tags) s += unicode(c) tag.replaceWith(s) return soup html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>" invalid_tags = ['b', 'i', 'u'] print strip_tags(html, invalid_tags)

结果是：

 <p>Good, bad, and ugly</p>

我在另一个问题上也给出了同样的答案。这似乎来了很多。

BeautifulSoup库的当前版本在标记对象replaceWithChildren（）上有一个未公开的方法。所以，你可以做这样的事情：

 html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>" invalid_tags = ['b', 'i', 'u'] soup = BeautifulSoup(html) for tag in invalid_tags: for match in soup.findAll(tag): match.replaceWithChildren() print soup

看起来就像你想要的那样，并且是相当简单的代码（尽pipe它通过DOM进行了一些传递，但是这可以很容易地被优化）。

虽然这已经在其他人的意见中，我想我会发布一个完整的答案显示如何与Mozilla的漂白剂做到这一点。就我个人而言，我认为这比使用BeautifulSoup更好。

 import bleach html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>" clean = bleach.clean(html, tags=[], strip=True) print clean # Should print: "Bad Ugly Evil()"

我有一个更简单的解决scheme，但我不知道是否有一个缺点。

更新：有一个缺点，请参阅Jesse Dhillon的评论。另外，另一个解决scheme是使用Mozilla的Bleach，而不是BeautifulSoup。

 from BeautifulSoup import BeautifulSoup VALID_TAGS = ['div', 'p'] value = '<div><p>Hello <b>there</b> my friend!</p></div>' soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.replaceWith(tag.renderContents()) print soup.renderContents()

这也将打印<div><p>Hello there my friend!</p></div>根据需要。

你可以使用soup.text

.text删除所有标签并连接所有文本。

在你移除标签之前，你可能必须将标签的孩子移动到标签父母的子女 – 这是你的意思吗？

如果是这样，那么，在正确的位置插入内容是棘手的，像这样的东西应该工作：

 from BeautifulSoup import BeautifulSoup VALID_TAGS = 'div', 'p' value = '<div><p>Hello <b>there</b> my friend!</p></div>' soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: for i, x in enumerate(tag.parent.contents): if x == tag: break else: print "Can't find", tag, "in", tag.parent continue for r in reversed(tag.contents): tag.parent.insert(i, r) tag.extract() print soup.renderContents()

与示例值，这将打印<div><p>Hello there my friend!</p></div>根据需要。

没有一个提出的答案似乎与BeautifulSoup为我工作。这里有一个与BeautifulSoup 3.2.1一起使用的版本，并且在连接来自不同标签的内容时插入一个空格，而不是拼接单词。

 def strip_tags(html, whitelist=[]): """ Strip all HTML tags except for a list of whitelisted tags. """ soup = BeautifulSoup(html) for tag in soup.findAll(True): if tag.name not in whitelist: tag.append(' ') tag.replaceWithChildren() result = unicode(soup) # Clean up any repeated spaces and spaces like this: '<a>test </a> ' result = re.sub(' +', ' ', result) result = re.sub(r' (<[^>]*> )', r'\1', result) return result.strip()

例：

 strip_tags('<h2><a><span>test</span></a> testing</h2><p>again</p>', ['a']) # result: u'<a>test</a> testing again'

这是更好的解决scheme，没有任何麻烦和样板代码来过滤标签保持内容。让我们说你想删除父标签内的任何儿童标签，只是想保留的内容/文本，那么你可以简单地做：

 for p_tags in div_tags.find_all("p"): print(p_tags.get_text())

就是这样，你可以免费使用父标签中的所有br或ib标签，并获得干净的文本。

使用解包。

展开将删除多次出现的标记之一，并保持内容。

例：

 >> soup = BeautifulSoup('Hi. This is a <nobr> nobr </nobr>') >> soup <html><body><p>Hi. This is a <nobr> nobr </nobr></p></body></html> >> soup.nobr.unwrap <nobr></nobr> >> soup >> <html><body><p>Hi. This is a nobr </p></body></html>

这是一个老问题，但只是说一个更好的方法来做到这一点。首先，BeautifulSoup 3 *不再被开发，所以你应该使用BeautifulSoup 4 *，所谓的bs4 。

另外，lxml只是你需要的函数： Cleaner类有属性remove_tags ，你可以设置标签，当它们的内容被拉进父标签时，标签将被删除。

使用BeautifulSoup删除标签，但保留其内容

在pythonparsingHTML – lxml或BeautifulSoup？哪种更适合哪种用途？

美丽的汤findAll没有find他们全部

我怎样才能从使用Python的HTML获得href链接？

UnicodeEncodeError：'ascii'编解码器不能编码字符u'\ xa0'在位置20：序号不在范围内（128）

ImportError：没有模块命名为bs4（BeautifulSoup）

如何在窗口上安装python 2.7美丽的汤4

UnicodeEncodeError：'ascii'编解码器不能以特殊名称编码字符

BeautifulSoup在复合类名称search时返回空列表

屏幕抓取：绕过“HTTP错误403：robots.txt不允许的请求”

BeautifulSoup抓住可见的网页文本

使用BeautifulSoup删除标签，但保留其内容

在pythonparsingHTML – lxml或BeautifulSoup？ 哪种更适合哪种用途？

美丽的汤findAll没有find他们全部

我怎样才能从使用Python的HTML获得href链接？

UnicodeEncodeError：'ascii'编解码器不能编码字符u'\ xa0'在位置20：序号不在范围内（128）

ImportError：没有模块命名为bs4（BeautifulSoup）

如何在窗口上安装python 2.7美丽的汤4

UnicodeEncodeError：'ascii'编解码器不能以特殊名称编码字符

BeautifulSoup在复合类名称search时返回空列表

屏幕抓取：绕过“HTTP错误403：robots.txt不允许的请求”

BeautifulSoup抓住可见的网页文本

在pythonparsingHTML – lxml或BeautifulSoup？哪种更适合哪种用途？