BeautifulSoup在复合类名称search时返回空列表
当使用正则expression式search复合类名时,BeautifulSoup返回空列表。
例:
import re from bs4 import BeautifulSoup bs = """ <a class="name-single name692" href="www.example.com"">Example Text</a> """ bsObj = BeautifulSoup(bs) # this returns the class found_elements = bsObj.find_all("a", class_= re.compile("^(name-single.*)$")) # this returns an empty list found_elements = bsObj.find_all("a", class_= re.compile("^(name-single name\d*)$"))
我需要选课非常精确。 有任何想法吗?
不幸的是,当你尝试在一个包含多个类的类属性值上进行正则expression式匹配时, BeautifulSoup
会将正则expression式分别应用于每个类。 以下是有关该问题的相关主题:
- 美丽的汤的Python正则expression式
- 多CSS类search是不方便的
这是因为class
是一个非常特殊的多值属性 ,每当你parsingHTML时, BeautifulSoup
的树形构build器(取决于parsing器的select)在内部将一个类string值分割成一个类列表(引用HTMLTreeBuilder
的文档string):
# The HTML standard defines these attributes as containing a # space-separated list of values, not a single value. That is, # class="foo bar" means that the 'class' attribute has two values, # 'foo' and 'bar', not the single value 'foo bar'. When we # encounter one of these attributes, we will parse its value into # a list of values if possible. Upon output, the list will be # converted back into a string.
有多种解决方法,但是这里是一个黑客 – 我们将要求BeautifulSoup
不要通过使用我们简单的自定义树构build器来将class
作为多值属性来处理:
import re from bs4 import BeautifulSoup from bs4.builder._htmlparser import HTMLParserTreeBuilder class MyBuilder(HTMLParserTreeBuilder): def __init__(self): super(MyBuilder, self).__init__() # BeautifulSoup, please don't treat "class" specially self.cdata_list_attributes["*"].remove("class") bs = """<a class="name-single name692" href="www.example.com"">Example Text</a>""" bsObj = BeautifulSoup(bs, "html.parser", builder=MyBuilder()) found_elements = bsObj.find_all("a", class_=re.compile(r"^name\-single name\d+$")) print(found_elements)
在这种情况下,正则expression式将作为一个整体应用于class
属性值。
或者,您可以仅使用xml
functionparsingHTML(如果适用):
soup = BeautifulSoup(data, "xml")
您还可以使用CSSselect器,并将所有元素与name-single
类和一个类似于“名称”的类匹配:
soup.select("a.name-single,a[class^=name]")
然后,您可以根据需要手动应用正则expression式:
pattern = re.compile(r"^name-single name\d+$") for elm in bsObj.select("a.name-single,a[class^=name]"): match = pattern.match(" ".join(elm["class"])) if match: print(elm)
对于这个用例我简单地使用一个自定义filter ,就像这样:
import re from bs4 import BeautifulSoup from bs4.builder._htmlparser import HTMLParserTreeBuilder def myclassfilter(tag): return re.compile(r"^name\-single name\d+$").search(' '.join(tag['class'])) bs = """<a class="name-single name692" href="www.example.com"">Example Text</a>""" bsObj = BeautifulSoup(bs, "html.parser") found_elements = bsObj.find_all(myclassfilter) print(found_elements)