如何使用DOMparsing器
我是新来的DOMparsing在PHP中:
我有一个HTML文件,我试图parsing。 它有一堆这样的DIV:
<div id="interestingbox"> <div id="interestingdetails" class="txtnormal"> <div>Content1</div> <div>Content2</div> </div> </div> <div id="interestingbox"> ......
我正在尝试使用PHP获取多个div框的内容。 我怎样才能使用DOMparsing器来做到这一点?
谢谢!
首先,我必须告诉你,你不能在两个不同的div上使用相同的id; 有这样的课程。 每个元素应该有一个唯一的ID。
用id =“interestingbox”获取div的内容的代码
$html = ' <html> <head></head> <body> <div id="interestingbox"> <div id="interestingdetails" class="txtnormal"> <div>Content1</div> <div>Content2</div> </div> </div> <div id="interestingbox2"><a href="#">a link</a></div> </body> </html>'; $dom_document = new DOMDocument(); $dom_document->loadHTML($html); //use DOMXpath to navigate the html with the DOM $dom_xpath = new DOMXpath($dom_document); // if you want to get the div with id=interestingbox $elements = $dom_xpath->query("*/div[@id='interestingbox']"); if (!is_null($elements)) { foreach ($elements as $element) { echo "\n[". $element->nodeName. "]"; $nodes = $element->childNodes; foreach ($nodes as $node) { echo $node->nodeValue. "\n"; } } } //OUTPUT [div] { Content1 Content2 }
类的示例:
$html = ' <html> <head></head> <body> <div class="interestingbox"> <div id="interestingdetails" class="txtnormal"> <div>Content1</div> <div>Content2</div> </div> </div> <div class="interestingbox"><a href="#">a link</a></div> </body> </html>'; //the same as before.. just change the xpath [...] $elements = $dom_xpath->query("*/div[@class='interestingbox']"); [...] //OUTPUT [div] { Content1 Content2 } [div] { a link }
有关更多详细信息,请参阅DOMXPath页面。
我把这个用simplehtmldom作为开始:
$html = file_get_html('example.com'); foreach ($html->find('div[id=interestingbox]') as $result) { echo $result->innertext; }
非常好的function从http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node) { $doc = $node->ownerDocument; $frag = $doc->createDocumentFragment(); foreach ($node->childNodes as $child) { $frag->appendChild($child->cloneNode(TRUE)); } return $doc->saveXML($frag); } $dom = new DOMDocument(); $dom->loadXML(' <html> <body> <table> <tr> <td id="foo"> The first bit of Data I want <br />The second bit of Data I want <br />The third bit of Data I want </td> </tr> </table> <body> <html> '); $xpath = new DOMXPath($dom); $node = $xpath->evaluate("/html/body//td[@id='foo' ]"); $dataString = innerXML($node->item(0)); $dataArr = explode("<br />", $dataString); $dataUno = $dataArr[0]; $dataDos = $dataArr[1]; $dataTres = $dataArr[2]; echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https : //github.com/knyga/webextractor它可以parsing页面与CSS,正则expression式,Xpathselect器。
看包和testing的例子:
使用WebExtractor \ DataExtractor \ DataExtractorFactory; 使用WebExtractor \ DataExtractor \ DataExtractorTypes; 使用WebExtractor \ Client \ Client;
$ factory = DataExtractorFactory :: getFactory(); $ extractor = $ factory-> createDataExtractor(DataExtractorTypes :: CSS); $ client = new Client; $ content = $ client-> get(' https://en.wikipedia.org/wiki/2014_Winter_Olympics '); $ extractor-> setContent($内容); $ h1 = $ extractor-> setSelector('h1') – > extract();