刮网页内容

我正在开发一个项目，为此我想在后台抓取一个网站的内容，并从该网站上获取一些有限的内容。例如，在我的页面中，我有“userid”和“password”字段，通过使用这些字段，我将访问我的邮件并刮取我的收件箱内容并将其显示在我的页面中。请帮我解决这个问题，提前致谢。

我单独使用JavaScript做了上述。但是，当我点击loginbutton我的网页（ http：//localhost/web/Login.html ）的URL被更改为URL（ http://mail.in.com/mails/inbox.php?nomail= ….）我刮了。但是我在不改变我的url的情况下报废了细节请帮我find解决问题，在此先感谢..

绝对去与PHP简单的HTML DOMparsing器。它快速，简单，超级灵活。它基本上把一个完整的HTML页面粘贴在一个对象中，然后你可以访问该对象的任何元素。

就像官方网站的例子一样，要获取主要Google页面上的所有链接：

// Create DOM from URL or file $html = file_get_html('http://www.google.com/'); // Find all images foreach($html->find('img') as $element) echo $element->src . '<br>'; // Find all links foreach($html->find('a') as $element) echo $element->href . '<br>';

HTTP请求

首先，您发出一个HTTP请求来获取页面的内容。有几种方法可以做到这一点。

FOPEN

发送HTTP请求的最基本的方法是使用fopen 。一个主要的优点是可以设置一次读取多less个字符，这在读取非常大的文件时非常有用。然而，这并不是最简单的做法，除非你正在阅读非常大的文件，并且担心遇到内存问题，否则不build议这样做。

 $fp = fopen("http://www.4wtech.com/csp/web/Employee/Login.csp", "rb"); if (FALSE === $fp) { exit("Failed to open stream to URL"); } $result = ''; while (!feof($fp)) { $result .= fread($fp, 8192); } fclose($fp); echo $result;

的file_get_contents

最简单的方法就是使用file_get_contents 。如果或多或less像fopen一样，但你有更less的select，你可以select。这里的主要优点是只需要一行代码。

 $result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp'); echo $result;

sockets

如果您需要更多的控制哪些头被发送到服务器，您可以使用套接字，结合fopen 。

 $fp = fsockopen("www.4wtech.com/csp/web/Employee/Login.csp", 80, $errno, $errstr, 30); if (!$fp) { $result = "$errstr ($errno)<br />\n"; } else { $result = ''; $out = "GET / HTTP/1.1\r\n"; $out .= "Host: www.4wtech.com/csp/web/Employee/Login.csp\r\n"; $out .= "Connection: Close\r\n\r\n"; fwrite($fp, $out); while (!feof($fp)) { $result .= fgets($fp, 128); } fclose($fp); } echo $result;

stream

或者，您也可以使用stream。 stream类似于套接字，可以与fopen和file_get_contents结合使用。

 $opts = array( 'http'=>array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" ) ); $context = stream_context_create($opts); $result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp', false, $context); echo result;

curl

如果你的服务器支持cURL（通常是这样），它build议使用cURL。使用cURL的一个关键优势在于它依赖于其他编程语言中常用的stream行的C库。它还为创build请求头提供了一个便捷的方法，并在发生错误时使用简单的接口自动分析响应头。

 $defaults = array( CURLOPT_URL, "http://www.4wtech.com/csp/web/Employee/Login.csp" CURLOPT_HEADER=> 0 ); $ch = curl_init(); curl_setopt_array($ch, ($options + $defaults)); if( ! $result = curl_exec($ch)) { trigger_error(curl_error($ch)); } curl_close($ch); echo $result;

图书馆

或者，您可以使用许多PHP库之一 。我不推荐使用图书馆，因为它可能是矫枉过正的。在大多数情况下，最好使用cURL编写自己的HTTP类。

HTMLparsing

PHP有一个方便的方法来将任何HTML加载到DOMDocument 。

 $pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp'); $doc = new DOMDocument(); $doc->loadHTML($pagecontent); echo $doc->saveHTML();

不幸的是，PHP对HTML5的支持是有限的。如果遇到尝试parsing页面内容的错误，请考虑使用第三方库。为此，我可以推荐Masterminds / html5-php 。用这个库parsingHTML文件与用DOMDocumentparsingHTML文件非常相似。

 use Masterminds\HTML5; $pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp'); $html5 = new HTML5(); $dom = $html5->loadHTML($html); echo $html5->saveHTML($dom);

或者，你可以使用例如。我的库PHPPowertools / DOM-Query 。它使用Masterminds / html5-php的自定义版本，将HTML5stringparsing为DomDocument和symfony / DomCrawler，以便将CSSselect器转换为XPathselect器。它始终使用相同的DomDocument ，即使传递一个对象到另一个对象，以确保体面的performance。

 namespace PowerTools; // Get file content $pagecontent = file_get_contents( 'http://www.4wtech.com/csp/web/Employee/Login.csp' ); // Define your DOMCrawler based on file string $H = new DOM_Query( $pagecontent ); // Define your DOMCrawler based on an existing DOM_Query instance $H = new DOM_Query( $H->select('body') ); // Passing a string (CSS selector) $s = $H->select( 'div.foo' ); // Passing an element object (DOM Element) $s = $H->select( $documentBody ); // Passing a DOM Query object $s = $H->select( $H->select('p + p') ); // Select the body tag $body = $H->select('body'); // Combine different classes as one selector to get all site blocks $siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer'); // Nest your methods just like you would with jQuery $siteblocks->select('button')->add('span')->addClass('icon icon-printer'); // Use a lambda function to set the text of all site blocks $siteblocks->text(function( $i, $val) { return $i . " - " . $val->attr('class'); }); // Append the following HTML to all site blocks $siteblocks->append('<div class="site-center"></div>'); // Use a descendant selector to select the site's footer $sitefooter = $body->select('.site-footer > .site-center'); // Set some attributes for the site's footer $sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see')); // Use a lambda function to set the attributes of all site blocks $siteblocks->attr('data-val', function( $i, $val) { return $i . " - " . $val->attr('class') . " - photo by Kelly Clark"; }); // Select the parent of the site's footer $sitefooterparent = $sitefooter->parent(); // Remove the class of all i-tags within the site's footer's parent $sitefooterparent->select('i')->removeAttr('class'); // Wrap the site's footer within two nex selectors $sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

您可以使用PHP的cURL扩展来从您的PHP页面脚本中对另一个网站执行HTTP请求。请参阅这里的文档。

当然，这里的缺点是您的网站会慢慢响应，因为您必须在外部网站上提供完整的页面/输出到您的用户。

你试过OutWit Hub吗？这是一个整个刮刮环境。你可以让它尝试猜测结构或开发自己的刮板。我真的build议你看看它。这使我的生活变得更简单。 ZR

我已经使用PHP简单的HTML DOMparsing器和它的好处。我已经使用这个我的stackoverflowcollections夹插件。

PHP简单的DOMparsing器有很多的错误，不再更新。我使用PHP DOM扩展来重写PHP简单DOMparsing器，我正在维护它，你可以在这里检查它。

您还应该看看Apache Nutch，因为他们的网站是“高度可扩展，高度可扩展的Web爬虫”

http://nutch.apache.org/

刮网页内容

HTTP请求

FOPEN

的file_get_contents

sockets

stream

curl

图书馆

HTMLparsing

简单的屏幕抓取使用jQuery

search引擎如何find相关内容？

什么是一个很好的工具来屏幕刮与JavaScript支持？

我如何防止网站刮取？

PhantomJS无法打开HTTPS网站

像kayak.com网站如何聚合内容？

从Python执行Javascript

Perl的WWW :: Mechanize有PHP的等价物吗？

用Node.js实时刮取网页

屏幕抓取：绕过“HTTP错误403：robots.txt不允许的请求”