Perl的WWW :: Mechanize有PHP的等价物吗？

我正在寻找一个function类似于Perl的WWW :: Mechanize ，但PHP的function。基本上，它应该允许我用简单的语法提交HTTP GET和POST请求，然后parsing生成的页面，并以简单的格式返回所有的表单及其字段，以及页面上的所有链接。

我知道关于CURL，但它有一点准备，语法是非常丑陋的（吨curl_foo($curl_handle, ...)语句

澄清：

我想要的东西比目前的答案更高级。例如，在Perl中，您可以执行如下操作：

 # navigate to the main page $mech->get( 'http://www.somesite.com/' ); # follow a link that contains the text 'download this' $mech->follow_link( text_regex => qr/download this/i ); # submit a POST form, to log into the site $mech->submit_form( with_fields => { username => 'mungo', password => 'lost-and-alone', } ); # save the results as a file $mech->save_content('somefile.zip');

使用HTTP_Client或wget或CURL来做同样的事情将会是很多工作，我不得不手动parsing页面来查找链接，find表单URL，提取所有隐藏的字段，等等。我要求一个PHP解决scheme的原因是我没有使用Perl的经验，而且我可以用很多工作来构build我所需要的东西，但是如果我可以在PHP中完成以上任务，则会更快。

SimpleTest的ScriptableBrowser可以独立于testing框架使用。我用它来做很多自动化工作。

我感觉不得不回答这个问题，即使它是一个旧的post…我一直在使用PHPcurl很多，它不像任何类似WWW的机器人：机械化，我正在切换（我想我要去用Ruby语言实现）curl过时了，因为它需要太多的“grunt工作”来自动化任何事情，最简单的可脚本化的浏览器看起来很有希望，但在testing它，它不会在大多数web我尝试的forms…老实说，我觉得PHP是缺乏这种types的刮，networking自动化，所以最好看看不同的语言，只是想发布这个，因为我花了无数小时在这个话题上，也许它将来可以节省一些时间。

尝试在PEAR库中查找。如果一切都失败了，为curl创build一个对象包装器。

你可以这样简单的事情：

 class curl { private $resource; public function __construct($url) { $this->resource = curl_init($url); } public function __call($function, array $params) { array_unshift($params, $this->resource); return call_user_func_array("curl_$function", $params); } }

尝试以下方法之一：

PEAR的HTTP_Request
一个Zend_Http_Client

（是的，它是ZendFramework的代码，但是它不会让你的类变慢，因为它只是加载所需的库。）

看看史努比： http : //sourceforge.net/projects/snoopy/

curl是简单的请求的方式。它运行跨平台，有一个PHP扩展，被广泛采用和testing。

我创build了一个很好的类，可以通过调用CurlHandler :: Get（$ url，$ data）来GET和POST一组数据（INCLUDING FILES！）到一个url。 CurlHandler :: Post（$ url，$ data）。有一个可选的HTTP用户身份validation选项也:)

 /** * CURLHandler handles simple HTTP GETs and POSTs via Curl * * @package Pork * @author SchizoDuckie * @copyright SchizoDuckie 2008 * @version 1.0 * @access public */ class CURLHandler { /** * CURLHandler::Get() * * Executes a standard GET request via Curl. * Static function, so that you can use: CurlHandler::Get('http://www.google.com'); * * @param string $url url to get * @return string HTML output */ public static function Get($url) { return self::doRequest('GET', $url); } /** * CURLHandler::Post() * * Executes a standard POST request via Curl. * Static function, so you can use CurlHandler::Post('http://www.google.com', array('q'=>'StackOverFlow')); * If you want to send a File via post (to eg PHP's $_FILES), prefix the value of an item with an @ ! * @param string $url url to post data to * @param Array $vars Array with key=>value pairs to post. * @return string HTML output */ public static function Post($url, $vars, $auth = false) { return self::doRequest('POST', $url, $vars, $auth); } /** * CURLHandler::doRequest() * This is what actually does the request * <pre> * - Create Curl handle with curl_init * - Set options like CURLOPT_URL, CURLOPT_RETURNTRANSFER and CURLOPT_HEADER * - Set eventual optional options (like CURLOPT_POST and CURLOPT_POSTFIELDS) * - Call curl_exec on the interface * - Close the connection * - Return the result or throw an exception. * </pre> * @param mixed $method Request Method (Get/ Post) * @param mixed $url URI to get or post to * @param mixed $vars Array of variables (only mandatory in POST requests) * @return string HTML output */ public static function doRequest($method, $url, $vars=array(), $auth = false) { $curlInterface = curl_init(); curl_setopt_array ($curlInterface, array( CURLOPT_URL => $url, CURLOPT_RETURNTRANSFER => 1, CURLOPT_FOLLOWLOCATION =>1, CURLOPT_HEADER => 0)); if (strtoupper($method) == 'POST') { curl_setopt_array($curlInterface, array( CURLOPT_POST => 1, CURLOPT_POSTFIELDS => http_build_query($vars)) ); } if($auth !== false) { curl_setopt($curlInterface, CURLOPT_USERPWD, $auth['username'] . ":" . $auth['password']); } $result = curl_exec ($curlInterface); curl_close ($curlInterface); if($result === NULL) { throw new Exception('Curl Request Error: '.curl_errno($curlInterface) . " - " . curl_error($curlInterface)); } else { return($result); } } } ?>

只是现在阅读澄清…你可能想要使用上面提到的自动化东西之一。您也可以决定使用像ChickenFoot这样的客户端Firefox扩展来获得更大的灵活性。我将把上面的示例类留在这里以备将来的search。

如果您在项目中使用CakePHP，或者如果您倾向于提取相关库，则可以使用它们的curl包装器HttpSocket。它有你描述的简单的页面抓取语法，例如，

 # This is the sugar for importing the library within CakePHP App::import('Core', 'HttpSocket'); $HttpSocket = new HttpSocket(); $result = $HttpSocket->post($login_url, array( "username" => "username", "password" => "password" ) );

…虽然它没有办法parsing响应页面。为此，我将使用simplehtmldom： http ://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/它将自己描述为有一个类似jQuery的语法。

我倾向于同意，底线是PHP没有Perl / Ruby拥有的真棒抓取/自动化库。

现在是2016年，还有貂皮。它甚至支持来自无头的纯PHP“浏览器”（不带JavaScript）的不同引擎，而不是Selenium（需要Firefox或Chrome等浏览器）到NPM中的无头“browser.js”，它支持JavaScript。

如果你在一个* nix系统上，你可以使用带有wget的shell_exec（），它有很多不错的select。

Perl的WWW :: Mechanize有PHP的等价物吗？

简单的屏幕抓取使用jQuery

通过url来做同样的事情

我如何防止网站刮取？

在线程中执行Webbrowser控件的屏幕视图

PhantomJS无法打开HTTPS网站

我怎样才能把一个HTML表格CSV？

像kayak.com网站如何聚合内容？

如何在PHP中实现一个Web刮板？

从Python执行Javascript

屏幕抓取：绕过“HTTP错误403：robots.txt不允许的请求”