使用PhantomJS和node.js保存并呈现网页

我正在寻找一个请求网页的示例，等待JavaScript呈现（JavaScript修改DOM），然后抓取页面的HTML。

对于PhantomJS来说，这应该是一个很明显的例子。我找不到一个体面的例子，文档似乎都是关于命令行的使用。

从你的意见，我猜你有2个选项

尝试find一个phantomjs节点模块 – https://github.com/sgentle/phantomjs-node
在节点内运行phantomjs作为subprocess – http://nodejs.org/api/child_process.html

编辑：

看来subprocess是由phantomjs提出的一种与节点交互的方式，请参阅faq – http://code.google.com/p/phantomjs/wiki/FAQ

编辑：

用于获取页面的Phantomjs脚本示例HTML标记：

var page = require('webpage').create(); page.open('http://www.google.com', function (status) { if (status !== 'success') { console.log('Unable to access network'); } else { var p = page.evaluate(function () { return document.getElementsByTagName('html')[0].innerHTML }); console.log(p); } phantom.exit(); });

使用phantomjs-node v2，在处理完HTML之后，打印HTML非常容易。

 var phantom = require('phantom'); phantom.create().then(function(ph) { ph.createPage().then(function(page) { page.open('https://stackoverflow.com/').then(function(status) { console.log(status); page.property('content').then(function(content) { console.log(content); page.close(); ph.exit(); }); }); }); });

这将显示输出，因为它将与浏览器呈现。

过去我使用了两种不同的方法，包括查询Declan提到的DOM的page.evaluate（）方法。我从网页上传递信息的另一种方式是从那里吐出到console.log（），并在phantomjs脚本中使用：

 page.onConsoleMessage = function (msg, line, source) { console.log('console [' +source +':' +line +']> ' +msg); }

我也可能会在onConsoleMessage中捕获variablesmsg并search一些封装数据。取决于你想如何使用输出。

然后在Nodejs脚本中，您将不得不扫描Phantomjs脚本的输出：

 var yourfunc = function(...params...) { var phantom = spawn('phantomjs', [...args]); phantom.stdout.setEncoding('utf8'); phantom.stdout.on('data', function(data) { //parse or echo data var str_phantom_output = data.toString(); // The above will get triggered one or more times, so you'll need to // add code to parse for whatever info you're expecting from the browser }); phantom.stderr.on('data', function(data) { // do something with error data }); phantom.on('exit', function(code) { if (code !== 0) { // console.log('phantomjs exited with code ' +code); } else { // clean exit: do something else such as a passed-in callback } }); }

希望有一些帮助。

为什么不使用这个？

 var page = require('webpage').create(); page.open("http://example.com", function (status) { if (status !== 'success') { console.log('FAIL to load the address'); } else { console.log('Success in fetching the page'); console.log(page.content); } phantom.exit(); });

如果有人在这个问题上绊倒，

我的一个同事开发的一个关于GitHub的项目正是为了帮助你做到这一点： https ： //github.com/vmeurisse/phantomCrawl 。

它还有点年轻，它肯定缺less一些文档，但是提供的例子应该有助于基本的抓取。

这是一个旧版本，我使用运行节点，快递和phantomjs，它将页面保存为.png。你可以很快调整它来获取html。

https://github.com/wehrhaus/sitescrape.git

使用PhantomJS和node.js保存并呈现网页

如何让ng-bind-html编译angularjs代码

在HTML5中使用target =“_ blank”是否正确？

使Android元视口缩放感觉：我错过了什么？

jQuery点击函数在ajax调用后不起作用？

是否有可能使任何CSS元素的行为像<pre>

如何防止填充属性在CSS中改变宽度或高度？

如何validation$ _GET是否存在？

为什么HTML认为“chucknorris”是一种颜色？

在<select>元素中隐藏垂直滚动条

如何minify的PHP页面的HTML输出？