DOMparsing器，允许使用HTML5风格的</ in <script>标记

更新： html5lib （问题的底部）似乎接近，我只需要提高我的理解如何使用。

我正在尝试为PHP 5.3find一个兼容HTML5的DOMparsing器。特别是，我需要在脚本标记中访问以下类似HTML的CDATA：

 <script type="text/x-jquery-tmpl" id="foo"> <table><tr><td>${name}</td></tr></table> </script>

大多数parsing器会提前parsing，因为HTML 4.01在<script>标签内findETAGO（ </ ）时会结束脚本标签parsing 。但是，HTML5 允许在</script>之前。到目前为止我所尝试过的所有parsing器都没有成功，或者它们的logging太差，以至于我没有弄清楚它们是否工作。

我的要求：

真正的parsing器，而不是正则expression式黑客。
能够加载完整页面或HTML片段。
能够拉出脚本内容，通过标签的id属性进行select。

input：

 <script id="foo"><td>bar</td></script>

输出失败的示例（不结束</td> ）：

 <script id="foo"><td>bar</script>

一些parsing器及其结果：

DOMDocument （失败）

资源：

 <?php header('Content-type: text/plain'); $d = new DOMDocument; $d->loadHTML('<script id="foo"><td>bar</td></script>'); echo $d->saveHTML();

输出：

 Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head><script id="foo"><td>bar</script></head></html>

FluentDOM （失败）

资源：

 <?php header('Content-type: text/plain'); require_once 'FluentDOM/src/FluentDOM.php'; $html = "<html><head></head><body><script id='foo'><td></td></script></body></html>"; echo FluentDOM($html, 'text/html');

输出：

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head></head><body><script id="foo"><td></script></body></html>

phpQuery （失败）

资源：

 <?php header('Content-type: text/plain'); require_once 'phpQuery.php'; phpQuery::newDocumentHTML(<<<EOF <script type="text/x-jquery-tmpl" id="foo"> <td>test</td> </script> EOF );

echo（string）pq（'＃foo'）;

输出：

 <script type="text/x-jquery-tmpl" id="foo"> <td>test </script>

html5lib （通行证）

可能有希望。我可以在script#foo标签的内容吗？

资源：

 <?php header('Content-type: text/plain'); include 'HTML5/Parser.php'; $html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>"; $d = HTML5_Parser::parse($html); echo $d->saveHTML();

输出：

 <html><head></head><body><script id="foo"><td></td></script></body></html>

我有同样的问题，显然你可以通过加载文件为XML，并保存为HTML :)来破解你的方式

 $d = new DOMDocument; $d->loadXML('<script id="foo"><td>bar</td></script>'); echo $d->saveHTML();

但是，当然，标记必须是无错误的loadXML工作。

Re：html5lib

您点击下载选项卡并下载parsing器的PHP版本。

您解压本地文件夹中的存档

  tar -zxvf html5lib-php-0.1.tar.gz x html5lib-php-0.1/ x html5lib-php-0.1/VERSION x html5lib-php-0.1/docs/ ... etc

您更改目录并创build一个名为hello.php的文件

 cd html5lib-php-0.1 touch hello.php

您将以下PHP代码放在hello.php

 $html = '<html><head></head><body> <script type="text/x-jquery-tmpl" id="foo"> <table><tr><td>${name}</td></tr></table> </script> </body></html>'; $dom = HTML5_Parser::parse($html); var_dump($dom->saveXml()); echo "\nDone\n";

你从命令行运行hello.php

 php hello.php

parsing器将parsing文档树，并返回一个DOMDocument对象，该对象可以像任何其他DOMDocument对象一样进行操作。

FluentDOM使用DOMDocument，但阻止加载通知和警告。它没有自己的parsing器。你可以添加你自己的装载器（例如使用html5lib）。

我在我的jQuery模板块（CDATA块也失败了）中添加了注释标记（  ），并且DOMDocument没有触及内部HTML。

然后，在我使用jQuery模板之前，我写了一个脚本来删除注释。

 $(function() { $('script[type="text/x-jquery-tmpl"]').text(function() { // The comment node in this context is actually a text node. return $.trim($(this).text()).replace(/^<!--([\s\S]*)-->$/, '$1'); }); });

不理想，但我不确定更好的解决方法。

我遇到了这个确切的问题。

PHP Dom文档parsing脚本标签内的html，实际上可以导致一个完全不同的dom。

因为我不想使用DomDocument以外的其他库。我写了几行删除任何脚本内容，然后你做任何你需要做的dom文件，然后你把这个脚本内容回来。

显然脚本内容不可用于您的dom对象，因为它是空的。

用以下的PHP代码行，你可以“修复”这个问题。被警告脚本标签中的脚本标签会导致错误。

 $scripts = array(); // this will select all script tags non-greedy. If you have a script tag in your script tag, it will cause problems. preg_match_all("/((<script.*>)(.*))\/script>/sU", $html, $scripts); // Make content of scripts empty $html = str_replace($scripts[3], '', $html); // Do DOM Document stuff here // Put script contents back $html = str_replace($scripts[2], $scripts[1], $html);

我希望这会帮助一些人:-)。

DOMparsing器，允许使用HTML5风格的</ in <script>标记

DOMDocument （失败）

FluentDOM （失败）

phpQuery （失败）

html5lib （通行证）

JavaScript：DOM加载事件，执行顺序和$（document）.ready（）

在Javascript中，你可以扩展DOM吗？

在JavaScript中将em转换为px（并获取默认字体大小）

分别了解offsetWidth，clientWidth，scrollWidth和-Height

DOM parentNode和parentElement之间的区别

DOM ID中允许使用哪些字符？

如何在xpath中使用“不”？

获取指定位置的元素 – JavaScript

什么是DOM元素？

React renderToString（）性能和cachingReact组件