如何从HTML提取img src，title和alt使用php？

我想创build一个页面，其中驻留在我的网站上的所有图像被列为标题和替代表示。

我已经写了一个小程序来查找和加载所有的HTML文件，但现在我坚持如何从这个HTML中提取src ， title和alt ：

 <img src ="/image/fluffybunny.jpg" title ="Harvey the bunny" alt ="a cute little fluffy bunny" />

我想这应该用一些正则expression式，但由于标签的顺序可能会有所不同，我需要所有这些，我真的不知道如何parsing这个优雅的方式（我可以做到这一点辛苦char的方式，但这是痛苦的）。

编辑：现在，我知道更好

使用正则expression式来解决这种问题是一个坏主意，并可能导致不可维护和不可靠的代码。更好地使用HTMLparsing器。

解决scheme使用正则expression式

在这种情况下，最好将stream程分成两部分：

获取所有的img标签
提取其元数据

我会假设你的文档不是xHTML严格的，所以你不能使用XMLparsing器。 EG与这个网页源代码：

 /* preg_match_all match the regexp in all the $html string and output everything as an array in $result. "i" option is used to make it case insensitive */ preg_match_all('/<img[^>]+>/i',$html, $result); print_r($result); Array ( [0] => Array ( [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" /> [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" /> [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" /> [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" /> [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" /> [...] ) )

然后我们用一个循环获得所有的img标签属性：

 $img = array(); foreach( $result as $img_tag) { preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]); } print_r($img); Array ( [<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array ( [0] => Array ( [0] => src="/Content/Img/stackoverflow-logo-250.png" [1] => alt="logo link to homepage" ) [1] => Array ( [0] => src [1] => alt ) [2] => Array ( [0] => "/Content/Img/stackoverflow-logo-250.png" [1] => "logo link to homepage" ) ) [<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array ( [0] => Array ( [0] => src="/content/img/vote-arrow-up.png" [1] => alt="vote up" [2] => title="This was helpful (click again to undo)" ) [1] => Array ( [0] => src [1] => alt [2] => title ) [2] => Array ( [0] => "/content/img/vote-arrow-up.png" [1] => "vote up" [2] => "This was helpful (click again to undo)" ) ) [<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array ( [0] => Array ( [0] => src="/content/img/vote-arrow-down.png" [1] => alt="vote down" [2] => title="This was not helpful (click again to undo)" ) [1] => Array ( [0] => src [1] => alt [2] => title ) [2] => Array ( [0] => "/content/img/vote-arrow-down.png" [1] => "vote down" [2] => "This was not helpful (click again to undo)" ) ) [<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array ( [0] => Array ( [0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" [1] => alt="gravatar image" ) [1] => Array ( [0] => src [1] => alt ) [2] => Array ( [0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" [1] => "gravatar image" ) ) [..] ) )

正则expression式是CPU密集型的，所以你可能想caching这个页面。如果你没有caching系统，你可以通过使用ob_start和从文本文件加载/保存来调整你自己的caching系统。

这个东西如何工作？

首先，我们使用preg_match_all函数，该函数获取每个匹配模式的string，并将其输出到第三个参数中。

正则expression式：

 <img[^>]+>

我们将其应用于所有的html网页。它可以被读作每一个以“ <img ”开始的string，包含非“>”字符，并以>结尾 。

 (alt|title|src)=("[^"]*")

我们依次在每个img标签上应用它。它可以被读作每一个以“alt”，“title”或“src”开头的string，然后是“=”，然后是一个'''，一堆不是'''的东西，以'''隔离（）之间的子string 。

最后，每次你想处理正则expression式，都有很好的工具来快速testing它们。检查这个在线regexptesting器。

编辑：回答第一个评论。

的确，我没有考虑使用单引号（希望很less）的人。

那么，如果你只使用'，只是取代所有的“由”。

如果你混合使用首先你应该拍一下自己:-)，然后试着用（“|'）代替，或者用”和[^ø]代替[^“]。

 $url="http://example.com"; $html = file_get_contents($url); $doc = new DOMDocument(); @$doc->loadHTML($html); $tags = $doc->getElementsByTagName('img'); foreach ($tags as $tag) { echo $tag->getAttribute('src'); }

只是举一个使用PHP的XMLfunction的小例子：

 $doc=new DOMDocument(); $doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>"); $xml=simplexml_import_dom($doc); // just to make xpath more simple $images=$xml->xpath('//img'); foreach ($images as $img) { echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title']; }

我没有使用DOMDocument::loadHTML()方法，因为这个方法可以处理HTML语法，并且不会强制input文档是XHTML。严格来说，转换为SimpleXMLElement并不是必须的 – 它只是使xpath和xpath结果更加简单。

使用xpath。

对于PHP，您可以使用simplexml或domxml

也请看这个问题

如果是XHTML，你的例子是，你只需要simpleXML。

 <?php $input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>'; $sx = simplexml_load_string($input); var_dump($sx); ?>

输出：

 object(SimpleXMLElement)#1 (1) { ["@attributes"]=> array(3) { ["src"]=> string(22) "/image/fluffybunny.jpg" ["title"]=> string(16) "Harvey the bunny" ["alt"]=> string(26) "a cute little fluffy bunny" } }

脚本必须像这样编辑

foreach( $result[0] as $img_tag)

因为preg_match_all返回数组的数组

RE解决scheme：

  $url="http://example.com"; $html = file_get_contents($url); $doc = new DOMDocument(); @$doc->loadHTML($html); $tags = $doc->getElementsByTagName('img'); foreach ($tags as $tag) { echo $tag->getAttribute('src'); }

你如何从多个文件/ url获取标签和属性？

这样做对我来说不起作用：

  foreach (glob("path/to/files/*.html") as $html) { $doc = new DOMDocument(); $doc->loadHTML($html); $tags = $doc->getElementsByTagName('img'); foreach ($tags as $tag) { echo $tag->getAttribute('src'); } }

你可以使用simplehtmldom 。 simplehtmldom支持大多数jQueryselect器。下面给出一个例子

 // Create DOM from URL or file $html = file_get_html('http://www.google.com/'); // Find all images foreach($html->find('img') as $element) echo $element->src . '<br>'; // Find all links foreach($html->find('a') as $element) echo $element->href . '<br>';

下面是一个PHP函数我从上面的所有信息一起蹒跚在一起，用于类似的目的，即dynamic调整图像标签的宽度和长度属性…有点笨重，或许，但似乎工作可靠：

 function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) { // find image tags preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER); // put image tags in a simpler array $imagearray = array(); for ($i = 0; $i < count($rawimagearray); $i++) { array_push($imagearray, $rawimagearray[$i][0]); } // put image attributes in another array $imageinfo = array(); foreach($imagearray as $img_tag) { preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]); } // combine everything into one array $AllImageInfo = array(); foreach($imagearray as $img_tag) { $ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]); $OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]); $OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]); $NewWidth = $OrignialWidth; $NewHeight = $OrignialHeight; $AdjustDimensions = "F"; if($OrignialWidth > $MaximumWidth) { $diff = $OrignialWidth-$MaximumHeight; $percnt_reduced = (($diff/$OrignialWidth)*100); $NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100)); $NewWidth = floor($OrignialWidth-$diff); $AdjustDimensions = "T"; } if($OrignialHeight > $MaximumHeight) { $diff = $OrignialHeight-$MaximumWidth; $percnt_reduced = (($diff/$OrignialHeight)*100); $NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100)); $NewHeight= floor($OrignialHeight-$diff); $AdjustDimensions = "T"; } $thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions); array_push($AllImageInfo, $thisImageInfo); } // build array of before and after tags $ImageBeforeAndAfter = array(); for ($i = 0; $i < count($AllImageInfo); $i++) { if($AllImageInfo[$i]['AdjustDimensions'] == "T") { $NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']); $NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag); $thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag); array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter); } } // execute search and replace for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) { $HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent); } return $HTMLContent; }

这里是解决scheme，在PHP中：

只要下载QueryPath，然后执行如下操作：

 $doc= qp($myHtmlDoc); foreach($doc->xpath('//img') as $img) { $src= $img->attr('src'); $title= $img->attr('title'); $alt= $img->attr('alt'); }

就是这样，你完成了！

如果HTML保证是XHTML，那么也可以尝试SimpleXML，它会为你parsing标记，你将能够通过它们的名字来访问属性。（也有DOM库，如果它只是HTML，你不能依靠XML语法。）

你可以写一个正则expression式来获得所有的img标签（ <img[^>]*> ），然后使用简单的爆炸： $res = explode("\"", $tags) ，输出结果如下所示：

 $res[0] = "<img src="; $res[1] = "/image/fluffybunny.jpg"; $res[2] = "title="; $res[3] = "Harvey the bunny"; $res[4] = "alt="; $res[5] = "a cute little fluffy bunny"; $res[6] = "/>";

如果你在爆炸之前删除<img标签，那么你会得到一个数组forms

 property= value

所以这些属性的顺序是不相关的，你只使用你想要的。

我用preg_match来做到这一点。

在我的情况下，我有一个string，包含我从Wordpress获得的一个<img>标记（并没有其他标记），我试图获取src属性，所以我可以通过timthumb运行它。

 // get the featured image $image = get_the_post_thumbnail($photos[$i]->ID); // get the src for that image $pattern = '/src="([^"]*)"/'; preg_match($pattern, $image, $matches); $src = $matches[1]; unset($matches);

在获取标题或者替代品的模式中，可以简单地使用$pattern = '/title="([^"]*)"/';来获取标题或者$pattern = '/title="([^"]*)"/'; 抢alt。可悲的是，我的正则expression式不足以通过一次抓取全部三个（alt / title / src）。

下面的代码在wordpress中为我工作…

它从代码中提取所有的图像源

 $search = "any html code with image tags"; preg_match_all( '/src="([^"]*)"/', $search, $matches); if ( isset( $matches ) ) { foreach ($matches as $match) { if(strpos($match[0], "src")!==false) { $res = explode("\"", $match[0]); $image = parse_url($res[1], PHP_URL_PATH); $xml .= " <image:image>\n"; $xml .= " <image:loc>".home_url().$image."</image:loc>\n"; $xml .= " <image:caption>".htmlentities($title)."</image:caption>\n"; $xml .= " <image:license>".home_url()."</image:license>\n"; $xml .= " </image:image>\n"; } } }

干杯！

 $content = "<img src='http://google.com/2af5e6ae749d523216f296193ab0b146.jpg' width='40' height='40'>"; $image = preg_match_all('~<img rel="imgbot" remote="(.*?)" width="(.*?)" height="(.*?)" linktext="(.*?)" linkhref="(.*?)" src="(.*?)" />~is', $content, $matches);

如果你想使用regEx，为什么不像这样简单：

 preg_match_all('% (.*)=\"(.*)\"%Uis', $code, $matches, PREG_SET_ORDER);

这将返回类似于：

 array(2) { [0]=> array(3) { [0]=> string(10) " src="abc"" [1]=> string(3) "src" [2]=> string(3) "abc" } [1]=> array(3) { [0]=> string(10) " bla="123"" [1]=> string(3) "bla" [2]=> string(3) "123" } }

有我的解决scheme从wordpress或html内容的任何职位的内容只收回图像。 `

 $content = get_the_content(); $count = substr_count($content, '<img'); $start = 0; for ($i=0;$i<$count;$i++) { if ($i == 0){ $imgBeg = strpos($content, '<img', $start); $post = substr($content, $imgBeg); } else { $imgBeg = strpos($post, '<img', $start); $post = substr($post, $imgBeg-2); } $imgEnd = strpos($post, '>'); $postOutput = substr($post, 0, $imgEnd+1); $postOutput = preg_replace('/width="([0-9]*)" height="([0-9]*)"/', '',$postOutput); $image[$i] = $postOutput; $start= $imgEnd + 1; } print_r($image);

  “] +>] +> /）？>”

这将提取与图像标签嵌套的锚标签

对于一个元素，你可以使用这个缩小的解决scheme，使用DOMDocument。处理“和”引号，并validationhtml。最好的做法是使用现有的库，而不是使用正则expression式自己的解决scheme。

 $html = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />'; $attribute = 'src'; $doc = new DOMDocument(); @$doc->loadHTML($html); $attributeValue = @$doc->documentElement->firstChild->firstChild->attributes->getNamedItem($attribute)->value; echo $attributeValue;

如何使用正则expression式来查找img标签（如"<img[^>]*>" ），然后，对于每个img标签，可以使用另一个正则expression式来查找每个属性。

也许类似于" ([a-zA-Z]+)=\"([^"]*)\""来查找属性，但是如果您正在处理标签，您可能希望允许引号不存在汤…如果你去了，你可以从每个比赛中的组获得参数名称和值。

也许这会给你正确的答案：

 <img.*?(?:(?:\s+(src)="([^"]+)")|(?:\s+(alt)="([^"]+)")|(?:\s+(title)="([^"]+)")|(?:\s+[^\s]+))+.*/>

如何从HTML提取img src，title和alt使用php？

编辑：现在，我知道更好

解决scheme使用正则expression式

这个东西如何工作？

任何大于0的数的正则expression式？

如何用下划线replace空格，反之亦然？

如何用preg_match在数组中search？

包含至less8个字符，1个数字，1个大写和1个小写字母的javascript正则expression式

使用或不使用正则expression式？

正则expression式字母，数字，短划线和下划线

正则expression式的zip代码

正则expression式：数字范围

对cricinfo记分卡的htmlparsing

任何好的正则expression式创build软件或在线工具创build正则expression式