Re: Is there a good way to extract the <embed>/<object> content in HTML with/without closing tag?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chian Hsieh wrote:
Hi,

I want to extract all contents started with <embed> and <object>
with/without closing tags.
My solution is using a regular expression to get it work, but there is some
exception I could not handle out.

The REGEXs I used are:

// With closing tag
if (preg_match_all("#(<(object|embed)[^>]+>.*?</\\2>)#is", $str,
$matchObjs)) {
  // blahblah

// Without closing tag
} else if (preg_match_all("#(<(?:object|embed)[^>]+>)#",$str,$matchObjs)){
  // blahblah
}

But it might be failed if the $str are mixed with/without closing tags:

$str ='<div><div><object type="application/x-shockwave-flash"><param
name="zz" value="xx"></object></div><div><embed src="http://sample.com";
/></div>'

In this situation, it will only get the
<object type="application/x-shockwave-flash"><param name="zz"
value="xx"></object>

but I want to get the two results which are
<object type="application/x-shockwave-flash"><param name="zz"
value="xx"></object>
<embed src="http://sample.com"; />


So, is there a good way to use one REGEX to process this issue?

If you're open to using methods other than regex; then one way to get pretty good results is to run the document through HTML Tidy, then parse it in to a DOM and query it using xpath/xquery - basically mimic the base way in which the browsers do it (and the way recommended by the HTML specs)

Best,

Nathan

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux