Le 19/02/2011 0:23, Tommy Pham a Ãcrit :
@Simon,
Thanks for explaining about the [^href]. I need to read up more about
greediness. I thought I understood it but guess not.
@Peter,
I tried your pattern but it didn't capture all of my new test cases.
Also, it captures the single/double quotes in addition to the
fragments inside the href. I couldn't figure out how to modify your
pattern to exclude the ', ", and URL fragment from group 1 matches.
Below is the new pattern with the new sample test cases that I got it
to work. The new pattern failed only 1 of the non-compliant.
$html =<<<HTML
<a href=/sample/link>content</a>
<a class=link href=/sample/link_extra_attribs title=sample
link>content link_extra_attribs</a>
<a href='/sample/link_single_quote'>content link_single_quote</a>
<a class='link' href='/sample/link_single_quote_pre_attribs'>content
link_single_quote_pre_attribs</a>
<a class='link' href='/sample/link_single_quote_extra_attribs'
title='sample link'>content link_single_quote_extra_attribs</a>
<a class='link'
href='/sample/link_single_quote_extra_attribs_frag#fragment'
title='sample link'>content
link_single_quote_extra_attribs_frag#fragment</a>
<a class='link'
href='/sample/link_single_quote_extra_attribs_query_frag?par=val#fragment'
title='sample link'>content
link_single_quote_extra_attribs_query_frag?par=val#fragment</a>
<a href="/sample/link_double_quote">content link_double_quote</a>
<a class="link" href="/sample/link_double_quote_pre_attribs">content
link_double_quote_pre_attribs</a>
<a class="link"
href="/sample/link_double_quote_extra_attribs_frag#fragment"
title="sample link">content
link_double_quote_extra_attribs_frag#fragment</a>
<a class="link"
href="/sample/link_double_quote_extra_attribs_nested_tag"
title="sample link"><img class="image" src="/images/content.jpg"
alt="content" title="content">
link_double_quote_extra_attribs_nested_tag</a>
<a href="#fragment">content fragment</a>
<a class="link" href="#fragment" title="sample link">content fragment</a>
<li class="small tab "><a class="y-mast-link images"
href="http://images.search.yahoo.com/images"
data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Images</span></a></li>
<li class="small tab "><a class="y-mast-link video"
href="http://video.search.yahoo.com/video"
data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Video</span></a></li>
<li class="small tab "><a class="y-mast-link local"
href="http://local.yahoo.com/results"
data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Local</span></a></li>
<li class="small tab "><a class="y-mast-link shopping"
href="http://shopping.yahoo.com/search"
data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Shopping</span></a></li>
<li class="small lasttab more-tab "><a class="y-mast-link more"
href="http://tools.search.yahoo.com/about/forsearchers.html"><span
class="tab-cover y-mast-bg-hide">More</span><span
class="y-fp-pg-controls arrow"></span></a></li>
HTML;
$pattern = '%<a[\s]+[^>]*?href\s*=\s*["\']?([^"\'#>]*)["\']?\s?[^>]*>(.*?)</a>%ims';
preg_match_all($pattern, $html, $matches);
Thanks for your time,
Tommy
Hi Tommy,
This is why you shouldn't mix regexes and HTML/XML, especially when you
are not sure that you are working with valid/consistent html.
A great/fun answer has been posted on StackOverflow about this at
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
You could easily break any regular expressions solution by adding some
valid comments, see example here :
http://stackoverflow.com/questions/1357357/regexp-to-add-attribute-in-any-xml-tags/1357393#1357393
You really should consider using a XML parser instead for this kind of job.
Here is a simple sample that matches your example :
<?php
$oTidy = new tidy();
$html = $oTidy->repairString($html,array("clean" => true,
"drop-proprietary-attributes" => true));
unset($oTidy);
$matches = get_links($html);
function get_links($html) {
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the url's contents into the DOM
$xml->loadHTML($html);
// Empty array to hold all links to return
$links = array();
//Loop through each <a> tag in the dom and add it to the link array
foreach($xml->getElementsByTagName('a') as $link) {
$links[] = array('url' => $link->getAttribute('href'), 'text'
=> $link->nodeValue);
}
//Return the links
return $links;
}
?>
Regards,
Yann
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php