Re: Regex pattern for preg_match_all

Yann Milin <yann@xxxxxxxxxx> · Tue, 22 Feb 2011 16:20:23 +0100

Le 19/02/2011 0:23, Tommy Pham a Ãcrit :
@Simon,

Thanks for explaining about the [^href].  I need to read up more about
greediness.  I thought I understood it but guess not.

@Peter,

I tried your pattern but it didn't capture all of my new test cases.
Also, it captures the single/double quotes in addition to the
fragments inside the href.  I couldn't figure out how to modify your
pattern to exclude the ', ", and URL fragment from group 1 matches.

Below is the new pattern with the new sample test cases that I got it
to work.  The new pattern failed only 1 of the non-compliant.

$html =<<<HTML
<a href=/sample/link>content</a>
<a class=link href=/sample/link_extra_attribs title=sample
link>content link_extra_attribs</a>
<a href='/sample/link_single_quote'>content link_single_quote</a>
<a class='link' href='/sample/link_single_quote_pre_attribs'>content
link_single_quote_pre_attribs</a>
<a class='link' href='/sample/link_single_quote_extra_attribs'
title='sample link'>content link_single_quote_extra_attribs</a>
<a class='link'
href='/sample/link_single_quote_extra_attribs_frag#fragment'
title='sample link'>content
link_single_quote_extra_attribs_frag#fragment</a>
<a class='link'
href='/sample/link_single_quote_extra_attribs_query_frag?par=val#fragment'
title='sample link'>content
link_single_quote_extra_attribs_query_frag?par=val#fragment</a>
<a href="/sample/link_double_quote">content link_double_quote</a>
<a class="link" href="/sample/link_double_quote_pre_attribs">content
link_double_quote_pre_attribs</a>
<a class="link"
href="/sample/link_double_quote_extra_attribs_frag#fragment"
title="sample link">content
link_double_quote_extra_attribs_frag#fragment</a>
<a class="link"
href="/sample/link_double_quote_extra_attribs_nested_tag"
title="sample link"><img class="image" src="/images/content.jpg"
alt="content" title="content">
link_double_quote_extra_attribs_nested_tag</a>
<a href="#fragment">content fragment</a>
<a class="link" href="#fragment" title="sample link">content fragment</a>
<li class="small  tab "><a class="y-mast-link images"
href="http://images.search.yahoo.com/images";
data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Images</span></a></li>
<li class="small  tab "><a class="y-mast-link video"
href="http://video.search.yahoo.com/video";
data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Video</span></a></li>
<li class="small  tab "><a class="y-mast-link local"
href="http://local.yahoo.com/results";
data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Local</span></a></li>
<li class="small  tab "><a class="y-mast-link shopping"
href="http://shopping.yahoo.com/search";
data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Shopping</span></a></li>
<li class="small lasttab more-tab "><a class="y-mast-link more"
href="http://tools.search.yahoo.com/about/forsearchers.html";><span
class="tab-cover y-mast-bg-hide">More</span><span
class="y-fp-pg-controls arrow"></span></a></li>
HTML;

$pattern = '%<a[\s]+[^>]*?href\s*=\s*["\']?([^"\'#>]*)["\']?\s?[^>]*>(.*?)</a>%ims';

preg_match_all($pattern, $html, $matches);

Thanks for your time,
Tommy

Hi Tommy,

This is why you shouldn't mix regexes and HTML/XML, especially when you 
are not sure that you are working with valid/consistent html.
A great/fun answer has been posted on StackOverflow about this at 
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

You could easily break any regular expressions solution by adding some 
valid comments, see example here : 
http://stackoverflow.com/questions/1357357/regexp-to-add-attribute-in-any-xml-tags/1357393#1357393

You really should consider using a XML parser instead for this kind of job.

Here is a simple sample that matches your example :

<?php
$oTidy = new tidy();
$html = $oTidy->repairString($html,array("clean" => true, 
"drop-proprietary-attributes" => true));
unset($oTidy);

$matches = get_links($html);

function get_links($html) {

    // Create a new DOM Document to hold our webpage structure
    $xml = new DOMDocument();

    // Load the url's contents into the DOM
    $xml->loadHTML($html);

    // Empty array to hold all links to return
    $links = array();

    //Loop through each <a> tag in the dom and add it to the link array
    foreach($xml->getElementsByTagName('a') as $link) {
        $links[] = array('url' => $link->getAttribute('href'), 'text' 
=> $link->nodeValue);
    }

    //Return the links
    return $links;
}
?>

Regards,
Yann

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php