Re: Regex pattern for preg_match_all

Tommy Pham <tommyhp2@xxxxxxxxx> · Tue, 22 Feb 2011 15:30:54 -0800

On Tue, Feb 22, 2011 at 7:20 AM, Yann Milin <yann@xxxxxxxxxx> wrote:
> Le 19/02/2011 0:23, Tommy Pham a Ãcrit :
>>
>> @Simon,
>>
>> Thanks for explaining about the [^href]. ÂI need to read up more about
>> greediness. ÂI thought I understood it but guess not.
>>
>> @Peter,
>>
>> I tried your pattern but it didn't capture all of my new test cases.
>> Also, it captures the single/double quotes in addition to the
>> fragments inside the href. ÂI couldn't figure out how to modify your
>> pattern to exclude the ', ", and URL fragment from group 1 matches.
>>
>> Below is the new pattern with the new sample test cases that I got it
>> to work. ÂThe new pattern failed only 1 of the non-compliant.
>>
>> $html =<<<HTML
>> <a href=/sample/link>content</a>
>> <a class=link href=/sample/link_extra_attribs title=sample
>> link>content link_extra_attribs</a>
>> <a href='/sample/link_single_quote'>content link_single_quote</a>
>> <a class='link' href='/sample/link_single_quote_pre_attribs'>content
>> link_single_quote_pre_attribs</a>
>> <a class='link' href='/sample/link_single_quote_extra_attribs'
>> title='sample link'>content link_single_quote_extra_attribs</a>
>> <a class='link'
>> href='/sample/link_single_quote_extra_attribs_frag#fragment'
>> title='sample link'>content
>> link_single_quote_extra_attribs_frag#fragment</a>
>> <a class='link'
>> href='/sample/link_single_quote_extra_attribs_query_frag?par=val#fragment'
>> title='sample link'>content
>> link_single_quote_extra_attribs_query_frag?par=val#fragment</a>
>> <a href="/sample/link_double_quote">content link_double_quote</a>
>> <a class="link" href="/sample/link_double_quote_pre_attribs">content
>> link_double_quote_pre_attribs</a>
>> <a class="link"
>> href="/sample/link_double_quote_extra_attribs_frag#fragment"
>> title="sample link">content
>> link_double_quote_extra_attribs_frag#fragment</a>
>> <a class="link"
>> href="/sample/link_double_quote_extra_attribs_nested_tag"
>> title="sample link"><img class="image" src="/images/content.jpg"
>> alt="content" title="content">
>> link_double_quote_extra_attribs_nested_tag</a>
>> <a href="#fragment">content fragment</a>
>> <a class="link" href="#fragment" title="sample link">content fragment</a>
>> <li class="small Âtab "><a class="y-mast-link images"
>> href="http://images.search.yahoo.com/images";
>> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
>> style="padding-left:0em;padding-right:0em;">Images</span></a></li>
>> <li class="small Âtab "><a class="y-mast-link video"
>> href="http://video.search.yahoo.com/video";
>> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
>> style="padding-left:0em;padding-right:0em;">Video</span></a></li>
>> <li class="small Âtab "><a class="y-mast-link local"
>> href="http://local.yahoo.com/results";
>> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
>> style="padding-left:0em;padding-right:0em;">Local</span></a></li>
>> <li class="small Âtab "><a class="y-mast-link shopping"
>> href="http://shopping.yahoo.com/search";
>> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
>> style="padding-left:0em;padding-right:0em;">Shopping</span></a></li>
>> <li class="small lasttab more-tab "><a class="y-mast-link more"
>> href="http://tools.search.yahoo.com/about/forsearchers.html";><span
>> class="tab-cover y-mast-bg-hide">More</span><span
>> class="y-fp-pg-controls arrow"></span></a></li>
>> HTML;
>>
>> $pattern =
>> '%<a[\s]+[^>]*?href\s*=\s*["\']?([^"\'#>]*)["\']?\s?[^>]*>(.*?)</a>%ims';
>>
>> preg_match_all($pattern, $html, $matches);
>>
>> Thanks for your time,
>> Tommy
>
> Hi Tommy,
>
> This is why you shouldn't mix regexes and HTML/XML, especially when you are
> not sure that you are working with valid/consistent html.
> A great/fun answer has been posted on StackOverflow about this at
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
>
> You could easily break any regular expressions solution by adding some valid
> comments, see example here :
> http://stackoverflow.com/questions/1357357/regexp-to-add-attribute-in-any-xml-tags/1357393#1357393
>
> You really should consider using a XML parser instead for this kind of job.
>
> Here is a simple sample that matches your example :
>
> <?php
> $oTidy = new tidy();
> $html = $oTidy->repairString($html,array("clean" => true,
> "drop-proprietary-attributes" => true));
> unset($oTidy);
>
> $matches = get_links($html);
>
> function get_links($html) {
>
> Â Â// Create a new DOM Document to hold our webpage structure
> Â Â$xml = new DOMDocument();
>
> Â Â// Load the url's contents into the DOM
> Â Â$xml->loadHTML($html);
>
> Â Â// Empty array to hold all links to return
> Â Â$links = array();
>
> Â Â//Loop through each <a> tag in the dom and add it to the link array
> Â Âforeach($xml->getElementsByTagName('a') as $link) {
> Â Â Â Â$links[] = array('url' => $link->getAttribute('href'), 'text' =>
> $link->nodeValue);
> Â Â}
>
> Â Â//Return the links
> Â Âreturn $links;
> }
> ?>
>
> Regards,
> Yann
>

Hi Yann,

I already have a working code based on DOMDocument+XPath.  But I
wanted to filter out the fragments too in one swoop.  Thus,
preg_match_all came into mind.  With DOMDocument, I'd have to add
check condition.  I've thought about using Tidy for cleaning the
non-compliant pages prior to extraction but I haven't tested Tidy on
its cleaning process.

Thanks,
Tommy

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php