On Tue, Feb 22, 2011 at 7:20 AM, Yann Milin <yann@xxxxxxxxxx> wrote: > Le 19/02/2011 0:23, Tommy Pham a Ãcrit : >> >> @Simon, >> >> Thanks for explaining about the [^href]. ÂI need to read up more about >> greediness. ÂI thought I understood it but guess not. >> >> @Peter, >> >> I tried your pattern but it didn't capture all of my new test cases. >> Also, it captures the single/double quotes in addition to the >> fragments inside the href. ÂI couldn't figure out how to modify your >> pattern to exclude the ', ", and URL fragment from group 1 matches. >> >> Below is the new pattern with the new sample test cases that I got it >> to work. ÂThe new pattern failed only 1 of the non-compliant. >> >> $html =<<<HTML >> <a href=/sample/link>content</a> >> <a class=link href=/sample/link_extra_attribs title=sample >> link>content link_extra_attribs</a> >> <a href='/sample/link_single_quote'>content link_single_quote</a> >> <a class='link' href='/sample/link_single_quote_pre_attribs'>content >> link_single_quote_pre_attribs</a> >> <a class='link' href='/sample/link_single_quote_extra_attribs' >> title='sample link'>content link_single_quote_extra_attribs</a> >> <a class='link' >> href='/sample/link_single_quote_extra_attribs_frag#fragment' >> title='sample link'>content >> link_single_quote_extra_attribs_frag#fragment</a> >> <a class='link' >> href='/sample/link_single_quote_extra_attribs_query_frag?par=val#fragment' >> title='sample link'>content >> link_single_quote_extra_attribs_query_frag?par=val#fragment</a> >> <a href="/sample/link_double_quote">content link_double_quote</a> >> <a class="link" href="/sample/link_double_quote_pre_attribs">content >> link_double_quote_pre_attribs</a> >> <a class="link" >> href="/sample/link_double_quote_extra_attribs_frag#fragment" >> title="sample link">content >> link_double_quote_extra_attribs_frag#fragment</a> >> <a class="link" >> href="/sample/link_double_quote_extra_attribs_nested_tag" >> title="sample link"><img class="image" src="/images/content.jpg" >> alt="content" title="content"> >> link_double_quote_extra_attribs_nested_tag</a> >> <a href="#fragment">content fragment</a> >> <a class="link" href="#fragment" title="sample link">content fragment</a> >> <li class="small Âtab "><a class="y-mast-link images" >> href="http://images.search.yahoo.com/images" >> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide" >> style="padding-left:0em;padding-right:0em;">Images</span></a></li> >> <li class="small Âtab "><a class="y-mast-link video" >> href="http://video.search.yahoo.com/video" >> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide" >> style="padding-left:0em;padding-right:0em;">Video</span></a></li> >> <li class="small Âtab "><a class="y-mast-link local" >> href="http://local.yahoo.com/results" >> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide" >> style="padding-left:0em;padding-right:0em;">Local</span></a></li> >> <li class="small Âtab "><a class="y-mast-link shopping" >> href="http://shopping.yahoo.com/search" >> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide" >> style="padding-left:0em;padding-right:0em;">Shopping</span></a></li> >> <li class="small lasttab more-tab "><a class="y-mast-link more" >> href="http://tools.search.yahoo.com/about/forsearchers.html"><span >> class="tab-cover y-mast-bg-hide">More</span><span >> class="y-fp-pg-controls arrow"></span></a></li> >> HTML; >> >> $pattern = >> '%<a[\s]+[^>]*?href\s*=\s*["\']?([^"\'#>]*)["\']?\s?[^>]*>(.*?)</a>%ims'; >> >> preg_match_all($pattern, $html, $matches); >> >> Thanks for your time, >> Tommy > > Hi Tommy, > > This is why you shouldn't mix regexes and HTML/XML, especially when you are > not sure that you are working with valid/consistent html. > A great/fun answer has been posted on StackOverflow about this at > http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 > > You could easily break any regular expressions solution by adding some valid > comments, see example here : > http://stackoverflow.com/questions/1357357/regexp-to-add-attribute-in-any-xml-tags/1357393#1357393 > > You really should consider using a XML parser instead for this kind of job. > > Here is a simple sample that matches your example : > > <?php > $oTidy = new tidy(); > $html = $oTidy->repairString($html,array("clean" => true, > "drop-proprietary-attributes" => true)); > unset($oTidy); > > $matches = get_links($html); > > function get_links($html) { > > Â Â// Create a new DOM Document to hold our webpage structure > Â Â$xml = new DOMDocument(); > > Â Â// Load the url's contents into the DOM > Â Â$xml->loadHTML($html); > > Â Â// Empty array to hold all links to return > Â Â$links = array(); > > Â Â//Loop through each <a> tag in the dom and add it to the link array > Â Âforeach($xml->getElementsByTagName('a') as $link) { > Â Â Â Â$links[] = array('url' => $link->getAttribute('href'), 'text' => > $link->nodeValue); > Â Â} > > Â Â//Return the links > Â Âreturn $links; > } > ?> > > Regards, > Yann > Hi Yann, I already have a working code based on DOMDocument+XPath. But I wanted to filter out the fragments too in one swoop. Thus, preg_match_all came into mind. With DOMDocument, I'd have to add check condition. I've thought about using Tidy for cleaning the non-compliant pages prior to extraction but I haven't tested Tidy on its cleaning process. Thanks, Tommy -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php