On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx>wrote: > On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote: > > > On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan > > <ash@xxxxxxxxxxxxxxxxxxxx>wrote: > > > > > On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote: > > > > > > I'm looking for a regular expression to accomplish a specific task. > > > > > > I'm hoping someone who's really good at regex patterns can lend a quick > hand. > > > > > > I need a regex pattern that will grab URLs out of HTML that have a > > > certain link text. (i.e. the word "Continue") > > > > > > This is what I have so far but it does not work properly (If there are > > > other attributes in the <a> tag it returns them as part of the URL.) > > > > > > > preg_match_all('#<a[\s]+[^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue</a>#i', > > > $html, $matches); > > > > > > It needs to be able to extract the URL and disregard arbitrary > > > attributes in the HTML tag > > > > > > Test it with the following examples: > > > > > > <a href=/path/to/url.html>Continue</a> > > > <a href='/path/to/url.html'>Continue</a> > > > <a href="http://example.com/path/to/url.html" > class="link">Continue</a> > > > <a style="font-size: 12px" href="http://example.com/path/to/url.html" > > > onlick="someFunction('foo','bar')">Continue</a> > > > > > > Please reply > > > > > > Your help is much appreciated. > > > > > > Thanks in advance, > > > Brad F. > > > > > > > > > > > > > preg_match_all('#<a[\s]+[^>]*href\s*=\s*[\"\']+([^\"\']+?).+?>Continue</a>#i', > > > $html, $matches); > > > > > > I just changed your regex a bit. What your regex was previously doing > was > > > matching everything from the first quote after the href= right up until > the > > > first > it found, which would usually be the one that closes the > opening > > > tag. You could make it a bit more intelligent if you wished with > > > backreferencing to make sure it matches against the same type of > quotation > > > character it matched as the start of the href's value. > > > > > > Thanks, > > > Ash > > > http://www.ashleysheridan.co.uk > > > > > > > > > > > > > I appreciate the help. However, when try this I only get the first > > character of the URL. Can you double check it please. > > > > Thanks again > > > I think it's probably the first ? in ([^\"\']+?) > > Remove that and it should do the trick > > Thanks, > Ash > http://www.ashleysheridan.co.uk > > > Hi Brad, I agree with Jim. Take a look at this. It might help. <?php $xml_string = <<<TEXT_BOUNDARY <html> <head> <title></title> </head> <body> <div> <a href="http://example.com/path/to/urlA.html">Continue</a> <a href="http://example.com/path/to/url2.html">Brad Fuller</a> <a href="http://example.com/path/to/urlB.html">Continue</a> <a href="http://example.com/path/to/url4.html">PHP.net</a> <a href="http://example.com/path/to/urlC.html" class="link">Continue</a> <a style="font-size: 12px" href=" http://example.com/path/to/urlD.html" onclick="someFunction('foo','bar')">Continue</a> </div> </body> </html> TEXT_BOUNDARY; $xml = simplexml_load_string($xml_string); $continue_hrefs = $xml->xpath("//a[text() = 'Continue']/@href"); print_r($continue_hrefs); ?> -- "Good Enough" is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.