Re: regex pattern for extracting URLs

Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> · Fri, 23 Oct 2009 18:28:52 +0100

On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:

> I'm looking for a regular expression to accomplish a specific task.
> 
> I'm hoping someone who's really good at regex patterns can lend a quick hand.
> 
> I need a regex pattern that will grab URLs out of HTML that have a
> certain link text. (i.e. the word "Continue")
> 
> This is what I have so far but it does not work properly (If there are
> other attributes in the <a> tag it returns them as part of the URL.)
> 
>     preg_match_all('#<a[\s]+[^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue</a>#i',
> $html, $matches);
> 
> It needs to be able to extract the URL and disregard arbitrary
> attributes in the HTML tag
> 
> Test it with the following examples:
> 
> <a href=/path/to/url.html>Continue</a>
> <a href='/path/to/url.html'>Continue</a>
> <a href="http://example.com/path/to/url.html"; class="link">Continue</a>
> <a style="font-size: 12px" href="http://example.com/path/to/url.html";
> onlick="someFunction('foo','bar')">Continue</a>
> 
> Please reply
> 
> Your help is much appreciated.
> 
> Thanks in advance,
> Brad F.
> 

preg_match_all('#<a[\s]+[^>]*href\s*=\s*[\"\']+([^
\"\']+?).+?>Continue</a>#i', $html, $matches);

I just changed your regex a bit. What your regex was previously doing
was matching everything from the first quote after the href= right up
until the first > it found, which would usually be the one that closes
the opening tag. You could make it a bit more intelligent if you wished
with backreferencing to make sure it matches against the same type of
quotation character it matched as the start of the href's value.

Thanks,
Ash
http://www.ashleysheridan.co.uk