Re: regex pattern for extracting URLs

Brad Fuller <bfuller21@xxxxxxxxx> · Fri, 23 Oct 2009 13:54:40 -0400

On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan
<ash@xxxxxxxxxxxxxxxxxxxx>wrote:

>  On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:
>
> On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
> <ash@xxxxxxxxxxxxxxxxxxxx>wrote:
>
> >  On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
> >
> > I'm looking for a regular expression to accomplish a specific task.
> >
> > I'm hoping someone who's really good at regex patterns can lend a quick hand.
> >
> > I need a regex pattern that will grab URLs out of HTML that have a
> > certain link text. (i.e. the word "Continue")
> >
> > This is what I have so far but it does not work properly (If there are
> > other attributes in the <a> tag it returns them as part of the URL.)
> >
> >     preg_match_all('#<a[\s]+[^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue</a>#i',
> > $html, $matches);
> >
> > It needs to be able to extract the URL and disregard arbitrary
> > attributes in the HTML tag
> >
> > Test it with the following examples:
> >
> > <a href=/path/to/url.html>Continue</a>
> > <a href='/path/to/url.html'>Continue</a>
> > <a href="http://example.com/path/to/url.html"; class="link">Continue</a>
> > <a style="font-size: 12px" href="http://example.com/path/to/url.html";
> > onlick="someFunction('foo','bar')">Continue</a>
> >
> > Please reply
> >
> > Your help is much appreciated.
> >
> > Thanks in advance,
> > Brad F.
> >
> >
> >
> > preg_match_all('#<a[\s]+[^>]*href\s*=\s*[\"\']+([^\"\']+?).+?>Continue</a>#i',
> > $html, $matches);
> >
> > I just changed your regex a bit. What your regex was previously doing was
> > matching everything from the first quote after the href= right up until the
> > first > it found, which would usually be the one that closes the opening
> > tag. You could make it a bit more intelligent if you wished with
> > backreferencing to make sure it matches against the same type of quotation
> > character it matched as the start of the href's value.
> >
> >   Thanks,
> > Ash
> > http://www.ashleysheridan.co.uk
> >
> >
> >
>
> I appreciate the help.  However, when try this I only get the first
> character of the URL.  Can you double check it please.
>
> Thanks again
>
>
> I think it's probably the first ? in ([^\"\']+?)
>
> Remove that and it should do the trick
>
>   Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
>
That did the trick.  Thanks Ash you are awesome!

Also thanks Jim for your suggestion.  I may move to SimpleXML if the project
grows much bigger.  But for now I was looking for a nice one liner and this
is it.

Cheers,
Brad