Re: regex pattern for extracting URLs

Israel Ekpo <israelekpo@xxxxxxxxx> · Fri, 23 Oct 2009 13:54:34 -0400

On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan
<ash@xxxxxxxxxxxxxxxxxxxx>wrote:

> On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:
>
> > On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
> > <ash@xxxxxxxxxxxxxxxxxxxx>wrote:
> >
> > >  On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
> > >
> > > I'm looking for a regular expression to accomplish a specific task.
> > >
> > > I'm hoping someone who's really good at regex patterns can lend a quick
> hand.
> > >
> > > I need a regex pattern that will grab URLs out of HTML that have a
> > > certain link text. (i.e. the word "Continue")
> > >
> > > This is what I have so far but it does not work properly (If there are
> > > other attributes in the <a> tag it returns them as part of the URL.)
> > >
> > >
> preg_match_all('#<a[\s]+[^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue</a>#i',
> > > $html, $matches);
> > >
> > > It needs to be able to extract the URL and disregard arbitrary
> > > attributes in the HTML tag
> > >
> > > Test it with the following examples:
> > >
> > > <a href=/path/to/url.html>Continue</a>
> > > <a href='/path/to/url.html'>Continue</a>
> > > <a href="http://example.com/path/to/url.html";
> class="link">Continue</a>
> > > <a style="font-size: 12px" href="http://example.com/path/to/url.html";
> > > onlick="someFunction('foo','bar')">Continue</a>
> > >
> > > Please reply
> > >
> > > Your help is much appreciated.
> > >
> > > Thanks in advance,
> > > Brad F.
> > >
> > >
> > >
> > >
> preg_match_all('#<a[\s]+[^>]*href\s*=\s*[\"\']+([^\"\']+?).+?>Continue</a>#i',
> > > $html, $matches);
> > >
> > > I just changed your regex a bit. What your regex was previously doing
> was
> > > matching everything from the first quote after the href= right up until
> the
> > > first > it found, which would usually be the one that closes the
> opening
> > > tag. You could make it a bit more intelligent if you wished with
> > > backreferencing to make sure it matches against the same type of
> quotation
> > > character it matched as the start of the href's value.
> > >
> > >   Thanks,
> > > Ash
> > > http://www.ashleysheridan.co.uk
> > >
> > >
> > >
> >
> > I appreciate the help.  However, when try this I only get the first
> > character of the URL.  Can you double check it please.
> >
> > Thanks again
>
>
> I think it's probably the first ? in ([^\"\']+?)
>
> Remove that and it should do the trick
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
>
Hi Brad,

I agree with Jim.

Take a look at this. It might help.

<?php

$xml_string = <<<TEXT_BOUNDARY
<html>
    <head>
        <title></title>
    </head>
    <body>
        <div>
            <a href="http://example.com/path/to/urlA.html";>Continue</a>
            <a href="http://example.com/path/to/url2.html";>Brad Fuller</a>
            <a href="http://example.com/path/to/urlB.html";>Continue</a>
            <a href="http://example.com/path/to/url4.html";>PHP.net</a>
            <a href="http://example.com/path/to/urlC.html";
class="link">Continue</a>
            <a style="font-size: 12px" href="
http://example.com/path/to/urlD.html";
onclick="someFunction('foo','bar')">Continue</a>
        </div>
    </body>
</html>
TEXT_BOUNDARY;

$xml = simplexml_load_string($xml_string);

$continue_hrefs = $xml->xpath("//a[text() = 'Continue']/@href");

print_r($continue_hrefs);

?>

-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.