RE: Parsing HTML href-Attribute

"Boyd, Todd M." <tmboyd1@xxxxxxxx> · Fri, 16 Jan 2009 14:55:06 -0600

> -----Original Message-----
> From: Shawn McKenzie [mailto:nospam@xxxxxxxxxxxxx]
> Sent: Friday, January 16, 2009 2:37 PM
> To: php-general@xxxxxxxxxxxxx
> Subject: Re:  Parsing HTML href-Attribute
> 
> >>>>> Hey, I want to "parse" a href-attribute in a given String to
> >>>>> check if
> >> there
> >>>>> is a relative link and then adding an absolute path. Example:
> >>>>>  $string  = '<a class="sample" [...additional attributes...]
> >>>>>  href="/foo/bar.php" >';
> >>>>>
> >>>>> I tried using regular expressions but my knowledge of RegEx
> >>>>> is very limited. Things to consider: - $string could be quite
> >>>>>  long but my concern are only those href attributes (so
> >>>>> working with explode() would be not very handy) - Should also
> >>>>>  work if href= is not using quotes or using single
> >> quotes
> >>>>> - link could already be an absolute path, so just searching
> >>>>> for
> >> href=
> >>>>> and then inserting absolute path could mess up the link
> >>>>>
> >>>>> Any ideas? Or can someone create a RegEx to use?
> >>>> Just spitballing here, but this is probably how I would start:
> >>>>
> >>>> RegEx pattern: /<a.*? href=(.+?)>/ig
> >>>>
> >>>> Then, using the capture group, determine if the href attribute
> >>>> uses
> >> quotes (single or double, doesn't matter). If it does, you don't
> >> need to worry about splitting the capture group at the first white
> >> space. If it doesn't, then you must assume the first whitespace is
> >> the end of the URL and the beginning of additional attributes, and
> >> just grab the URL up to (but not including) the first whitespace.
> >>>> So...
> >>>>
> >>>> <?php
> >>>>
> >>>> # here is where $anchorText (text for the <a> tag) would be
> >>>> assigned # here is where $curDir (text for the current
> >>>> directory) would be
> >> assigned
> >>>> # find the href attribute $matches = Array();
> >>>> preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches);
> >>>>
> >>>> # determine if it has surrounding quotes if($matches[1][0] ==
> >>>> '\'' || $matches[1][0] == '"') { # pull everything but the
> >>>> first and last character $anchorText = substr($anchorText, 1,
> >>>> strlen($anchorText) - 3); } else { # pull up to the first space
> >>>>  (if there is one) $spacePos = strpos($anchorText, ' ');
> >>>> if($spacePos !== false) $anchorText = substr($anchorText, 0,
> >>>> strpos($anchorText, '
> >> '))
> >>>> }
> >>>>
> >>>> # now, check to see if it is relative or absolute # (regex
> >>>> pattern searches for protocol spec (i.e., http://), which
> >> will be
> >>>> # treated as an absolute path for the purpose of this
> >>>> algorithm) if($anchorText[0] != '/' && preg_match('#^\w+://#',
> >>>> $anchorText) ==
> >> 0)
> >>>> { # add current directory to the beginning of the relative path
> >>>>  # (nothing is done to absolute paths or URLs with protocol
> >>>> spec) $anchorText = $curDir . '/' . $anchorText; }
> >>>>
> >>>> echo $anchorText;
> >>>>
> >>>> ?>
> >>>>
> >>> Wow, that's alot!  This should work with or without quotes and
> >> assumes
> >>> no spaces in the URL:
> >>>
> >>> $prefix = "http://example.com/";; $html =
> >>> preg_replace("|(href=['\"]?)(?!$prefix)([^>'\"\s]+)(\s)?|",
> >>> "$1$prefix$2$3", $html);
> >>>
> >>>
> >> Might need to keep a preceding slash out of there:
> >>
> >> $html =
> >> preg_replace("|(href=['\"]?)(?!$prefix)[/]?([^>'\"\s]+)(\s)?|",
> >> "$1$prefix$2$3", $html);
> >
> > I believe the OP wanted to leave already-absolute paths alone (i.e.,
> > only convert relative paths). The regex does not take into account
> > fully-qualified URLs (i.e., http://www.google.com/search?q=php) and
> > it does not determine if a given path is relative or absolute. He was
> >  wanting to take the href attribute of an anchor tag and, **IF** it
> > was a relative path, turn it into an absolute path (meaning to append
> >  the relative path to the absolute path of the current script).
> 
> That's exactly what this regex does :-)  The (?!$prefix) negative
> lookahead assertion fails the match if it's already an absolute URL.

I see that now. I didn't notice the negative look-ahead the first go 'round. However, I still have qualms with it. :) You are only checking for http://, and only for the local server. What I meant by "absolute path" was, for example, "/index.php" (the index in the root directory of the server) as opposed to "somefolder/index.php" (the index in a subfolder of the current directory named 'somefolder').

* http://www.google.com/search?q=php ... absolute path (yes, it's a URL, but treat it as absolute)
* https://www.example.com/index.php ... absolute path (yes, it's a URL, but to the local server)
* /index.php ... absolute path (no protocol given, true absolute path)
* index.php ... relative path (relative to current directory on current server)
* somefolder/index.php ... relative path (same reason)

That is indeed a nifty use of look-ahead, though. That will work for any anchor tag that doesn't reference the server (or any other server) with a protocol spec preceding it. However, if you want to run it through an entire list of anchor tags with any spec (http://, https://, udp://, ftp://, aim://, rss://, etc.)--or lack of spec--and only mess with those that don't have a spec and don't use absolute paths, it needs to get a bit more complex. You've convinced me, however, that it can be done entirely with one regex pattern.

Ooh--one more thing I noticed: If the href attribute is not surrounded in quotes (as the OP said it might not be in certain cases), then the remainder of the URL should be discarded... but your regex will also discard the remainder if it HAS been enclosed in quotes. (<a href="http://www.google.com/search?q=php is cool"> is totally valid, and will work by converting the " " to "+" [or maybe %20] when it is requested.)

I don't really have the time right now to further the regex, but you can bet your shirt I'm going to give it another look when I get home. :)

// Todd