RE: Parsing HTML href-Attribute

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Shawn McKenzie [mailto:nospam@xxxxxxxxxxxxx]
> Sent: Friday, January 16, 2009 1:08 PM
> To: php-general@xxxxxxxxxxxxx
> Subject: Re:  Parsing HTML href-Attribute
> 
> Shawn McKenzie wrote:
> > Boyd, Todd M. wrote:
> >>> -----Original Message-----
> >>> From: farnion@xxxxxxxxxxxxxx [mailto:farnion@xxxxxxxxxxxxxx] On
> Behalf
> >>> Of Edmund Hertle
> >>> Sent: Thursday, January 15, 2009 4:13 PM
> >>> To: PHP - General
> >>> Subject:  Parsing HTML href-Attribute
> >>>
> >>> Hey,
> >>> I want to "parse" a href-attribute in a given String to check if
> there
> >>> is a
> >>> relative link and then adding an absolute path.
> >>> Example:
> >>> $string  = '<a class="sample" [...additional attributes...]
> >>> href="/foo/bar.php" >';
> >>>
> >>> I tried using regular expressions but my knowledge of RegEx is very
> >>> limited.
> >>> Things to consider:
> >>> - $string could be quite long but my concern are only those href
> >>> attributes
> >>> (so working with explode() would be not very handy)
> >>> - Should also work if href= is not using quotes or using single
> quotes
> >>> - link could already be an absolute path, so just searching for
> href=
> >>> and
> >>> then inserting absolute path could mess up the link
> >>>
> >>> Any ideas? Or can someone create a RegEx to use?
> >> Just spitballing here, but this is probably how I would start:
> >>
> >> RegEx pattern: /<a.*? href=(.+?)>/ig
> >>
> >> Then, using the capture group, determine if the href attribute uses
> quotes (single or double, doesn't matter). If it does, you don't need
> to worry about splitting the capture group at the first white space. If
> it doesn't, then you must assume the first whitespace is the end of the
> URL and the beginning of additional attributes, and just grab the URL
> up to (but not including) the first whitespace.
> >>
> >> So...
> >>
> >> <?php
> >>
> >> # here is where $anchorText (text for the <a> tag) would be assigned
> >> # here is where $curDir (text for the current directory) would be
> assigned
> >>
> >> # find the href attribute
> >> $matches = Array();
> >> preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches);
> >>
> >> # determine if it has surrounding quotes
> >> if($matches[1][0] == '\'' || $matches[1][0] == '"')
> >> {
> >> 	# pull everything but the first and last character
> >> 	$anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
> >> }
> >> else
> >> {
> >> 	# pull up to the first space (if there is one)
> >> 	$spacePos = strpos($anchorText, ' ');
> >> 	if($spacePos !== false)
> >> 		$anchorText = substr($anchorText, 0, strpos($anchorText, '
> '))
> >> }
> >>
> >> # now, check to see if it is relative or absolute
> >> # (regex pattern searches for protocol spec (i.e., http://), which
> will be
> >> # treated as an absolute path for the purpose of this algorithm)
> >> if($anchorText[0] != '/' && preg_match('#^\w+://#', $anchorText) ==
> 0)
> >> {
> >> 	# add current directory to the beginning of the relative path
> >> 	# (nothing is done to absolute paths or URLs with protocol spec)
> >> 	$anchorText = $curDir . '/' . $anchorText;
> >> }
> >>
> >> echo $anchorText;
> >>
> >> ?>
> >>
> >> ...UNTESTED.
> >>
> >> HTH,
> >>
> >>
> >> // Todd
> >
> > Wow, that's alot!  This should work with or without quotes and
> assumes
> > no spaces in the URL:
> >
> > $prefix = "http://example.com/";;
> > $html = preg_replace("|(href=['\"]?)(?!$prefix)([^>'\"\s]+)(\s)?|",
> > "$1$prefix$2$3", $html);
> >
> >
> Might need to keep a preceding slash out of there:
> 
> $html = preg_replace("|(href=['\"]?)(?!$prefix)[/]?([^>'\"\s]+)(\s)?|",
> "$1$prefix$2$3", $html);

I believe the OP wanted to leave already-absolute paths alone (i.e., only convert relative paths). The regex does not take into account fully-qualified URLs (i.e., http://www.google.com/search?q=php) and it does not determine if a given path is relative or absolute. He was wanting to take the href attribute of an anchor tag and, **IF** it was a relative path, turn it into an absolute path (meaning to append the relative path to the absolute path of the current script).

That was my understanding. Perhaps you saw it differently, but I don't believe your pattern is enough to accomplish what the OP was asking for--hence "a lot" of code was in my reply. ;)

Believe me, I'm the first guy to hop on the "do it with a regex!" bandwagon... but there are just some circumstances where regex can't do what you need to do (such as more-than-superficial contextual logic).

HTH,


// Todd


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux