Boyd, Todd M. wrote: >> -----Original Message----- From: Shawn McKenzie >> [mailto:nospam@xxxxxxxxxxxxx] Sent: Friday, January 16, 2009 1:08 >> PM To: php-general@xxxxxxxxxxxxx Subject: Re: Parsing HTML >> href-Attribute >> >> Shawn McKenzie wrote: >>> Boyd, Todd M. wrote: >>>>> -----Original Message----- From: farnion@xxxxxxxxxxxxxx >>>>> [mailto:farnion@xxxxxxxxxxxxxx] On >> Behalf >>>>> Of Edmund Hertle Sent: Thursday, January 15, 2009 4:13 PM To: >>>>> PHP - General Subject: Parsing HTML href-Attribute >>>>> >>>>> Hey, I want to "parse" a href-attribute in a given String to >>>>> check if >> there >>>>> is a relative link and then adding an absolute path. Example: >>>>> $string = '<a class="sample" [...additional attributes...] >>>>> href="/foo/bar.php" >'; >>>>> >>>>> I tried using regular expressions but my knowledge of RegEx >>>>> is very limited. Things to consider: - $string could be quite >>>>> long but my concern are only those href attributes (so >>>>> working with explode() would be not very handy) - Should also >>>>> work if href= is not using quotes or using single >> quotes >>>>> - link could already be an absolute path, so just searching >>>>> for >> href= >>>>> and then inserting absolute path could mess up the link >>>>> >>>>> Any ideas? Or can someone create a RegEx to use? >>>> Just spitballing here, but this is probably how I would start: >>>> >>>> RegEx pattern: /<a.*? href=(.+?)>/ig >>>> >>>> Then, using the capture group, determine if the href attribute >>>> uses >> quotes (single or double, doesn't matter). If it does, you don't >> need to worry about splitting the capture group at the first white >> space. If it doesn't, then you must assume the first whitespace is >> the end of the URL and the beginning of additional attributes, and >> just grab the URL up to (but not including) the first whitespace. >>>> So... >>>> >>>> <?php >>>> >>>> # here is where $anchorText (text for the <a> tag) would be >>>> assigned # here is where $curDir (text for the current >>>> directory) would be >> assigned >>>> # find the href attribute $matches = Array(); >>>> preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches); >>>> >>>> # determine if it has surrounding quotes if($matches[1][0] == >>>> '\'' || $matches[1][0] == '"') { # pull everything but the >>>> first and last character $anchorText = substr($anchorText, 1, >>>> strlen($anchorText) - 3); } else { # pull up to the first space >>>> (if there is one) $spacePos = strpos($anchorText, ' '); >>>> if($spacePos !== false) $anchorText = substr($anchorText, 0, >>>> strpos($anchorText, ' >> ')) >>>> } >>>> >>>> # now, check to see if it is relative or absolute # (regex >>>> pattern searches for protocol spec (i.e., http://), which >> will be >>>> # treated as an absolute path for the purpose of this >>>> algorithm) if($anchorText[0] != '/' && preg_match('#^\w+://#', >>>> $anchorText) == >> 0) >>>> { # add current directory to the beginning of the relative path >>>> # (nothing is done to absolute paths or URLs with protocol >>>> spec) $anchorText = $curDir . '/' . $anchorText; } >>>> >>>> echo $anchorText; >>>> >>>> ?> >>>> >>>> ...UNTESTED. >>>> >>>> HTH, >>>> >>>> >>>> // Todd >>> Wow, that's alot! This should work with or without quotes and >> assumes >>> no spaces in the URL: >>> >>> $prefix = "http://example.com/"; $html = >>> preg_replace("|(href=['\"]?)(?!$prefix)([^>'\"\s]+)(\s)?|", >>> "$1$prefix$2$3", $html); >>> >>> >> Might need to keep a preceding slash out of there: >> >> $html = >> preg_replace("|(href=['\"]?)(?!$prefix)[/]?([^>'\"\s]+)(\s)?|", >> "$1$prefix$2$3", $html); > > I believe the OP wanted to leave already-absolute paths alone (i.e., > only convert relative paths). The regex does not take into account > fully-qualified URLs (i.e., http://www.google.com/search?q=php) and > it does not determine if a given path is relative or absolute. He was > wanting to take the href attribute of an anchor tag and, **IF** it > was a relative path, turn it into an absolute path (meaning to append > the relative path to the absolute path of the current script). That's exactly what this regex does :-) The (?!$prefix) negative lookahead assertion fails the match if it's already an absolute URL. > That was my understanding. Perhaps you saw it differently, but I > don't believe your pattern is enough to accomplish what the OP was > asking for--hence "a lot" of code was in my reply. ;) > > Believe me, I'm the first guy to hop on the "do it with a regex!" > bandwagon... but there are just some circumstances where regex can't > do what you need to do (such as more-than-superficial contextual > logic). > > HTH, > > > // Todd -- Thanks! -Shawn http://www.spidean.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php