Re: Parsing HTML href-Attribute

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Boyd, Todd M. wrote:
>> -----Original Message----- From: Shawn McKenzie 
>> [mailto:nospam@xxxxxxxxxxxxx] Sent: Friday, January 16, 2009 1:08 
>> PM To: php-general@xxxxxxxxxxxxx Subject: Re:  Parsing HTML 
>> href-Attribute
>> 
>> Shawn McKenzie wrote:
>>> Boyd, Todd M. wrote:
>>>>> -----Original Message----- From: farnion@xxxxxxxxxxxxxx 
>>>>> [mailto:farnion@xxxxxxxxxxxxxx] On
>> Behalf
>>>>> Of Edmund Hertle Sent: Thursday, January 15, 2009 4:13 PM To:
>>>>>  PHP - General Subject:  Parsing HTML href-Attribute
>>>>> 
>>>>> Hey, I want to "parse" a href-attribute in a given String to 
>>>>> check if
>> there
>>>>> is a relative link and then adding an absolute path. Example:
>>>>>  $string  = '<a class="sample" [...additional attributes...]
>>>>>  href="/foo/bar.php" >';
>>>>> 
>>>>> I tried using regular expressions but my knowledge of RegEx 
>>>>> is very limited. Things to consider: - $string could be quite
>>>>>  long but my concern are only those href attributes (so 
>>>>> working with explode() would be not very handy) - Should also
>>>>>  work if href= is not using quotes or using single
>> quotes
>>>>> - link could already be an absolute path, so just searching 
>>>>> for
>> href=
>>>>> and then inserting absolute path could mess up the link
>>>>> 
>>>>> Any ideas? Or can someone create a RegEx to use?
>>>> Just spitballing here, but this is probably how I would start:
>>>> 
>>>> RegEx pattern: /<a.*? href=(.+?)>/ig
>>>> 
>>>> Then, using the capture group, determine if the href attribute 
>>>> uses
>> quotes (single or double, doesn't matter). If it does, you don't 
>> need to worry about splitting the capture group at the first white 
>> space. If it doesn't, then you must assume the first whitespace is 
>> the end of the URL and the beginning of additional attributes, and 
>> just grab the URL up to (but not including) the first whitespace.
>>>> So...
>>>> 
>>>> <?php
>>>> 
>>>> # here is where $anchorText (text for the <a> tag) would be 
>>>> assigned # here is where $curDir (text for the current 
>>>> directory) would be
>> assigned
>>>> # find the href attribute $matches = Array(); 
>>>> preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches);
>>>> 
>>>> # determine if it has surrounding quotes if($matches[1][0] == 
>>>> '\'' || $matches[1][0] == '"') { # pull everything but the 
>>>> first and last character $anchorText = substr($anchorText, 1, 
>>>> strlen($anchorText) - 3); } else { # pull up to the first space
>>>>  (if there is one) $spacePos = strpos($anchorText, ' '); 
>>>> if($spacePos !== false) $anchorText = substr($anchorText, 0, 
>>>> strpos($anchorText, '
>> '))
>>>> }
>>>> 
>>>> # now, check to see if it is relative or absolute # (regex 
>>>> pattern searches for protocol spec (i.e., http://), which
>> will be
>>>> # treated as an absolute path for the purpose of this 
>>>> algorithm) if($anchorText[0] != '/' && preg_match('#^\w+://#', 
>>>> $anchorText) ==
>> 0)
>>>> { # add current directory to the beginning of the relative path
>>>>  # (nothing is done to absolute paths or URLs with protocol 
>>>> spec) $anchorText = $curDir . '/' . $anchorText; }
>>>> 
>>>> echo $anchorText;
>>>> 
>>>> ?>
>>>> 
>>>> ...UNTESTED.
>>>> 
>>>> HTH,
>>>> 
>>>> 
>>>> // Todd
>>> Wow, that's alot!  This should work with or without quotes and
>> assumes
>>> no spaces in the URL:
>>> 
>>> $prefix = "http://example.com/";; $html = 
>>> preg_replace("|(href=['\"]?)(?!$prefix)([^>'\"\s]+)(\s)?|", 
>>> "$1$prefix$2$3", $html);
>>> 
>>> 
>> Might need to keep a preceding slash out of there:
>> 
>> $html = 
>> preg_replace("|(href=['\"]?)(?!$prefix)[/]?([^>'\"\s]+)(\s)?|", 
>> "$1$prefix$2$3", $html);
> 
> I believe the OP wanted to leave already-absolute paths alone (i.e., 
> only convert relative paths). The regex does not take into account 
> fully-qualified URLs (i.e., http://www.google.com/search?q=php) and 
> it does not determine if a given path is relative or absolute. He was
>  wanting to take the href attribute of an anchor tag and, **IF** it 
> was a relative path, turn it into an absolute path (meaning to append
>  the relative path to the absolute path of the current script).

That's exactly what this regex does :-)  The (?!$prefix) negative
lookahead assertion fails the match if it's already an absolute URL.

> That was my understanding. Perhaps you saw it differently, but I 
> don't believe your pattern is enough to accomplish what the OP was 
> asking for--hence "a lot" of code was in my reply. ;)
> 
> Believe me, I'm the first guy to hop on the "do it with a regex!" 
> bandwagon... but there are just some circumstances where regex can't 
> do what you need to do (such as more-than-superficial contextual 
> logic).
> 
> HTH,
> 
> 
> // Todd


-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux