Re: Parsing HTML href-Attribute

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Shawn McKenzie wrote:
> Boyd, Todd M. wrote:
>>> -----Original Message-----
>>> From: farnion@xxxxxxxxxxxxxx [mailto:farnion@xxxxxxxxxxxxxx] On Behalf
>>> Of Edmund Hertle
>>> Sent: Thursday, January 15, 2009 4:13 PM
>>> To: PHP - General
>>> Subject:  Parsing HTML href-Attribute
>>>
>>> Hey,
>>> I want to "parse" a href-attribute in a given String to check if there
>>> is a
>>> relative link and then adding an absolute path.
>>> Example:
>>> $string  = '<a class="sample" [...additional attributes...]
>>> href="/foo/bar.php" >';
>>>
>>> I tried using regular expressions but my knowledge of RegEx is very
>>> limited.
>>> Things to consider:
>>> - $string could be quite long but my concern are only those href
>>> attributes
>>> (so working with explode() would be not very handy)
>>> - Should also work if href= is not using quotes or using single quotes
>>> - link could already be an absolute path, so just searching for href=
>>> and
>>> then inserting absolute path could mess up the link
>>>
>>> Any ideas? Or can someone create a RegEx to use?
>> Just spitballing here, but this is probably how I would start:
>>
>> RegEx pattern: /<a.*? href=(.+?)>/ig
>>
>> Then, using the capture group, determine if the href attribute uses quotes (single or double, doesn't matter). If it does, you don't need to worry about splitting the capture group at the first white space. If it doesn't, then you must assume the first whitespace is the end of the URL and the beginning of additional attributes, and just grab the URL up to (but not including) the first whitespace.
>>
>> So...
>>
>> <?php
>>
>> # here is where $anchorText (text for the <a> tag) would be assigned
>> # here is where $curDir (text for the current directory) would be assigned
>>
>> # find the href attribute
>> $matches = Array();
>> preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches);
>>
>> # determine if it has surrounding quotes
>> if($matches[1][0] == '\'' || $matches[1][0] == '"')
>> {
>> 	# pull everything but the first and last character
>> 	$anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
>> }
>> else
>> {
>> 	# pull up to the first space (if there is one)
>> 	$spacePos = strpos($anchorText, ' ');	
>> 	if($spacePos !== false) 
>> 		$anchorText = substr($anchorText, 0, strpos($anchorText, ' '))
>> }
>>
>> # now, check to see if it is relative or absolute
>> # (regex pattern searches for protocol spec (i.e., http://), which will be
>> # treated as an absolute path for the purpose of this algorithm)
>> if($anchorText[0] != '/' && preg_match('#^\w+://#', $anchorText) == 0)
>> {
>> 	# add current directory to the beginning of the relative path
>> 	# (nothing is done to absolute paths or URLs with protocol spec)
>> 	$anchorText = $curDir . '/' . $anchorText;
>> }
>>
>> echo $anchorText;
>>
>> ?>
>>
>> ...UNTESTED.
>>
>> HTH,
>>
>>
>> // Todd
> 
> Wow, that's alot!  This should work with or without quotes and assumes
> no spaces in the URL:
> 
> $prefix = "http://example.com/";;
> $html = preg_replace("|(href=['\"]?)(?!$prefix)([^>'\"\s]+)(\s)?|",
> "$1$prefix$2$3", $html);
> 
> 
Might need to keep a preceding slash out of there:

$html = preg_replace("|(href=['\"]?)(?!$prefix)[/]?([^>'\"\s]+)(\s)?|",
"$1$prefix$2$3", $html);

-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux