Re: Extract links from strings

Jim Lucas <lists@xxxxxxxxx> · Wed, 23 Sep 2009 09:58:56 -0700

Philip Thompson wrote:
> On Sep 21, 2009, at 6:20 PM, Jim Lucas wrote:
> 
>> Jim Lucas wrote:
>>> Jônatas Zechim wrote:
>>>> Hi there, i've the following strings:
>>>>
>>>> $string1 = 'Lorem ipsum dolor http://site.com sit amet';
>>>> $string2 = 'Lorem ipsum dolor http://www.site.com/ sit amet';
>>>> $string3 = 'Lorem ipsum dolor http://www.site.net sit amet';
>>>>
>>>> How can I extract the URL from these strings?
>>>> They can be [http:// + url] or [www. + url].
>>>>
>>>> Zechim
>>>>
>>>>
>>>
>>> Something like this should work for you.
>>>
>>> <plaintext><?php
>>>
>>> $urls[] = 'Lorem ipsum dolor http://site.com sit amet';
>>> $urls[] = 'Lorem ipsum dolor https://www.site.com/ sit amet';
>>> $urls[] = 'Lorem ipsum dolor www.site1.net sit amet';
>>> $urls[] = 'Lorem ipsum dolor www site2.net sit amet';
>>>
>>> foreach ( $urls AS $url ) {
>>>     if ( preg_match('%((https?://|www\.)[^\s]+)%', $url, $m) ) {
>>>         print_r($m);
>>>     }
>>> }
>>>
>>> ?>
>>>
>>
>> Actually, try this.  It seems to work a little better.
>>
>> <plaintext><?php
>>
>> $urls[] = 'Lorem ipsum dolor http://site.com sit amet';
>> $urls[] = 'Lorem ipsum dolor https://www.site.com/ or
>> http://www.site2.com/';
>> $urls[] = 'Lorem ipsum dolor www.site1.net sit amet';
>> $urls[] = 'Lorem ipsum dolor www site2.net sit amet';
>>
>> foreach ( $urls AS $url ) {
>>     if ( preg_match_all(    '%(https?://[^\s]+|www\.[^\s]+)%',
>>                 $url,
>>                 $m,
>>                 (PREG_SET_ORDER ^ PREG_OFFSET_CAPTURE)
>>     ) ) {
>>         print_r($m);
>>     }
>> }
>>
>> ?>
> 
> What if the sub domain was not 'www'?
> 
> http://no-www.org/
> 

Well, if it had the http:// at the beginning, then it would be found.

but, somedomain.no-www.org would not work.

But, if they only had no-www.org, it would only find www.org

So, I guess it would need to be looking at the characters before the www\. part
to include them in the url also

This should work. Note: the [^\s]+ placed before the www\. portion.

if ( preg_match_all(    '%(https?://[^\s]+|[^\/\s]+www\.[^\s]+)%',

This should catch example.www.org and no-www.org now.

You could get into the business of trying to match the TLD, but that would be a
PITA to keep updated.

> Cheers,
> ~Philip
> 
> 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php