Re: Another parse problem

"Daniel P. Brown" <daniel.brown@xxxxxxxxxxxx> · Mon, 14 Jun 2010 10:08:28 -0400

On Mon, Jun 14, 2010 at 09:14, tedd <tedd@xxxxxxxxxxxx> wrote:
> Hi gang:
>
> Considering all the recent parsing, here's another problem to consider --
> given any text, parse the domain-names out of it.
>
> You may limit the parsing to the most popular TDL's, such as .com, .net, and
> .org, but the finished result should be an array containing all the
> domain-names found in a text file.

<?php
$text =<<<TXT
	To test example.com and www.php.net and other domain names
such as january.pilotpig.net and ca2.php.parasane.net, we need a
reliable method of checking.  We don't want to match on regular
periods, nor on the 2.2million or 2.2 million or just 2,200,000
other potential matches. And not when we are double-spacing or
single-spacing, just when oidk.net and similar domains are found.
We'll match hyphen domains like l-i-e.com, but not fake_underscored_domain.net.
We also want to match http://-fronted domains like http://php1.net/,
which also contains a number.  If we wanted to match domains plus
paths, but there was no leading http:// to indicate that it should
be a URL, we could extend this to grab things like www.facebook.com/parasane,
so long as we don't ignore the rare one-character SLDs like x.com,
as well as the domains in email addresses like danbrown@xxxxxxx
So if everything works as expected, we should see eleven domains
matched here, because ccTLDs like guthr.ie should be matched as well.

TXT;

/**
 * $fromText can be defined via a file_get_contents() or
 * similar function, while $fullLink should be anything
 * but false to enable link-matching, which will return
 * only link-like domains with paths attached.
 */
function extract_domains($fromText,$fullLink=false) {

	// If we only want to match the domain names.
	if ($fullLink === false) {
		preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5})\b/',$fromText,$matches);
		return $matches[1];
	}

	// If we want to match just domain names with trailing paths.
	preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5}\/.+?)\b/',$fromText,$matches);
	return $matches[1];
}

// Demo
echo "<pre>".PHP_EOL;

echo "Just domains:".PHP_EOL;
var_dump(extract_domains($text));

echo PHP_EOL;

echo "Full links:".PHP_EOL;
var_dump(extract_domains($text,true));

echo "</pre>".PHP_EOL;
?>

-- 
</Daniel P. Brown>
daniel.brown@xxxxxxxxxxxx || danbrown@xxxxxxx
http://www.parasane.net/ || http://www.pilotpig.net/
We now offer SAME-DAY SETUP on a new line of servers!

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php