On Mon, Jun 14, 2010 at 09:14, tedd <tedd@xxxxxxxxxxxx> wrote: > Hi gang: > > Considering all the recent parsing, here's another problem to consider -- > given any text, parse the domain-names out of it. > > You may limit the parsing to the most popular TDL's, such as .com, .net, and > .org, but the finished result should be an array containing all the > domain-names found in a text file. <?php $text =<<<TXT To test example.com and www.php.net and other domain names such as january.pilotpig.net and ca2.php.parasane.net, we need a reliable method of checking. We don't want to match on regular periods, nor on the 2.2million or 2.2 million or just 2,200,000 other potential matches. And not when we are double-spacing or single-spacing, just when oidk.net and similar domains are found. We'll match hyphen domains like l-i-e.com, but not fake_underscored_domain.net. We also want to match http://-fronted domains like http://php1.net/, which also contains a number. If we wanted to match domains plus paths, but there was no leading http:// to indicate that it should be a URL, we could extend this to grab things like www.facebook.com/parasane, so long as we don't ignore the rare one-character SLDs like x.com, as well as the domains in email addresses like danbrown@xxxxxxx So if everything works as expected, we should see eleven domains matched here, because ccTLDs like guthr.ie should be matched as well. TXT; /** * $fromText can be defined via a file_get_contents() or * similar function, while $fullLink should be anything * but false to enable link-matching, which will return * only link-like domains with paths attached. */ function extract_domains($fromText,$fullLink=false) { // If we only want to match the domain names. if ($fullLink === false) { preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5})\b/',$fromText,$matches); return $matches[1]; } // If we want to match just domain names with trailing paths. preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5}\/.+?)\b/',$fromText,$matches); return $matches[1]; } // Demo echo "<pre>".PHP_EOL; echo "Just domains:".PHP_EOL; var_dump(extract_domains($text)); echo PHP_EOL; echo "Full links:".PHP_EOL; var_dump(extract_domains($text,true)); echo "</pre>".PHP_EOL; ?> -- </Daniel P. Brown> daniel.brown@xxxxxxxxxxxx || danbrown@xxxxxxx http://www.parasane.net/ || http://www.pilotpig.net/ We now offer SAME-DAY SETUP on a new line of servers! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php