This is usually a first-year CS programming problem (word frequency
counts) complicated a little bit by needing to extract the text.
You've started off fine, stripping tags, converting to lower case,
you'll want to either convert or strip HTML entities as well, deciding
what you want to do with plurals and words like "you're", "Charlie's",
"it's", etc, also whether something like RFC822 is a word or not
(mixed letters and numbers).
When you've arranged all that, splitting on white space is trivial:
$words = preg_split('/[[:space:]]+/',$text);
and then you just run through the words building an associative array
by incrementing the count of each word as the key to the array:
foreach ($words as $word) {
$freq[$word]++;
}
For output, you may want to sort the array:
ksort($freq);
That's awesome. Thanks!
Let me start with my first problem:
I want to extract All Occurrences of text AFTER "News Releases" and
before "-30-".
http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html
How do I do that?
Yeah, I am still asking first year questions :)) Every project brings
new challenges.
John
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php