On 12-08-21 12:32 AM, John Taylor-Johnston wrote:
This is usually a first-year CS programming problem (word frequency
counts) complicated a little bit by needing to extract the text.
You've started off fine, stripping tags, converting to lower case,
you'll want to either convert or strip HTML entities as well, deciding
what you want to do with plurals and words like "you're", "Charlie's",
"it's", etc, also whether something like RFC822 is a word or not
(mixed letters and numbers).
When you've arranged all that, splitting on white space is trivial:
$words = preg_split('/[[:space:]]+/',$text);
and then you just run through the words building an associative array
by incrementing the count of each word as the key to the array:
foreach ($words as $word) {
$freq[$word]++;
}
For output, you may want to sort the array:
ksort($freq);
That's awesome. Thanks!
Let me start with my first problem:
I want to extract All Occurrences of text AFTER "News Releases" and
before "-30-".
http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html
How do I do that?
Yeah, I am still asking first year questions :)) Every project brings
new challenges.
You can use strpos() to find the location of "News Releases" then you
can again use strpos() to find the location of "-- 30 --" but you will
want to feed strpos() an offset for matching "-- 30 --" (specifically
the position found for "News Releases"). This ensures that you only
match on "-- 30 --" when it comes after "News Releases". Once you have
your beginning and start offsets you can use substr() to create a
substring of the interesting excerpt. Once you have the excerpt in hand
you can go back to JTJ's recommendation above.
Cheers,
Rob.
--
E-Mail Disclaimer: Information contained in this message and any
attached documents is considered confidential and legally protected.
This message is intended solely for the addressee(s). Disclosure,
copying, and distribution are prohibited unless authorized.
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php