Re: How do I do count the occurrence of each word?

tamouse mailing lists <tamouse.lists@xxxxxxxxx> · Sun, 19 Aug 2012 01:09:52 -0500

On Sun, Aug 19, 2012 at 12:38 AM, Marco Behnke <marco@xxxxxxxxxx> wrote:
> Am 19.08.12 06:59, schrieb tamouse mailing lists:
>> On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston
>> <jt.johnston@xxxxxxxxxxxxxx> wrote:
>>> I want to parse this text and count the occurrence of each word:
>>>
>>> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html;
>>> #Can I do this?
>>> $stripping = strip_tags($text); #get rid of html
>>> $stripping = strtolower($stripping); #put in lowercase
>>>
>>> ----------------
>>> First of all I want to start AFTER the expression "News Releases" and stop
>>> BEFORE the next occurrence of "-30-"
>>>
>>> #This may occur an undetermined number of times on
>>> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html
>>>
>>>
>>> ----------------
>>> Second, do I put $stripping into an array to separate each word by each
>>> space " "?
>>>
>>> $stripping = implode(" ", $stripping);
>>>
>>> ----------------
>>> Third how do I count the number of occurrences of each word?
>>>
>>> Sample Output:
>>>
>>> determined = 4
>>> fire = 7
>>> patrol = 3
>>> theft = 6
>>> witness = 1
>>> witnessed = 1
>>>
>>> ----------------
>>> <?php
>>> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html
>>> #echo strip_tags($text);
>>> #echo "\n";
>>> $stripping = strip_tags($text);
>>>
>>> #Get text between "News Releases" and stop before the next occurrence of
>>> "-30-"
>>>
>>> #$stripping = str_replace("\r", " ", $stripping);# getting rid of \r
>>> #$stripping = str_replace("\n", " ", $stripping);# getting rid of \n
>>> #$stripping = str_replace("  ", " ", $stripping);# getting rid of the
>>> occurrences of double spaces
>>>
>>> #$stripping = strtolower($stripping);
>>>
>>> #Where do I go now?
>>> ?>
>>>
>>>
>>> --
>>> PHP General Mailing List (http://www.php.net/)
>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>
>> This is usually a first-year CS programming problem (word frequency
>> counts) complicated a little bit by needing to extract the text.
>> You've started off fine, stripping tags, converting to lower case,
>> you'll want to either convert or strip HTML entities as well, deciding
>> what you want to do with plurals and words like "you're", "Charlie's",
>> "it's", etc, also whether something like RFC822 is a word or not
>> (mixed letters and numbers).
>>
>> When you've arranged all that, splitting on white space is trivial:
>>
>> $words = preg_split('/[[:space:]]+/',$text);
>>
>> and then you just run through the words building an associative array
>> by incrementing the count of each word as the key to the array:
>>
>> foreach ($words as $word) {
>>     $freq[$word]++;
>> }
>
> Please an existence check to avoid incrementing not set array keys
>
> foreach ($words as $word) {
>   if (array_key_exists($word, $freq)) {
>     $freq[$word] = 1;
>   } else {
>     $freq[$word]++;
>   }
> }

Ah, yes, good point -- as written, my code will raise two notices. In
addition, "declare" the $freq array:

$freq=array();

as well before the foreach loop to ensure notice-free operation.

>
>
>>
>> For output, you may want to sort the array:
>>
>> ksort($freq);
>>
>
>
> --
> Marco Behnke
> Dipl. Informatiker (FH), SAE Audio Engineer Diploma
> Zend Certified Engineer PHP 5.3
>
> Tel.: 0174 / 9722336
> e-Mail: marco@xxxxxxxxxx
>
> Softwaretechnik Behnke
> Heinrich-Heine-Str. 7D
> 21218 Seevetal
>
> http://www.behnke.biz
>
>

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php