Re: How do I do count the occurrence of each word?

Marco Behnke <marco@xxxxxxxxxx> · Sun, 19 Aug 2012 07:38:53 +0200



Am 19.08.12 06:59, schrieb tamouse mailing lists:
> On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston
> <jt.johnston@xxxxxxxxxxxxxx> wrote:
>> I want to parse this text and count the occurrence of each word:
>>
>> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html;
>> #Can I do this?
>> $stripping = strip_tags($text); #get rid of html
>> $stripping = strtolower($stripping); #put in lowercase
>>
>> ----------------
>> First of all I want to start AFTER the expression "News Releases" and stop
>> BEFORE the next occurrence of "-30-"
>>
>> #This may occur an undetermined number of times on
>> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html
>>
>>
>> ----------------
>> Second, do I put $stripping into an array to separate each word by each
>> space " "?
>>
>> $stripping = implode(" ", $stripping);
>>
>> ----------------
>> Third how do I count the number of occurrences of each word?
>>
>> Sample Output:
>>
>> determined = 4
>> fire = 7
>> patrol = 3
>> theft = 6
>> witness = 1
>> witnessed = 1
>>
>> ----------------
>> <?php
>> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html
>> #echo strip_tags($text);
>> #echo "\n";
>> $stripping = strip_tags($text);
>>
>> #Get text between "News Releases" and stop before the next occurrence of
>> "-30-"
>>
>> #$stripping = str_replace("\r", " ", $stripping);# getting rid of \r
>> #$stripping = str_replace("\n", " ", $stripping);# getting rid of \n
>> #$stripping = str_replace("  ", " ", $stripping);# getting rid of the
>> occurrences of double spaces
>>
>> #$stripping = strtolower($stripping);
>>
>> #Where do I go now?
>> ?>
>>
>>
>> --
>> PHP General Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
> This is usually a first-year CS programming problem (word frequency
> counts) complicated a little bit by needing to extract the text.
> You've started off fine, stripping tags, converting to lower case,
> you'll want to either convert or strip HTML entities as well, deciding
> what you want to do with plurals and words like "you're", "Charlie's",
> "it's", etc, also whether something like RFC822 is a word or not
> (mixed letters and numbers).
>
> When you've arranged all that, splitting on white space is trivial:
>
> $words = preg_split('/[[:space:]]+/',$text);
>
> and then you just run through the words building an associative array
> by incrementing the count of each word as the key to the array:
>
> foreach ($words as $word) {
>     $freq[$word]++;
> }

Please an existence check to avoid incrementing not set array keys

foreach ($words as $word) {
  if (array_key_exists($word, $freq)) {
    $freq[$word] = 1;
  } else {
    $freq[$word]++;
  }
}


>
> For output, you may want to sort the array:
>
> ksort($freq);
>


-- 
Marco Behnke
Dipl. Informatiker (FH), SAE Audio Engineer Diploma
Zend Certified Engineer PHP 5.3

Tel.: 0174 / 9722336
e-Mail: marco@xxxxxxxxxx

Softwaretechnik Behnke
Heinrich-Heine-Str. 7D
21218 Seevetal

http://www.behnke.biz


Attachment:
signature.asc

Description: OpenPGP digital signature