On Sun, Aug 19, 2012 at 12:38 AM, Marco Behnke <marco@xxxxxxxxxx> wrote: > Am 19.08.12 06:59, schrieb tamouse mailing lists: >> On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston >> <jt.johnston@xxxxxxxxxxxxxx> wrote: >>> I want to parse this text and count the occurrence of each word: >>> >>> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html; >>> #Can I do this? >>> $stripping = strip_tags($text); #get rid of html >>> $stripping = strtolower($stripping); #put in lowercase >>> >>> ---------------- >>> First of all I want to start AFTER the expression "News Releases" and stop >>> BEFORE the next occurrence of "-30-" >>> >>> #This may occur an undetermined number of times on >>> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html >>> >>> >>> ---------------- >>> Second, do I put $stripping into an array to separate each word by each >>> space " "? >>> >>> $stripping = implode(" ", $stripping); >>> >>> ---------------- >>> Third how do I count the number of occurrences of each word? >>> >>> Sample Output: >>> >>> determined = 4 >>> fire = 7 >>> patrol = 3 >>> theft = 6 >>> witness = 1 >>> witnessed = 1 >>> >>> ---------------- >>> <?php >>> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html >>> #echo strip_tags($text); >>> #echo "\n"; >>> $stripping = strip_tags($text); >>> >>> #Get text between "News Releases" and stop before the next occurrence of >>> "-30-" >>> >>> #$stripping = str_replace("\r", " ", $stripping);# getting rid of \r >>> #$stripping = str_replace("\n", " ", $stripping);# getting rid of \n >>> #$stripping = str_replace(" ", " ", $stripping);# getting rid of the >>> occurrences of double spaces >>> >>> #$stripping = strtolower($stripping); >>> >>> #Where do I go now? >>> ?> >>> >>> >>> -- >>> PHP General Mailing List (http://www.php.net/) >>> To unsubscribe, visit: http://www.php.net/unsub.php >>> >> This is usually a first-year CS programming problem (word frequency >> counts) complicated a little bit by needing to extract the text. >> You've started off fine, stripping tags, converting to lower case, >> you'll want to either convert or strip HTML entities as well, deciding >> what you want to do with plurals and words like "you're", "Charlie's", >> "it's", etc, also whether something like RFC822 is a word or not >> (mixed letters and numbers). >> >> When you've arranged all that, splitting on white space is trivial: >> >> $words = preg_split('/[[:space:]]+/',$text); >> >> and then you just run through the words building an associative array >> by incrementing the count of each word as the key to the array: >> >> foreach ($words as $word) { >> $freq[$word]++; >> } > > Please an existence check to avoid incrementing not set array keys > > foreach ($words as $word) { > if (array_key_exists($word, $freq)) { > $freq[$word] = 1; > } else { > $freq[$word]++; > } > } Ah, yes, good point -- as written, my code will raise two notices. In addition, "declare" the $freq array: $freq=array(); as well before the foreach loop to ensure notice-free operation. > > >> >> For output, you may want to sort the array: >> >> ksort($freq); >> > > > -- > Marco Behnke > Dipl. Informatiker (FH), SAE Audio Engineer Diploma > Zend Certified Engineer PHP 5.3 > > Tel.: 0174 / 9722336 > e-Mail: marco@xxxxxxxxxx > > Softwaretechnik Behnke > Heinrich-Heine-Str. 7D > 21218 Seevetal > > http://www.behnke.biz > > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php