Am 19.08.12 06:59, schrieb tamouse mailing lists: > On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston > <jt.johnston@xxxxxxxxxxxxxx> wrote: >> I want to parse this text and count the occurrence of each word: >> >> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html; >> #Can I do this? >> $stripping = strip_tags($text); #get rid of html >> $stripping = strtolower($stripping); #put in lowercase >> >> ---------------- >> First of all I want to start AFTER the expression "News Releases" and stop >> BEFORE the next occurrence of "-30-" >> >> #This may occur an undetermined number of times on >> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html >> >> >> ---------------- >> Second, do I put $stripping into an array to separate each word by each >> space " "? >> >> $stripping = implode(" ", $stripping); >> >> ---------------- >> Third how do I count the number of occurrences of each word? >> >> Sample Output: >> >> determined = 4 >> fire = 7 >> patrol = 3 >> theft = 6 >> witness = 1 >> witnessed = 1 >> >> ---------------- >> <?php >> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html >> #echo strip_tags($text); >> #echo "\n"; >> $stripping = strip_tags($text); >> >> #Get text between "News Releases" and stop before the next occurrence of >> "-30-" >> >> #$stripping = str_replace("\r", " ", $stripping);# getting rid of \r >> #$stripping = str_replace("\n", " ", $stripping);# getting rid of \n >> #$stripping = str_replace(" ", " ", $stripping);# getting rid of the >> occurrences of double spaces >> >> #$stripping = strtolower($stripping); >> >> #Where do I go now? >> ?> >> >> >> -- >> PHP General Mailing List (http://www.php.net/) >> To unsubscribe, visit: http://www.php.net/unsub.php >> > This is usually a first-year CS programming problem (word frequency > counts) complicated a little bit by needing to extract the text. > You've started off fine, stripping tags, converting to lower case, > you'll want to either convert or strip HTML entities as well, deciding > what you want to do with plurals and words like "you're", "Charlie's", > "it's", etc, also whether something like RFC822 is a word or not > (mixed letters and numbers). > > When you've arranged all that, splitting on white space is trivial: > > $words = preg_split('/[[:space:]]+/',$text); > > and then you just run through the words building an associative array > by incrementing the count of each word as the key to the array: > > foreach ($words as $word) { > $freq[$word]++; > } Please an existence check to avoid incrementing not set array keys foreach ($words as $word) { if (array_key_exists($word, $freq)) { $freq[$word] = 1; } else { $freq[$word]++; } } > > For output, you may want to sort the array: > > ksort($freq); > -- Marco Behnke Dipl. Informatiker (FH), SAE Audio Engineer Diploma Zend Certified Engineer PHP 5.3 Tel.: 0174 / 9722336 e-Mail: marco@xxxxxxxxxx Softwaretechnik Behnke Heinrich-Heine-Str. 7D 21218 Seevetal http://www.behnke.biz
Attachment:
signature.asc
Description: OpenPGP digital signature