Re: what would a c extension buy me

joseph <kmh496@xxxxxxxxxx> · Mon, 13 Mar 2006 20:34:33 +0900

fellow php programmers, 

sorry to you two who were kind enough to take the time to attempt to
answer my question.  i thought it was enough to say "i had 550 lines
with a bunch of sql calls and loops and big data structures"

to help you make a better assessment i will summarize in more detail
now.

for my biggest file  (  

http://www.myowndictionary.com/index.php?old_cfile=rss&cfile=rss&rss_id=198&cid=242&from_lang=korean

) i get this:
root@www mysql]# tail -f /var/lib/mysql/mysqld_query.log  > out

[root@www mysql]# cat out | wc -l
15910    <<< ----------- that's line count
[root@www mysql]# head out
                   9795 Query       select
word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
like '운전할 %'  order by wordsize desc
                   9795 Query       select
word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
like '운전할'  order by wordsize desc
                   9795 Query       select
word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
like '  %'  order by wordsize desc
                   9795 Query       select
word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
like '수 %'  order by wordsize desc
                   9795 Query       select
word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
like '수'  order by wordsize desc

takes 6 minutes

[joe@www current]$ date
2006. 03. 13. (월) 20:25:16 KST
[joe@www current]$ date
2006. 03. 13. (월) 20:31:37 KST
[joe@www current]$

in general, you will see what is involved in processing each token of
text and sending that token through a match against databases .... and
processing results:

a summary of the code is:
mainLoop starts, it breaks some text into tokens using my_mb_preg_split.
(line 3)  
the reg exp is like 
$split_pattern =   "/((?:[\.!\s\?,:-]|\\\"는)+)/";
mb_preg_split i'm sure is broken, but haven't prepared a legal case at
this time  ;)

then, it sends each token to CallmatchThis (line 14) which calls
matchThis (line 27 - 47 below)  
matchThis may be called twice (2 sql queeries)
(line 51) select * where word = '$token' and another  (take that, but if
it's not there .... issue the next sql )
(line 55) select * where word like '$token%';

those can result sometimes in arrays from mysql of ~ what 100 words and
6 fields each word .... for those queeries.

then all that array is narrowed down by finding the longest word (yes, a
field in the DB is pre-set with that value)  .... there is an algo
involved, i cannot simply select the longest match, because i am also
selecting against not just that single token but against the following
tokens as well, to stop getting a match for "hair" if "hair of the dog
that bit you" is in the databse, i want the longer match.  that was a
necessary embellishment.

the array of arrays returned from matchThis are processed again by
MainLoop  at line 16 to produce a javascript overlib for every word.

as for indexes, i only search against word, so i have an index for that.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------+
| korean_english | CREATE TABLE `korean_english` (
  `wordid` int(11) NOT NULL auto_increment,
  `word` varchar(130) default NULL,
  `syn` varchar(190) default NULL,
  `def` blob,
  `posn` int(2) default '1',
  `pos` varchar(13) default '1',
  `submitter` varchar(25) default NULL,
  `doe` datetime NOT NULL default '0000-00-00 00:00:00',
  `wordsize` tinyint(3) unsigned default NULL,
  PRIMARY KEY  (`wordid`),
  KEY `word_idx` (`word`),
  KEY `wordid_idx` (`wordid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
+----------------+-----------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------

     1 function MainLoop()
      2
      3 -> $text_wordArr = my_mb_preg_split($subject,
$utf8_split_pattern,$debug_my_mb);
      4     while (!empty($text_word) ) {
      5         // somehow this allows matches for 'in the bag' (idiom)
by some algorithm or other i have
      6         if (preg_match($split_pattern,$text_word)
      7                 &&  (!(preg_match("/^$one_word_matches$/",
$text_wordArr[$i-2][0]) && ($text_wordArr[$i-1][0]==' '))) // gobble up
the space follow        ing those in such list, for previuos one
      8                 && (!(preg_match("/^$one_word_matches$/",
$text_wordArr[$i-3][0])) && ($text_wordArr[$i-2][0]==' '))) {   // try
to limit it to jus        t blanks and quotes, other such garbage for
previous TWO
      9         } else if ( preg_match("/^$one_word_matches$/i",
$text_word)) { // case insensitive.  this eliminates searching for the%
instead of "the One        %"
     10         } else {  // else do matching and processing as normal
     11             // section to match against this word and its array
of
     12             // all possible grammatical variations on the form
of the word
     13             // start with the longest first
     14             $matchArr = callMatchThis($db, $text_word,
$match_count,$subject,$debug_matchThis_detail,$debug_matchThis_general,
$debug_matchThis_dbArr        ay, $debug_array_of_possibles,
$debug_matchThis_great_detail);
     15             // make the overlib function
     16             $sideviewStr =
dict_get_sideview($sideview_word_to_replace, $matchArr,
$debug_sideview);   // overlib stuff, all the href tag
     17 }}}}
     18
     19 // end MainLoop function
     20
     21
     22 // find possible matches for submitted query
     23 // run query, get db resultArray and match for longest match
length
     24 // against $subject
     25 // return best match in array form as $matchArr
     26 // vary text_word as well as db_word trying to match everything
possible (by max length order, of course)
     27 function matchThis(&$db, $query_part1, $query_part2,$text_word,&
$match_count,$subject,$debug_matchThis_detail,$debug_matchThis_general,
$debug_match        This_dbArray, $debug_array_of_possibles,
$debug_matchThis_great_detail ) {
     28     for ($idx_outer =0,$text_word_variation =
$text_word_variationArr[$idx_outer]; $idx_outer <
count($text_word_variationArr) && !$major_break; $i        dx_outer++) {
     29
     30        $text_word_variation =
$text_word_variationArr[$idx_outer];
     31         $query = $query_part1 . $text_word_variation .
$query_part2;
     32         // run a query on each variation
     33         $dbTWVArr = db_getAssoc($db, $query);  // gives an array
of arrays
     34         while ((list($db_word, $valArr) = each($dbTWVArr)) && !
$major_break) {  // key is a word, $val is an array
     35             if ($debug_matchThis_detail) echo "starting search
for db_word***".$db_word."***on
subject***".substr($subject,0,13)."***<br>";
     36             if (preg_match("/^$db_word_match/i",$subject)) {  //
check for match on this possibly 1+ word "db_word" on all remaining text
     37                 for ($index=0; $index < count($valArr); $index
++) {
     38                     // code to select longest db_word out of
those words which do match
     39                     $matchArr[$match_count]['text_word'] =
$text_word;  // from way up on top!! as arg to function,the original
word from the text
     40                     $matchArr[$match_count]['db_word'] =
$db_word; //
     41                     // get other stuff from table as well
     42                     $matchArr[$match_count]['word_id'] =
$valArr[$index][1];
     43                     $matchArr[$match_count]['pos'] =
$valArr[$index][2];
     44                     $matchArr[$match_count]['pos_number'] =
$valArr[$index][3];
     45                 }
     46             }
     47     return $matchArr;
     48 }
     49
     50 function callMatchThis(&$db, $text_word,&$match_count,$subject,
$debug_matchThis_detail,$debug_matchThis_general,
$debug_matchThis_dbArray, $debug_ar        ray_of_possibles,
$debug_matchThis_great_detail) {
     51     $query_part1 = "select word,def,wordid,pos,posn,wordsize,syn
from  $def_table  where word like '";
     52         $query_part2 = " %'  order by wordsize desc";
     53         $matchArr = matchThis($db, $query_part1, $query_part2,
$text_word,$match_count,$subject,$debug_matchThis_detail,
$debug_matchThis_general,$de        bug_matchThis_dbArray,
$debug_array_of_possibles,$debug_matchThis_great_detail);
     54     if (!$match_count) {
     55             $query_part1 = "select
word,def,wordid,pos,posn,wordsize,syn from  $def_table where
     56             $query_part2 = "'  order by wordsize desc ";
     57             $matchArr = matchThis($db, $query_part1,
$query_part2,$text_word,$match_count,$subject,$debug_matchThis_detail,
$debug_matchThis_general        ,$debug_matchThis_dbArray,
$debug_array_of_possibles,$debug_matchThis_great_detail);
     58             if ($debug_matchThis_general) {  echo "after
matchThis#2**match_count***".$match_count."<br>";   }
     59         }
     60
     61 }
     62
     63 // unrelated function which makes the javascript overlib
     64
     65 function dict_get_sideview($word_in, $defArr,$debug_sideview=0)
{
     66 }
     67
     68

2006-03-13 (월), 10:16 +0100, Jochem Maas 쓰시길:
> joseph wrote:
> > hi,
> > my site is www.myowndictionary.com
> > i use javascript to create definitions for words (from open source
> > dictionaries) as pop-ups and hook word-lists for vocabulary study with
> > that.  
> > i can now parse the html from rss feeds and match only text of interest.
> > i just wrote my own multibyte html parser by combining various other
> > code i already had on hand for the parsing of multibyte text anyhow.  
> > but, that's not the slowest part.  that takes maybe 3 seconds to get the
> > page, 3 seconds to parse it, but it sometimes takes 5 minutes for the
> > next step on huge rss files.
> > (I know the time because they are cached)
> > that last step...
> > The slow part that's left is the dictionary searching itself.  It's the
> > parsing of complicated php code -- 550 lines of complex data structure
> > loops, function calls, looping through arrays to make sql queeries --
> > lots of sql queeries.... 
> > the seconds ~ 5 minute it takes it all takes so much time.  
> 
> use the 'EXPLAIN' syntax of MySQL to figure out if there are any bottlenecks
> related to crap or 'missing' indexes with regard to the sql queries you mention
> above. - chances are adding a couple of well placed indexes will probably
> shave massive ammounts from the process time.
> 
> > 
> > i want to ask for opinion about:
> > 1) rewriting the php as C extension.  Would it slow my down to the below
> > 15 second range?
> 
> How Long is a chinaman. How Long is a chinaman?
> yes the C extension would be much faster (assuming you wrote it correctly -
> but it's much harder to write and maintain)
> 
> > 2) does mysql have server-side functions yet?
> 
> there are stored procedures in mysql5.
> 
> > 3) if i moved the php code off mysql to postgresql and wrote server-side
> > functions (which i've been trained to do), how much of a time
> > improvement would we be talking about?
> 
> how on gods' (or whoever runs the place these days) earth can we tell
> whether an unwritten stored procedure in postgres would be faster or slower
> than an unknown (to us) piece of php code?
> 
> that said I doubt postgres will beat mysql in terms of pure speed assuming
> you db schema is sound (indexes are good, etc).
> 
> > 
> > please send me your input!  
> > i am at a growing stage and need direction here.  
> > because any of those steps will require considerable time.  
> 
> it's monday morning here too ;-)
> 
> interesting language tool your building, you might consider repackaging it
> in a modular form so that other people could implement in their own site.
> (you do call it 'open source')
> 
> rgds,
> Jochem
> 
> > 
> > thank you.
> > 
> > joseph.
> > 
> 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php