Re: case and accent - insensitive regular expression?

"Andrew Ballard" <aballard@xxxxxxxxx> · Tue, 15 Jul 2008 10:15:32 -0400



On Tue, Jul 15, 2008 at 9:46 AM, Andrew Ballard <aballard@xxxxxxxxx> wrote:> On Tue, Jul 15, 2008 at 5:38 AM, Yeti <yeti@xxxxxxxxxx> wrote:>> I dont think using all these regular expressions is a very efficient way to>> do so. As i previously pointed out there are many users who had a similar>> problem, which can be viewed at:>>>> http://it.php.net/manual/en/function.strtr.php>>>> One of my favourites is what derernst at gmx dot ch used.>>>> derernst at gmx dot ch>> wrote on 20-Sep-2005 07:29>> This works for me to remove accents for some characters of Latin-1, Latin-2>> and Turkish in a UTF-8 environment, where the htmlentities-based solutions>> fail:>>>>> <?php>>>> function remove_accents($string, $german=false) {>>>>   // Single letters>>>>   $single_fr = explode(" ", "� � � � � � &#260; &#258; � &#262; &#268;>> &#270; &#272; � � � � � &#280; &#282; &#286; � � � � &#304; &#321; &#317;>> &#313; � &#323; &#327; � � � � � � &#336; &#340; &#344; � &#346; &#350;>> &#356; &#354; � � � � &#366; &#368; � � &#377; &#379; � � � � � � &#261;>> &#259; � &#263; &#269; &#271; &#273; � � � � &#281; &#283; &#287; � � � �>> &#305; &#322; &#318; &#314; � &#324; &#328; � � � � � � � &#337; &#341;>> &#345; &#347; � &#351; &#357; &#355; � � � � &#367; &#369; � � � &#378;>> &#380;");>>>>   $single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I I>> I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a a>> a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s s>> t t u u u u u u y y z z z");>>>>   $single = array();>>>>   for ($i=0; $i<count($single_fr); $i++) {>>>>   $single[$single_fr[$i]] = $single_to[$i];>>>>   }>>>>   // Ligatures>>>>   $ligatures = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"ss");>>>>   // German umlauts>>>>   $umlauts = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"Ue",>> "�"=>"ue");>>>>   // Replace>>>>   $replacements = array_merge($single, $ligatures);>>>>   if ($german) $replacements = array_merge($replacements, $umlauts);>>>>   $string = strtr($string, $replacements);>>>>   return $string;>>>> }>>>> ?>>>>> I would change this function a bit ...>>>> <?php>> //echo rawurlencode("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); // RFC 1738 codes; NOTE: One>> might use UTF-8 as this documents encoding>> function remove_accents($string) {>>  $string = rawurlencode($string);>>  $replacements = array(>>  '%C3%A1' => 'a',>>  '%C3%A0' => 'a',>>  '%C3%A9' => 'e',>>  '%C3%A8' => 'e',>>  '%C3%AD' => 'i',>>  '%C3%AC' => 'i',>>  '%C3%B3' => 'o',>>  '%C3%B2' => 'o',>>  '%C3%BA' => 'u',>>  '%C3%B9' => 'u',>>  '%C3%81' => 'A',>>  '%C3%80' => 'A',>>  '%C3%89' => 'E',>>  '%C3%88' => 'E',>>  '%C3%8D' => 'I',>>  '%C3%8C' => 'I',>>  '%C3%93' => 'O',>>  '%C3%92' => 'O',>>  '%C3%9A' => 'U',>>  '%C3%99' => 'U'>>  );>>  return strtr($string, $replacements);>> }>> //echo remove_accents("CÀfé"); // I know it's not spelled right>> echo remove_accents("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); //OUTPUT (again: i used UTF-8>> for document): aaeeiioouuAAEEIIOOUU>> ?>>>>> Ciao>>>> Yeti>>>> On Mon, Jul 14, 2008 at 8:20 PM, Andrew Ballard <aballard@xxxxxxxxx> wrote:>>>>>> On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti>>> <giulio@xxxxxxxxxxxxx> wrote:>>> >>>>> >>>> > Brilliant !!!>>> >>>> > so you replace every occurence of every accent variation with all the>>> > accent>>> > variations...>>> >>>> > OK, that's it!>>> >>>> > only some more doubts ( regex are still an headhache for me... )>>> >>>> > preg_replace('/[iìíîïĩīĭįı]/iu',...  -- what's the meaning of iu after>>> > the>>> > match string?>>>>>> This page explains them both.>>> http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php>>>>>> > preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e)  for? -- every>>> > occurence of aàáâãäåǻāăą NOT followed by e?>>>>>> Yes. It matches any character based on the latin 'a' that is not>>> followed by an 'e'. It keeps the pattern from matching the 'a' when it>>> immediately precedes an 'e' for the character 'ae' for words like>>> these:>>>>>> http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature>>> (However, that may cause problems with words that have other variants>>> of 'ae' in them. I'll leave that to you to resolve.)>>> http://us.php.net/manual/en/regexp.reference.php>>>>>>>>>>>> > Many thanks again for your effort,>>> >>>> > I'm definitely on the good way>>> >>>> >      Giulio>>> >>>> >>>> >>>>> >> I was intrigued by your example, so I played around with it some more>>> >> this morning. My own quick web search yielded a lot of results for>>> >> highlighting search terms, but none that I found did what you're>>> >> after. (I admit I didn't look very deep.) I was up to something like>>> >> this before your reply came in. It's still by no means complete. It>>> >> even handles simple English plurals (words ending in 's' or 'es'), but>>> >> not variations that require changing the word base (like 'daisy' to>>> >> 'daisies').>>> >>>>> >> <?php>>> >> function highlight_search_terms($phrase, $string) {>>> >>   $non_letter_chars = '/[^\pL]/iu';>>> >>   $words = preg_split($non_letter_chars, $phrase);>>> >>>>> >>   $search_words = array();>>> >>   foreach ($words as $word) {>>> >>       if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {>>> >>           $search_words[] = $word;>>> >>       }>>> >>   }>>> >>>>> >>   $search_words = array_unique($search_words);>>> >>>>> >>   foreach ($search_words as $word) {>>> >>       $search = preg_quote($word);>>> >>>>> >>       /* repeat for each possible accented character */>>> >>       $search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);>>> >>       $search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);>>> >>       $search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',>>> >> '[aàáâãäåǻāăą]', $search);>>> >>       $search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);>>> >>       $search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);>>> >>       $search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',>>> >> '[eèéêëēĕėęě]', $search);>>> >>       $search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);>>> >>       $search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);>>> >>       $search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',>>> >> $search);>>> >>       $search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);>>> >>       $search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);>>> >>       $search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);>>> >>       $search = preg_replace('/[nñńņňŉŋ]/iu', '[nñńņňŉŋ]', $search);>>> >>       $search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',>>> >> '[oòóôõöōŏőǿơ]', $search);>>> >>       $search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);>>> >>       $search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);>>> >>       $search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);>>> >>       $search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',>>> >> '[uùúûüũūŭůűųǔǖǘǚǜ]', $search);>>> >>       $search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);>>> >>       $search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);>>> >>       $search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);>>> >>>>> >>>>> >>       $string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span>>> >> class="keysearch">$0</span>', $string);>>> >>   }>>> >>>>> >>   return $string;>>> >>>>> >> }>>> >> ?>>>> >>>>> >> I still can't help feeling there must be some better way, though.>>> >>>>> >>>>>> >>> well, i think I'm on the good way now, unfortunately I have some other>>> >>> urgent work and can't try it immediately, but I'll let you know    :)>>> >>>>>> >>> thank you!>>> >>>>>> >>>   Giulio>>> >>>>> >>>>> >> Andrew>>> >>>>> >>>>> >>>> >>>>>>> I agree it doesn't seem very efficient to me, but I haven't come up> with anything better. The problem with what you posted is that the OP> was looking to preserve the accented characters, NOT replace them. All> he wants to do is wrap some tags around the search terms so that they> are highlighted. I guess he could use your function to replace all the> accented characters with regular ones in a copy of the original> string, and then scan that string using str_pos() or similar against> the copy to find the index of each occurrence that needs replaced in> the original string. This seems even less efficient than the regular> expressions, to me.>> Andrew>
Well, OK, I can think of one optimization. This takes advantage of thefact that preg_replace can accept arrays as parameters. In a couplevery quick tests this version is roughly 30% faster than my previousversion:
<?php
function highlight_search_terms2($phrase, $string) {   $non_letter_chars = '/[^\pL]/iu';   $words = preg_split($non_letter_chars, $phrase);
   $search_words = array();   foreach ($words as $word) {       if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {           $search_words[] = $word;       }   }
   $search_words = array_unique($search_words);
   $patterns = array(                   /* repeat for each possible accented character */                   '/(ae|æ|ǽ)/iu'              => '(ae|æ|ǽ)',                   '/(oe|œ)/iu'                => '(oe|œ)',                   '/[aàáâãäåǻāăą](?!e)/iu'    => '[aàáâãäåǻāăą]',                   '/[cçćĉċč]/iu'              => '[cçćĉċč]',                   '/[dďđ]/iu'                 => '[dďđ]',                   '/(?<![ao])[eèéêëēĕėęě]/iu' => '[eèéêëēĕėęě]',                   '/[gĝğġģ]/iu'               => '[gĝğġģ]',                   '/[hĥħ]/iu'                 => '[hĥħ]',                   '/[iìíîïĩīĭįı]/iu'          => '[iìíîïĩīĭįı]',                   '/[jĵ]/iu'                  => '[jĵ]',                   '/[kķĸ]/iu'                 => '[kķĸ]',                   '/[lĺļľŀł]/iu'              => '[lĺļľŀł]',                   '/[nñńņňŉŋ]/iu'             => '[nñńņňŉŋ]',                   '/[oòóôõöōŏőǿơ](?!e)/iu'    => '[oòóôõöōŏőǿơ]',                   '/[rŕŗř]/iu'                => '[rŕŗř]',                   '/[sśŝşš]/iu'               => '[sśŝşš]',                   '/[tţťŧ]/iu'                => '[tţťŧ]',                   '/[uùúûüũūŭůűųǔǖǘǚǜ]/iu'    => '[uùúûüũūŭůűųǔǖǘǚǜ]',                   '/[wŵ]/iu'                  => '[wŵ]',                   '/[yýÿŷ]/iu'                => '[yýÿŷ]',                   '/[zźżž]/iu'                => '[zźżž]',               );
   foreach ($search_words as $word) {       $search = preg_quote($word);
       $search = preg_replace(array_keys($patterns), $patterns, $search);
       $string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<spanclass="keysearch">$0</span>', $string);   }
   return $string;
}
?>
Andrew