Re: case and accent - insensitive regular expression?

Yeti <yeti@xxxxxxxxxx> · Tue, 15 Jul 2008 11:44:24 +0200

Oh, and i forgot about this one ...

jorge at seisbits dot com
wrote on 11-Jul-2008 09:04
If you try to make a strtr of not usual charafters when you are in a utf8
enviroment, you can do that:

function normaliza ($string){
  $string = utf8_decode($string);
  $string = strtr($string, utf8_decode(" ÂÊÎÔÛÀ"), "-AEIOU");
  $string = strtolower($string);
  return $string;
}

On Tue, Jul 15, 2008 at 11:38 AM, Yeti <yeti@xxxxxxxxxx> wrote:

> I dont think using all these regular expressions is a very efficient way to
> do so. As i previously pointed out there are many users who had a similar
> problem, which can be viewed at:
>
> http://it.php.net/manual/en/function.strtr.php
>
> One of my favourites is what derernst at gmx dot ch used.
>
> derernst at gmx dot ch
> wrote on 20-Sep-2005 07:29
> This works for me to remove accents for some characters of Latin-1, Latin-2
> and Turkish in a UTF-8 environment, where the htmlentities-based solutions
> fail:
>
> <?php
>>
> function remove_accents($string, $german=false) {
>
>   // Single letters
>
>   $single_fr = explode(" ", "� � � � � � &#260; &#258; � &#262; &#268;
> &#270; &#272; � � � � � &#280; &#282; &#286; � � � � &#304; &#321; &#317;
> &#313; � &#323; &#327; � � � � � � &#336; &#340; &#344; � &#346; &#350;
> &#356; &#354; � � � � &#366; &#368; � � &#377; &#379; � � � � � � &#261;
> &#259; � &#263; &#269; &#271; &#273; � � � � &#281; &#283; &#287; � � � �
> &#305; &#322; &#318; &#314; � &#324; &#328; � � � � � � � &#337; &#341;
> &#345; &#347; � &#351; &#357; &#355; � � � � &#367; &#369; � � � &#378;
> &#380;");
>
>   $single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I
> I I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a
> a a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s
> s t t u u u u u u y y z z z");
>
>   $single = array();
>
>   for ($i=0; $i<count($single_fr); $i++) {
>
>   $single[$single_fr[$i]] = $single_to[$i];
>
>   }
>
>   // Ligatures
>
>   $ligatures = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe",
> "�"=>"ss");
>
>   // German umlauts
>
>   $umlauts = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"Ue",
> "�"=>"ue");
>
>   // Replace
>
>   $replacements = array_merge($single, $ligatures);
>
>   if ($german) $replacements = array_merge($replacements, $umlauts);
>
>   $string = strtr($string, $replacements);
>
>   return $string;
>
> }
>
> ?>
>
> I would change this function a bit ...
>
> <?php
> //echo rawurlencode("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); // RFC 1738 codes; NOTE: One
> might use UTF-8 as this documents encoding
> function remove_accents($string) {
>  $string = rawurlencode($string);
>  $replacements = array(
>  '%C3%A1' => 'a',
>  '%C3%A0' => 'a',
>  '%C3%A9' => 'e',
>  '%C3%A8' => 'e',
>  '%C3%AD' => 'i',
>  '%C3%AC' => 'i',
>  '%C3%B3' => 'o',
>  '%C3%B2' => 'o',
>  '%C3%BA' => 'u',
>  '%C3%B9' => 'u',
>  '%C3%81' => 'A',
>  '%C3%80' => 'A',
>  '%C3%89' => 'E',
>  '%C3%88' => 'E',
>  '%C3%8D' => 'I',
>  '%C3%8C' => 'I',
>  '%C3%93' => 'O',
>  '%C3%92' => 'O',
>  '%C3%9A' => 'U',
>  '%C3%99' => 'U'
>  );
>  return strtr($string, $replacements);
> }
> //echo remove_accents("CÀfé"); // I know it's not spelled right
> echo remove_accents("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); //OUTPUT (again: i used UTF-8
> for document): aaeeiioouuAAEEIIOOUU
> ?>
>
> Ciao
>
> Yeti
> On Mon, Jul 14, 2008 at 8:20 PM, Andrew Ballard <aballard@xxxxxxxxx>
> wrote:
>
>> On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti
>> <giulio@xxxxxxxxxxxxx> wrote:
>> >>
>> >
>> > Brilliant !!!
>> >
>> > so you replace every occurence of every accent variation with all the
>> accent
>> > variations...
>> >
>> > OK, that's it!
>> >
>> > only some more doubts ( regex are still an headhache for me... )
>> >
>> > preg_replace('/[iìíîïĩīĭįı]/iu',...  -- what's the meaning of iu after
>> the
>> > match string?
>>
>> This page explains them both.
>> http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php
>>
>> > preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e)  for? -- every
>> > occurence of aàáâãäåǻāăą NOT followed by e?
>>
>> Yes. It matches any character based on the latin 'a' that is not
>> followed by an 'e'. It keeps the pattern from matching the 'a' when it
>> immediately precedes an 'e' for the character 'ae' for words like
>> these:
>>
>> http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
>> (However, that may cause problems with words that have other variants
>> of 'ae' in them. I'll leave that to you to resolve.)
>> http://us.php.net/manual/en/regexp.reference.php
>>
>>
>>
>> > Many thanks again for your effort,
>> >
>> > I'm definitely on the good way
>> >
>> >      Giulio
>> >
>> >
>> >>
>> >> I was intrigued by your example, so I played around with it some more
>> >> this morning. My own quick web search yielded a lot of results for
>> >> highlighting search terms, but none that I found did what you're
>> >> after. (I admit I didn't look very deep.) I was up to something like
>> >> this before your reply came in. It's still by no means complete. It
>> >> even handles simple English plurals (words ending in 's' or 'es'), but
>> >> not variations that require changing the word base (like 'daisy' to
>> >> 'daisies').
>> >>
>> >> <?php
>> >> function highlight_search_terms($phrase, $string) {
>> >>   $non_letter_chars = '/[^\pL]/iu';
>> >>   $words = preg_split($non_letter_chars, $phrase);
>> >>
>> >>   $search_words = array();
>> >>   foreach ($words as $word) {
>> >>       if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
>> >>           $search_words[] = $word;
>> >>       }
>> >>   }
>> >>
>> >>   $search_words = array_unique($search_words);
>> >>
>> >>   foreach ($search_words as $word) {
>> >>       $search = preg_quote($word);
>> >>
>> >>       /* repeat for each possible accented character */
>> >>       $search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
>> >>       $search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
>> >>       $search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
>> >> '[aàáâãäåǻāăą]', $search);
>> >>       $search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
>> >>       $search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
>> >>       $search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
>> >> '[eèéêëēĕėęě]', $search);
>> >>       $search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
>> >>       $search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
>> >>       $search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',
>> $search);
>> >>       $search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
>> >>       $search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
>> >>       $search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
>> >>       $search = preg_replace('/[nñńņňŉŋ]/iu', '[nñńņňŉŋ]', $search);
>> >>       $search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
>> >> '[oòóôõöōŏőǿơ]', $search);
>> >>       $search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
>> >>       $search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
>> >>       $search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
>> >>       $search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
>> >> '[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
>> >>       $search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
>> >>       $search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
>> >>       $search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);
>> >>
>> >>
>> >>       $string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
>> >> class="keysearch">$0</span>', $string);
>> >>   }
>> >>
>> >>   return $string;
>> >>
>> >> }
>> >> ?>
>> >>
>> >> I still can't help feeling there must be some better way, though.
>> >>
>> >>>
>> >>> well, i think I'm on the good way now, unfortunately I have some other
>> >>> urgent work and can't try it immediately, but I'll let you know    :)
>> >>>
>> >>> thank you!
>> >>>
>> >>>   Giulio
>> >>
>> >>
>> >> Andrew
>> >>
>> >>
>> >
>> >
>>
>
>