Re: another seemingly simple encoding question

joseph <kmh496@xxxxxxxxxx> · Fri, 24 Mar 2006 23:43:45 +0900

problem is that my string -- which is in utf-8 -- because
all input is converted first in php with 
       $str_out = mb_convert_encoding($str_in, "UTF-8");
and the query, which 
is like 
"select wordid from korean_english where word='utf8string'";
and it is returning wordids for words which are not = utf8string

(in debug mode) the above is output as UTF-8 by php / browser encoding
over the web, and then "exit;" is called, 
so i just grab it from the browser by cutting and pasting the whole
query string.  
running the query in php and from psql return the same bad wordids,
pointing that the encoding is consistent through the cut-and-paste
operation.  

i don't understand what a "unicode normalization form" is.  the postgres
docs http://www.postgresql.org/docs/8.0/interactive/multibyte.html
say

Table 20-1. Server Character Sets

                Name
                 Description

UNICODE     Unicode (UTF-8)

so i thought they were the same, and i dont know about "unicode
normalization form".  

my question is why isn't the utf8string in query returning only
matching, corresponding wordids from the database....

thx.

2006-03-24 (금), 08:56 -0500, John D. Burger 쓰시길:
> > i have a problem matching a utf8 string with a field in a database 
> > encoded in utf8.
> 
> You seem to give all the details of your configuration, but unless I 
> misread your message, you don't say what the actual problem is.  Can 
> you provide more details?  What exactly doesn't work?
> 
> This may not be the issue, but many people don't realize that there
are 
> sometimes multiple ways to encode what is conceptually the same
string 
> in UTF8 (or any of the Unicode encodings).  If you do not
canonicalize 
> your strings using one of the Unicode normalization forms, then 
> seemingly identical strings may not match, because they are not 
> byte-for-byte identical.
> 
> - John D. Burger
>    MITRE
>