Re: Problem with langage encoding

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Thu, 27 Nov 2008 02:34:17 +0900

Ronald Vyhmeister wrote:
> 
>> The problem is that the database is UTF8, and it won't let me use the
>> convert_to function because it says that the characters don't exist...
> What
>> I need to do is to pull the UTF8 from the database, tell postgres that
> it's
>> 8859-1, and then convert it to WIN1251... How?  
> 
>> If the DB is in UTF8, you ought to be able to issue
> 
>>  set client_encoding to 'WIN1251'
> 
>> after connection, and the conversion happens automatically. See:
> 
>>  http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483
> 
>> HTH,
> 
>> Ray.
> 
> Wish it would work...  when I do it, I get:
> 
> ERROR:  character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"

Which it does not; that character is "쎽" (HANGUL SYLLABLE SSYEG) which
certainly isn't in WIN1251 or in latin-1 (ISO-8859-1). The byte sequence
for this character in UTF-8 is:

	0xec 0x8e 0xbd

When decoded as latin-1, those three bytes are interpreted as: "ìŽ½" ...
which doesn't seem much more likely to be valid. Are you SURE it's
latin-1, not (say) Koi_r-8?

Python is a useful tool for investigating encoding problems. It draws a
strong distinction between true Unicode strings and byte strings. It
provides quality conversion routines that let you specify the encoding
of the byte string, and that'll throw an exception for nonsensical input.

This lets you take a byte string, decode it according to various
encodings, and see what you land up with. You can also test various
horrible encoding mangling schemes used by apps rather easily. The
python `encodings' module contains a list of all known encodings by all
aliases as `encodings.aliases.aliases'. Here's some code that takes an
input byte string and prints a table of all ways it can be interpreted
according to various encodings:

-------------------
import sys
import encodings

stupid_encodings = ['zlib_codec', 'uu_codec', 'hex_codec', 'bz2_codec']

bs = '\xc3\xbd'
for encoding in set(encodings.aliases.aliases.values()):
  sys.stdout.write(u"%20s: " % encoding)
  sys.stdout.flush()
  if encoding in stupid_encodings:
    continue
  try:
    print u"\"%6s\" (%6s)" % (bs.decode(encoding),
repr(bs.decode(encoding)))
  except UnicodeDecodeError,e:
    print u"         [INVALID]"
  except LookupError,e:
    print u"         [UNSUPPORTED]"

--------------------

... and here's the output for the data you mentioned:

--------------------
           bz2_codec:               cp1140: "    C¨" (u'C\xa8')
              rot_13: "    Ã½" (u'\xc3\xbd')
               cp932: "    ﾃｽ" (u'\uff83\uff7d')
        euc_jisx0213: "     箪" (u'\u7baa')
               cp037: "    C¨" (u'C\xa8')
           hex_codec:                cp500: "    C¨" (u'C\xa8')
            uu_codec:            big5hkscs: "     羸" (u'\u7fb8')
                mbcs:          [UNSUPPORTED]
        euc_jis_2004: "     箪" (u'\u7baa')
        iso2022_jp_3:          [INVALID]
        iso2022_jp_2:          [INVALID]
        iso2022_jp_1:          [INVALID]
                 gbk: "     媒" (u'\u5a92')
     iso2022_jp_2004:          [INVALID]
        quopri_codec:          [INVALID]
               cp424: "    C¨" (u'C\xa8')
          iso2022_jp:          [INVALID]
         mac_iceland: "    √Ω" (u'\u221a\u03a9')
           hp_roman8: "    û§" (u'\xfb\xa7')
          iso2022_kr:          [INVALID]
              euc_kr: "     첵" (u'\uccb5')
              cp1254: "    Ã½" (u'\xc3\xbd')
              gb2312: "     媒" (u'\u5a92')
               cp850: "    ├¢" (u'\u251c\xa2')
           shift_jis: "    ﾃｽ" (u'\uff83\uff7d')
               cp852: "    ├Ż" (u'\u251c\u017b')
               cp855: "    ├й" (u'\u251c\u0439')
           utf_16_le: "     뷃" (u'\ubdc3')
               cp857: "    ├¢" (u'\u251c\xa2')
               cp775: "    ├Į" (u'\u251c\u012e')
              cp1026: "    C¨" (u'C\xa8')
          mac_latin2: "    √Ĺ" (u'\u221a\u0139')
        mac_cyrillic: "    √љ" (u'\u221a\u0459')
        base64_codec: "      " (    '')
             ptcp154: "    ГҪ" (u'\u0413\u04aa')
              euc_jp: "     箪" (u'\u7baa')
                  hz:          [INVALID]
               utf_8: "     ý" (u'\xfd')
           mac_greek: "    ΟΫ" (u'\u039f\u03ab')
               utf_7:          [INVALID]
         mac_turkish: "    √Ω" (u'\u221a\u03a9')
               cp949: "     첵" (u'\uccb5')
          zlib_codec:                 big5: "     羸" (u'\u7fb8')
           iso8859_9: "    Ã½" (u'\xc3\xbd')
           iso8859_8:          [INVALID]
           iso8859_5: "    УН" (u'\u0423\u041d')
           iso8859_4: "    ÃŊ" (u'\xc3\u014a')
           iso8859_7: "    Γ½" (u'\u0393\xbd')
           iso8859_6:          [INVALID]
           iso8859_3:          [INVALID]
           iso8859_2: "    Ă˝" (u'\u0102\u02dd')
             gb18030: "     媒" (u'\u5a92')
      shift_jis_2004: "    ﾃｽ" (u'\uff83\uff7d')
           mac_roman: "    √Ω" (u'\u221a\u03a9')
               cp950: "     羸" (u'\u7fb8')
              utf_16: "     뷃" (u'\ubdc3')
          iso8859_15: "    Ãœ" (u'\xc3\u0153')
          iso8859_14: "    ÃẄ" (u'\xc3\u1e84')
             tis_620: "    รฝ" (u'\u0e23\u0e1d')
          iso8859_16: "    Ăœ" (u'\u0102\u0153')
          iso8859_11: "    รฝ" (u'\u0e23\u0e1d')
          iso8859_10: "    Ã―" (u'\xc3\u2015')
              tactis:          [UNSUPPORTED]
               ascii:          [INVALID]
               cp869: "    ├Ξ" (u'\u251c\u039e')
               cp860: "    ├╜" (u'\u251c\u255c')
               cp861: "    ├╜" (u'\u251c\u255c')
               cp862: "    ├╜" (u'\u251c\u255c')
               cp863: "    ├╜" (u'\u251c\u255c')
               cp864: "    ﺃﺵ" (u'\ufe83\ufeb5')
               cp865: "    ├╜" (u'\u251c\u255c')
               cp866: "    ├╜" (u'\u251c\u255c')
      shift_jisx0213: "    ﾃｽ" (u'\uff83\uff7d')
              cp1255: "    ֳ½" (u'\u05b3\xbd')
             latin_1: "    Ã½" (u'\xc3\xbd')
              cp1257: "    Ć½" (u'\u0106\xbd')
              cp1256: "    أ½" (u'\u0623\xbd')
              cp1251: "    ГЅ" (u'\u0413\u0405')
              cp1250: "    Ă˝" (u'\u0102\u02dd')
              cp1253: "    Γ½" (u'\u0393\xbd')
              cp1252: "    Ã½" (u'\xc3\xbd')
               cp437: "    ├╜" (u'\u251c\u255c')
              cp1258: "    Ă½" (u'\u0102\xbd')
          iso8859_13: "    Ć½" (u'\u0106\xbd')
              koi8_r: "    ц╫" (u'\u0446\u256b')
           utf_16_be: "     쎽" (u'\uc3bd')
               johab: "     칳" (u'\uce73')
      iso2022_jp_ext:          [INVALID]
--------------------

I don't see anything particularly striking there, but (a) this is a tiny
sample not a useful string, (b) I'm relying on my interpretation of what
the original byte string must've been to be correct, (c) Python's
encoding support isn't absolutely comprehensive, and (d) I'm assuming
the original inserting app only made a simple error in populating the
database, rather than doing something really exciting with its text.

Can you provide a more complete sample of the text in the DB? It'd help
if you could show the CAST(stringvar AS bytea) value of the strings,
which will provide the byte sequence the string is stored as in the
server encoding.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general