Ronald Vyhmeister wrote: > >> The problem is that the database is UTF8, and it won't let me use the >> convert_to function because it says that the characters don't exist... > What >> I need to do is to pull the UTF8 from the database, tell postgres that > it's >> 8859-1, and then convert it to WIN1251... How? > >> If the DB is in UTF8, you ought to be able to issue > >> set client_encoding to 'WIN1251' > >> after connection, and the conversion happens automatically. See: > >> http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483 > >> HTH, > >> Ray. > > Wish it would work... when I do it, I get: > > ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" Which it does not; that character is "쎽" (HANGUL SYLLABLE SSYEG) which certainly isn't in WIN1251 or in latin-1 (ISO-8859-1). The byte sequence for this character in UTF-8 is: 0xec 0x8e 0xbd When decoded as latin-1, those three bytes are interpreted as: "ì½" ... which doesn't seem much more likely to be valid. Are you SURE it's latin-1, not (say) Koi_r-8? Python is a useful tool for investigating encoding problems. It draws a strong distinction between true Unicode strings and byte strings. It provides quality conversion routines that let you specify the encoding of the byte string, and that'll throw an exception for nonsensical input. This lets you take a byte string, decode it according to various encodings, and see what you land up with. You can also test various horrible encoding mangling schemes used by apps rather easily. The python `encodings' module contains a list of all known encodings by all aliases as `encodings.aliases.aliases'. Here's some code that takes an input byte string and prints a table of all ways it can be interpreted according to various encodings: ------------------- import sys import encodings stupid_encodings = ['zlib_codec', 'uu_codec', 'hex_codec', 'bz2_codec'] bs = '\xc3\xbd' for encoding in set(encodings.aliases.aliases.values()): sys.stdout.write(u"%20s: " % encoding) sys.stdout.flush() if encoding in stupid_encodings: continue try: print u"\"%6s\" (%6s)" % (bs.decode(encoding), repr(bs.decode(encoding))) except UnicodeDecodeError,e: print u" [INVALID]" except LookupError,e: print u" [UNSUPPORTED]" -------------------- ... and here's the output for the data you mentioned: -------------------- bz2_codec: cp1140: " C¨" (u'C\xa8') rot_13: " ý" (u'\xc3\xbd') cp932: " テス" (u'\uff83\uff7d') euc_jisx0213: " 箪" (u'\u7baa') cp037: " C¨" (u'C\xa8') hex_codec: cp500: " C¨" (u'C\xa8') uu_codec: big5hkscs: " 羸" (u'\u7fb8') mbcs: [UNSUPPORTED] euc_jis_2004: " 箪" (u'\u7baa') iso2022_jp_3: [INVALID] iso2022_jp_2: [INVALID] iso2022_jp_1: [INVALID] gbk: " 媒" (u'\u5a92') iso2022_jp_2004: [INVALID] quopri_codec: [INVALID] cp424: " C¨" (u'C\xa8') iso2022_jp: [INVALID] mac_iceland: " √Ω" (u'\u221a\u03a9') hp_roman8: " û§" (u'\xfb\xa7') iso2022_kr: [INVALID] euc_kr: " 첵" (u'\uccb5') cp1254: " ý" (u'\xc3\xbd') gb2312: " 媒" (u'\u5a92') cp850: " ├¢" (u'\u251c\xa2') shift_jis: " テス" (u'\uff83\uff7d') cp852: " ├Ż" (u'\u251c\u017b') cp855: " ├й" (u'\u251c\u0439') utf_16_le: " 뷃" (u'\ubdc3') cp857: " ├¢" (u'\u251c\xa2') cp775: " ├Į" (u'\u251c\u012e') cp1026: " C¨" (u'C\xa8') mac_latin2: " √Ĺ" (u'\u221a\u0139') mac_cyrillic: " √љ" (u'\u221a\u0459') base64_codec: " " ( '') ptcp154: " ГҪ" (u'\u0413\u04aa') euc_jp: " 箪" (u'\u7baa') hz: [INVALID] utf_8: " ý" (u'\xfd') mac_greek: " ΟΫ" (u'\u039f\u03ab') utf_7: [INVALID] mac_turkish: " √Ω" (u'\u221a\u03a9') cp949: " 첵" (u'\uccb5') zlib_codec: big5: " 羸" (u'\u7fb8') iso8859_9: " ý" (u'\xc3\xbd') iso8859_8: [INVALID] iso8859_5: " УН" (u'\u0423\u041d') iso8859_4: " ÃŊ" (u'\xc3\u014a') iso8859_7: " Γ½" (u'\u0393\xbd') iso8859_6: [INVALID] iso8859_3: [INVALID] iso8859_2: " Ă˝" (u'\u0102\u02dd') gb18030: " 媒" (u'\u5a92') shift_jis_2004: " テス" (u'\uff83\uff7d') mac_roman: " √Ω" (u'\u221a\u03a9') cp950: " 羸" (u'\u7fb8') utf_16: " 뷃" (u'\ubdc3') iso8859_15: " Ãœ" (u'\xc3\u0153') iso8859_14: " ÃẄ" (u'\xc3\u1e84') tis_620: " รฝ" (u'\u0e23\u0e1d') iso8859_16: " Ăœ" (u'\u0102\u0153') iso8859_11: " รฝ" (u'\u0e23\u0e1d') iso8859_10: " Ã―" (u'\xc3\u2015') tactis: [UNSUPPORTED] ascii: [INVALID] cp869: " ├Ξ" (u'\u251c\u039e') cp860: " ├╜" (u'\u251c\u255c') cp861: " ├╜" (u'\u251c\u255c') cp862: " ├╜" (u'\u251c\u255c') cp863: " ├╜" (u'\u251c\u255c') cp864: " ﺃﺵ" (u'\ufe83\ufeb5') cp865: " ├╜" (u'\u251c\u255c') cp866: " ├╜" (u'\u251c\u255c') shift_jisx0213: " テス" (u'\uff83\uff7d') cp1255: " ֳ½" (u'\u05b3\xbd') latin_1: " ý" (u'\xc3\xbd') cp1257: " Ć½" (u'\u0106\xbd') cp1256: " أ½" (u'\u0623\xbd') cp1251: " ГЅ" (u'\u0413\u0405') cp1250: " Ă˝" (u'\u0102\u02dd') cp1253: " Γ½" (u'\u0393\xbd') cp1252: " ý" (u'\xc3\xbd') cp437: " ├╜" (u'\u251c\u255c') cp1258: " Ă½" (u'\u0102\xbd') iso8859_13: " Ć½" (u'\u0106\xbd') koi8_r: " ц╫" (u'\u0446\u256b') utf_16_be: " 쎽" (u'\uc3bd') johab: " 칳" (u'\uce73') iso2022_jp_ext: [INVALID] -------------------- I don't see anything particularly striking there, but (a) this is a tiny sample not a useful string, (b) I'm relying on my interpretation of what the original byte string must've been to be correct, (c) Python's encoding support isn't absolutely comprehensive, and (d) I'm assuming the original inserting app only made a simple error in populating the database, rather than doing something really exciting with its text. Can you provide a more complete sample of the text in the DB? It'd help if you could show the CAST(stringvar AS bytea) value of the strings, which will provide the byte sequence the string is stored as in the server encoding. -- Craig Ringer -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general