Stephane Bortzmeyer wrote: > On Thu, Nov 27, 2008 at 02:34:17AM +0900, > Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> wrote > a message of 188 lines which said: > >>> ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" >> Which it does not; that character is "???" (HANGUL SYLLABLE SSYEG) > > No, I don't think so. I think that 0xc3bd is the binary value, and > 0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is > less surprising. It is in Latin-1 but not in WIN1251. Yes, that would make a lot more sense. From there it's easy, see below: > In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if > the database was really in Latin-1. OK, so what we end up with is the following sequence (expressed in Python, which is always handy for encoding conversions etc): print "\xc3\xbd".decode("utf-8").encode("latin-1").decode("koi8_r") to yield: Щ Assuming that's right, what presumably happened is that the database was initialized with UTF-8 and loaded with data it was told was in the latin-1 encoding, but was actually in the koi8_r encoding. Therefore, utf-8 sequences were generated based on the interpretation of the bytes for each koi8_r character as the latin-1 character for the same byte value, so: Input file: 0xfd (latin-1: ý, koi8_r: Щ) | | (input interpreted as latin-1) v Database: 0xc3bd (utf-8: ý) To recover the data you must reverse that process. Thankfully it's going to be 100% reversible, ie no information has been lost. To create a tiny test table for the following explanation and fix code I just ran: create table ss ( x text) ; insert into ss (x) values (E'\xc3\xbd'); Now if I: set client_encoding = "WIN1251"; I get: test=> select * from ss; ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" just like you. With client_encoding set as utf-8 (which is what my machine is) I get: test=> select * from ss; x --- ý (1 row) PostgreSQL's convert() function: http://www.postgresql.org/docs/current/static/functions-string.html may be used now to transform your data. It doesn't assume any encoding for the input string, unlike convert_from and convert_to, so you can use a statement like this to convert your data:(where 'x' is the string of mangled data to be converted): select ( convert_from( convert(x::bytea, 'utf-8', 'latin-1'), 'koi8_r') ) from ss; In other words: "Decode the utf-8 sequence in the input and map each utf-8 code point to the corresponding latin-1 character, outputting one byte per latin-1 character. Interpret the sequence of bytes just produced as a sequence of characters in the koi8_r encoding, and map them to the same characters in the database's internal encoding." To copy the converted data to a new table: CREATE TABLE converted (y text); INSERT INTO converted(y) SELECT ( convert_from( convert(x::bytea, 'utf-8', 'latin-1'), 'koi8_r') ) from ss; Now if I SELECT from the table of converted data, I get the right(?) output: test=> select * from converted; y --- Щ (1 row) You can easily wrap that up in a simple SQL function: CREATE OR REPLACE FUNCTION fixstr(text) RETURNS text AS $$ SELECT convert_from(convert($1::bytea, 'utf-8', 'latin-1'),'koi8_r') $$ LANGUAGE 'SQL' IMMUTABLE; so you can just: test=> select fixstr(x) from ss; fixstr -------- Щ (1 row) -- Craig Ringer -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general