Re: Problem with langage encoding

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Tue, 02 Dec 2008 13:34:44 +1300

Stephane Bortzmeyer wrote:
> On Thu, Nov 27, 2008 at 02:34:17AM +0900,
>  Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> wrote 
>  a message of 188 lines which said:
> 
>>> ERROR:  character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"
>> Which it does not; that character is "???" (HANGUL SYLLABLE SSYEG) 
> 
> No, I don't think so. I think that 0xc3bd is the binary value, and
> 0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is
> less surprising. It is in Latin-1 but not in WIN1251.

Yes, that would make a lot more sense. From there it's easy, see below:

> In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if
> the database was really in Latin-1.

OK, so what we end up with is the following sequence (expressed in
Python, which is always handy for encoding conversions etc):

print "\xc3\xbd".decode("utf-8").encode("latin-1").decode("koi8_r")

to yield: Щ

Assuming that's right, what presumably happened is that the database was
initialized with UTF-8 and loaded with data it was told was in the
latin-1 encoding, but was actually in the  koi8_r encoding. Therefore,
utf-8 sequences were generated based on the interpretation of the bytes
for each koi8_r character as the latin-1 character for the same byte
value, so:

Input file: 0xfd (latin-1: ý, koi8_r: Щ)
       |
       | (input interpreted as latin-1)
       v
Database:   0xc3bd (utf-8: ý)

To recover the data you must reverse that process. Thankfully it's going
to be 100% reversible, ie no information has been lost.

To create a tiny test table for the following explanation and fix code I
just ran:

create table ss ( x text) ;
insert into ss (x) values (E'\xc3\xbd');

Now if I:

set client_encoding = "WIN1251";

I get:

test=> select * from ss;
ERROR:  character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"

just like you.

With client_encoding set as utf-8 (which is what my machine is) I get:

test=> select * from ss;
 x
---
 ý
(1 row)

PostgreSQL's convert() function:
http://www.postgresql.org/docs/current/static/functions-string.html

may be used now to transform your data. It doesn't assume any encoding
for the input string, unlike convert_from and convert_to, so you can use
a statement like this to convert your data:(where 'x' is the string of
mangled data to be converted):

select (
convert_from(
  convert(x::bytea, 'utf-8', 'latin-1'),
          'koi8_r')
) from ss;

In other words: "Decode the utf-8 sequence in the input and map each
utf-8 code point to the corresponding latin-1 character, outputting one
byte per latin-1 character. Interpret the sequence of bytes just
produced as a sequence of characters in the koi8_r encoding, and map
them to the same characters in the database's internal encoding."

To copy the converted data to a new table:

CREATE TABLE converted (y text);
INSERT INTO converted(y)
SELECT (
convert_from(
  convert(x::bytea, 'utf-8', 'latin-1'),
          'koi8_r')
) from ss;

Now if I SELECT from the table of converted data, I get the right(?) output:

test=> select * from converted;
 y
---
 Щ
(1 row)

You can easily wrap that up in a simple SQL function:

CREATE OR REPLACE FUNCTION fixstr(text) RETURNS text AS $$
SELECT convert_from(convert($1::bytea, 'utf-8', 'latin-1'),'koi8_r')
$$ LANGUAGE 'SQL' IMMUTABLE;

so you can just:

test=> select fixstr(x) from ss;
 fixstr
--------
 Щ
(1 row)

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general