Search Postgresql Archives

Trouble with Mixed UTF-8 and Latin1 data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

we have a 8.2.1 database that has a Latin1 encoding.

We managed to write there UTF-8 data (not Latin1) and, as this version
of the database, was actually allowing everything to be written to the
Latin1 database, by now we have a problem of having data in different
encoding on that database. I know it looks like a mess, but we could
not do anything against that by now.

To normalize the strings on the side of our Java frontend I have
written a special function in Java, that takes a string (normally
fetched from the database with JDBC connection), that can be Latin1 or
several times converted into UTF-8 and brings it to the native java
encoding (UTF-16) (trying to  convert back to Latin1 several times
until the conversion fails and taking a valid string).

Now, as we want to migrate the database to 8.3, all that text with
mixed encoding cannon be imported into the new database (as far as I
remember starting from version 8.2.3 or so). and my task is to
efficiently update the database so, that it contains only correctly
UTF-8 encoded texts (though being Latin1 database), and then dump it
and import into the new 8.3 instance.

Unfortunately using iconv on the dump file will not work, as it will
either corrupt my Latin1 data that is not compatible with UTF-8 or
convert texts that are already in UTF-8 twice. None of these
possibilities are good, as Latin1 encoded data is about 20% of total
amount of text data in that database.

One idea is to write the function, that will normalize the data to
UTF-8 in PL/pgSQL (that I could not do from the first try) or to write
such a procedure in C (that I have never done yet for Postgres) and
then to update the fields, that have different values.... in that case
I have another question:

Is it possible to directly access and update TOAST data for all the
tables?

Maybe there is a possibility to switch the checking of the correctness
of the UTF-8 encoding off and enable restoration of the corrupted
data, and deal with it later on the new database already (having
better update performance I hope).

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux