Re: invalid byte sequence for encoding "UTF8"

Gregory Stark <stark@xxxxxxxxxxxxxxxx> · Fri, 30 Nov 2007 10:03:38 +0000

[Generally it's not a good idea to start a new thread by responding to an
existing one, it confuses people and makes it more likely for your question to
be missed.]

"Glyn Astill" <glynastill@xxxxxxxxxxx> writes:

> Hi People,
>
> I've setup a postgres 8.2 server and have a database setup with UTF8
> encoding. I intend to read some of our legacy data into the table,
> this legacy data is in ASCII format, and as far as I know is 8 bit
> ASCII.

ASCII is a 7-bit encoding. If you have bytes with the high bit set then you
have something else. Can you give any examples of characters with the high bit
set and what you think they represent?

> We have a migration tool from mertechdata.com to convert these files
> that are in a DataFlex format into out postgres tables.
>
> Some files convert over okay, and some come up with the error message
> 'invalid byte sequence for encoding "UTF8"'. the files that come up
> with the error are created correctly and so are their index's, but as
> soon as we come to insert the data we get this error.

This error indicates that you are trying to import data with client_encoding
set to UTF8 but the data isn't actually UTF8 and contains invalid byte
sequences for UTF8.

If your migration toolkit lets you set the client encoding separately from the
server encoding then you can set the client encoding to match your data and
the server encoding to the encoding you want the server to use. 

Otherwise you'll have to recode the data to UTF8 or whatever encoding you want
the data to be. There are tools to do this (such as GNU "recode" for example).

> Are there any more flexible formats we could use? I noticed we have
> Latin 1-10 and ISO formats. Is there any reason why we shouldn't use
> these?

Well there are pros and cons. The 1-byte ISO formats will be more space
efficient and also allow some cpu optimizations so they perform somewhat
better. But if you ever need to store a character which doesn't fit in the
encoding you'll be stuck. Postgres doesn't support using multiple encodings in
the same database (or effectively even in the same initdb cluster).

-- 
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org/