Re: Hex characters in COPY input

Adam Hooper <adam@xxxxxxxxxxxxxx> · Fri, 27 Feb 2015 13:02:31 -0500

On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@xxxxxxxxx> wrote:
> On 2/26/15, Vick Khera <vivek@xxxxxxxxx> wrote:
>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@xxxxxxxxx>
>> wrote:
>>
>>> I get an error "ERROR:  invalid byte sequence for
>>> encoding "UTF8": 0xe9616c"
>>
>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if
>> you're not.
>
> Regardless, can you point me to some reading that would have clued
> me in that e9 is not a UTF8 character? Or is the clue the fact that it was not
> preceeded with 0x00?

The error message Postgres gave you is a pretty darned good clue :).

But your question has an error, and it's worth expounding a bit. 0xe9
is not a UTF8 character because 0xe9 is not a _character_. 0xe9 is a
_byte_. Characters are not bytes. Characters can be _encoded_ into
bytes, and that's not the same thing.

UTF-8 encoding is a bit confusing: any byte in the range [0x00-0x7f]
represents the same character as in ASCII encoding (an extremely
popular encoding). Any byte in the range [0x80-0xff] is the "leading
byte" in a sequence of bytes that represents a single character.
"Continuation" bytes are in the range [0x80-0xbf]. (Why not the range
[0x00-0xff]? Because UTF-8 was designed to generate errors when fed
non-UTF8 byte sequences.) The first four bits of the leading byte
describe how many continuation bytes there are. If you care to read up
on the how and why of UTF-8 (a red herring in this discussion), try:
See http://en.wikipedia.org/wiki/UTF-8

Back to 0xe9. 0xe9 is '1110 1001' in binary. Postgres' UTF-8 decoder
sees that initial '1110' and determines that it needs to inspect three
bytes to read one character. The second byte is 0x61, which is not in
the range [0x80-0xbf], so Postgres reports an invalid byte sequence.
Hooray: it produces exactly the error message it should.

You don't need to concern yourself with the complications of UTF-8.
You only need to know that bytes are not characters; if you don't know
the encoding of a sequence of bytes, you've made a logic error.
Postgres told you the error, though it didn't hint at how to fix it.
(Text editors and web browsers use heuristics to guess file encodings,
and they're usually right, though it's impossible to create a perfect
heuristic. See http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and
for further discussion there.)

If you're looking for take-away lessons, the main one is: "read the
error message" :).

Next time you have the "wrong encoding" problem, you have two options:
1) figure out the encoding and tell Postgres; or 2) regenerate the
file in the correct encoding (UTF-8). The "why" is here:
http://www.joelonsoftware.com/articles/Unicode.html

We on this list jumped strait to option 1. We've memorized 0xe9 in
particular, because we've been through your pain before. In the
Americas and Western Europe, if a file contains the byte 0xe9 it
probably contains the character "é" encoded as
windows-1252/ISO-8859-1/ISO-8859-15. That's very common. MySQL in
particular is a ghastly Internet denizen, in that it defaults to
ISO-8859-15 in an apparent crusade against globalization and modern
standards.

Enjoy life,
Adam

-- 
Adam Hooper
+1-613-986-3339
http://adamhooper.com

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general