Search Postgresql Archives

Re: Hex characters in COPY input

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 27, 2015 at 12:02 PM, Adam Hooper <adam@xxxxxxxxxxxxxx> wrote:
> On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@xxxxxxxxx> wrote:
>> On 2/26/15, Vick Khera <vivek@xxxxxxxxx> wrote:
>>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@xxxxxxxxx>
>>> wrote:
>>>
>>>> I get an error "ERROR:  invalid byte sequence for
>>>> encoding "UTF8": 0xe9616c"
>>>
>>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if
>>> you're not.
>>
>> Regardless, can you point me to some reading that would have clued
>> me in that e9 is not a UTF8 character? Or is the clue the fact that it was not
>> preceeded with 0x00?
>
> The error message Postgres gave you is a pretty darned good clue :).
>
> But your question has an error, and it's worth expounding a bit. 0xe9
> is not a UTF8 character because 0xe9 is not a _character_. 0xe9 is a
> _byte_. Characters are not bytes. Characters can be _encoded_ into
> bytes, and that's not the same thing.
>
> UTF-8 encoding is a bit confusing: any byte in the range [0x00-0x7f]
> represents the same character as in ASCII encoding (an extremely
> popular encoding). Any byte in the range [0x80-0xff] is the "leading
> byte" in a sequence of bytes that represents a single character.
> "Continuation" bytes are in the range [0x80-0xbf]. (Why not the range
> [0x00-0xff]? Because UTF-8 was designed to generate errors when fed
> non-UTF8 byte sequences.) The first four bits of the leading byte
> describe how many continuation bytes there are. If you care to read up
> on the how and why of UTF-8 (a red herring in this discussion), try:
> See http://en.wikipedia.org/wiki/UTF-8
>
> Back to 0xe9. 0xe9 is '1110 1001' in binary. Postgres' UTF-8 decoder
> sees that initial '1110' and determines that it needs to inspect three
> bytes to read one character. The second byte is 0x61, which is not in
> the range [0x80-0xbf], so Postgres reports an invalid byte sequence.
> Hooray: it produces exactly the error message it should.
>
> You don't need to concern yourself with the complications of UTF-8.
> You only need to know that bytes are not characters; if you don't know
> the encoding of a sequence of bytes, you've made a logic error.
> Postgres told you the error, though it didn't hint at how to fix it.
> (Text editors and web browsers use heuristics to guess file encodings,
> and they're usually right, though it's impossible to create a perfect
> heuristic. See http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and
> for further discussion there.)
>
> If you're looking for take-away lessons, the main one is: "read the
> error message" :).
>
> Next time you have the "wrong encoding" problem, you have two options:
> 1) figure out the encoding and tell Postgres; or 2) regenerate the
> file in the correct encoding (UTF-8). The "why" is here:
> http://www.joelonsoftware.com/articles/Unicode.html

Timing is everything: I was just starting to read this in fact.

Thank you for taking the time to provide this information. It is proving to be
most useful. And yes, I did misspeak. At least the one part I have learned
about all of this is that more than one byte is required to represent the
majority of characters out there. So yeah, I did know that e9 was a byte, but I
appreciate you re-iterating that for me. Hoping to represent each character
with a single byte is an artifact of most early computer development being
dominated by English speaking persons way back when, eh?

I must say, at least as far as the MySQL outputting goes, I like option 1. It
was easy enough to find and fix that way once I knew what to look for. The
MySQL documentation is sometimes difficult to wade through. But no
documentation is ever perfect, is it?

One of the things that has endeared me to PostgreSQL is the fact that it seems
to strive for global usefulness while remaining standards compliant. Or are
those one and the same?

Regards

>
> We on this list jumped strait to option 1. We've memorized 0xe9 in
> particular, because we've been through your pain before. In the
> Americas and Western Europe, if a file contains the byte 0xe9 it
> probably contains the character "é" encoded as
> windows-1252/ISO-8859-1/ISO-8859-15. That's very common. MySQL in
> particular is a ghastly Internet denizen, in that it defaults to
> ISO-8859-15 in an apparent crusade against globalization and modern
> standards.
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-613-986-3339
> http://adamhooper.com


-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux