On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@xxxxxxxxx> wrote: > On 2/26/15, Vick Khera <vivek@xxxxxxxxx> wrote: >> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@xxxxxxxxx> >> wrote: >> >>> I get an error "ERROR: invalid byte sequence for >>> encoding "UTF8": 0xe9616c" >> >> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if >> you're not. > > Regardless, can you point me to some reading that would have clued > me in that e9 is not a UTF8 character? Or is the clue the fact that it was not > preceeded with 0x00? The error message Postgres gave you is a pretty darned good clue :). But your question has an error, and it's worth expounding a bit. 0xe9 is not a UTF8 character because 0xe9 is not a _character_. 0xe9 is a _byte_. Characters are not bytes. Characters can be _encoded_ into bytes, and that's not the same thing. UTF-8 encoding is a bit confusing: any byte in the range [0x00-0x7f] represents the same character as in ASCII encoding (an extremely popular encoding). Any byte in the range [0x80-0xff] is the "leading byte" in a sequence of bytes that represents a single character. "Continuation" bytes are in the range [0x80-0xbf]. (Why not the range [0x00-0xff]? Because UTF-8 was designed to generate errors when fed non-UTF8 byte sequences.) The first four bits of the leading byte describe how many continuation bytes there are. If you care to read up on the how and why of UTF-8 (a red herring in this discussion), try: See http://en.wikipedia.org/wiki/UTF-8 Back to 0xe9. 0xe9 is '1110 1001' in binary. Postgres' UTF-8 decoder sees that initial '1110' and determines that it needs to inspect three bytes to read one character. The second byte is 0x61, which is not in the range [0x80-0xbf], so Postgres reports an invalid byte sequence. Hooray: it produces exactly the error message it should. You don't need to concern yourself with the complications of UTF-8. You only need to know that bytes are not characters; if you don't know the encoding of a sequence of bytes, you've made a logic error. Postgres told you the error, though it didn't hint at how to fix it. (Text editors and web browsers use heuristics to guess file encodings, and they're usually right, though it's impossible to create a perfect heuristic. See http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and for further discussion there.) If you're looking for take-away lessons, the main one is: "read the error message" :). Next time you have the "wrong encoding" problem, you have two options: 1) figure out the encoding and tell Postgres; or 2) regenerate the file in the correct encoding (UTF-8). The "why" is here: http://www.joelonsoftware.com/articles/Unicode.html We on this list jumped strait to option 1. We've memorized 0xe9 in particular, because we've been through your pain before. In the Americas and Western Europe, if a file contains the byte 0xe9 it probably contains the character "é" encoded as windows-1252/ISO-8859-1/ISO-8859-15. That's very common. MySQL in particular is a ghastly Internet denizen, in that it defaults to ISO-8859-15 in an apparent crusade against globalization and modern standards. Enjoy life, Adam -- Adam Hooper +1-613-986-3339 http://adamhooper.com -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general