Re: Error "invalid byte sequence for encoding UTF8" on insert into BYTEA column

Alan Millington <admillington@xxxxxxxxxxx> · Sat, 23 Jan 2010 13:26:22 +0000 (GMT)

>You probably need to ask the mxODBC developers (who AFAIK don't hang out
>on this list) what they are doing with that data.  It sounds fairly
>likely to me that the bytea value is just being sent as a string without
>any special encoding.  That would explain both the null sensitivity you
>mention later in the thread, and the encoding validity complaints ---
>PG 8.1 was much less picky about string encoding validity than recent
>versions are.
>There are basically two ways that you could make this work reliably:
>arrange for the bytea value to be sent as an out-of-line binary
>parameter, or encode it using backslash sequences (eg, '\000' for a
>null).  Whether the former is possible with mxODBC I dunno.  The latter
>might be something that mxODBC will do for you if it knows the value
>is supposed to be bytea, but without that knowledge I don't see how it
>could.  You might end up having to do the encoding yourself.

Preliminary notes:

1. I have now confirmed that at some point I upgraded from mxODBC 3.0 to 3.0.3. The statement in my original posting that my mxODBC installation had not changed was wrong.

2. The Python 'str' datatype is used for any sequence of single bytes, like C's array of char. One cannot tell from the datatype what these bytes are intended to represent: it could be ASCII characters, characters in any single-byte encoding, Unicode in any encoding, or binary data.

I have discovered a workaround, which is to pass the data to mxODBC in a Python buffer object, which clearly identifies the data as binary.

I wrote to eGenix about this as follows:

For some years I have used a Postgres 8.1.4 database (UTF8) with the Postgres Unicode ODBC driver. At first I used mxODBC 2.0 as my interface with Python 2.4, then I upgraded to mxODBC 3.0 for Python 2.5. I am now on Python 2.6.

 With mxODBC 2.0 I had to use the Binary() function to convert binary data to a buffer object if I was to insert it successfully into a bytea column. With mxODBC 3.0 I found that that was no longer necessary, though now I had to check the “bytea as LO” option of the driver.

 At some point I upgraded to mxODBC 3.0.3, and recently I upgraded to Postgres 8.4.1. On trying to insert into a bytea column yesterday I found I was no longer able to do so. Either I received an "invalid byte sequence for encoding UTF8" error from Postgres, or the data was truncated at the first NULL, as though it was a C-style string.

 I find, however, that if I
 use Binary() again, I can insert the data correctly.

 It seems to make no difference whether I use BIND_USING_SQLTYPE or BIND_USING_PYTHONTYPE.

 Possibilities:

 i) Postgres has changed its processing of bytea input.
ii) mxODBC has changed the way it handles such data.
iii) Both have changed.

 Can you shed any light on this?
Marc-Andre Lemburg replied as follows:

mxODBC has not changed in this respect between 3.0.0 and 3.0.3,
so it must be the PostgreSQL code that has changed.

Note that changing the bind type doesn't help in this case, since
the PostgreSQL ODBC driver does not provide the necessary APIs
to do SQL type binding. As a result, mxODBC will always use
Python type binding.

By using the Binary() wrapper for binary data (which really is just
an alias for buffer()), you tell mxODBC to send the data as
binary data to PostgreSQL. Without the wrapper, mxODBC will send
the data as text data and since it is not possible to store
binary data as UTF-8 you get the errors you are seeing.
What puzzles me is hinted at in the last sentence: why does Postgres 8.4.1 (though apparently not 8.1.4) try to interpret the bytes as UTF8 when they are being sent to a column that is typed as bytea?

I apologise if this posting is excessively long, but I like to understand the reasons for things, and others may find the information useful.