Re: bytea encode performance issues

"Merlin Moncure" <mmoncure@xxxxxxxxx> · Thu, 7 Aug 2008 09:38:22 -0400

On Thu, Aug 7, 2008 at 1:16 AM, Sim Zacks <sim@xxxxxxxxxxxxxx> wrote:
>
>> I don't quite follow that...the whole point of utf8 encoded database
>> is so that you can use text functions and operators without the bytea
>> treatment.  As long as your client encoding is set up properly (so
>> that data coming in and out is computed to utf8), then you should be
>> ok.  Dropping to ascii is usually not the solution.  Your data
>> inputting application should set the client encoding properly and
>> coerce data into the unicode text type...it's really the only
>> solution.
>>
> Email does not always follow a specific character set. I have tried
> converting the data that comes in to utf-8 and it does not always work.
> We receive Hebrew emails which come in mostly 2 flavors, UTF-8 and
> windows-1255. Unfortunately, they are not compatible with one another.
> SQL-ASCII and ASCII are different as someone on the list pointed out to
> me. According to the documentation, SQL-ASCII makes no assumption about
> encoding, so you can throw in any encoding you want.

no, you can't! SQL-ASCII means that the database treats everything
like ascii.  This means that any operation that deals with text could
(and in the case of Hebrew, almost certianly will) be broken.  Simple
things like getting the length of a string will be wrong.  If you are
accepting unicode input, you absolutely must be using a unicode
encoded backend.

If you are accepting text of different encodings from the client, you
basically have two choices:
a) set client_encoding on the fly to whatever text the client is encoded in
b) pick an encoding (utf8) and convert all text to that before sending
it to the database (preferred)

you pretty much have to go with option 'b' if you are accepting any
text for which there is no supported client encoding translation.

merlin