Search Postgresql Archives

Re: COPY command character set

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> There are no lead bytes in UTF-8

Sorry, sloppy use of terminology. I should have said "UTF signatures"
aka the "byte order mark". IOW, the "magic number" bytes commonly found
at the front of UTF encoded files:

UTF-16 little-endian   FF FE
UTF-16 big-endian   FE FF
UTF-8   EF BB BF

These tend to be inserted automatically by text editors, so it would be
advantageous to have them handled automatically by COPY (at least as an
option). Right now, if I edit a UTF-8 file then load it with COPY, I get
errors or bad data if the editor chose to add the 3 signature bytes.

Whilst UTF-16 is not supported internally, COPY seems to be a legitimate
special case, because it is used for migration to/from other tools that
may emit or expect UTF-16. ISTR that Postgres uses UCI? If so it would
be near-trivial to allow COPY to read and write UTF-16. If done via a
syntax extension to COPY (which I think is the most desirable
implementation), this would have no adverse effect on any other
capability. It also seems sufficiently isolated from sensitive/complex
areas of the code that it might make a suitable first project for
someone who is interested in becoming a contributor...

-- 
Peter Headland
Architect
Actuate Corporation


-----Original Message-----
From: Tom Lane [mailto:tgl@xxxxxxxxxxxxx] 
Sent: Thursday, September 10, 2009 11:13
To: Peter Headland
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  COPY command character set 

"Peter Headland" <pheadland@xxxxxxxxxxx> writes:
> How about my suggestion to add a means (extend COPY syntax) to specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?

There are no lead bytes in UTF-8, and we make no pretense of handling
UTF-16, so I don't think we'd be interested in some hack that cleans
up misencoding problems.

The idea of overriding client_encoding has been suggested before.  I
don't remember if it was rejected or is just languishing on the TODO
list.  I'd be a little worried about sending clients data in an encoding
they aren't expecting, but if it only works for I/O to a file it might
be okay.

			regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux