COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

David Gagnon <dgagnon@xxxxxxxxxx> · Wed, 06 Apr 2005 18:12:06 -0400

Hi all,

  I ran into this problem and want to share and have a confirmation.

I tried to use COPY function to load bulk data.  I craft myself a
UNICODE file from a MSSQL db.  I can't load it into the postgresql.  I
always get the error: CONTEXT:  COPY vd, line 1, column vdnum: "ÿþ1"

The problem is that both file are exactly the same... I found that
pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE
but with variable length encoding (Ie: Some character user 8 bytes and
other 16 bytes ...).  See for detail:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8.  The file I crafted
is a true UNICODE (16 bytes or UCS-2) file (Confirm please)

So here is the content of the file:

UTF-8 (Postgresql dump):

1    1    1    AC    COLUMNÿACNUMÿACDESCÿACDELPAIÿ

UNICODE (crafted from mssql)

1    1    1    AC    COLUMNÿACNUMÿACDESCÿACDELPAIÿ

HEX representation UTF-8 (Postgresql dump):

00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3   
1.1.1.AC.COLUMNÃ

00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf   
¿ACNUMÃ¿ACDESCÃ¿

00000020:41 43 44 45 4c 50 41 49 c3 bf                     
ACDELPAIÃ¿      

HEX representation UNICODE (crafted from mssql)

00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00   
ÿþ1...1...1...A.

00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00   
C...C.O.L.U.M.N.

00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00   
ÿ.A.C.N.U.M.ÿ.A.

00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00   
C.D.E.S.C.ÿ.A.C.

00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00         
D.E.L.P.A.I.ÿ.  

So postgresql bug with the FF FE that start the UNICODE document.  Is
that normal UNICODE file starts with this FF FE ?! Note that I tried to
delete those character but they aren`t visible...

So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)?  Is there a way I can make the COPY command with
a UNICODE UCS-2 encoding

Thanks for your help

/David

1	1	1	AC	COLUMNÃ¿ACNUMÃ¿ACDESCÃ¿ACDELPAIÃ¿
ÿþ1	1	1	AC	COLUMNÿACNUMÿACDESCÿACDELPAIÿ
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
      message can get through to the mailing list cleanly