Am Mittwoch, den 06.04.2005, 18:12 -0400 schrieb David Gagnon: > Hi all, > > I ran into this problem and want to share and have a confirmation. > > I tried to use COPY function to load bulk data. I craft myself a > UNICODE file from a MSSQL db. I can't load it into the postgresql. I > always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ÿþ1" > > The problem is that both file are exactly the same... I found that > pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE > but with variable length encoding (Ie: Some character user 8 bytes and > other 16 bytes ...). See for detail: > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted > is a true UNICODE (16 bytes or UCS-2) file (Confirm please) > > So here is the content of the file: > UTF-8 (Postgresql dump): > 1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ > > UNICODE (crafted from mssql) > 1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ > > HEX representation UTF-8 (Postgresql dump): > > 00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3 > 1.1.1.AC.COLUMNà > 00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf > ¿ACNUMÿACDESCÿ > 00000020:41 43 44 45 4c 50 41 49 c3 bf > ACDELPAIÿ > > HEX representation UNICODE (crafted from mssql) > 00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00 > ÿþ1...1...1...A. > 00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00 > C...C.O.L.U.M.N. > 00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00 > ÿ.A.C.N.U.M.ÿ.A. > 00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00 > C.D.E.S.C.ÿ.A.C. > 00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00 > D.E.L.P.A.I.ÿ. > > So postgresql bug with the FF FE that start the UNICODE document. Is > that normal UNICODE file starts with this FF FE ?! Note that I tried > to delete those character but they aren`t visible... > > So am I right ? Is Postgresql using UTF-8 and don`t really understand > UNICODE file (UCS-2)? Is there a way I can make the COPY command with > a UNICODE UCS-2 encoding Yes, postgres Unicode means utf-8. Windows programs which store unicode text are usually prepending the file with a BOM (byte order mark) (just google for unicode and BOM) So you would need something to convert. For example I know python can read BOM so it can be a matter of open it, read, encode it into utf-8 and write it out again. Regards Tino Wildenhain ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend