Search Postgresql Archives

Re: Issue with loading unicode characters with copy command

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/12/24 07:23, Kiran K V wrote:
Hi,


I have a UTF8 database and simple table with two columns (integer and varchar). Created a csv file with some multibyte characters and trying to perform load operation using the copy command.

The multibyte characters come from what character set?


__ __

Database info:____

Postgresql database details:____

   Name    |  Owner   | Encoding |      Collate       | Ctype        |   Access privileges____

-----------+----------+----------+--------------------+--------------------+-----------------------____

postgres  | postgres | UTF8     | English_India.1252 | English_India.1252 |____

__ __

(Note: I also tried with collate utf8 and no luck)


postgres=# set client_encoding='UTF8';____

SET____

__ __

Table:____

create table public.test ( PKCOL integer not null, STR1 character varying(64) null, primary key( PKCOL )) ____

____

csv contents:____

1|"àáâãäåæçèéêëìíîï"____

__ __

After data loading, actual data is becoming____

à áâãäåæçèéêëìÃîï____

hex of this is - c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af____

__ __

The hex values are indeed the UTF-8 encodings of the characters in your expected string, and the presence of `C2` before each character is indicative of how UTF-8 represents certain characters.____

In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`, etc.) are represented as two bytes. The first byte `C2` or `C3` indicates that this is a two-byte character, and the second byte specifies the character. For example:____

- `à` is represented as `C3 A0`____

- `á` is `C3 A1`____

- `â` is `C3 A2`, and so on.____

In this case, the `C2` byte is getting interpreted as a separate character and that is the likely reason that an `Â` (which corresponds to `C2`) is seen before each intended character. Looks like UTF-8 encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or Windows-1252, where each byte is treated as a separate character.


Please advise. Thank you very much.


Regards,

Kiran


--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux