Re: Issue with loading unicode characters with copy command

Kiran K V <kirankv.1982@xxxxxxxxx> · Fri, 12 Jan 2024 23:17:27 +0530

Its UTF-8. Also verified the load file and its utf-8.

Regards,
Kiran

On Fri, Jan 12, 2024 at 10:48 PM Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 1/12/24 07:23, Kiran K V wrote:

> Hi,

> 

> 

> I have a UTF8 database and simple table with two columns (integer and 

> varchar). Created a csv file with some multibyte characters and trying 

> to perform load operation using the copy command.

The multibyte characters come from what character set?

> 

> __ __

> 

> Database info:____

> 

> Postgresql database details:____

> 

>     Name    |  Owner   | Encoding |      Collate       |       

> Ctype        |   Access privileges____

> 

> -----------+----------+----------+--------------------+--------------------+-----------------------____

> 

> postgres  | postgres | UTF8     | English_India.1252 | 

> English_India.1252 |____

> 

> __ __

> 

> (Note: I also tried with collate utf8 and no luck)

> 

> 

> postgres=# set client_encoding='UTF8';____

> 

> SET____

> 

> __ __

> 

> Table:____

> 

> create table public.test ( PKCOL integer not null, STR1 character 

> varying(64) null, primary key( PKCOL )) ____

> 

> ____

> 

> csv contents:____

> 

> 1|"àáâãäåæçèéêëìíîï"____

> 

> __ __

> 

> After data loading, actual data is becoming____

> 

> Ã Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯____

> 

> hex of this is -  

> c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af____

> 

> __ __

> 

> The hex values are indeed the UTF-8 encodings of the characters in your 

> expected string, and the presence of `C2` before each character is 

> indicative of how UTF-8 represents certain characters.____

> 

> In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`, 

> etc.) are represented as two bytes. The first byte `C2` or `C3` 

> indicates that this is a two-byte character, and the second byte 

> specifies the character. For example:____

> 

> - `à` is represented as `C3 A0`____

> 

> - `á` is `C3 A1`____

> 

> - `â` is `C3 A2`, and so on.____

> 

> In this case, the `C2` byte is getting interpreted as a separate 

> character and that is the likely reason that an `Â` (which corresponds 

> to `C2`) is seen before each intended character. Looks like UTF-8 

> encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or 

> Windows-1252, where each byte is treated as a separate character.

> 

> 

> Please advise. Thank you very much.

> 

> 

> Regards,

> 

> Kiran

> 

-- 

Adrian Klaver

adrian.klaver@xxxxxxxxxxx