Re: questions on multibyte

Michael Fuhr <mike@xxxxxxxx> · Sat, 3 Feb 2007 01:35:52 -0700

On Thu, Feb 01, 2007 at 09:32:22PM -0800, Karthikeyan Sundaram wrote:
> 1) I am using 8.2 version
> 2) My database ise encoded as UTF-8 and client encoding is as "SET 
> client_encoding = 'UNICODE';"
> 3) We have a portal that are proposed to accept multibyte.  We are using 
> PHP modules.  Do we need to convert to UTF-8 before inserting in PHP?

If you set client_encoding to whatever encoding the data is in then
you shouldn't need to convert it in the application.  See "Automatic
Character Set Conversion Between Server and Client" in the documentation:

http://www.postgresql.org/docs/8.2/interactive/multibyte.html#AEN24099

If the data is in an encoding that PostgreSQL doesn't support then
you'll need to convert it before inserting it.

> 4) We are using IBM redbrick data warehouse which is also set to UTF-8
> 5) I am using a select statement with an \o output to a flat file with pipe 
> as delimeter.  Then I am using a rb_tmu utility supplied by IBM to load 
> into the redbrick warehouse database.  When I use a select statement where 
> the column data contains Korean, Japanese and chinese character, they were 
> display fine in the PGAdmin GUI.  But when I use a select statement to 
> convert to flat file, then it's give junk chars spread across many lines. 
> My requirement is one line one record.

As for multiple lines, are you sure UTF-8 is the problem?  I don't
use pgAdmin much but I notice that it displays only the first line
of multi-line data.  For example:

CREATE TABLE test (id serial, address text);
INSERT INTO test (address) VALUES (E'Line 1\nLine 2\nLine 3');
SELECT * FROM test;

In pgAdmin the SELECT shows only 'Line 1' in the address column.
psql shows multiple lines:

 id | address 
----+---------
  1 | Line 1
    : Line 2
    : Line 3
(1 row)

You can get a single line by converting CR and LF to spaces:

SELECT id, regexp_replace(address, E'[\r\n]+', ' ', 'g') AS address FROM test;
 id |       address        
----+----------------------
  1 | Line 1 Line 2 Line 3
(1 row)

psql's \o option probably isn't the best choice for exporting data
that will be imported somewhere else.  Try COPY or psql's \copy:

\copy test to test.txt delimiter '|'

This should create a file named test.txt with the following contents:

1|Line 1\nLine 2\nLine 3

However, this output might be problematic if the import process
doesn't understand escapes like \r and \n so you might need to
convert CR and LF to spaces as above.

As for "junk chars," make sure that whatever terminal or application
you're using interprets the data as whatever encoding it's in.  For
example, if you exported the data with client_encoding set to UTF-8
(or one of its aliases like UNICODE) then make sure your display
interprets the data as UTF-8 and knows how to render it.

-- 
Michael Fuhr