May "PostgreSQL server side GB18030 character set support" reconsidered?

Han Parker <parker.han@xxxxxxxxxxx> · Mon, 5 Oct 2020 05:14:58 +0000

Hi，

May "GB18030 server side support" deserve reconsidering, after about 15 years later than  release of GB18030-2005?

It may be the one of most green features for PostgreSQL.

1. In this big data and mobile era, in the country with most population, 50% more disk energy consuming for Chinese characters (UTF-8 usually 3 bytes for a Chinese character, while GB180830 only 2 bytes) is indeed a harm to "Carbon Neutral",  along with Polar
 ice melting.

https://www.nasa.gov/feature/goddard/2020/emissions-could-add-15-inches-to-2100-sea-level-rise-nasa-led-study-finds 

2."Setting client side to UTF-8, just like setting server side to UTF-8" in the following mail is not practical for most Chinese IT projects, especially public funding projects. Because GB18030 compatible is a law in Mainland

China.

Usually the client side encoding configuration with a GUI is more difficult to be hidden, and most MS Windows users are familiar with GB18030.

MySQL supports GB18030 in server side from V5.7 in 2015.  And I am not sure how much this feature contributed to MySQL's more popular in Mainland China.

https://dev.mysql.com/doc/mysql-g11n-excerpt/5.7/en/charset-gb18030.html

Emissions could add 15 inches to
 2100 sea level rise | NASA

If greenhouse gas emissions continue apace, Greenland and Antarctica’s ice sheets could together contribute more than 15 inches of global sea level rise by 2100

www.nasa.gov

Parker Han

From: pgsql-general-owner@xxxxxxxxxxxxxx <pgsql-general-owner@xxxxxxxxxxxxxx> on behalf of Arjen Nienhuis <a.g.nienhuis@xxxxxxxxx>

Sent: Saturday, March 7, 2015 8:18

To: lsliang <lsliang@xxxxxxxxxxxxxxx>

Cc: Adrian Klaver <adrian.klaver@xxxxxxxxxxx>; pgsql-general <pgsql-general@xxxxxxxxxxxxxx>

Subject: Re: Re: Re: [GENERAL] can postgresql supported utf8mb4 character sets?

On Fri, Mar 6, 2015 at 3:55 AM, lsliang 
<lsliang@xxxxxxxxxxxxxxx> wrote:

2015-03-06

发件人：Adrian Klaver
发送时间：2015-03-05 21:31:39
收件人：lsliang; pgsql-general
抄送：
主题：Re: [GENERAL] can postgresql supported utf8mb4 character sets?

On 03/05/2015 01:45 AM, lsliang wrote:

> can  postgresql supported   utf8mb4  character set?
> today   mobile  apps support   4-byte  character   and  utf8 can only
> support  1-3 bytes character

The docs would seem to indicate otherwise:

http://www.postgresql.org/docs/9.3/interactive/multibyte.html

http://en.wikipedia.org/wiki/UTF-8

> if   load  string  to database which  contain  a  4-byte character
> will failed  .

Have you actually tried to load strings in to Postgres?

If so and it failed what was the method you used and what was the error?

> mysql   since  5.5.3 support utf8mb4 character sets
> I don't  find  some information about  postgresql
>   thanks

-- 
Adrian Klaver
adrian.klaver@xxxxxxxxxxx

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

thanks   for  your help . 

 postgresql   can support   4-byte  character   

test=> select * from utf8mb4_test ;
ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding "GB18030"
test=> \encoding utf8 
test=> select * from utf8mb4_test ;
 content 
---------
 ðŸ˜„
 ðŸ˜„

pcauto=> 

UTF-8 support works fine. The 3 byte limit was something mysql invented. But it only works if your client encoding is UTF-8. In your example, your terminal is not set to UTF-8.

create table test (glyph text);

insert into test values ('A'), ('馬'), ('𐁀'), ('😄'), ('🇪🇸');

select glyph, convert_to(glyph, 'utf-8'), length(glyph) FROM test;

 glyph |     convert_to     | length

-------+--------------------+--------

 A     | \x41               |      1

 馬    | \xe9a6ac           |      1

 𐁀     | \xf0908180         |      1

 😄     | \xf09f9884         |      1

 🇪🇸    | \xf09f87aaf09f87b8 |      2

(5 rows)

What doesn't work is GB18030:

select glyph, convert_to(glyph, 'GB18030'), length(glyph) FROM test;

ERROR:  character with byte sequence 0xf0 0x90 0x81 0x80 in encoding "UTF8" has no equivalent in encoding "GB18030"

I think that is a bug.

Gr. Arjen