Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?

Tatsuo Ishii <ishii@xxxxxxxxxxxx> · Tue, 06 Oct 2020 09:04:10 +0900 (JST)

> TBH, even if you came up with a complete patch, we'd probably
> reject it as unmaintainable and a security hazard.  The problem
> is that code may scan a string looking for certain ASCII characters
> such as backslash (\), which up to now it's always been able to do
> byte-by-byte without fear that non-ASCII characters could confuse it.
> To support GB18030 (or other encodings with the same issue, such as
> SJIS), every such loop would have to be modified to advance character
> by character, thus roughly "p += pg_mblen(p)" instead of "p++".
> Anyplace that neglected to do that would have a bug --- one that
> could only be exposed by careful testing using GB18030 encoding.
> What's more, such bugs could easily be security problems.
> Mis-detecting a backslash, for example, could lead to wrong decisions
> about where string literals end, allowing SQL-injection exploits.

One of ideas to avoid the concern could be "shifting" GB18030 code
points into "ASCII safe" code range with some calculations so that
backend can handle them without worrying about the concern above. This
way, we could avoid a table lookup overhead which is necessary in
conversion between GB18030 and UTF8 and so on.

However I don't come up with such a mathematical conversion method for
now.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp