Re: String comparison problem in select - too many results

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Fri, 12 Jan 2018 11:08:39 +0100

Durumdara:

On Thu, Jan 11, 2018 at 1:23 PM, Durumdara <durumdara@xxxxxxxxx> wrote:
> Dear Francesco!

FrancIsco, with an I, Spanish, not Italian.

> My "bug" is that I commonly used Windows environment where the default
> collation is ANSI, based on Windows language which is Hungarian here
> (Windows1250).

I'm not sure ANSI is a collation in windows. You should read a bit
about locales, as they are not the same as collations. In fact I
suspect the order issues you've got are not UTF8 derived, as all your
data is ASCII(7bits), which represents the same in ANSI and UTF8 or
ISO-8859-whatever ( IIRC win1250 is a bastard superset of ISO8859-1 )

> But because of special characters we used UTF8 to store data in database.
> I supposed that UTF8.hu_HU is working like local natural order here, and the
> common ASCII chars are (like '/') in same position.

You must define "local natural order". But your problem is in the
locale, not in the encoding. If you managed to use win1250.hu_HU your
sorting will be the same, what you need is to use UTF8.C

> Python/Delphi/LibreOffice can sort these numbers correctly (based on local
> ANSI sort).

ANSI does not define sort order. And those three use, IIRC, C-locale
like sort. You are incorrectly assuming this is the correct one.

> I supposed that UTF8.hu_HU is using mostly same order which is valid here
> and it contains all ASCII + ANSI characters we are using here in daily work,
> and they are in very similar order.
> I never thought that it can't handle normal characters in 7 bit range...

It can. Your assumptions are BAD. You are not used to working with
collation-aware systems like postgres, and you assume they must use
the same as non-locale-awaer programs by default. This is normally
never true.

Also, windows is notoriously dificult to configure for locales. IIRC (
haven't used it in 15 years ) you had to go to keyboard preferences to
change it.

> For these numbers I can use C collation, it's ok.

C collation is like sorting raw bytes, it doesn't event sort
upper/lower case correctly ( Do not know how you do it in HU, but in
ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
what non-locale aware programs use because it's dirt cheap to
implement, just sort the data as unsigned byte arrays
lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
this works too ( You can sort by raw bytes and you get the data sorted
lexicographically sorted by code points ).

Another thing, if you have a column which you want sorted in C locale,
it can be defined as such so it does it by default ( but if you do it
with an alter column or similar you'll need a reindex probably ).

As a final note, please, DO NOT TOP POST and TRIM YOUR QUOTES.
Specially I do not need my signature quoted.

Francisco Olarte.