Re: String comparison problem in select - too many results

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Sun, 14 Jan 2018 13:20:05 +0100

On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@xxxxxx> wrote:
> On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote:
>> C collation is like sorting raw bytes, it doesn't event sort
>> upper/lower case correctly
>
> Now you are falling into the same trap as Durumdara, calling an
> unintended sort order "not correct" ;-).

Well, had you quoted / read a little farther:
>>
C collation is like sorting raw bytes, it doesn't event sort
upper/lower case correctly ( Do not know how you do it in HU, but in
ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
what non-locale aware programs use because it's dirt cheap to
implement, just sort the data as unsigned byte arrays
lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
this works too ( You can sort by raw bytes and you get the data sorted
lexicographically sorted by code points ).
<<

You'll see I was trying to define "correctly" somehow. English is not
my native language and it seems I obviously failed at it, I'll try to
do it better next time.

> C collation is certainly not what a "normal user" expects. It is
> therefore wrong for many applications (e.g., you couldn't use it to
> sort a telephone book), but it might be correct for others (e.g., it you
> need a stable, unambiguous sort order but don't care much about the
> order itself).

stable / unambiguous is shared by a la lot of collation methods, but I
see what you try to say.

> (By coincidence, I also stumbled over the fact that the en_US.UTF-8
> collation ignores punctuation characters and spaces at least in the
> first pass - took me a minute or so to figure out why I got an
> "inconsistent" (read: unexpected and not obvious for me) sort order.)

Nearly all the locale-aware sorts do funny things with
punctuation/spaces. I've found sort more and more surprissing since I
started ( with the electromechanical IBM card sorters, those )

> Lexicographers are interested in sorting single words. Therefore they
> aren't interested in punctuation. They may also not be very interested
> in accents, because most languages have few words which differ only by
> an accent

We have loads of them in spanish, but they are normally easy and many
of them come from verbs conjugation which does not o into the
dictionary  ...

> (and it the dictionary is printed on paper, the user will
> easily be able to scan up and down a few centimeters and find the right
> entry without knowing whether "à" is sorted before or after "á").

and we have none of this, I only know French doing it.

> Somebody preparing an address list should care about punctuation: You
> would probably not expect to find "Smith-Jones, Amanda" between "Smith,
> John" and "Smith, Julia". And you probably want to sort "23 Main Street"
> before "180 Main Street".

> Which brings us back to Durumdara's example: I don't know his
> application, so I don't know what "normal users" of his application
> would expect, but to me those values look like two numbers separated by
> a slash. So I would expect '23/4567' to be sorted between '18/0212' and
> '180/2010', but of course the C collation doesn't do that:

He does not seem to want this. As all his examples use the same prefix
I think he just want to extract a small range of keys with a common
prefix . I've had this problems when "augmenting" a part-number code
to be product-part, and using things like "between xxx- and xxx-zz" to
get all parts for product xxx ( it was changed to dual fields at the
next iteration, but sometimes you need these things for the interim ).

I mean, the fact that they are both numbers doesn't mean he wants /
need numerical ordering on them, for many purposes just collapsing
prefixes is enough.

>
> => select * from foo order by t collate "C";
> ╔══════════╗
> ║    t     ║
> ╟──────────╢
> ║ 18/0113  ║
> ║ 18/0212  ║
> ║ 180/2010 ║
> ║ 23/4567  ║
> ╚══════════╝
> (4 rows)
>
> It might be possible to define a custom collation for that, but in a
> case like this I would either split this field into two integer fields
> or use a function-based index.

Yep, but he may have a temporary problem. C collation puts all the
"prefixes" together, which normally is good enough.

Francisco Olarte.