Search Postgresql Archives

Re: String comparison problem in select - too many results

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@xxxxxx> wrote:
> On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote:
>> C collation is like sorting raw bytes, it doesn't event sort
>> upper/lower case correctly
>
> Now you are falling into the same trap as Durumdara, calling an
> unintended sort order "not correct" ;-).

Well, had you quoted / read a little farther:
>>
C collation is like sorting raw bytes, it doesn't event sort
upper/lower case correctly ( Do not know how you do it in HU, but in
ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
what non-locale aware programs use because it's dirt cheap to
implement, just sort the data as unsigned byte arrays
lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
this works too ( You can sort by raw bytes and you get the data sorted
lexicographically sorted by code points ).
<<

You'll see I was trying to define "correctly" somehow. English is not
my native language and it seems I obviously failed at it, I'll try to
do it better next time.

> C collation is certainly not what a "normal user" expects. It is
> therefore wrong for many applications (e.g., you couldn't use it to
> sort a telephone book), but it might be correct for others (e.g., it you
> need a stable, unambiguous sort order but don't care much about the
> order itself).

stable / unambiguous is shared by a la lot of collation methods, but I
see what you try to say.


> (By coincidence, I also stumbled over the fact that the en_US.UTF-8
> collation ignores punctuation characters and spaces at least in the
> first pass - took me a minute or so to figure out why I got an
> "inconsistent" (read: unexpected and not obvious for me) sort order.)

Nearly all the locale-aware sorts do funny things with
punctuation/spaces. I've found sort more and more surprissing since I
started ( with the electromechanical IBM card sorters, those )

> Lexicographers are interested in sorting single words. Therefore they
> aren't interested in punctuation. They may also not be very interested
> in accents, because most languages have few words which differ only by
> an accent

We have loads of them in spanish, but they are normally easy and many
of them come from verbs conjugation which does not o into the
dictionary  ...

> (and it the dictionary is printed on paper, the user will
> easily be able to scan up and down a few centimeters and find the right
> entry without knowing whether "à" is sorted before or after "á").

and we have none of this, I only know French doing it.

> Somebody preparing an address list should care about punctuation: You
> would probably not expect to find "Smith-Jones, Amanda" between "Smith,
> John" and "Smith, Julia". And you probably want to sort "23 Main Street"
> before "180 Main Street".


> Which brings us back to Durumdara's example: I don't know his
> application, so I don't know what "normal users" of his application
> would expect, but to me those values look like two numbers separated by
> a slash. So I would expect '23/4567' to be sorted between '18/0212' and
> '180/2010', but of course the C collation doesn't do that:

He does not seem to want this. As all his examples use the same prefix
I think he just want to extract a small range of keys with a common
prefix . I've had this problems when "augmenting" a part-number code
to be product-part, and using things like "between xxx- and xxx-zz" to
get all parts for product xxx ( it was changed to dual fields at the
next iteration, but sometimes you need these things for the interim ).

I mean, the fact that they are both numbers doesn't mean he wants /
need numerical ordering on them, for many purposes just collapsing
prefixes is enough.

>
> => select * from foo order by t collate "C";
> ╔══════════╗
> ║    t     ║
> ╟──────────╢
> ║ 18/0113  ║
> ║ 18/0212  ║
> ║ 180/2010 ║
> ║ 23/4567  ║
> ╚══════════╝
> (4 rows)
>
> It might be possible to define a custom collation for that, but in a
> case like this I would either split this field into two integer fields
> or use a function-based index.

Yep, but he may have a temporary problem. C collation puts all the
"prefixes" together, which normally is good enough.

Francisco Olarte.





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux