On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@xxxxxx> wrote: > On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote: >> C collation is like sorting raw bytes, it doesn't event sort >> upper/lower case correctly > > Now you are falling into the same trap as Durumdara, calling an > unintended sort order "not correct" ;-). Well, had you quoted / read a little farther: >> C collation is like sorting raw bytes, it doesn't event sort upper/lower case correctly ( Do not know how you do it in HU, but in ES we sort aA before bB, while C locale normally sorts AB..ab.. It's what non-locale aware programs use because it's dirt cheap to implement, just sort the data as unsigned byte arrays lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that this works too ( You can sort by raw bytes and you get the data sorted lexicographically sorted by code points ). << You'll see I was trying to define "correctly" somehow. English is not my native language and it seems I obviously failed at it, I'll try to do it better next time. > C collation is certainly not what a "normal user" expects. It is > therefore wrong for many applications (e.g., you couldn't use it to > sort a telephone book), but it might be correct for others (e.g., it you > need a stable, unambiguous sort order but don't care much about the > order itself). stable / unambiguous is shared by a la lot of collation methods, but I see what you try to say. > (By coincidence, I also stumbled over the fact that the en_US.UTF-8 > collation ignores punctuation characters and spaces at least in the > first pass - took me a minute or so to figure out why I got an > "inconsistent" (read: unexpected and not obvious for me) sort order.) Nearly all the locale-aware sorts do funny things with punctuation/spaces. I've found sort more and more surprissing since I started ( with the electromechanical IBM card sorters, those ) > Lexicographers are interested in sorting single words. Therefore they > aren't interested in punctuation. They may also not be very interested > in accents, because most languages have few words which differ only by > an accent We have loads of them in spanish, but they are normally easy and many of them come from verbs conjugation which does not o into the dictionary ... > (and it the dictionary is printed on paper, the user will > easily be able to scan up and down a few centimeters and find the right > entry without knowing whether "à" is sorted before or after "á"). and we have none of this, I only know French doing it. > Somebody preparing an address list should care about punctuation: You > would probably not expect to find "Smith-Jones, Amanda" between "Smith, > John" and "Smith, Julia". And you probably want to sort "23 Main Street" > before "180 Main Street". > Which brings us back to Durumdara's example: I don't know his > application, so I don't know what "normal users" of his application > would expect, but to me those values look like two numbers separated by > a slash. So I would expect '23/4567' to be sorted between '18/0212' and > '180/2010', but of course the C collation doesn't do that: He does not seem to want this. As all his examples use the same prefix I think he just want to extract a small range of keys with a common prefix . I've had this problems when "augmenting" a part-number code to be product-part, and using things like "between xxx- and xxx-zz" to get all parts for product xxx ( it was changed to dual fields at the next iteration, but sometimes you need these things for the interim ). I mean, the fact that they are both numbers doesn't mean he wants / need numerical ordering on them, for many purposes just collapsing prefixes is enough. > > => select * from foo order by t collate "C"; > ╔══════════╗ > ║ t ║ > ╟──────────╢ > ║ 18/0113 ║ > ║ 18/0212 ║ > ║ 180/2010 ║ > ║ 23/4567 ║ > ╚══════════╝ > (4 rows) > > It might be possible to define a custom collation for that, but in a > case like this I would either split this field into two integer fields > or use a function-based index. Yep, but he may have a temporary problem. C collation puts all the "prefixes" together, which normally is good enough. Francisco Olarte.