> -----Original Message----- > From: pgsql-hackers-owner@xxxxxxxxxxxxxx [mailto:pgsql-hackers- > owner@xxxxxxxxxxxxxx] On Behalf Of Markus Schaber > Sent: Thursday, February 16, 2006 5:45 AM > To: pgsql-performance@xxxxxxxxxxxxxx; pgsql-hackers@xxxxxxxxxxxxxx > Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index > > Hi, Ron, > > Ron wrote: > > > ...and of course if you know enough about the data to be sorted so as to > > constrain it appropriately, one should use a non comparison based O(N) > > sorting algorithm rather than any of the general comparison based > > O(NlgN) methods. > > Sounds interesting, could you give us some pointers (names, URLs, > papers) to such algorithms? He refers to counting sort and radix sort (which comes in most significant digit and least significant digit format). These are also called distribution (as opposed to comparison) sorts. These sorts are O(n) as a function of the data size, but really they are O(M*n) where M is the average key length and n is the data set size. (In the case of MSD radix sort M is the average length to completely differentiate strings) So if you have an 80 character key, then 80*log(n) will only be faster than n*log(n) when you have 2^80th elements -- in other words -- never. If you have short keys, on the other hand, distribution sorts will be dramatically faster. On an unsigned integer, for instance, it requires 4 passes with 8 bit buckets and so 16 elements is the crossover to radix is faster than an O(n*log(n)) sort. Of course, there is a fixed constant of proportionality and so it will really be higher than that, but for large data sets distribution sorting is the best thing that there is for small keys. You could easily have an introspective MSD radix sort. The nice thing about MSD radix sort is that you can retain the ordering that has occurred so far and switch to another algorithm. An introspective MSD radix sort could call an introspective quick sort algorithm once it processed a crossover point of buckets of key data. In order to have distribution sorts that will work with a database system, for each and every data type you will need a function to return the bucket of bits of significance for the kth bucket of bits. For a character string, you could return key[bucket]. For an unsigned integer it is the byte of the integer to return will be a function of the endianness of the CPU. And for each other distinct data type a bucket function would be needed or a sort could not be generated for that type using the distribution method.