Markus Schaber wrote:
Ron wrote:
...and of course if you know enough about the data to be sorted so as to
constrain it appropriately, one should use a non comparison based O(N)
sorting algorithm rather than any of the general comparison based
O(NlgN) methods.
Sounds interesting, could you give us some pointers (names, URLs,
papers) to such algorithms?
Most of these techniques boil down to good ol' "bucket sort". A simple example: suppose you have a column of integer percentages, range zero to 100. You know there are only 101 distinct values. So create 101 "buckets" (e.g. linked lists), make a single pass through your data and drop each row's ID into the right bucket, then make a second pass through the buckets, and write the row ID's out in bucket order. This is an O(N) sort technique.
Any time you have a restricted data range, you can do this. Say you have 100 million rows of scientific results known to be good to only three digits -- it can have at most 1,000 distinct values (regardless of the magnitude of the values), so you can do this with 1,000 buckets and just two passes through the data.
You can also use this trick when the optimizer is asked for "fastest first result." Say you have a cursor on a column of numbers with good distribution. If you do a bucket sort on the first two or three digits only, you know the first "page" of results will be in the first bucket. So you only need to apply qsort to that first bucket (which is very fast, since it's small), and you can deliver the first page of data to the application. This can be particularly effective in interactive situations, where the user typically looks at a few pages of data and then abandons the search.
I doubt this is very relevant to Postgres. A relational database has to be general purpose, and it's hard to give it "hints" that would tell it when to use this particular optimization.
Craig