At 10:53 AM 2/17/2006, Martijn van Oosterhout wrote:
On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote:
> >For this mapping, you need a full table sort.
> One physical IO pass should be all that's needed. However, let's
> pretend you are correct and that we do need to sort the table to get
> the key mapping. Even so, we would only need to do it =once= and
> then we would be able to use and incrementally update the results
> forever afterward. Even under this assumption, one external sort to
> save all subsequent such sorts seems well worth it.
>
> IOW, even if I'm wrong about the initial cost to do this; it is still
> worth doing ;-)
I think you're talking about something different here. You're thinking
of having the whole table sorted and when you add a new value you add
it in such a way to keep it sorted. The problem is, what do you sort it
by? If you've sorted the table by col1, then when the user does ORDER
BY col2 it's useless.
No, I'm thinking about how to represent DB row data in such a way that
a= we use a compact enough representation that we can sort internally
rather than externally.
b= we do the sort once and avoid most of the overhead upon subsequent
similar requests.
I used the example of sorting on the entire row to show that the
approach works even when the original record being sorted by is very large.
All my previous comments on this topic hold for the case where we are
sorting on only part of a row as well.
If all you are doing is sorting on a column or a few columns, what
I'm discussing is even easier since treating the columns actually
being used a sort criteria as integers rather than the whole row as
an atomic unit eats less resources during the key creation and
mapping process. If the row is 2-4KB in size, but we only care about
some combination of columns that only takes on <= 2^8 or <= 2^16
different values, then what I've described will be even better than
the original example I gave.
Basically, the range of a key is going to be restricted by how
a= big the field is that represents the key (columns and such are
usually kept narrow for performance reasons) or
b= big each row is (the more space each row takes, the fewer rows fit
into any given amount of storage)
c= many rows there are in the table
Between the conditions, the range of a key tends to be severely
restricted and therefore use much less space than sorting the actual
DB records would. ...and that gives us something we can take advantage of.
Indeed, this is what btrees do, you store the order of the table
seperate from the data. And you can store multiple orders. But even
then, when someone does ORDER BY lower(col1), it's still useless.
And you're right, we still need to do the single mass sort in the
beginning, which is precisely what we're trying to optimise here.
Sigh. My points were:
1= we have information available to us that allows us to map the rows
in such a way as to turn most external sorts into internal sorts,
thereby avoiding the entire external sorting problem in those
cases. This is a huge performance improvement.
2= that an external sort is =NOT= required for initial key
assignment, but even if it was it would be worth it.
3= that this is a variation of a well known technique so I'm not
suggesting heresy here.
Ron