Re: [HACKERS] qsort again (was Re: Strange Create

Ron <rjpeace@xxxxxxxxxxxxx> · Fri, 17 Feb 2006 01:20:58 -0500

At 01:47 PM 2/16/2006, Ron wrote:
At 12:19 PM 2/16/2006, Scott Lamb wrote:
On Feb 16, 2006, at 8:32 AM, Ron wrote:
Let's pretend that we have the typical DB table where rows are
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
such a table.

A 32b hash code can be assigned to each row value such that only
exactly equal rows will have the same hash code.
A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an
optional pass to rearrange the actual rows if we so wish.

I don't understand this.

This is a true statement: (H(x) != H(y)) => (x != y)
This is not: (H(x) < H(y)) => (x < y)

Hash keys can tell you there's an inequality, but they can't tell you
how the values compare. If you want 32-bit keys that compare in the
same order as the original values, here's how you have to get them:
For most hash codes, you are correct.  There is a class of hash or 
hash-like codes that maintains the mapping to support that second statement.

More later when I can get more time.
Ron

OK, so here's _a_ way (there are others) to obtain a mapping such that
 if a < b then f(a) < f (b) and
 if a == b then f(a) == f(b)

Pretend each row is a integer of row size (so a 2KB row becomes a 
16Kb integer; a 4KB row becomes a 32Kb integer; etc)
Since even a 1TB table made of such rows can only have 256M - 512M 
possible values even if each row is unique, a 28b or 29b key is large 
enough to represent each row's value and relative rank compared to 
all of the others even if all row values are unique.

By scanning the table once, we can map say 0000001h (Hex used to ease 
typing) to the row with the minimum value and 1111111h to the row 
with the maximum value as well as mapping everything in between to 
their appropriate keys.  That same scan can be used to assign a 
pointer to each record's location.

We can now sort the key+pointer pairs instead of the actual data and 
use an optional final pass to rearrange the actual rows if we wish.

That initial scan to set up the keys is expensive, but if we wish 
that cost can be amortized over the life of the table so we don't 
have to pay it all at once.  In addition, once we have created those 
keys, then can be saved for later searches and sorts.

Further space savings can be obtained whenever there are duplicate 
keys and/or when compression methods are used on the Key+pointer pairs.

Ron