Re: The need for clustered indexes to boost TPC-V performance

Claudio Freire <klaussfreire@xxxxxxxxx> · Wed, 4 Jul 2012 11:26:36 -0300

On Tue, Jul 3, 2012 at 8:13 PM, Reza Taheri <rtaheri@xxxxxxxxxx> wrote:
> So I looked more closely at the indexes. I chose the CASH_TRANSACTION
> table since it has a single index, and we can compare it more directly to the
> Dell data. If you look at page 34 of http://bit.ly/QeWXhE, the index size of CT
> is 1,278,720KB for 6,120,529,488 rows. That’s less than one byte of index
> per data row!  How could that be?  Well, MS SQL used a “clustered index”
> for CT, i.e., the data is held in the leaf pages of the index B-Tree.
> The data and index are in one data structure. Once you lookup the index,
> you also have the data at zero additional cost. For PGSQL, we had to create
> a regular index, which took up 55GB. Once you do the math, this works out
> to around 30 bytes per row. I imagine we have the 15-byte key along with a
> couple of 4-byte or 8-byte pointers.
...
> So MS SQL beats PGSQL by a) having a lower I/O rate due to no competition
> for the buffer pool from indexes (except for secondary indexes); and b) by
> getting the data with a free lookup, whereas we have to work our way down
> both the index and the data trees.

15-byte key?

What about not storing the keys, but a hash, for leaf nodes?

Assuming it can be made to work for both "range" and "equality" scans,
holding only hashes on leaf nodes would reduce index size, but how
much?

I think it's doable, and I could come up with a spec if it's worth it.
It would have to scan the heap for only two extra index pages (the
extremes that cannot be ruled out) and hash collisions, which doesn't
seem like a big loss versus the reduced index.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance