Re: The need for clustered indexes to boost TPC-V performance

Reza Taheri <rtaheri@xxxxxxxxxx> · Thu, 5 Jul 2012 12:10:04 -0700

Hi Daniel,
Yes, it sounds like GIT will take us half the way there by getting rid of much of the index I/O if we cluster the tables. We can set the fillfactor parameter to keep tables sorted after updates. I am not sure what impact inserts will have since the primary key keeps growing with new inserts, so perhaps the table will maintain the cluster order and the benefits of GIT for new rows, too. GIT won't save CPU cycles the way a clustered/integrated index would, and actually adds to the CPU cost since the data page has to be searched for the desired tuple.

Thanks,
Reza

> -----Original Message-----
> From: Daniel Farina [mailto:daniel@xxxxxxxxxx]
> Sent: Wednesday, July 04, 2012 6:40 AM
> To: Craig Ringer
> Cc: Reza Taheri; pgsql-performance@xxxxxxxxxxxxxx; Robert Haas
> Subject: Re:  The need for clustered indexes to boost TPC-V
> performance
> 
> On Tue, Jul 3, 2012 at 10:43 PM, Craig Ringer <ringerc@xxxxxxxxxxxxx> wrote:
> > On 07/04/2012 07:13 AM, Reza Taheri wrote:
> >
> > Following the earlier email introducing the TPC-V benchmark, and that
> > we are developing an industry standard benchmarking kit for TPC-V
> > using PostgreSQL, here is a specific performance issue we have run into.
> >
> >
> > Which version of PostgreSQL are you using?
> >
> > How has it been tuned beyond the defaults - autovacuum settings,
> > shared_buffers, effective_cache_size, WAL settings, etc?
> >
> > How much RAM is on the blade? What OS and version are on the blade?
> >
> >
> > Comparing the table sizes, we are close to 2X larger (more on this in
> > a later note). But the index size is what stands out. Our overall
> > index usage (again, after accounting for different numbers of rows) is
> > 4.8X times larger. 35% of our I/Os are to the index space. I am
> > guessing that the 4.8X ballooning has something to do with this, and
> > that in itself explains a lot about our high I/O rate, as well as
> > higher CPU/tran cycles compared to MS SQL (we are  2.5-3 times slower).
> >
> > This is making me wonder about bloat issues and whether proper
> > vacuuming is being done. If the visibility map and free space map
> > aren't maintained by proper vaccum operation everything gets messy,
> fast.
> >
> > Well, MS SQL used a "clustered index" for CT, i.e., the data is held
> > in the leaf pages of the index B-Tree. The data and index are in one
> > data structure. Once you lookup the index, you also have the data at
> > zero additional cost.
> >
> > [snip]
> >
> >
> >
> > Is the PGSQL community willing to invest in a feature that a) has been
> > requested by many others already; and b) can make a huge difference in
> > a benchmark that can lend substantial credibility to PGSQL performance?
> >
> >
> > while PostgreSQL doesn't support covering indexes or clustered indexes
> > at this point, 9.2 has added support for index-only scans, which are a
> > half-way point of sorts. See:
> >
> >   http://rhaas.blogspot.com.au/2011/10/index-only-scans-weve-got-
> em.html
> >   http://rhaas.blogspot.com.au/2010/11/index-only-scans.html
> >
> > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a2822fb9
> > 337a21f98ac4ce850bb4145acf47ca27
> >
> > If at all possible please see how your test is affected by this
> > PostgreSQL
> > 9.2 enhancement. It should make a big difference, and if it doesn't
> > it's important to know why.
> >
> > (CC'd Robert Haas)
> >
> > I'm not sure what the best option for getting a 9.2 beta build for
> > Windows is.
> >
> >
> > As for the "invest" side - that's really a matter for EnterpriseDB,
> > Command Prompt, Red Hat, and the other backers who're employing
> people
> > to work on the DB. Consider asking on pgsql-hackers, too; if nothing
> > else you'll get a good explanation of the current state and progress toward
> clustered indexes.
> >
> > Some links that may be useful to you are:
> >
> >   http://wiki.postgresql.org/wiki/Todo
> >   Things that it'd be good to support/implement at some point.
> > Surprisingly, covering/clustered indexes aren't on there or at least aren't
> easily found.
> > It's certainly a much-desired feature despite its apparent absence
> > from the TODO.
> 
> I think there is, deservingly, a lot of hesitation to implement a strictly
> ordered table construct.  A similar feature that didn't quite get finished --
> but maybe can be beaten into shape -- is the grouped-index-tuple
> implementation:
> 
> http://community.enterprisedb.com/git/
> 
> It is mentioned on the TODO page.  It's under the category that is perhaps
> poorly syntactically overloaded in the world "cluster".
> 
> --
> fdr

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance