Re: GiST index performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 21 Apr 2009, Matthew Wakeling wrote:
Unfortunately, it seems there is another bug in the picksplit function. My patch fixes a bug that reveals this new bug. The whole picksplit algorithm is fundamentally broken, and needs to be rewritten completely, which is what I am doing.

I have now rewritten the picksplit and penalty functions for the bioseg data type, and they perform much better. The index size is now 164MB, compared to 350MB or so originally, and 2400MB after my earlier bugfix. Execution time of one of our queries (basically a nested loop join over a sequential scan and an index lookup in this index type) has gone down from 45 minutes to two minutes.

I have abandoned "Guttman's poly time split algorithm". A fundamental flaw in the picksplit algorithm is that it would create two separate target sets, and incrementally add entries to whichever one would grow the least in range size. However, if the entries arrived in any sort of order, they would all be added to the one set, growing it by a small amount each time. This caused the picksplit algorithm to split a set of 367 entries into a set of 366 and a set of one a high proportion of the time.

I have replaced the picksplit algorithm with a simple one. For each range element, find the midpoint of the range. Then find the mean of all the midpoints. All elements with a midpoint below the mean go in one set, and the others go in the second set. This usually splits the entries in a meaningful way.

I have also changed the penalty function. Previously, the penalty was the amount that the range would have to expand. So, if a new element fitted inside the existing range, then the penalty is zero. I have changed it to create a tie-break between multiple index pages that the element would fit in without expanding the range - the element should be inserted into the index page with the smallest range. This prevents large elements from messing up the index by forcing a large index page range that sucks in all the elements in the whole area into a non-selective group.

I may experiment with improving these functions further. The main problem with this index is the fact that I need to index ranges with a wide variety of widths, and I have a couple more strategies yet to help with that.

I will post a patch when I have ported my bioseg code over to the seg data type.

Matthew

--
Riker: Our memory pathways have become accustomed to your sensory input.
Data:  I understand - I'm fond of you too, Commander. And you too Counsellor

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux