Re: [GENERAL] Creation of tsearch2 index is very

Ron <rjpeace@xxxxxxxxxxxxx> · Thu, 26 Jan 2006 19:18:35 -0500

At 01:27 PM 1/21/2006, Tom Lane wrote:
Ron <rjpeace@xxxxxxxxxxxxx> writes:
> At 07:23 PM 1/20/2006, Tom Lane wrote:
>> Well, we're trying to split an index page that's gotten full into
>> two index pages, preferably with approximately equal numbers of items in
>> each new page (this isn't a hard requirement though).

> Maybe we are over thinking this.  What happens if we do the obvious
> and just make a new page and move the "last" n/2 items on the full
> page to the new page?

Search performance will go to hell in a handbasket :-(.  We have to make
at least some effort to split the page in a way that will allow searches
to visit only one of the two child pages rather than both.

It's certainly true though that finding the furthest pair is not a
necessary component of that.  It's reasonable if you try to visualize
the problem in 2D or 3D, but I'm not sure that that geometric intuition
holds up in such a high-dimensional space as we have here.
After reading the various papers available on GiST and RD trees, I 
think I have a decent suggestion.

Since RD tree keys contain the keys of their descendents/components 
in them, they are basically a representation of a depth first 
search.  This can be useful for intra-document searches.

OTOH, inter-document searches are going to be more akin to breadth 
first searches using RD trees.

Thus my suggestion is that we maintain =two= index structures for text data.

The first contains as many keys and all their descendents as 
possible.  When we can no longer fit a specific complete "path" into 
a page, we start a new one; trying to keep complete top level to leaf 
key sets within a page.  This will minimize paging during 
intra-document searches.

The second index keeps siblings within a page and avoids putting 
parents or children within a page unless the entire depth first 
search can be kept within the page in addition to the siblings 
present.  This will minimize paging during inter-document searches.

Traditional B-tree ordering methods can be used to define the 
ordering/placement of pages within each index, which will minimize 
head seeks to find the correct page to scan.

Since the criteria for putting a key within a page or starting a new 
page is simple, performance for those tasks should be O(1).

The price is the extra space used for two indexes instead of one, but 
at first glance that seems well worth it.

Comments?
Ron