Re: [GENERAL] Creation of tsearch2 index is very

Ron <rjpeace@xxxxxxxxxxxxx> · Fri, 27 Jan 2006 02:52:45 -0500

At 11:13 PM 1/26/2006, Craig A. James wrote:
Ron,

I'll write to you privately, because these discussions can get messy 
"in public".

I'm responding to this missive publicly in an attempt to help the 
discussion along.  It is not my usual practice to respond to private 
messages publicly, but this seems a good case for an exception.

You seem to have missed my point.  I just gave a very clear 
description of how to "decide which bitmaps go in each of the two 
buckets" by reformulating the question into "decide which bitmaps 
go in each of =four= buckets".

Sorry to disagree, but here's the problem. It's not whether you put 
them into two, or four, or N buckets.  The problem is, how do you 
categorize them to begin with, so that you have some reasonable 
criterion for which item goes in which bucket?  THAT is the hard 
problem, not whether you have two or four buckets.

Agreed.  ...and I've given the answer to "how do you categorize them" 
using a general property of RD Trees which should result in "a 
reasonable criterion for which item goes in which bucket" when used 
for text searching.

The definition of RD tree keys being either "atomic" (non 
decomposable) or "molecular" (containing the keys of their 
descendents) is the one source of our current problems figuring out 
how to split them and, if I'm correct, a hint as to how to solve the 
splitting problem in O(1) time while helping to foster high 
performance during seearches.

Earlier, you wrote:
Traditional B-tree ordering methods can be used to define the 
ordering/placement of pages within each index, which will minimize 
head seeks to find the correct page to scan.
Since the criteria for putting a key within a page or starting a 
new page is simple, performance for those tasks should be O(1).

What are the "traditional B-tree ordering methods"?  That's the 
issue, right there.  The problem is that bitmaps HAVE NO ORDERING 
METHOD.  You can't sort them numerically or alphabetically.
The =bitmaps= have no ordering method.  =Pages= of bitmaps MUST have 
an ordering method or we have no idea which page to look at when 
searching for a key.

Treating the "root" bitmaps (those that may have descendents but have 
no parents) as numbers and ordering the pages using B tree creation 
methods that use those numbers as keys is a simple way to create a 
balanced data structure with high fan out.  IOW, a recipe for finding 
the proper page(s) to scan using minimal seeks.

Take a crowd of people.  It's easy to divide them in half by names: 
just alphabetize and split the list in half.  That's what your 
solution would improve on.

But imagine I gave you a group of people and told you to put them 
into two rooms, such that when you are through, the people in each 
room are maximally similar to each other and maximally dissimilar to 
the people in the other room.  How would you do it?

First of all, you'd have to define what "different" and "same" 
are.  Is your measure based on skin color, hair color, age, height, 
weight, girth, intelligence, speed, endurance, body odor, ... 
?  Suppose I tell you, "All of those".  You have to sort the people 
so that your two groups are separated such that the group 
differences are maximized in this N-dimensional 
space.  Computationally, it's a nearly impossible problem.
I'm =changing= the problem using the semantics of RD trees.  Using an 
RD tree representation, we'd create and assign a key for each person 
that ranked them compared to everyone else for each of the metrics we 
decided to differentiate on.

Then we start to form trees of keys to these people by creating 
"parent" keys as roots that contain the union of everyone with the 
same or possibly similar value for some quality.  By iterating this 
process, we end up with a bottom up construction method for an RD 
tree whose root key will the union of all the keys representing these 
people.  If this is one data structure, we end up with an efficient 
and effective way of answering not only Boolean but also ranking and 
similarity type queries.

The problem comes when we have to split this monolithic DS into 
pieces for best performance.  As many have noted, there is no 
simple  way to decide how to do such a thing.

OTOH, we =do= have the knowledge of how RD trees are built and what 
their keys represent, and we know that queries are going to tend 
strongly to either a) traverse the path from parent to child (depth 
first) or b) find all siblings with (dis)similar characteristics 
(breadth first), or c) contain a mixture of a) and b).

So I'm suggesting that conceptually we clone the original RD tree and 
we split each clone according to two different methods.
Method A results in pages that contain as many complete depth first 
paths from root to leave as possible on each page.
Method B results in pages that contain as many siblings as possible per page.
...and we use the appropriate index during each type of query or query part.

In return for using 2x as much space, we should have a general method 
that is O(1) for decomposing RD trees in such a way as to support 
high performance during searches.

Here's a much simpler version of the problem.  Suppose I give you 
1000 numbers, and tell you to divide them in half so that the two 
groups have the smallest standard deviation within each group 
possible, and the two average values of the groups of numbers are 
the farthest apart possible.  Pretty easy, right?

Now do it where "distance" is evaluated modulo(10), that is, 1 and 9 
are closer together than 1 and 3.  Now try it -- you'll find that 
this problem is almost impossible to solve.
This is orthogonal to the discussion at hand since the above is not 
akin to text searching nor best done with RD trees and that is 
exactly what this discussion is about.  We don't have to solve a 
general problem for all domains.  We only have to solve it for the 
specific domain of text search using the specific DS of RD trees.

The problem is that, as database designers, we're used to text and 
numbers, which have an inherent order.  Bitmaps have no ordering -- 
you can't say one is "greater than" or "less than" the other.  All 
you can do is determine that A and B are "more similar" or "less 
similar" than A and C.
Text and numbers only have an order because we deem them to.  There 
is even a very common default order we tend to use.  But even in the 
trivial case of ranking we've all said that "1" is better than "2" in 
one situation and "2" is better than "1" in another.

In the case of text searching, the bitmaps represent where to find 
legomena, AKA specific tokens.  Particularly hapax legomena, AKA 
unique tokens.  Hapax Legomena are particularly important because 
they are maximally efficient at driving the search needed to answer a query.

While absolute hapax legomena are great for quickly pruning things 
within a document or document region, relative hapax legomena can do 
the same thing when searching among multiple documents or document regions.

The two indexes I'm suggesting are designed to take advantage of this 
general property of text searching.

Hopefully this clarifies things and motivates a better discussion?
Ron