Re: what's a useful definition of full text index on a repository?

"David Tweed" <david.tweed@xxxxxxxxx> · Tue, 2 Oct 2007 10:34:19 +0100

On 10/1/07, Jon Smirl <jonsmirl@xxxxxxxxx> wrote:
> This is what full text is used for with code:
> http://lxr.mozilla.org/
>
> It makes grep instant.

I'd thought that keeping a full-text index of all my program files was
my dirty little secret that shows I'm not a "pro" programmer ;-)

> For source code you can take the full text concept further and store
> parse trees.
[details snipped]

This sounds interesting in principle but is beyond what I'm thinking
in practice (particularly since I'm not in the "C is the only language
worth ever using" camp).

> Full text indexing can also achieve high levels of compression as
> stated in the earlier threads. It is full scale dictionary
> compression. When it is being used for compression you want to apply
> it to all revisions.

Well, as I say I'm not convinced it makes sense to integrate this with
existing pack stuff precisely because I don't think it's universally
useful. So you seem to end up with all the usual tricks, eg, Golomb
coding inverted indexes, etc, _if_ you treat each blob as completely
independent. I was wondering if there was anything else you can do
given the special structure that might be both more useful and more
compact?

> You would full text index the expanded source text for each revision,
> not the delta. There are forms of full text indexes that record the
> words position in the document. They let you search for "vision NEAR
> feedback"

Well, the kind of question I was thinking was "clearly you can use the
existing sort of full text indexing (eg, the stuff covered in Cleary,
Witten & Bell's covered Managing Gigabytes), but is that the most
useful way of doing things in the context of an evolving database?" If
you treat every blob as essentially a different document there are
indexing tools out there already you can use. What I was wondering was
if it's really that useful to a human user to report every revision of
a document containing those keywords even if the differences are in
other parts of the text far removed from the text containing the
keywords. I don't know the answer.

> > (One question is "why do you want to build a table rather than
> > actively search the full git repo for each query (ie, combination of
> > words) as you make it?" My primary motivation is that I might in the
> > future like to do queries on some sort of low processor power
> > UMPC-type thing, having built the file containing a "full text index"
> > data structure for the index on a quite beefy desktop. The other point
> > is that searching natural language text based on a fallible memory
> > you're more likely to try different combinations of search terms
> > iteratively to try and hit the right one, so there might be some point
> > in trying to build an index.)
>
> I do admit that these indexes are used to make functions that can be
> done with brute force faster. As computers get faster the need for
> these decrease. Right now the size of the kernel repo is not growing
> faster than the progress of hardware. If you went back are tried to do
> these things on a 386 you'd be shouting for indexes tomorrow.

The other point is that direct searching is easier because you know
exactly what the query is at the point you have access to the full
text, whereas building an index you want to extract no more and no
less information to be able to answer all allowed queries. But I still
like the idea of getting a UMPC type thing if they become affordable.

-- 
cheers, dave tweed__________________________
david.tweed@xxxxxxxxx
Rm 124, School of Systems Engineering, University of Reading.
"we had no idea that when we added templates we were adding a Turing-
complete compile-time language." -- C++ standardisation committee
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html