On 10/1/07, Jon Smirl <jonsmirl@xxxxxxxxx> wrote: > This is what full text is used for with code: > http://lxr.mozilla.org/ > > It makes grep instant. I'd thought that keeping a full-text index of all my program files was my dirty little secret that shows I'm not a "pro" programmer ;-) > For source code you can take the full text concept further and store > parse trees. [details snipped] This sounds interesting in principle but is beyond what I'm thinking in practice (particularly since I'm not in the "C is the only language worth ever using" camp). > Full text indexing can also achieve high levels of compression as > stated in the earlier threads. It is full scale dictionary > compression. When it is being used for compression you want to apply > it to all revisions. Well, as I say I'm not convinced it makes sense to integrate this with existing pack stuff precisely because I don't think it's universally useful. So you seem to end up with all the usual tricks, eg, Golomb coding inverted indexes, etc, _if_ you treat each blob as completely independent. I was wondering if there was anything else you can do given the special structure that might be both more useful and more compact? > You would full text index the expanded source text for each revision, > not the delta. There are forms of full text indexes that record the > words position in the document. They let you search for "vision NEAR > feedback" Well, the kind of question I was thinking was "clearly you can use the existing sort of full text indexing (eg, the stuff covered in Cleary, Witten & Bell's covered Managing Gigabytes), but is that the most useful way of doing things in the context of an evolving database?" If you treat every blob as essentially a different document there are indexing tools out there already you can use. What I was wondering was if it's really that useful to a human user to report every revision of a document containing those keywords even if the differences are in other parts of the text far removed from the text containing the keywords. I don't know the answer. > > (One question is "why do you want to build a table rather than > > actively search the full git repo for each query (ie, combination of > > words) as you make it?" My primary motivation is that I might in the > > future like to do queries on some sort of low processor power > > UMPC-type thing, having built the file containing a "full text index" > > data structure for the index on a quite beefy desktop. The other point > > is that searching natural language text based on a fallible memory > > you're more likely to try different combinations of search terms > > iteratively to try and hit the right one, so there might be some point > > in trying to build an index.) > > I do admit that these indexes are used to make functions that can be > done with brute force faster. As computers get faster the need for > these decrease. Right now the size of the kernel repo is not growing > faster than the progress of hardware. If you went back are tried to do > these things on a 386 you'd be shouting for indexes tomorrow. The other point is that direct searching is easier because you know exactly what the query is at the point you have access to the full text, whereas building an index you want to extract no more and no less information to be able to answer all allowed queries. But I still like the idea of getting a UMPC type thing if they become affordable. -- cheers, dave tweed__________________________ david.tweed@xxxxxxxxx Rm 124, School of Systems Engineering, University of Reading. "we had no idea that when we added templates we were adding a Turing- complete compile-time language." -- C++ standardisation committee - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html