Basically a "blue sky" question about full-text indexing git repositories. A while back, whilst talking about overall git structure (see http://marc.info/?l=git&m=118891945402778&w=2 ), Jon Smirl raised the question of putting a full-text index on a repository. I doubt I full text index is of much use on a code repository because the question tends to be focussed around either released versions or immediate git-blame stuff. However, for repositories of things like evolving documents/presentations/notes where content is deleted because it doesn't fit the context rather than being superceded by better stuff, it might be more natural to search for "where's that paragraph I wrote on 'human' 'vision' and 'kinematic' 'feedback' ?". So I got to thinking about experimenting with a full text index and even started writing some code. However, I then realised that it's not obvious what the most useful definition of a full text index is on evolving files. (To be clear, I'm _not_ thinking about changing the database fundamentals as discussed in the referenced thread and indeed would put the full-text index into a different file that just references the existing git db stuff precisely because I doubt the text index will be of use to most people.) A "classical" full-text index seems inappropriate because, if I've got a long text document that in a blob in commit n1 uses word 'x' in one section and the corresponding file in descendant commit n2 has the same text using word 'x' but has changes to a different section of the document, there's probably no point showing me both documents (and I can always track through the history once I've got one (commit-id,file pair)). So in the case of a single word, a "useful" definition would be the entry for word w in the full-text index should consist of those (commit-id,file) pairs whose diff with their parent contains an addition of text containing word w. (This will catch the creation of the text containing w and then precisely those files which are close modifications of it but ignore changes to other areas.) This seems to make sense for a single word. Let's call this the "appearance diff" definition of a full text index. However, things become unclear if you consider a query with multiple words. Consider the simplest case of a linear history where commit n0 adds word "vision" to file p1.tex (with respect to its single parent), there are some intermediate commits and then commit n7 adds word "feedback" to p1.tex. Then there's no commit whose diff with its single parent contains both words "vision" and "feedback". In the linear history case you could imagine trying to find the first commit which is a child of _all_ the commits under the "appearance diff" definition. However, that clearly doesn't "obviously" extend to general full DAG histories, and in any case it's probably not fully correct even in the linear case. So maybe a different definition would be better. So I'm just throwing the question out to the list in case anyone has any better ideas for what a full-text index on an evolving set of files ought to be. (One question is "why do you want to build a table rather than actively search the full git repo for each query (ie, combination of words) as you make it?" My primary motivation is that I might in the future like to do queries on some sort of low processor power UMPC-type thing, having built the file containing a "full text index" data structure for the index on a quite beefy desktop. The other point is that searching natural language text based on a fallible memory you're more likely to try different combinations of search terms iteratively to try and hit the right one, so there might be some point in trying to build an index.) Anyway, it's currently an idle speculation, -- cheers, dave tweed__________________________ david.tweed@xxxxxxxxx Rm 124, School of Systems Engineering, University of Reading. "we had no idea that when we added templates we were adding a Turing- complete compile-time language." -- C++ standardisation committee - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html