Re: A tracking tree for the active work space

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Sun, 11 Mar 2007 22:31:58 +0100 (CET)

Hi,

On Sun, 11 Mar 2007, Jon Smirl wrote:

> As for the part about 'git grep' Shawn and I have been talking off and 
> on about experimenting with an inverted index for a packfile format. The 
> basic idea is that you tokenize the input and turn a source file into a 
> list of tokens. You diff with the list of tokens like you would normally 
> do with text. There is a universal dictionary for tokens, a token's id 
> is it's position in the dictionary.

All in all, this is an interesting idea.

However, I see some problems I'd like to know solutions for:

- how to determine the optimal length of the tokens? (It is easy if you 
  tokenize on the word level, but you suggested that it is more efficient 
  to have longer phrases.)

- the search terms can be _part of_ the tokens. In fact, a search term can 
  be the postfix of one token, then a list of other tokens, and then a 
  prefix of yet another token. It might not be really cheap to construct 
  _all_ possible combinations of tokens which make up the search term...

- how do you want to cope with regular expressions? (The previous problem 
  only addresses simple, constant search terms, i.e. no true regular 
  expressions.)

- at the moment, most objects which are contained by a pack are relatively 
  cheaply transported via the pack protocol. IIUC your new pack format 
  would need _exactly_ the same dictionary to be transmitted as-is. IOW
  how do you want to make on-the-fly pack generation cheap again?

Don't get me wrong: I don't want to discourage you, but it is too easy to 
optimize for the wrong use cases (I expact a repack, or a fetch, to happen 
much more often than a grep). If you can address above-mentioned issues, 
I see no reason why the new pack format should not be used instead of the 
current one.

Ciao,
Dscho

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html