Re: what's a useful definition of full text index on a repository?

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Tue, 2 Oct 2007 11:48:53 -0400

On 10/2/07, David Tweed <david.tweed@xxxxxxxxx> wrote:
> > Full text indexing can also achieve high levels of compression as
> > stated in the earlier threads. It is full scale dictionary
> > compression. When it is being used for compression you want to apply
> > it to all revisions.
>
> Well, as I say I'm not convinced it makes sense to integrate this with
> existing pack stuff precisely because I don't think it's universally
> useful. So you seem to end up with all the usual tricks, eg, Golomb
> coding inverted indexes, etc, _if_ you treat each blob as completely
> independent. I was wondering if there was anything else you can do
> given the special structure that might be both more useful and more
> compact?

Dictionary compression can be used without full-text indexes. It is
just really easy to build the full-text index if the data is already
dictionary compressed. Dictionary compression works for everything
except binary or random data.

Git is already using a small scale dictionary compressor via zip. I
suspect doing a full scale dictionary for a pack file and then using
arithmetic encoding of the tokens would provide substantially more
compression. The big win is having a single dictionary instead of a
new dictionary each time zip is used.

When we were working on Mozilla, Mozilla changed licenses three times.
The license text ended up taking about 30MB in the current scheme.
With full dictionary compression this would reduce down to a few kb.

More compression is good for git. It means we can keep more data in
RAM and reduce download times. With current hardware it is almost
always better to trade off CPU to reduce IO.

-- 
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html