Re: Mozilla .git tree

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Tue, 29 Aug 2006 23:27:08 -0400

On 8/29/06, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
Jon Smirl <jonsmirl@xxxxxxxxx> wrote:
> sha1s are effectively 20 byte pointer addresses into the pack. With 2M
> objects you can easily get away with 4 byte address and a mapping
> table. Another idea would be to replace the 20 byte sha1 in tree
> objects with 32b file offsets - requiring that anything the tree
> refers to has to already be in the pack before the tree entry can be
> written.

I've thought of that, but when you transfer a "thin" pack over the
wire the base object may not even be in the pack.  Thus you can't
use an offset to reference it.  Otherwise there's probably little
reason why the base couldn't be referenced by its 4 byte offset
rather than its full 20 byte object ID.  Added up over all deltas
in the mozilla pack it saves a whopping 23 MiB.

Every time an object goes on the wire these 'pack internal'
optimizations need to be undone. If you are sending the whole pack
everything can be sent as is.

These intense compression schemes are meant for archival level data.
Everybody should end up with a copy of the entire archive and that
will be the end of those objects moving on the wire.

From what I was able to gather I don't think Clucene stores the
documents themselves as the tokenized compressed data.  Or if it
does you lose everything between the tokens.  There's a number of
things we want to preserve in the original "document" like whitespace
that would be likely stripped when constructing tokens.

I can't remember if the Clucene code includes the ability to compress
using the dictionary. I had thought that the code was in there but
maybe not. Things that aren't in the dictionary use an escape code and
are copied intact. I have the Lucene book on my desk, I'll flip
through it and see what it says.

Might be worthwhile to poke around on the net and see if you can come
up with an existing dictionary based compressor. There has got to be
one out there, this is a 30 year old concept.

But it shouldn't be that difficult to produce a rough estimate of
what that storage size would be.

--
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html