Re: Mozilla .git tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jon Smirl <jonsmirl@xxxxxxxxx> wrote:
> sha1s are effectively 20 byte pointer addresses into the pack. With 2M
> objects you can easily get away with 4 byte address and a mapping
> table. Another idea would be to replace the 20 byte sha1 in tree
> objects with 32b file offsets - requiring that anything the tree
> refers to has to already be in the pack before the tree entry can be
> written.

I've thought of that, but when you transfer a "thin" pack over the
wire the base object may not even be in the pack.  Thus you can't
use an offset to reference it.  Otherwise there's probably little
reason why the base couldn't be referenced by its 4 byte offset
rather than its full 20 byte object ID.  Added up over all deltas
in the mozilla pack it saves a whopping 23 MiB.

> The Mozilla license has changed at least five times. That makes 250K
> copies of licenses.

Cute.
 
> I suspect a tree specific zlib dictionary will be a good win.  But
> those trees contain a lot of uncompressible data, the sha1. Those
> sha1s are in binary not hex, right?

Yup, binary.
 
> The git tools can be modified to set the compression level to 0 before
> compressing tree deltas. There is no need to change the decoding code.
> Even with compression level 0 they still get slightly larger because
> zlib tacks on a header.

See my followup email to myself; I think we're talking a zlib
overhead of 9.2 bytes on average per tree delta.  That's with a
compression level of -1 (default, which is 6).
 
> I'm still interested in getting an idea of how much a Clucene type
> dictionary compression would help. It is hard to see how you can get
> smaller than that method. Note that you don't want to include the
> indexing portion of Clucene in the comparison. Just the part where
> everything gets tokenized into a big dictionary, arithmetic encoded
> based on usage frequency, and then the strings in the orginal
> documents are replaced with the codes. You want to do the diffs before
> replacing everything with codes. Encoding this way is a two pass
> process so it is easiest to work from an existing pack.

>From what I was able to gather I don't think Clucene stores the
documents themselves as the tokenized compressed data.  Or if it
does you lose everything between the tokens.  There's a number of
things we want to preserve in the original "document" like whitespace
that would be likely stripped when constructing tokens.

But it shouldn't be that difficult to produce a rough estimate of
what that storage size would be.
 
-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]