On 9/6/06, A Large Angry SCM <gitzilla@xxxxxxxxx> wrote:
Jon Smirl wrote: > On 9/6/06, A Large Angry SCM <gitzilla@xxxxxxxxx> wrote: >> TREE objects do not delta or deflate well. > > I can understand why they don't deflate, the path names are pretty > much unique and the sha1s are incompressible. By why don't they delta > well? Does sorting them by size mess up the delta process? My guess would be the TREEs would only delta well against other TREE versions for the same path.
I would have thought that the TREE delta code would have already taken this into account. Is there enough info around to match up all of the TREEs for a specific path into a delta chain? The repack code could build a model of the tree as it is repacking, that is what fast-import does. If you have a model of the tree then when you change a TREE node you track the last sha1 that corresponded to that directory path. Now you know what to diff to.
> Shawn is doing some prototype work on true dictionary based > compression. I don't know how far along he is but it has potential for > taking 30% off the Mozilla pack. Just looking at the structures in non-BLOBS, I see a lot of potential for the use of a set dictionaries when deflating TREEs and another set of dictionaries when deflating COMMITs and TAGs. The low hanging fruit is to create dictionaries of the most referenced IDs across all TREE or COMMIT/TAG objects.
Doing this will get 4-7% according to our small tests, to get more you need a different type of compression algorithm. For true dictionary based compression you parse all of the input documents into words. The words are put into a single giant huffman table. Then the huffman codes for the words are concatenated together to form the compressed document. This is one of the highest possible compression methods since it takes into account you are compressing something that is human readable. gzip is sort of like this but the dictionary is limited to 32KB. Search engines use true dictionary compression since it is easy to add on search indices to the compressed data. For a search index you make a bit vector that corresponds to each word in the dictionary and then set a bit if that document number contains the word. The vectors and dictionary are stored compressed. There are algorithms for doing and/or on the compressed vectors. The and/or stage lets you identify likely hits, you retrieve those and then do an exact match on the search terms to be sure. -- Jon Smirl jonsmirl@xxxxxxxxx -- Jon Smirl jonsmirl@xxxxxxxxx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html