A Large Angry SCM <gitzilla@xxxxxxxxx> wrote: > Just looking at the structures in non-BLOBS, I see a lot of potential > for the use of a set dictionaries when deflating TREEs and another set > of dictionaries when deflating COMMITs and TAGs. The low hanging fruit > is to create dictionaries of the most referenced IDs across all TREE or > COMMIT/TAG objects. The most referenced IDs should be getting reused through deltas. That is IDs which are highly referenced are probably referenced in the same tree over many versions of that tree. Since the data isn't changing it should be getting copied by a delta copy command rather than appearing as a literal. The Mozilla pack appears to have the bulk of its storage taken up by blobs (both bases and deltas). I suspect this is because the bases compress to approx. 50% of their original size but share a lot of common tokens. Those common tokens are being repeated in every private zlib dictionary. The same thing happens in a delta, except here we are probably copying a lot from the base so the average size is greatly reduced but we are still repeating tokens in the zlib dictionary for anything that is a literal in the delta (as it didn't appear in the base). A large dictionary containing all tokens for the project should greatly reduce the size of each blob, base and delta alike. It also lends itself to creating an efficient full text index. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html