Linus Torvalds <torvalds@xxxxxxxx> wrote: > > On Mon, 28 Aug 2006, Nicolas Pitre wrote: > > > > Good job indeed. Oh and you probably should not bother trying to > > deltify commit objects at all since that would be a waste of time. > > It might not necessarily always be a waste of time. Especially if you have > multiple branches tracking a "maintenance" branch, you often end up having > the same commit message repeated several times in "unrelated" commits > (they're really the same commit, applied to another branch). > > Also, I could imagine that some automated system generates very verbose > (and possibly very regular) commit messages, so under certain > circumstances it may well make sense to see if the commits migth delta > against each other. > > But I'll agree that in normal use it's not likely to be a huge saving, > though. It's probably not worth doing for the fast importer unless it just > happens to fall out of the code very easily. Does git-pack-objects attempt to delta commits against each other? I've been thinking about applying a pack-local but zlib-stream global dictionary. If we added three global dicationaries to the front of the pack file, one for commits, one for trees and one for blobs, and use those as the global dictionaries for the zlib streams stored within that pack we could probably get a good space savings for trees and commits. I'd suspect that for many projects the commit global dictionary would contain the common required strings such as: 'tree ', 'parent ', 'committer ', 'author ', 'Signed-off-by: ' plus the top author/committer name/email combination strings. For GIT I'd expect 'Junio C Hamano <junkio@xxxxxxx>' to be way up there in terms of frequency within commit objects. Finding the most common authors and committer strings would be trivial, as would finding the most common 'footer' strings such as 'Signed-off-by: ' and 'Acked-by: '. I think the same is true of trees, with '10644 ', '10755 ', '40000 ' being way up there, but also file names that commonly appear within trees, e.g. "Makefile.in\0". Blobs would be more difficult to generate a reasonable global dictionary for. But for some projects a crude estimated dictionary can shave off at least 4% of pack size (true in both GIT and Mozilla sources it seems). Of course the major problem with pack-local, stream global dictionaries is it voids the ability to reuse that zlib'd content from that pack in another pack without wholesale copying the dictionary as well. This is an issue for servers which want to copy out the pack entry without recompressing it but also want the storage savings from the global dictionaries. But then again, if we just delta against a commit which uses the same author and committer, or against the same tree but different version then there should be a lot of delta copying from the base... which easily allows entry reuse and should provide similiar space savings, providing the delta depth is deep enough (or the delta graph is wide enough) to minimize the number of base objects containing repeated occurrances of the common strings. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html