On Sat, 1 Dec 2007, Mike Hommey wrote: > > While playing around with git-pack-objects, it seemed to me that the > input it can take is not a simple list of object SHA1s. Well, it *can* take a simple list of object SHA1's. But yes, the preferred format is a list of "SHA1 <basename>", where the basename is used as part of the heuristics on what other objects to try doing a delta against. But if you give no basename, that heuristic just won't have the name hint, and things will still *work*, it's just more likely (but not certain) that the resulting packfile will be larger. > Could someone knowing the delta calculation internals enlighten me ? The delta calculations simply create a small hash based on the basename, and use that to clump blobs/trees with the same basename together. That's *usually* a huge win in terms of finding good deltas, since the most likely delta is for a previous version of the same file (or tree!) and since we don't try to find deltas against *all* other blobs, but just use a sliding window, having good delta candidates close to each other is going to help a lot. Without the basename information, the delta list will just be sorted by type and size, which works fine, but generally finds fewer deltas. But it's all a heuristic, and if can go both ways. If you have lots of renames (which aren't just cross-directory ones, but actually change the basename), then the basename information may actually hurt. (Btw: the hash we generate is on purpose not a very good one. It actually thinks that the last characters are "more important", so it tends to hash files that end in the same few characters together. So *.c files clump together etc. At least that's the intent). See builtin-pack-objects.c: - type_size_sort(): this is the rule for sortign objects for deltaing. Type is most important (ie we always sort commits, trees, blobs separately and clump them together and effectively delta them only against objects of the same type) Then comes the basename hash (so that we sort objects with the same name together, and *.c files closer to each other than to *.h files, for example). Then comes the preferred_base (so that we sort things that already have specific delta bases together), and then the size (so that we sort files that are similar in size). And finally, if everything else is equal (the size will generally be identical for tree objects of the same directory with no new files but just SHA1 changes, for example) we sort by the order they were found in the history ("recency") by just comparing the pointer itself, since the original thing will be just one big array filled in by order of objects. - find_deltas() - this is the actual thing that does the "look through the object window and try to find good deltas", which operates on the array that was created by the type_size_sort. Hope that clarified something. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html