Re: git pack-objects input list

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sat, 1 Dec 2007, Mike Hommey wrote:
> 
> While playing around with git-pack-objects, it seemed to me that the
> input it can take is not a simple list of object SHA1s.

Well, it *can* take a simple list of object SHA1's. But yes, the preferred 
format is a list of "SHA1 <basename>", where the basename is used as part 
of the heuristics on what other objects to try doing a delta against.

But if you give no basename, that heuristic just won't have the name hint, 
and things will still *work*, it's just more likely (but not certain) that 
the resulting packfile will be larger.

> Could someone knowing the delta calculation internals enlighten me ?

The delta calculations simply create a small hash based on the basename, 
and use that to clump blobs/trees with the same basename together. That's 
*usually* a huge win in terms of finding good deltas, since the most 
likely delta is for a previous version of the same file (or tree!) and 
since we don't try to find deltas against *all* other blobs, but just use 
a sliding window, having good delta candidates close to each other is 
going to help a lot.

Without the basename information, the delta list will just be sorted by 
type and size, which works fine, but generally finds fewer deltas.

But it's all a heuristic, and if can go both ways. If you have lots of 
renames (which aren't just cross-directory ones, but actually change the 
basename), then the basename information may actually hurt.

(Btw: the hash we generate is on purpose not a very good one. It actually 
thinks that the last characters are "more important", so it tends to hash 
files that end in the same few characters together. So *.c files clump 
together etc. At least that's the intent).

See builtin-pack-objects.c:
 - type_size_sort(): this is the rule for sortign objects for deltaing. 

   Type is most important (ie we always sort commits, trees, blobs 
   separately and clump them together and effectively delta them only 
   against objects of the same type)

   Then comes the basename hash (so that we sort objects with the same 
   name together, and *.c files closer to each other than to *.h files, 
   for example).

   Then comes the preferred_base (so that we sort things that already have 
   specific delta bases together), and then the size (so that we sort
   files that are similar in size).

   And finally, if everything else is equal (the size will generally be 
   identical for tree objects of the same directory with no new files but 
   just SHA1 changes, for example) we sort by the order they were found in 
   the history ("recency") by just comparing the pointer itself, since 
   the original thing will be just one big array filled in by order of 
   objects.

 - find_deltas() - this is the actual thing that does the "look through 
   the object window and try to find good deltas", which operates on the 
   array that was created by the type_size_sort.

Hope that clarified something.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux