Re: [PATCH] git-pack-objects: cache small deltas between big objects

Junio C Hamano <junkio@xxxxxxx> · Tue, 22 May 2007 01:04:26 -0700

"Dana How" <danahow@xxxxxxxxx> writes:

> If I simply refuse to insert enormous blobs in the packfiles,  and keep
> them loose,  the performance is better.  More importantly,  my packfiles
> are now sized like everyone else's, so I'm in an operating regime which
> everyone is testing and optimizing.  This was not true with 12GB+ of packfiles.
> Of course, loose objects are slower, but slight extra overhead to access
> something large enough to be noticeable already doesn't bother me.
>
> Finally, loose objects don't get deltified.  This is a problem,  but I would
> need to repack at least every week,  and nonzero window/depth would
> be prohibitive with large objects included.

Here are a few quick comments before going to bed.

 * The objects in the packfile are ordered in "recency" order,
   as "rev-list --objects" feeds you, so it is correct that we
   get trees and blobs mixed.  It might be an interesting
   experiment, especially with a repository without huge blobs,
   to see how much improvement we might get if we keep the
   recency order _but_ emit tags, commits, trees, and then
   blobs, in this order.  In write_pack_file() we have a single
   loop to call write_one(), but we could make it a nested loop
   that writes only objects of each type.

 * Also my earlier "nodelta" attribute thing would be worth
   trying with your repository with huge blobs, with the above
   "group by object type" with further tweak to write blobs
   without "nodelta" marker first and then finally blobs with
   "nodelta" marker.

I suspect the above two should help "git log" and "git log --
pathspec..."  performance, as these two do not look at blobs at
all (pathspec limiting does invoke diff machinery, but that is
only at the tree level).

The "I want to have packs with reasonable size as everybody
else" (which I think is a reasonable thing to want, but does not
have much technical meaning as other issues do) wish is
something we cannot _measure_ to judge pros and cons, but with
the above experiment, you could come up with three set of packs
such that, all three sets use "nodelta" to leave the huge blobs
undeltified, and use the default window and depth for others,
and:

 (1) One set has trees and blobs mixed;

 (2) Another set has trees and blobs grouped, but "nodelta" blobs
     and others are not separated;

 (3) The third set has trees and blobs grouped, and "nodelta"
     blobs and others are separated.

Comparing (1) and (2) would show how bad it is to have huge
blobs in between trees (which are presumably accessed more
often).  I suspect that comparing (2) and (3) would show that
for most workloads, the split is not worth it.

And compare (3) with another case where you leave "nodelta"
blobs loose.  That's the true comparison that would demonstrate
why placing huge blobs in packs is bad and they should be left
loose.  I'm skeptical if there will be significant differences,
though.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html