Re: [PATCH 09/16] documentation: add documentation for the bitmap format

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 25, 2013 at 09:33:11PM +0200, Vicent Martí wrote:

> > One way we side-stepped the size inflation problem in JGit was to only
> > use the bitmap index information when sending data on the wire to a
> > client. Here delta reuse plays a significant factor in building the
> > pack, and we don't have to be as accurate on matching deltas. During
> > the equivalent of `git repack` bitmaps are not used, allowing the
> > traditional graph enumeration algorithm to generate path hash
> > information.
> 
> OH BOY HERE WE GO. This is worth its own thread, lots to discuss here.
> I think peff will have a patchset regarding this to upstream soon,
> we'll get back to it later.

We do the same thing (only use bitmaps during on-the-wire fetches).  But
there a few problems with assuming delta reuse.

For us (GitHub), the foremost one is that we pack many "forks" of a
repository together into a single packfile. That means when you clone
torvalds/linux, an object you want may be stored in the on-disk pack
with a delta against an object that you are not going to get. So we have
to throw out that delta and find a new one.

I'm dealing with that by adding an option to respect "islands" during
packing, where an island is a set of common objects (we split it by
fork, since we expect those objects to be fetched together, but you
could use other criteria). The rule is that an object cannot delta
against another object that is not in all of its islands. So everybody
can delta against shared history, but objects in your fork can only
delta against other objects in the fork.  You are guaranteed to be able
to reuse such deltas during a full clone of a fork, and the on-disk pack
size does not suffer all that much (because there is usually a good
alternate delta base within your reachable history).

So with that series, we can get good reuse for clones. But there are
still two cases worth considering:

  1. When you fetch a subset of the commits, git marks only the edges as
     preferred bases, and does not walk the full object graph down to
     the roots. So any object you want that is delta'd against something
     older will not get reused. If you have reachability bitmaps, I
     don't think there is any reason that we cannot use the entire
     object graph (starting at the "have" tips, of course) as preferred
     bases.

  2. The server is not necessarily fully packed. In an active repo, you
     may have a large "base" pack with bitmaps, with several recently
     pushed packs on top. You still need to delta the recently pushed
     objects against the base objects.

I don't have measurements on how much the deltas suffer in those two
cases. I know they suffered quite badly for clones without the name
hashes in our alternates repos, but that part should go away with my
patch series.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]