On Tue, Jun 25, 2013 at 11:11 PM, Jeff King <peff@xxxxxxxx> wrote: > On Tue, Jun 25, 2013 at 09:33:11PM +0200, Vicent Martí wrote: > >> > One way we side-stepped the size inflation problem in JGit was to only >> > use the bitmap index information when sending data on the wire to a >> > client. Here delta reuse plays a significant factor in building the >> > pack, and we don't have to be as accurate on matching deltas. During >> > the equivalent of `git repack` bitmaps are not used, allowing the >> > traditional graph enumeration algorithm to generate path hash >> > information. >> >> OH BOY HERE WE GO. This is worth its own thread, lots to discuss here. >> I think peff will have a patchset regarding this to upstream soon, >> we'll get back to it later. > > We do the same thing (only use bitmaps during on-the-wire fetches). But > there a few problems with assuming delta reuse. > > For us (GitHub), the foremost one is that we pack many "forks" of a > repository together into a single packfile. That means when you clone > torvalds/linux, an object you want may be stored in the on-disk pack > with a delta against an object that you are not going to get. So we have > to throw out that delta and find a new one. Gerrit Code Review ran into the same problem a few years ago with the refs/changes namespace. Objects reachable from a branch were often delta compressed against dropped code review revisions, making for some slow transfers. We fixed this by creating a pack of everything reachable from refs/heads/* and then another pack of the other stuff. I would encourage you to do what you suggest... > I'm dealing with that by adding an option to respect "islands" during > packing, where an island is a set of common objects (we split it by > fork, since we expect those objects to be fetched together, but you > could use other criteria). The rule is that an object cannot delta > against another object that is not in all of its islands. So everybody > can delta against shared history, but objects in your fork can only > delta against other objects in the fork. You are guaranteed to be able > to reuse such deltas during a full clone of a fork, and the on-disk pack > size does not suffer all that much (because there is usually a good > alternate delta base within your reachable history). Yes, exactly. I want to do the same thing on our servers, as we have many forks of some popular open source repositories that are also not small (Linux kernel, WebKit). Unfortunately Google has not had the time to develop the necessary support into JGit. > So with that series, we can get good reuse for clones. But there are > still two cases worth considering: > > 1. When you fetch a subset of the commits, git marks only the edges as > preferred bases, and does not walk the full object graph down to > the roots. So any object you want that is delta'd against something > older will not get reused. If you have reachability bitmaps, I > don't think there is any reason that we cannot use the entire > object graph (starting at the "have" tips, of course) as preferred > bases. In JGit we use the reachability bitmap to provide proof a client has an object. Even if its not in the edges. This allows us much better delta reuse, as often frequently deltas will be available pointing to something behind the edge, but that the client certainly has given the edges we know about. We also use the reachability bitmap to provide proof a client does not need an object. We found a reduction in number of objects transferred because the "want AND NOT have" subtracted out a number of objects not in the edge. Apparently merges, reverts and cherry-picks happen often enough in the repositories we host that this particular optimization helps reduce data transfer, and work at both server and client ends of the connection. Its a nice freebie the bitmap algorithm gives us. > 2. The server is not necessarily fully packed. In an active repo, you > may have a large "base" pack with bitmaps, with several recently > pushed packs on top. You still need to delta the recently pushed > objects against the base objects. Yes, this is unfortunate. One way we avoid this in JGit is to keep everything in pack files, rather than exploding loose. The reachability bitmap often proves the client has the delta base the pusher used to make the object, allowing us to reuse the delta. It may not be the absolute best delta in the world, but reuse is faster than inflate()+delta()+deflate(), and the delta is probably "good enough" until the server can do a real GC in the background. We combine small packs from pushes together by almost literally just concat'ing the packs together and creating a new .idx. Newer pushed data is put in front of the older data, the pack is clustered by "commit, tree, blob" ordering, duplicates are removed, and its written back to disk. Typically we complete this "pack concat" operation mere seconds after a push finishes, so readers have very few packs to deal with. > I don't have measurements on how much the deltas suffer in those two > cases. I know they suffered quite badly for clones without the name > hashes in our alternates repos, but that part should go away with my > patch series. JGit doesn't poke objects into the object table (or even the object list) when a bitmap is used. We spool the bits out of the bitmap in bitmap order and write them to the wire in that order. Its way faster, but depends on the bitmap being in pack-ordering. So clones are crazy fast even though we don't have the path-hash table. JGit also has another optimization where we figure out based on the bitmap if the client needs *everything* in this pack. Which given a pack created only for refs/heads/* is the common case for a clone. If the client is getting all objects we essentially just do a sendfile() for the region starting at offset 12 through end-20. It can't be a sendfile() syscall because it has to be computed into the trailer SHA-1 the client sees, but its a crazy tight IO copy loop with no Git smarts beyond the SHA-1 updating. Like I said, there are a ton of optimizations you guys missed. And we think they make a bigger difference than screwing around with little-endian format to favor x86 CPUs. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html