Hi, Very insightful write-up. I'll need more time to read through again, just some initial opinions. On Sat, Apr 7, 2012 at 2:21 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote: > My officemate Colby and I came up with a better solution a few weeks > ago, but haven't really had a chance to discuss it in on the list. I > guess I should try to do that now. Like anything else, we went into > this work with some assumptions. > > There are two operations we really wanted to improve the performance > of, `git rev-list --objects` for the two commonly used cases from > pack-objects, notably `rev-list --objects $WANT` and `rev-list > --objects $WANT --not $HAVE`. That is, clone and incrementally > fetching something when you have a common ancestor. I'm currently > ignoring shallow clone in this work as it tends to be a bit less > expensive on the object enumeration part. > > Working from the linux repository, with roughly 2.2M objects, we can > assume the vast majority of these objects are stored in a pack file. > If we further assume these are mostly in a single pack file, we can > easily assign every packed object a unique integer. We do this by > assigning the N-th object in the pack integer N. You can already do > this by taking the pack index and computing the reverse index, sorted > by offset in pack. Finding the integer value for any SHA-1 is then a > matter of locating its offset in the normal index, and locating the > position of it in the reverse index... a O(2 log N) operation. > > With all of the packed objects named by an integer [0, N) we can build > a series of bitmaps representing reachability. Given a commit, its > bitmap has every bit set for every object that `git rev-list --objects > $COMMIT_SHA1` would output. If the pack is built from a single branch > (e.g. a repository with no tags and only a master branch), that tip > commit would have every bit set in its bitmap, as all objects in the > pack are contained in the bitmap. > > ... > > Having multiple packs is common, and does complicate this algorithm. > There are known ways to combine different bitmap indexes together to > create a single larger bitmap, mostly by applying a unique "base > prefix" to each bitmap's values. Its very common in the full text > search community to do this when incrementally updating a full text > index. Common repos usually have a big pack as a result of clone and several smaller packs. How about we create the bitmap for the largest pack only and fall back to normal rev walking for the rest? We need to deal with loose objects anyway. I wonder if we could also mark the boundary objects for a given commits (i.e. another bitmap) so we can start walking from there to get to other packs and loose objects. The second bitmap hopefully compresses well. Not sure how it complicates the want-have bitmap operations you describe above though. > A process can assign each pack it observes a unique base prefix, and > then join together bitmaps across those packs to get a more complete > picture. Its not entirely that simple though because a commit in a > newer pack probably still references a parent in an older pack, and so > that commit in the newer pack doesn't have a complete bitmap. > > One way out of this is to only produce bitmaps on a full GC, where the > entire repository is rewritten. If every 10k commits worth of history > costs about 100ms additional processing time to do object enumeration, > we only really have to do a major repack about every 100k commits when > processing is starting to come close to 1.2 seconds of CPU time. The > linux history has done ~220k commits in ~5 years, or 44k commits/year. > Asking a repository to do a full GC at least once per year so that > there only needs to be one set of bitmaps might be acceptable. :-) I'd be happy for it to run, even once a month, as long as it is not run automatically, unexpectedly and stops me from doing whatever I'm doing, like "gc --auto". -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html