On Thu, Apr 5, 2012 at 06:02, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: > On linux-2.6.git, one 521MB pack, it generates a 356MB cache and a > 30MB index companion. Though if you are willing to pay extra 5 seconds > for decompressing, then the cache can go down to 94MB. We can cut > nearly half "rev-list --objects --all" time with this cache > (uncompressed cache): > > $ time ~/w/git/git rev-list --objects --all --quiet </dev/null > real 2m31.310s > user 2m28.735s > sys 0m1.604s > > $ time TREE_CACHE=cache ~/w/git/git rev-list --objects --all --quiet </dev/null > real 1m6.810s > user 1m6.091s > sys 0m0.708s > > $ time ~/w/git/git rev-list --all --quiet </dev/null > real 0m14.261s # should be cut down to one third with commit cache > user 0m14.088s > sys 0m0.171s > > Not really good. "rev-list --objects"'s taking less than 30s would be > nicer. lookup_object() is on top from 'perf' report with cache on. Not > sure what to do with it. My officemate Colby and I came up with a better solution a few weeks ago, but haven't really had a chance to discuss it in on the list. I guess I should try to do that now. Like anything else, we went into this work with some assumptions. There are two operations we really wanted to improve the performance of, `git rev-list --objects` for the two commonly used cases from pack-objects, notably `rev-list --objects $WANT` and `rev-list --objects $WANT --not $HAVE`. That is, clone and incrementally fetching something when you have a common ancestor. I'm currently ignoring shallow clone in this work as it tends to be a bit less expensive on the object enumeration part. Working from the linux repository, with roughly 2.2M objects, we can assume the vast majority of these objects are stored in a pack file. If we further assume these are mostly in a single pack file, we can easily assign every packed object a unique integer. We do this by assigning the N-th object in the pack integer N. You can already do this by taking the pack index and computing the reverse index, sorted by offset in pack. Finding the integer value for any SHA-1 is then a matter of locating its offset in the normal index, and locating the position of it in the reverse index... a O(2 log N) operation. With all of the packed objects named by an integer [0, N) we can build a series of bitmaps representing reachability. Given a commit, its bitmap has every bit set for every object that `git rev-list --objects $COMMIT_SHA1` would output. If the pack is built from a single branch (e.g. a repository with no tags and only a master branch), that tip commit would have every bit set in its bitmap, as all objects in the pack are contained in the bitmap. A bitmap of 2.2M objects is 2.2M bits in size, and is roughly 275 KiB worth of data. But we can compress the bitmap using word aligned hybrid compression (WAH) [1] and have it drop to about 1 KiB in size. Packs have fairly good locality given that they are roughly ordered by time. The further back in history you go, the bitmap for any given commit will start to contain more zeros than ones, and the zeros will be roughly consecutive as the regions of the pack are less full, so the bitmap still compresses well. We actually did an experiment computing the bitmaps for all of the commits in the kernel repository, IIRC the largest ones were coming in around 40 KiB in the middle of history, and then shrinking smaller again as you got further back in history. Assuming all bitmaps are around 20 KiB average size (most were in this range), storing bitmaps for every commit costs around 4.2G. Not worth it. However. If we take the kernel history in rev-list and pick two commits that are roughly ~10,000 commits apart from one another, JGit can compute the rev-list --objects between these two commits in about 120 milliseconds (git-core should be faster, or at least comparable). Given that, we don't have to store every bitmap. Instead you store the bitmaps about every 10,000 commits. Now your bitmap storage is closer to 1 MiB in size for the entire pack. This can be easily appended to the end of the pack-*.idx file as a new section when the pack is built. The packer can construct the necessary data during packing with what I suspect relatively little additional cost to what its already doing, as most of the required data is in memory or being touched anyway. Obviously that 10k distance between bitmaps is tuneable, and could be a config variable. We could even make it a % of the pack size, with Git automatically selecting equidistant points between commits in the history such that the compressed bitmaps fit within the target percentage. Repository owners could set this to e.g. 10% and let the machine do the rest of the work. (10% of a 560M pack might be ~56M worth of bitmaps or every ~2800 commits.) Computing `rev-list objects $WANT` is just a matter of OR-ing together the bitmaps that correspond to what the client asked for. WAH compressed bitmaps can apply OR without decompressing, in ~5ms time range, rather than 14.2s... or 90s. :-) Computing `rev-list objects $WANT --not $HAVE` is likewise an OR of the $WANT and $HAVE groups, then a negation, which again can be done directly on the compressed bitmaps. When the client uses a $WANT or $HAVE that we don't have a direct bitmap for, build the bitmap on the fly by running `rev-list $WANT` (or $HAVE) until the revision traversal produces a commit that does have a bitmap. Extend the bitmap with the additional objects that aren't in the pack by assigning them new temporary integers that are larger than the number of objects in the pack. Finish the operation with the bitmaps. To output the pack we don't even need to build up a huge list of struct packed_obj in memory. Instead we iterate the set bits in the bitmap, working off the compressed format. When an object has its bit set, it needs to be sent to the client. The object can be found in the pack by looking at the N-th position of the reverse index to get its offset, or the N-th position in the reverse index to get its SHA-1. Copying slices of pack data should be a pretty simple task. The bits are already sorted by preferred pack order, since they came from a pack, so the writer still produces a good stream. Obviously if you want to change the ordering, or the deltas, aka repack -f, we need to avoid using the bitmap ordering here. There is some complication relating to swapping out delta compressed form for non-delta compressed form if you cannot prove the peer has the delta base that is currently being used. But with the bitmaps we actually have a much more accurate representation of what the client *actually* has. Instead of being limited to the common ancestor negotiation point's trees, the $HAVE bitmap covers *every single object to the beginning of time*. This significantly increases the number of candidates available for delta reuse, because there are better chances that the base we use is already available to the client. The $HAVE bitmap covering a much bigger section of history also means we transmit fewer objects, which makes for a faster transfer for the client. This often happens with cherry-picks or reverts across the common ancestor cut point in the graph, where the peer already has the relevant object(s) but we can't prove it from the limited section of history we look at today. The $HAVE bitmap is much more comprehensive picture of the client's state, making for a smaller transfer. Having multiple packs is common, and does complicate this algorithm. There are known ways to combine different bitmap indexes together to create a single larger bitmap, mostly by applying a unique "base prefix" to each bitmap's values. Its very common in the full text search community to do this when incrementally updating a full text index. A process can assign each pack it observes a unique base prefix, and then join together bitmaps across those packs to get a more complete picture. Its not entirely that simple though because a commit in a newer pack probably still references a parent in an older pack, and so that commit in the newer pack doesn't have a complete bitmap. One way out of this is to only produce bitmaps on a full GC, where the entire repository is rewritten. If every 10k commits worth of history costs about 100ms additional processing time to do object enumeration, we only really have to do a major repack about every 100k commits when processing is starting to come close to 1.2 seconds of CPU time. The linux history has done ~220k commits in ~5 years, or 44k commits/year. Asking a repository to do a full GC at least once per year so that there only needs to be one set of bitmaps might be acceptable. :-) The other operation that matters is `git rev-list --objects $NEW --not --all`, which is done for a reachability test during receive of data into a repository from the network (in either fetch or receive-pack). If we know a pack was created locally by the git gc or git repack command, we only need to run `git rev-list --objects $NEW` and stop traversal for a section of the graph when we find a commit or object that already exists in a locally created pack index. In other words we don't use the "--not --all" bit and we instead we look at each object coming out of the traversal to see if it is in a locally created pack, if it exists in a pack's index we mark it UNINTERESTING and continue the traversal. This would eliminate the need to parse and load --all into the priority queue, instead we parse and insert only the commits that are directly connected to the part of the $NEW graph we were just given, and we only do that so we can stop traversal at the common ancestor point(s) and avoid walking back to the root. It probably isn't even necessary to parse or insert these commits, we just have to tag them UNINTERESTING and avoid putting them into the priority queue. However not all locally stored packs were locally created. Some are created from the network (e.g. when the number of objects is > 100). So for this optimization to work we need an additional chunk of metadata written with the pack to say "this pack was made by git gc / git repack and is trustworthy". This could be a new ".local" file alongside .pack/.idx, or we could do a minor change to the .idx format to allow a "source bit" to be written to say where the pack came from (network or fast-import vs. local gc). I think a lot of the other users of the commit graph / object graph are just looking at small enough slices of history that walking through 10k commits in 120 ms is acceptable response time to the human running the command. So its not really pack v4. It only needs a few MiB additional space on top of existing pack data, and is easily stored onto the end of the local index file. But it gets us a lot of improvement in some pretty tough areas. I don't have any code to share, because we haven't written it yet. But this should shave off some of the big corners within Git, with relatively little additional disk usage. [1] "Sorting improves word-aligned bitmap indexes", Daniel Lemire, Owen Kaser, Kamel Aouiche http://arxiv.org/abs/0901.3751 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html