On Sun, Feb 15, 2015 at 11:59:02PM -0500, Nicolas Pitre wrote: > Yet, I think the biggest problem with pack v4 at the moment is the > packing algorithm for tree objects. We are piggy-backing on the pack v2 > object delta compression sorting and that produces suboptimal results > due to deep recursions. And it is the recursion that kills us. The pack > v4 requires a new packing algorithm for its tree objects. > > What I imagined is something like this: > > - Each canonical tree entry is made of a SHA1, mode and path. Let's > assume this is hashed into a 24-bit value. > > - Each tree object can therefore be represented as a string of 24-bit > "characters". > > - Delta-compressing a tree object becomes a substring search where we > try to replace a sequence of "characters" with the longest "string" > possible from another object. Repeat with the remaining sequences. Somewhat related to this, I was playing this weekend with the idea of generating fast tree diffs from our on-disk deltas. That is, given a base tree and a binary delta against it, could I reliably reproduce a diff (one way or the other) in O(size of diff), rather than O(size of tree)? I have two applications in mind: 1. Tree diffs for "git log -- pathspec" do the same diff (between a tree and its corresponding version in the parent) over and over. A significant percentage of the time (at least in linux.git and git.git), we store deltas between these trees. We could stop wasting time walking the common parts of the trees, and jump right to the differences. 2. Calculating reachability for packing[1] spends a lot of time in lookup_object, as we have to go through each tree saying "have we seen object 1234abcd yet?". If we could instead just view the differences, we would not have to make those hash lookups for entries whose objects we know we have seen. The conclusion I came to was "no, you cannot do it in the general case of byte-wise binary diffs"[2]. Conceptually, it's not too difficult. For example, if we know we have copied up to byte X from the source tree, then we skip to byte Y, we know that bytes Y-X did not make it, and are a deletion of entries. If those fall on whole-entry boundaries, it's pretty simple. But of course we have no guarantee that they do in a byte-wise diff. E.g., if entry "foo" changes its sha1, then we will generally copy up to and including the "foo\0" from the source, and then insert our own new sha1. Which means we need to walk backwards from point "Y" to find the filename of the entry we are in. Or worse, the sha1s might share one or more leading bytes, and that would be part of the copy, too. So in short, we may end up anywhere within a tree object and need to walk backwards to the start of the entry. But we cannot walk backwards looking for a NUL, because it may also be part of a sha1. You can only reliably parse a git tree by starting at the beginning of the stream and going forwards[3]. If we knew that our deltas were always produced on entry-boundaries (a "character" in your description above), this would be much simpler. Maybe this is all stuff you already thought through, but if not, food for thought. :) -Peff [1] Of course there are other reachability checks besides packing, like git-prune. But all of those are even better sped up by using reachability bitmaps. Packing, as the place where we generate the bitmaps, and which is sensitive to things like traversal order, remains the place where we will always need to actually walk. [2] One option, of course, is to generate byte-wise deltas, but with a promise to always align them on entry boundaries. I'm tempted by this, because the result would be readable by existing packv2 readers. We'd have to set a flag somewhere that indicates the pack was written with this property, though. [3] I suspect you could come up with some heuristic that finds tree entry boundaries with high probability, and in the low probability case does not produce a wrong answer, but instead has to walk all the way back to the beginning of the tree. That would be fine here. But frankly, this "walk backwards" thing was just the straw that broke the camel's back for me in working on this. Handling all the possible cases was ending up quite complicated. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html