On Wed, Oct 30, 2013 at 11:38 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote: >> Since (1) is relatively rare, I think we are using this as a proxy for >> (2), so that we can do a regular walk rather than looking around for >> bitmaps that don't exist. But I may be misremembering the reasoning. >> Vicent? > > Ah. I am not sure if we do this in JGit. I think JGit's approach is to > look if the have appears in a pack with bitmaps, this is a simple > lookup in the .idx file and does not require expanding any data from > the .bitmap file. No, you don't do this in JGit right now. This is a small optimization we implemented to prevent *loading* the bitmap file (and hence building the reverse index, which can be expensive) unless we're sure we can at least *attempt* a bitmap walk. > But it wasn't my question. :-) > > Client sends "want B ; have E". What if E appears in the bitmapped > pack, but does not itself have a bitmap? Do you walk backwards from B > and switch to the bitmap algorithm when you find a commit that has a > bitmap and that bitmap contains E? This is correct, we use the same heuristics as JGit. Once we have loaded a bitmap file we know we can attempt a bitmap walk; if E is on the bitmapped pack, we'll build a temporary bitmap using an extended index (with bits for commits that are not in the packfile) as we walk backwards until E. Once we find a commit with a bitmap, we'll OR that with the temporary bitmap to skip the full walk. >>> In JGit we write the to_pack list first, then the reuse pack. Our >>> rationale was the to_pack list is recent objects that are newer and >>> would appear first in a traditional traversal, so they should go at >>> the front of the stream. This does mean if they delta compress against >>> an object in that reuse_packfile slice they have to use REF_DELTA >>> instead of OFS_DELTA. >> >> That's a good point. In our case I think we do not delta against the >> reused packfile objects at all, as we simply send out the whole slice of >> packfile without making an entry for each object. > > JGit also doesn't use the reused packfile as delta bases, because > there are no object entries to shove through the delta window. So > there is never any risk of a reference to a base later in the file. It > also means that "thin pack" at the front of the stream is less > optimally compressed. At Google we side step that by doing GC at the > server very often, to try and keep the number of objects in that pack > low. > > It might make sense to use a commit that covers the majority of the > reused pack as the edge base candidate case during delta compression > here, as though the client had sent us a "have" for that commit. I > don't think we have tried this in JGit. It would make deltas use > REF_DELTA, but the delta size has to be smaller than the REF_DELTA > header to be used in the stream so its still a smaller overall > transfer. > >>> Is this series running on github.com/torvalds/linux? Last Saturday I >>> ran a live demo clone comparing github.com/torvalds/linux to a JGit >>> bitmap clone and some guy heckled me because GitHub was only a few >>> seconds slower. :-) >> >> It is. Use kernel.org if you want to make fun of someone. :) > > Hah. OK, so GitHub was only a few seconds slower only because my > desktop is better connected to our data centers than to yours. Nicely > done, this patch series really works. :) Thanks Shawn, your feedback was very helpful. Re. cloning speeds: we are actually bottlenecked by our routing layer. The CPUs in our new fileservers can clone `torvalds/linux` to /dev/null in 20s, but we're slowing down when routing the actual traffic back to the client. We're in the process of rewriting our Git reverse proxys so let's see what the future looks like. Love, vmg -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html