On Mon, Dec 9, 2019 at 9:11 AM Jeff King <peff@xxxxxxxx> wrote: > > On Fri, Nov 15, 2019 at 03:15:41PM +0100, Christian Couder wrote: > > > The old code to reuse deltas from an existing packfile > > just tried to dump a whole segment of the pack verbatim. > > That's faster than the traditional way of actually adding > > objects to the packing list, but it didn't kick in very > > often. This new code is really going for a middle ground: > > do _some_ per-object work, but way less than we'd > > traditionally do. > > This is a good intro to the general problem, but... > > > For instance, packing torvalds/linux on GitHub servers > > just now reused 6.5M objects, but only needed ~50k > > chunks. > > What the heck is a chunk? :) > > I know, because I wrote the patches, but I think we need to sketch out > the solution a bit for the reader. Ok with sketching it out. > I think the flow here is: > > - we want to do some per-object work (i.e., your intro above) > > - the general strategy is to make a bitmap of objects from the > packfile we'll include, and then iterate over it, writing out each > object exactly as it is in our on-disk pack, but _not_ adding it to > our packlist (which costs memory, and increases the search space for > deltas) > > - one complication is that if we're omitting some objects, we can't > set a delta against a base that we're not sending. So we have to > check each object in try_partial_reuse() to make sure we have its > delta (actually, that third big comment in that function is > incomplete, I think; it talks about sending the object later, not as > part of the reused pack, but we might not send it at all!). Are you talking about this comment: /* * And finally, if we're not sending the base as part of our * reuse chunk, then don't send this object either. The base * would come after us, along with other objects not * necessarily in the pack, which means we'd need to convert * to REF_DELTA on the fly. Better to just let the normal * object_entry code path handle it. */ ? I don't see how it's talking about sending the object later, so I have left it as is. Maybe you can check it again in the v4 series I am going to send. > - another complication is that when we omit parts of the packfile, > that screws up delta base offsets. So to handle that, we have to > keep track of which chunks we send (which is the bits you included > below about chunks) > > - and then we can talk about performance; worst case we might have > interleaved objects that we are sending or not sending, and we'd > have as many chunks as objects. But in practice we send big chunks, > so that's where the 6.5M objects vs 50k chunks comes in. Ok, I have added your points above to the commit message. > > Additional checks are added in have_duplicate_entry() > > and obj_is_packed() to avoid duplicate objects in the > > reuse bitmap. It was probably buggy to not have such a > > check before. > > > > If a client both asks for a tag by sha1 and specifies > > "include-tag", we may end up including the tag in the reuse > > bitmap (due to the first thing), and then later adding it to > > the packlist (due to the second). This results in duplicate > > objects in the pack, which git chokes on. We should notice > > that we are already including it when doing the include-tag > > portion, and avoid adding it to the packlist. > > > > The simplest place to fix this is right in add_ref_tag, > > where we could avoid peeling the tag at all if we know that > > we are already including it. However, I've pushed the check > > instead into have_duplicate_entry(). This fixes not only > > this case, but also means that we cannot have any similar > > problems lurking in other code. > > I think this part could go in its own commit. If we introduce > reuse_packfile_bitmap early, even if it's always an all-or-nothing chunk > at the beginning of the file with the existing code, then we can > introduce the extra checks in have_duplicate_entry() on top of that. Ok, I have extracted this part into its own commit. > And then it would be safe to do the cleanup in show_objects_from_type() > that triggers the test failure in patch 4. Ok, I have eventually moved patch 4 at the end of the series. > > builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++--------- > > pack-bitmap.c | 150 ++++++++++++++++++++-------- > > pack-bitmap.h | 3 +- > > 3 files changed, 288 insertions(+), 87 deletions(-) > > It might be worth having a perf test here. The case this is helping the > most, I think, is when somebody cloning wants all of the objects from > the beginning of the pack, but then there's a bunch of _other_ stuff. > > That could be unreachable objects kept by "repack -k", or maybe objects > reachable outside of heads and tags. Or objects that are part of other > delta islands. This last is the most plausible, really, because we'll > put all of the root-island objects at the beginning of the pack. So > maybe a good perf test would be to take an existing repository add a > bunch of new commits in refs/foo/, Not sure how I could do so. How would you do that? I think it would be best to add completely new realistic commits that are changing the main code base. > and then repack with delta islands > such that refs/heads and refs/tags are in one (root) island, but > refs/foo is in another. > > The old code should fail to do the pack-reuse thing, but we should get > pretty good reuse with the new code (and less CPU and peak RAM, > hopefully, though the perf suite doesn't measure RAM directly). I haven't added a perf test. I may do it if I get an idea about how to add new commits in refs/foo/. > Answering some questions from Jonathan's response in the last round > (some of the quotes are his, some are yours): > > > I still don't understand this part. Going off what's written in the > > commit message here, it seems to me that the issue is that (1) an object > > can be added to the reuse bitmap without going through to_pack, and (2) > > this is done when the client asks for a tag by SHA-1. But all objects > > when specified by SHA-1 go through to_pack, don't they? > > No, if it's part of the reused chunk, we'd avoid adding it to to_pack at > all (and I think the commit message should make that more clear, as I > described above). I used your description above to improve the commit message, so I guess it now answers his question. > > >> No tests, because git does not actually exhibit this "ask > > >> for it and also include-tag" behavior. We do one or the > > >> other on clone, depending on whether --single-branch is set. > > >> However, libgit2 does both. > > > > So my wild guess sould be that maybe the reason is that some of this > > code was shared (or intended to be shared) with libgit2? > > No, the question is how the client behaves, and how we on the server > react to it. Git as a client would never ask for both include-tag and > for the tag itself by sha1. But libgit2 will, so a libgit2 client > cloning from a Git server would trigger the bug. Ok, I have made it explicit in the commit message that the bug would be triggered by a libgit2 client but not a Git client. > > > + if (pos >= bitmap_git->pack->num_objects) > > > + return; /* not actually in the pack */ > > > > Is this case possible? try_partial_reuse() is only called when there is > > a 1 at the bit location. > > Yes, it's possible. The result of a bitmap walk is larger than a given > pack, because we have to extend it to match whatever objects the caller > asked for. E.g., imagine we have commit A, repack into into a bitmapped > pack, and then somebody pushes up commit B. Now I want to clone, getting > both A and B. > > Our bitmap result represents the whole answer, and so must include both > objects. But since "B" isn't in the pack, it doesn't have an assigned > bit. We add it to the in-memory bitmap_git->ext_index, which gives it a > bit (valid only for that walk result!). But of course for pack reuse, we > only care about the objects that were actually in the bitmapped pack. Not sure if these explanations should go into the commit message or a comment in the code. So I haven't added them for now. > > > + /* Don't mark objects not in the packfile */ > > > + if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD) > > > + i = bitmap_git->pack->num_objects / BITS_IN_EWORD; > > > > Why is this necessary? Let's say we have 42 fully-1 words, and therefore > > i is 42. Shouldn't we allocate 42 words in our reuse bitmap? > > This is the same issue as above. Imagine instead of two objects, imagine > we have 42 words worth. But if only 20 words worth are in the packfile, > then we have to stop there. > > Now we _could_ figure this out for each individual object through the > similar logic in try_partial_reuse(). But we try to take an even faster > path for the initial chunk of objects. We don't even walk over them > looking to see if they're deltas or not. We just send that chunk of the > pack verbatim. > > I don't have numbers for how often we hit this path versus the > individual try_partial_reuse() path. Possibly the stats could > differentiate that. Also not sure if these explanations should go into the commit message or a comment in the code. So I haven't added them for now. > Thinking on it more, though, I wonder if there's a bug hiding here. We > know that we can send the whole initial chunk verbatim for OFS_DELTA > objects, because the relative offsets will remain unchanged (i.e., there > are no "holes" that trigger our chunk code). But if there were a > REF_DELTA in that initial chunk, we have no certainty that the base is > being sent. > > In practice, this doesn't happen because the objects in question have to > be in a pack with bitmaps, which means it was written ourselves by > git-repack. And we'd never write REF_DELTA objects there. > > But I wonder if it's worth being a bit more careful (and what the > performance impact is; it would mean checking the type of every object > in that initial chunk). I haven't done anything related to that. Maybe something can be done in a follow up patch. Thanks a lot for the review! Christian.