On Thu, Dec 19, 2019 at 1:42 AM Jeff King <peff@xxxxxxxx> wrote: > > On Wed, Dec 18, 2019 at 12:26:28PM +0100, Christian Couder wrote: > > > > - one complication is that if we're omitting some objects, we can't > > > set a delta against a base that we're not sending. So we have to > > > check each object in try_partial_reuse() to make sure we have its > > > delta (actually, that third big comment in that function is > > > incomplete, I think; it talks about sending the object later, not as > > > part of the reused pack, but we might not send it at all!). > > > > Are you talking about this comment: > > > > /* > > * And finally, if we're not sending the base as part of our > > * reuse chunk, then don't send this object either. The base > > * would come after us, along with other objects not > > * necessarily in the pack, which means we'd need to convert > > * to REF_DELTA on the fly. Better to just let the normal > > * object_entry code path handle it. > > */ > > > > ? > > > > I don't see how it's talking about sending the object later, so I have > > left it as is. Maybe you can check it again in the v4 series I am > > going to send. > > Yes, that's the comment. It says "the base would come after us". That > could be true, but it also could be that we are not sending the object > at all (in fact, that seems more likely in practice). The outcome is the > same, though: we don't verbatim reuse the delta'd object if its base is > not also being reused. In the V4 (https://public-inbox.org/git/20191218112547.4974-11-chriscool@xxxxxxxxxxxxx/) there is now the following in the commit message: One complication is that if we're omitting some objects, we can't set a delta against a base that we're not sending. So we have to check each object in try_partial_reuse() to make sure we have its delta. Would you also want the comment to be improved to something like: /* * And finally, if we're not sending the base as part of our * reuse chunk, then don't send this object either. Because we * should not send the object at all. Or because the base * would come after us, along with other objects not * necessarily in the pack, which means we'd need to convert * to REF_DELTA on the fly. Better to just let the normal * object_entry code path handle it. */ > > > It might be worth having a perf test here. The case this is helping the > > > most, I think, is when somebody cloning wants all of the objects from > > > the beginning of the pack, but then there's a bunch of _other_ stuff. > > > > > > That could be unreachable objects kept by "repack -k", or maybe objects > > > reachable outside of heads and tags. Or objects that are part of other > > > delta islands. This last is the most plausible, really, because we'll > > > put all of the root-island objects at the beginning of the pack. So > > > maybe a good perf test would be to take an existing repository add a > > > bunch of new commits in refs/foo/, > > > > Not sure how I could do so. How would you do that? > > > > I think it would be best to add completely new realistic commits that > > are changing the main code base. > > I agree that would be the most realistic for the "forked repository > network" case that we originally wrote this for. But I think a more > mundane (and possibly easier to generate) example might be having a > bunch of refs/notes. > > So perhaps something like this: [...] Ok, I will add this test to my patch series if I need to resend a V5. Otherwise I will send it separately. > From my limited prodding, it doesn't trigger pack-reuse with the current > code, but would after your series. On linux.git, it produces: > > Test origin origin/jk/packfile-reuse-cleanup > ---------------------------------------------------------------------------------- > 5312.3: clone without notes 14.16(14.26+4.84) 10.80(10.40+0.44) -23.7% > 5312.4: clone size 1.4G 1.4G +0.0% Yeah, nice improvement. > though I suspect the more interesting savings is in RAM (since we don't > have to allocate a "struct object_entry" for most of the objects). I > don't know how hard it would be to collect that data in the perf suite. I don't know either. > Running the final pack-objects manually with massif, I get peak heap > usage dropping from ~1500MB to ~380MB. Very nice! > For git.git it seems less impressive (very little CPU saved, and the > resulting pack is slightly larger, perhaps due to not re-considering > some deltas whose on-disk versions had to be thrown away): > > Test origin origin/jk/packfile-reuse-cleanup > -------------------------------------------------------------------------------- > 5312.3: clone without notes 1.22(3.60+0.16) 1.20(3.56+0.16) -1.6% > 5312.4: clone size 65.3M 67.0M +2.5% Yeah, I think this is acceptable. > > > > Is this case possible? try_partial_reuse() is only called when there is > > > > a 1 at the bit location. > > > > > > Yes, it's possible. The result of a bitmap walk is larger than a given > > > pack, because we have to extend it to match whatever objects the caller > > > asked for. E.g., imagine we have commit A, repack into into a bitmapped > > > pack, and then somebody pushes up commit B. Now I want to clone, getting > > > both A and B. > > > > > > Our bitmap result represents the whole answer, and so must include both > > > objects. But since "B" isn't in the pack, it doesn't have an assigned > > > bit. We add it to the in-memory bitmap_git->ext_index, which gives it a > > > bit (valid only for that walk result!). But of course for pack reuse, we > > > only care about the objects that were actually in the bitmapped pack. > > > > Not sure if these explanations should go into the commit message or a > > comment in the code. So I haven't added them for now. > > I think this is really outside the scope of this series, even, and gets > into how the bitmap traversal works in general. I'm sure the > documentation for that could be a bit better. Ok, I will take a look at improving the documentation, either in this patch series if I resend it, or in a follow up small series. > > > Thinking on it more, though, I wonder if there's a bug hiding here. We > > > know that we can send the whole initial chunk verbatim for OFS_DELTA > > > objects, because the relative offsets will remain unchanged (i.e., there > > > are no "holes" that trigger our chunk code). But if there were a > > > REF_DELTA in that initial chunk, we have no certainty that the base is > > > being sent. > > > > > > In practice, this doesn't happen because the objects in question have to > > > be in a pack with bitmaps, which means it was written ourselves by > > > git-repack. And we'd never write REF_DELTA objects there. > > > > > > But I wonder if it's worth being a bit more careful (and what the > > > performance impact is; it would mean checking the type of every object > > > in that initial chunk). > > > > I haven't done anything related to that. Maybe something can be done > > in a follow up patch. > > Hmm. I was thinking the problem was introduced in this series, but the > old code should suffer from this as well. I wondered if it might simply > be that the old code did not trigger often enough for anybody to notice, > but we have been running the code in this series for several years at > GitHub. So probably my reasoning above is sound (but it might still be > worth addressing). Ok, I will take a look. Thanks, Christian.