On Fri, Oct 6, 2023 at 4:02 PM Taylor Blau <me@xxxxxxxxxxxx> wrote: > > On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote: > > Taylor Blau <me@xxxxxxxxxxxx> writes: > > > > > When using merge-tree often within a repository[^1], it is possible to > > > generate a relatively large number of loose objects, which can result in > > > degraded performance, and inode exhaustion in extreme cases. > > > > Well, be it "git merge-tree" or "git merge", new loose objects tend > > to accumulate until "gc" kicks in, so it is not a new problem for > > mere mortals, is it? > > Yeah, I would definitely suspect that this is more of an issue for > forges than individual Git users. It may still be nice to also do this optimization for plain "git merge" as well. I had it in my list of ideas somewhere to do a "fast-import-like" thing to avoid writing loose objects, as I suspected that'd actually be a performance impediment. > > As one "interesting" use case of "merge-tree" is for a Git hosting > > site with bare repositories to offer trial merges, without which > > majority of the object their repositories acquire would have been in > > packs pushed by their users, "Gee, loose objects consume many inodes > > in exchange for easier selective pruning" becomes an issue, right? > > Right. > > > Just like it hurts performance to have too many loose object files, > > presumably it would also hurt performance to keep too many packs, > > each came from such a trial merge. Do we have a "gc" story offered > > for these packs created by the new feature? E.g., "once merge-tree > > is done creating a trial merge, we can discard the objects created > > in the pack, because we never expose new objects in the pack to the > > outside, processes running simultaneously, so instead closing the > > new packfile by calling flush_bulk_checkin_packfile(), we can safely > > unlink the temporary pack. We do not even need to spend cycles to > > run a gc that requires cycles to enumerate what is still reachable", > > or something like that? > > I know Johannes worked on something like this recently. IIRC, it > effectively does something like: > > struct tmp_objdir *tmp_objdir = tmp_objdir_create(...); > tmp_objdir_replace_primary_odb(tmp_objdir, 1); > > at the beginning of a merge operation, and: > > tmp_objdir_discard_objects(tmp_objdir); > > at the end. I haven't followed that work off-list very closely, but it > is only possible for GitHub to discard certain niche kinds of > merges/rebases, since in general we make the objects created during test > merges available via refs/pull/N/{merge,rebase}. Oh, at the contributor summit, Johannes said he only needed pass/fail, not the actual commits, which is why I suggested this route. If you need to keep the actual commits, then this won't help. I was interested in the same question as Junio, but from a different angle. fast-import documentation points out that the packs it creates are suboptimal with poorer delta choices. Are the packs created by bulk-checkin prone to the same issues? When I was thinking in terms of having "git merge" use fast-import for pack creation instead of writing loose objects (an idea I never investigated very far), I was wondering if I'd need to mark those packs as "less optimal" and do something to make sure they were more likely to be repacked. I believe geometric repacking didn't exist back when I was thinking about this, and perhaps geometric repacking automatically handles things nicely for us. Does it, or are we risking retaining sub-optimal deltas from the bulk-checkin code? (I've never really cracked open the pack code, so I have absolutely no idea; I'm just curious.) > I think that like anything, this is a trade-off. Having lots of packs > can be a performance hindrance just like having lots of loose objects. > But since we can represent more objects with fewer inodes when packed, > storing those objects together in a pack is preferable when (a) you're > doing lots of test-merges, and (b) you want to keep those objects > around, e.g., because they are reachable. A couple of the comments earlier in the series suggested this was about streaming blobs to a pack in the bulk checkin code. Are tree and commit objects also put in the pack, or will those continue to be written loosely?