On Mon, Nov 13, 2023 at 2:34 PM Taylor Blau <me@xxxxxxxxxxxx> wrote: > > On Mon, Nov 13, 2023 at 05:02:54PM -0500, Jeff King wrote: > > On Fri, Nov 10, 2023 at 03:51:18PM -0800, Elijah Newren wrote: > > > > > This is unsafe; the object may need to be read later within the same > > > merge. [...] > > > > > > I think (untested) that you could fix this by creating two packs > > > instead of just one. In particular, add a call to > > > flush_odb_transcation() after the "redo_after_renames" block in > > > merge_ort_nonrecursive_internal(). > > > > It might not be too hard to just let in-process callers access the > > objects we've written. A quick and dirty patch is below, which works > > with the test you suggested (the test still fails because it finds a > > conflict, but it gets past the "woah, I can't find that sha1" part). > > That's a very slick idea, and I think that this series has some legs to > stand on now as a result. > > There are a few of interesting conclusions we can draw here: > > 1. This solves the recursive merge problem that Elijah mentioned > earlier where we could generate up to 2^N packs (where N is the > maximum depth of the recursive merge). > > 2. This also solves the case where the merge-ort code needs to read an > object that it wrote earlier on in the same process without having > to flush out intermediate packs. So in the best case we need only a > single pack (instead of two). > > 3. This also solves the 'replay' issue, *and* allows us to avoid the > tmp-objdir thing there entirely, since we can simulate object reads > in the bulk-checkin pack. > > So I think that this is a direction worth pursuing. Agreed; this looks great. It's basically bringing fast-import-like functionality -- writing objects to a single new packfile while making previous objects accessible to subsequent ones -- to additional callers. > In terms of making those lookups faster, I think that what you'd want is > an oidmap. The overhead is slightly unfortunate, but I think that any > other solution which requires keeping the written array in sorted order > would in practice be more expensive as you have to constantly reallocate > and copy portions of the array around to maintain its invariant. When comparing the overhead of an oidmap to a bunch of inodes, however, it seems relatively cheap. :-) > > I don't know if that is sufficient, though. Would other spawned > > processes (hooks, external merge drivers, and so on) need to be able to > > see these objects, too? > > Interesting point. In theory those processes could ask about newly > created objects, and if they were spawned before the bulk-checkin pack > was flushed, those lookups would fail. One of the big design differences that I was pushing really hard with git-replay was performance and things that came with it -- no worktree, no per-commit hooks (which are nearly ruled out by no worktree, but it's still worth calling out separately), etc. A post-operation hook could be fine, but would also not get to assume a worktree. merge-tree is the same as far as hooks; I'd rather just not have them, but if we did, they'd be a post-operation hook. In both cases, that makes hooks not much of a sticking point. External merge drivers, however... > But that requires that, e.g. for hooks, that we know a-priori the object > ID of some newly-written objects. If you wanted to make those lookups > succeed, I think there are a couple of options: > > - teach child-processes about the bulk-checkin pack, and let them > perform the same fake lookup > > - flush (but do not close) the bulk-checkin transaction > > In any event, I think that this is a sufficiently rare and niche case > that we'd be OK to declare that you should not expect the above > scenarios to work when using `--write-pack`. If someone does ask for > that feature in the future, we could implement it relatively painlessly > using one of the above options. Seems reasonable to me.