Re: What's cooking in git.git (Sep 2021, #08; Mon, 27)

Elijah Newren <newren@xxxxxxxxx> · Fri, 1 Oct 2021 11:12:22 -0700

On Fri, Oct 1, 2021 at 10:02 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Elijah Newren <newren@xxxxxxxxx> writes:
>
> > Said another way, I don't think anything should be writing a critical
> > file that needs to be durable when we're in the middle of a
> > "read-only" process like git-log.  The only things written should be
> > temporary stuff, like the automatic remerge calculation from
> > merge-ort, the textconv cache optimization stuff, or perhaps future
> > gitattributes transformation caching.  All that stuff can safely be
> > blown away at the completion of each merge.
>
> The textconv cache can be populated/written during "git log -p" into
> the object store to persist.  With or without "--remerge-diff", we
> can make design decision to either
>
>  - use temporary object store to discard everything we create at the
>    end in one-go, or
>
>  - write them into the object store to let later gc to handle the
>    crufts.
>
> The former will disable persistent write access to the cache.  It
> still allows accesses the cached data during the same process,
> though.  We so far deemed that textconv cache, when the user enables
> it, is valuable enough to make persistent.  Perhaps remerge-diff's
> tentative merge result may fall into the same category?  Some folks
> may want to cache, others may not.
>
> If we were to use the same notes-cache mechanism, we record the tree
> object (perhaps the object name) as the cached value for the merge
> commit in question.  Hopefully most of the merges are clean merges,
> and "caching" the results of the recreation of the merge would cost
> almost nothing.  We need objects to record the fact that "this merge
> has cached result" in the notes-cache, but the tree object that
> represents the cached result is already in the history the merge
> belongs to.

I'm not sure this performance caching for remerge-diff makes any
sense, for multiple reasons:

1) Does the cache become invalidated when the merge algorithm changes
or when config options change (--remerge-diff shows the difference
between what an automatic merge would do, but an automatic merge's
results depend on the current algorithm and various merge.*, diff.*,
etc. config options the user can set).  Do we re-cache each variant?
Do we expire the cache at some point?

2) The performance aspect seems suspect at best, and likely to backfire.  Badly:

2a) Let's say there's 100,000 merge commits in your history (so small
history roughly the size of linux.git).  `git log --remerge-diff` thus
repeats 100,000 merges, creating loose objects for them.  Perhaps only
1 in 10 merges needs any loose objects created (because the rest of
the merges were clean).  But those that do have conflicts will need to
create both new blobs and any necessary new trees that use those new
blobs.  Who knows how many blobs and trees are needed, but let's just
use 10 total blobs+trees per merge that has conflicts as a guesstimate
of the average.  Thus `git log --remerge-diff` will need to create
100,000 * 1/10 * 10 = 100,000 loose objects while it runs.  Preserving
all those files to cache the results slows things down
considerably...until they are gc'ed.  If we do the notes-cache thing,
then yes these objects would become packed.  Perhaps you're suggested
doing an automatic gc (or several) as part of running --remerge-diff,
but that's super slow itself.  When I implemented that, it felt worse
to me than just letting loose objects pile up.

2b) Note that my implementation didn't just clear out the tmp-objdir
at the _end_ of the remerge-diff (which was the easiest to implement),
but cleared it out after each merge, because that was much faster with
as many objects as can accumulate.

2c) Is --remerge-diff slow enough to even merit the complicated effort
of caching results?  --remerge-diff is faster than --cc in my testing,
so why is it even worth the pain to cache?

So with that out of the way, let's return to discussing the textconv
cache.  If the remerge-diff results aren't cached, isn't it unsafe to
allow the textconv cache to persist anything while remerge-diff is
running because it could create corruption?  We could still let the
textconv cache persist results when remerge-diff is not in use, but I
think we _want_ anything written by the textconv cache during
remerge-diff to be thrown away because there's a risk it references a
temporary object created by remerge-diff that will be deleted.

With all that in mind, to me, it seems like using the tmp-objdir as
the primary object store in combination with the refs quarantine is
actually the safest solution for this usecase.  Am I missing
something?