Re: [PATCH] refs: honor reference-transaction semantics when deleting refs

Patrick Steinhardt <ps@xxxxxx> · Mon, 9 May 2022 15:48:52 +0200

On Mon, May 09, 2022 at 12:18:51PM +0200, Michael Heemskerk wrote:
> On Mon, May 9, 2022 at 8:23 AM Patrick Steinhardt <ps@xxxxxx> wrote:
> >
> > I really like these changes given that they simplify things, but I
> > wonder whether we can do them. In the preimage we're eagerly removing
> > loose refs: any error encountered when deleting a reference is recorded,
> > but we keep on trying to remove the other refs, as well. With the new
> > behaviour we now create a single transaction for all refs and try to
> > commit it. This also means that we'll abort the transaction when locking
> > any of the refs fails, which is a change in behaviour.
> >
> > The current behaviour is explicitly documented in `refs.h:refs_delete_refs()`:
> >
> >     /*
> >      * Delete the specified references. If there are any problems, emit
> >      * errors but attempt to keep going (i.e., the deletes are not done in
> >      * an all-or-nothing transaction). msg and flags are passed through to
> >      * ref_transaction_delete().
> >      */
> >     int refs_delete_refs(struct ref_store *refs, const char *msg,
> >                          struct string_list *refnames, unsigned int flags);
> >
> > There are multiple callsites of this function via `delete_refs()`. Now
> > honestly, most of these callsites look somewhat broken:
> >
> >     - `bisect.c` simply does its best to clean up bisect state. This
> >       usecase looks fine to me.
> >
> >     - `builtin/branch.c` reports the branches as deleted even if
> >       `delete_refs()` failed.
> >
> >     - `builtin/remote.c` also misreports the deleted branches for the
> >       `prune` verb. The `rm` verb looks alright: if deletion of any
> >       branch failed then it doesn't prune the remote's config in the end
> >       and reports an error.
> >
> >     - `builtin/fetch.c` also misreports deleted branches with `--prune`.
> >
> > So most of these commands incorrectly handle the case where only a
> > subset of branches has been deleted. This raises the question whether
> > the interface provided by `refs_delete_refs()` is actually sensible if
> > it's so easy to get wrong. It doesn't even report which branches could
> > be removed and which couldn't. Furthermore, the question is whether new
> > backends like the reftable backend which write all refs into a single
> > slice would actually even be in a position to efficiently retain
> > semantics of this function.
> >
> > I'm torn. There are valid usecases for eagerly deleting refs even if a
> > subset of deletions failed, making this change a tough sell, but most of
> > the callsites don't actually handle this correctly in the first place.
> 
> Exactly; this is the reason that I initially suggested fixing the issue by
> just removing the upfront rewrite of packed-refs. With that rewrite removed,
> the refs-to-be-deleted are deleted in individual transactions, which may or
> may not rewrite packed-refs. The downside, as you correctly pointed out,
> is that we could end up rewriting packed-refs multiple times, which could
> come at a significant performance penalty for repositories with large
> packed-refs files.
> 
> Unfortunately, the current approach of updating packed-refs in one
> transaction and updating the loose refs in individual transactions doesn't
> work either.
> 
> So what are our options?
> 
> - delete each of the refs in a separate transaction and pay a (potentially
>   significant) performance penalty in repositories with large packed-refs
>   files when deleting many refs. I'll note that this scenario is similar to
>   deleting a set of refs through a non-atomic push.

By chance I know that on gitlab.com we've had huge performance issues
with access patterns like this. In some cases with repos that have
millions of refs I have seen that certain actions were completely
dominated by rewriting the packed-refs file. So I'd rather avoid going
there given that it would be a serious performance regression.

> - switch to a single transaction and update refs.h:refs_delete_refs to use
>   an all-or-nothing approach (the approach I've taken in my patch).

This is the easiest approach, but also backwards incompatible as I've
layed out above. I personally wouldn't mind, and as said I think that
most of the usecases are broken anyway because of implementations which
misreport the case where we only partially deleted refs. But it's not a
decision we can make without more discussion, I guess.

> - improve the reference-transaction mechanism to support the
>   'batch-of-transactions' mode more efficiently. If I remember correctly,
>   something like that has been suggested before, but I'm not sure if it's
>   actually been built or spiked. In this batch-of-transactions mode, git
>   could try to prepare all refs, and only invoke the hook for the refs that
>   could be successfully prepared. The hook should then be able to
>   reject individual ref updates, and git would then apply only the
>   non-rejected ref updates. While such a change would make many
>   scenarios where multiple refs are being updated more efficient, it's also
>   a much bigger change that's hard to make without breaking the current
>   reference-transaction protocol.

Peff and I had such a discussion in the past, and having transactions
with eager semantics would also fix some edge cases we have in other
parts of the codebase. You already mentioned git-fetch(1) without
`--atomic`, and there are likely others that could benefit.

I did have a go at this several times already, but none of the
approaches I took resulted in a clean-to-use API. I consider this to be
the best solution, but also the hardest one to implement, unfortunately.

> Sticking to a transaction per ref and rewriting packed-refs multiple times
> is the safer option. Deleting the refs in a single transaction is the more
> performant option, but changes the behavior. A stay/stale lock file could
> then make it impossible to remove a remote, or to prune /refs/remotes/
> refs.
> 
> My suggestion would be to stick to a transaction per ref and pay the
> same performance penalty as you'd get when deleting many refs through
> a non-atomic push.

I don't think we should do this. The one-transaction-per-ref issue is
why I added the `--atomic` flag to git-fetch(1) in the first place,
because it has been biting us in repos with millions of refs. Quoting
583bc41923 (fetch: make `--atomic` flag cover pruning of refs,
2022-02-17), which introduces this flag:

    Benchmark 1: git fetch --prune --atomic +refs/*:refs/* (HEAD~)
      Time (mean ± σ):      2.366 s ±  0.021 s    [User: 0.858 s, System: 1.508 s]
      Range (min … max):    2.328 s …  2.407 s    10 runs

    Benchmark 2: git fetch --prune --atomic +refs/*:refs/* (HEAD)
      Time (mean ± σ):      1.369 s ±  0.017 s    [User: 0.715 s, System: 0.641 s]
      Range (min … max):    1.346 s …  1.400 s    10 runs

    Summary
      'git fetch --prune --atomic +refs/*:refs/* (HEAD)' ran
        1.73 ± 0.03 times faster than 'git fetch --prune --atomic +refs/*:refs/* (HEAD~)'

And this is only with a 100k references. The issue gets a lot worse when
you're in the millions of refs, both because you have 10x more refs to
prune, but also because the packed-refs file is 10x larger. So
performance doesn't scale linearly with the number of refs, but is a
product of file size and number of refs.

[snip]
> > One thing that worries me is that these patches kind of set the current
> > behaviour of driving the reftx hook via both packed and loose backend
> > into stone. My patch series that got reverted is going to change that
> > behaviour though so that we don't execute the hook from the packed
> > backend, and consequentially we'd have to change all these tests again
> > to match the new behaviour. This makes it a lot harder to argue though
> > that we can safely switch to the new behaviour without breaking any
> > assumptions when we even codified our current assumptions into tests.
> 
> The counter argument to that is that it's kind of scary if you could remove
>  half of the reference-transaction callbacks without needing to update a
> test. I'd rather have tests that verify current behavior that you need to
> update when you intentionally change the behavior, then not have those
> tests?

Oh, I certainly agree there. And we have stated in the past that the
reftx hook is _not_ providing a stable interface with regards to exactly
when it is called. We only want to guarantee that it is called on a
transaction.

> > Taking a step back I wonder whether my previous approach to just hide
> > the hook for the packed backend was the right thing to do though. An
> > alternative would be to instead expose additional information to the
> > hook so that it can decide by itself whether it wants to execute the
> > hook or not. This could e.g. trivially be done by exposing a new
> > "GIT_REFS_BACKEND" environment variable to the reftx hook that gets set
> > to either "packed-refs", "loose-refs" or "reftables" depending on which
> > backend is currently in use. Equipped with this information a hook
> > script can then easily ignore any updates to the packed-refs file by
> > itself without having to change the way it is invoked right now and thus
> > we wouldn't regress any currently existing hooks.
> 
> From the reference-transaction hook writer's perspective, the backend
> involved is an implementation detail that the hook should not have to
> care about. Getting separate callbacks for the loose and the packed
> backends makes it a lot harder to write a good reference-transaction
> hook, especially when the callbacks differ if a ref is packed or not.
>
> IMO, there should really be a "files" backend transaction, that internally
> takes care of locking individual refs and possibly "packed-refs" in case
> of a deletion. In addition, not getting "artificial" packed callbacks also
> saves us a few extra reference-transaction callbacks when deleting refs.
> Even if those take only a few ms per invocation, when deleting hundreds
> of refs, it's still something that we'd like to avoid if we can.

True. In this case it's just an alternative way to tackle this specific
issue given that one wouldn't typically care about the packed-refs
changes. And it's very much an artifact of the files-backend really
being two backends in one. In any case I can see arguments for both
approaches, and ultimately I agree that it would be best if the
files-backend behaved as if it was a single one.

> > >  test_done
> > > diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
> > > index 4620f0ca7fa..8b09a99c2e8 100755
> > > --- a/t/t5510-fetch.sh
> > > +++ b/t/t5510-fetch.sh
> > > @@ -169,10 +169,10 @@ test_expect_success REFFILES 'fetch --prune fails to delete branches' '
> > >       git clone . prune-fail &&
> > >       cd prune-fail &&
> > >       git update-ref refs/remotes/origin/extrabranch main &&
> > > -     : this will prevent --prune from locking packed-refs for deleting refs, but adding loose refs still succeeds  &&
> > > -     >.git/packed-refs.new &&
> > > +     : this will prevent --prune from locking refs/remotes/origin/extra for deletion &&
> > > +     >.git/refs/remotes/origin/extrabranch.lock &&
> > >
> > > -     test_must_fail git fetch --prune origin
> > > +     test_must_fail git fetch --prune origin > outputs 2> errors
> >
> > It would be nice to have an explanation why exactly this change is
> > needed, and why it is fine that the visible behaviour changes here.
> 
> The  test was relying on the fact that the packed-refs file was locked when
> git fetch --prune is called. My patch replaces that unconditional lock with
> a transaction. The transaction only takes out the lock if packed-refs
> actually needs to be updated, and since the ref being pruned only exists
> as a loose ref, git fetch --prune failed to fail.
> 
> I've replaced the lock on packed-refs with a lock on the loose ref that
> should be pruned.

Ah, thanks. I was missing the fact that the lock for the packed-refs
file was taken unconditionally before.

Patrick
Attachment:
signature.asc

Description: PGP signature