Re: [PATCH] refs: honor reference-transaction semantics when deleting refs

Michael Heemskerk <mheemskerk@xxxxxxxxxxxxx> · Mon, 9 May 2022 12:18:51 +0200

On Mon, May 9, 2022 at 8:23 AM Patrick Steinhardt <ps@xxxxxx> wrote:
>
> I really like these changes given that they simplify things, but I
> wonder whether we can do them. In the preimage we're eagerly removing
> loose refs: any error encountered when deleting a reference is recorded,
> but we keep on trying to remove the other refs, as well. With the new
> behaviour we now create a single transaction for all refs and try to
> commit it. This also means that we'll abort the transaction when locking
> any of the refs fails, which is a change in behaviour.
>
> The current behaviour is explicitly documented in `refs.h:refs_delete_refs()`:
>
>     /*
>      * Delete the specified references. If there are any problems, emit
>      * errors but attempt to keep going (i.e., the deletes are not done in
>      * an all-or-nothing transaction). msg and flags are passed through to
>      * ref_transaction_delete().
>      */
>     int refs_delete_refs(struct ref_store *refs, const char *msg,
>                          struct string_list *refnames, unsigned int flags);
>
> There are multiple callsites of this function via `delete_refs()`. Now
> honestly, most of these callsites look somewhat broken:
>
>     - `bisect.c` simply does its best to clean up bisect state. This
>       usecase looks fine to me.
>
>     - `builtin/branch.c` reports the branches as deleted even if
>       `delete_refs()` failed.
>
>     - `builtin/remote.c` also misreports the deleted branches for the
>       `prune` verb. The `rm` verb looks alright: if deletion of any
>       branch failed then it doesn't prune the remote's config in the end
>       and reports an error.
>
>     - `builtin/fetch.c` also misreports deleted branches with `--prune`.
>
> So most of these commands incorrectly handle the case where only a
> subset of branches has been deleted. This raises the question whether
> the interface provided by `refs_delete_refs()` is actually sensible if
> it's so easy to get wrong. It doesn't even report which branches could
> be removed and which couldn't. Furthermore, the question is whether new
> backends like the reftable backend which write all refs into a single
> slice would actually even be in a position to efficiently retain
> semantics of this function.
>
> I'm torn. There are valid usecases for eagerly deleting refs even if a
> subset of deletions failed, making this change a tough sell, but most of
> the callsites don't actually handle this correctly in the first place.

Exactly; this is the reason that I initially suggested fixing the issue by
just removing the upfront rewrite of packed-refs. With that rewrite removed,
the refs-to-be-deleted are deleted in individual transactions, which may or
may not rewrite packed-refs. The downside, as you correctly pointed out,
is that we could end up rewriting packed-refs multiple times, which could
come at a significant performance penalty for repositories with large
packed-refs files.

Unfortunately, the current approach of updating packed-refs in one
transaction and updating the loose refs in individual transactions doesn't
work either.

So what are our options?

- delete each of the refs in a separate transaction and pay a (potentially
  significant) performance penalty in repositories with large packed-refs
  files when deleting many refs. I'll note that this scenario is similar to
  deleting a set of refs through a non-atomic push.
- switch to a single transaction and update refs.h:refs_delete_refs to use
  an all-or-nothing approach (the approach I've taken in my patch).
- improve the reference-transaction mechanism to support the
  'batch-of-transactions' mode more efficiently. If I remember correctly,
  something like that has been suggested before, but I'm not sure if it's
  actually been built or spiked. In this batch-of-transactions mode, git
  could try to prepare all refs, and only invoke the hook for the refs that
  could be successfully prepared. The hook should then be able to
  reject individual ref updates, and git would then apply only the
  non-rejected ref updates. While such a change would make many
  scenarios where multiple refs are being updated more efficient, it's also
  a much bigger change that's hard to make without breaking the current
  reference-transaction protocol.

Sticking to a transaction per ref and rewriting packed-refs multiple times
is the safer option. Deleting the refs in a single transaction is the more
performant option, but changes the behavior. A stay/stale lock file could
then make it impossible to remove a remote, or to prune /refs/remotes/
refs.

My suggestion would be to stick to a transaction per ref and pay the
same performance penalty as you'd get when deleting many refs through
a non-atomic push.

> > +test_expect_success 'hook aborts deleting loose ref in prepared state' '
> > +     test_when_finished "rm actual" &&
> > +     test_when_finished "git branch -d to-be-deleted" &&
> > +     git branch to-be-deleted $PRE_OID &&
> > +     test_hook reference-transaction <<-\EOF &&
> > +             echo "$*" >>actual
> > +             exit 1
> > +     EOF
> > +     cat >expect <<-EOF &&
> > +             aborted
> > +             prepared
> > +             aborted
> > +     EOF
>
> It's not really clear why we get the first "aborted" result here. I know
> that it is because we didn't queue up any refs to the packed backend,
> and thus we don't even try to prepare the transaction. But it's likely
> confusing to a reader and might thus warrant a comment. On the other
> hand this is going away anyway if and when my patch series lands again
> that stops calling the hook from the nested packed backend.

I can add a comment explaining why we get the 'aborted' callback
and also add a comment that that 'aborted' callback is expected to be
removed in an upcoming version.

> > +     test_must_fail git branch -d to-be-deleted &&
> > +     test_cmp expect actual &&
> > +     git rev-parse refs/heads/to-be-deleted
> > +'
> > +
> > +test_expect_success 'hook aborts deleting packed ref in prepared state' '
> > +     test_when_finished "rm actual" &&
> > +     test_when_finished "git branch -d to-be-deleted" &&
> > +     git branch to-be-deleted $PRE_OID &&
> > +     git pack-refs --all --prune &&
> > +     test_hook reference-transaction <<-\EOF &&
> > +             echo "$*" >>actual
> > +             exit 1
> > +     EOF
> > +     cat >expect <<-EOF &&
> > +             prepared
> > +             aborted
> > +     EOF
> > +     test_must_fail git branch -d to-be-deleted &&
> > +     test_cmp expect actual &&
> > +     git rev-parse refs/heads/to-be-deleted
> > +'
> > +
>
> From your description one more interesting case is when the packed-refs
> transaction is committed, but the loose-refs backend is aborted. It's a
> bit harder to test given that we have no way to indicate what backend
> the reftx hook is being invoked from though.

That is indeed the more interesting case, and exactly the case we saw
fail when the packed-refs reference-transactions callbacks had been
suppressed in 2.36.0-rc1 and -rc2. But as you said, they're pretty hard to
test. In Bitbucket's tests we now verify from the reference-transaction
callback that the ref-updates have not been committed (yet) when the
prepared or aborted callback is received, and that they _have_ been
committed when the committed callback is received. Perhaps a similar
approach could be used in git's reference-transaction tests for more
thorough testing.

> One thing that worries me is that these patches kind of set the current
> behaviour of driving the reftx hook via both packed and loose backend
> into stone. My patch series that got reverted is going to change that
> behaviour though so that we don't execute the hook from the packed
> backend, and consequentially we'd have to change all these tests again
> to match the new behaviour. This makes it a lot harder to argue though
> that we can safely switch to the new behaviour without breaking any
> assumptions when we even codified our current assumptions into tests.

The counter argument to that is that it's kind of scary if you could remove
 half of the reference-transaction callbacks without needing to update a
test. I'd rather have tests that verify current behavior that you need to
update when you intentionally change the behavior, then not have those
tests?

> Taking a step back I wonder whether my previous approach to just hide
> the hook for the packed backend was the right thing to do though. An
> alternative would be to instead expose additional information to the
> hook so that it can decide by itself whether it wants to execute the
> hook or not. This could e.g. trivially be done by exposing a new
> "GIT_REFS_BACKEND" environment variable to the reftx hook that gets set
> to either "packed-refs", "loose-refs" or "reftables" depending on which
> backend is currently in use. Equipped with this information a hook
> script can then easily ignore any updates to the packed-refs file by
> itself without having to change the way it is invoked right now and thus
> we wouldn't regress any currently existing hooks.

>From the reference-transaction hook writer's perspective, the backend
involved is an implementation detail that the hook should not have to
care about. Getting separate callbacks for the loose and the packed
backends makes it a lot harder to write a good reference-transaction
hook, especially when the callbacks differ if a ref is packed or not.

IMO, there should really be a "files" backend transaction, that internally
takes care of locking individual refs and possibly "packed-refs" in case
of a deletion. In addition, not getting "artificial" packed callbacks also
saves us a few extra reference-transaction callbacks when deleting refs.
Even if those take only a few ms per invocation, when deleting hundreds
of refs, it's still something that we'd like to avoid if we can.

> >  test_done
> > diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
> > index 4620f0ca7fa..8b09a99c2e8 100755
> > --- a/t/t5510-fetch.sh
> > +++ b/t/t5510-fetch.sh
> > @@ -169,10 +169,10 @@ test_expect_success REFFILES 'fetch --prune fails to delete branches' '
> >       git clone . prune-fail &&
> >       cd prune-fail &&
> >       git update-ref refs/remotes/origin/extrabranch main &&
> > -     : this will prevent --prune from locking packed-refs for deleting refs, but adding loose refs still succeeds  &&
> > -     >.git/packed-refs.new &&
> > +     : this will prevent --prune from locking refs/remotes/origin/extra for deletion &&
> > +     >.git/refs/remotes/origin/extrabranch.lock &&
> >
> > -     test_must_fail git fetch --prune origin
> > +     test_must_fail git fetch --prune origin > outputs 2> errors
>
> It would be nice to have an explanation why exactly this change is
> needed, and why it is fine that the visible behaviour changes here.

The  test was relying on the fact that the packed-refs file was locked when
git fetch --prune is called. My patch replaces that unconditional lock with
a transaction. The transaction only takes out the lock if packed-refs
actually needs to be updated, and since the ref being pruned only exists
as a loose ref, git fetch --prune failed to fail.

I've replaced the lock on packed-refs with a lock on the loose ref that
should be pruned.

Michael