Re: [RFC] fetch: update refs in a single transaction

Patrick Steinhardt <ps@xxxxxx> · Wed, 8 Dec 2021 09:48:15 +0100

On Wed, Dec 08, 2021 at 09:15:01AM +0100, Christian Couder wrote:
> On Tue, Dec 7, 2021 at 11:11 PM Patrick Steinhardt <ps@xxxxxx> wrote:
> >
> > When git-fetch(1) isn't called with the `--atomic` flag, then each
> > reference will be updated in a separate transaction. As a result, even
> > if we failed to update a subset of transactions, the remaining refs will
> > be modified after the command has finished. While this pattern is
> > reasonably efficient with the files backend where we'd lock and write
> > each file separately anyway, the upcoming reftable backend could handle
> > such an update a lot more efficiently if it was batched into a single
> > transaction given that it could then create a single new reftable slice
> > instead of creating one slice per updated ref. While this inefficiency
> > can be easily mitigated by using the `--atomic` flag, this flag cannot
> > be used in contexts where we want partial-update semantics.
> 
> So it seems to me that there are 3 possible behaviors/modes:
> 
>   - "atomic": a single transaction with all or nothing semantics,
> currently implemented with --atomic
>   - "partial-update": a single transaction with partial-update
> semantics, the new behavior this patch is implementing
>   - "non-atomic": many transactions (one per ref?), currently the
> default when --atomic isn't passed
> 
> Yeah, "partial-update" seems better than "non-atomic" when the
> reftable backend is used. But what happens  when "partial-update" is
> used with the files backend?

Yeah, putting this in these three modes catches the gist of how it can
work. Currently, we only support "atomic" and "non-atomic", and this
patch proposes to replace "non-atomic" with "partial-update". I'd be
happy to instead implement it as a third mode, but if we can find a way
to "do the right thing" without having to introduce another option, then
that would be best. But as per your comments below and my own concerns
I'm not sure we can easily implement this without breaking existing
usecases.

> > Convert git-fetch(1) to always use a single reference transaction,
> > regardless of whether it is called with `--atomic` or not. The only
> > remaining difference between both modes is that with `--atomic` set,
> > we'd abort the transaciton in case at least one reference cannot be
> > updated.
> 
> Nit: I would say "as soon as one reference cannot be updated"
> 
> > Note that this slightly changes semantics of git-fetch(1): if we hit any
> > unexpected errors like the reference update racing with another process,
> > then we'll now fail to update any references, too.
> 
> So that's one difference between the "partial-update" and the
> "non-atomic" modes, but what you say doesn't really make me confident
> that it's the only one.

There can certainly be other edge cases which break, and I'm sure others
will be able to pinpoint exactly which these are.

> Also are there people who are in cases where a lot of reference
> updates are racing, and where it's important that in such cases at
> least some ref updates succeed?
>
> If yes, then maybe the 3 modes could be kept and options for
> "partial-update" and "non-atomic" could be added. "partial-update"
> could be the default in case the reftable backend is used, while
> "non-atomic" would still be the default when the files backend is
> used.

I bet there are.

> > Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
> > ---
> >  builtin/fetch.c | 78 ++++++++++++++++---------------------------------
> >  1 file changed, 25 insertions(+), 53 deletions(-)
> >
> > Hi,
> >
> > until now, we have been quite lenient with creating lots of reference
> > transactions even though they could've been bundled together into a
> > single transaction. After all, it didn't have much of a downside in most
> > contexts with the files backend: we'd have to lock each loose ref
> > separately anyway. I'd like to get some feedback on changing our stance
> > here, due to multiple reasons:
> >
> >     - The upcoming reftable backend will be more efficient if we use a
> >       single transaction to bundle together multiple ref updates given
> >       that it only needs to write one new slice instead of one per
> >       update.
> >
> >     - Syncing refs to disk can be batched more efficiently if we bundle
> >       ref updates. See my initial patch series to implement fsync-ing
> >       refs [1] and Peff's benchmarks [2] demonstrating that fetches may
> >       become a lot slower.
> 
> Maybe it's explained in the pointers, but is this in the case of the
> files backend or the reftable backend or both?

The discussion in [1] was in the case of the files backend: if we want
to guarantee that we never write corrupt refs to disk, then we must sync
written data to disk before moving refs into place. If using multiple
transactions, then we thus need to fsync(3P) in each commit. If we're
using a single transaction though, we can amortize the costs and batch
this, where we only fsync(3P) once after all references have been
prepared.

The same is effectively true for the reftable backend though: if we want
to ensure that reftable files are safe, then we'd have to sync them to
disk, as well.

> >     - The reference-transaction hook can be used more efficiently given
> >       that it would only need to execute twice, instead of twice per
> >       updated ref.
> 
> Yeah, so that could be a regression for people who expect it to run
> twice per updated ref.

The reference-transaction hook was always a mechanism which hooks into
internals. As such, how often the hook executes and in which order was
never guaranteed and is expected to change when implementation details
change. So this is not something I'd ever call a regression: the whole
intent of the hook is to be able to see what happens internally, not to
assert semantics or the order in which updates are performed for a
specific Git command. That's kind the trade-off with using a low-level
hook as this one.

> >       It also has a more global view of what's happening.
> >       While this is a concern of mine, it's a non-reason in the context
> >       of the Git project given that we really ought not to change
> >       semantics only to appease this hook.
> 
> Not sure I understand what you are saying here.

Previously you saw `2*len(refs)` executions of the hook, so you'd have
to piece together the puzzle to see what happens on a more global level
yourself. With the change of using a single transaction, you'd get only
two executions of the hook, where each execution has the complete
picture of all refs which are about to be updated.

> > With these reasons in mind, I'm wondering whether we want to accept
> > refactorings which convert existing code to use batched reference
> > transactions. While the potential performance improvements are a rather
> > obvious upside, the downside is that it would definitely change the
> > failure mode in many cases.
> 
> Another downside is that it changes how and when reference-transaction
> hooks are called.

As mentioned above, this is nothing I'd call a downside. We cannot ever
guarantee stable execution order of this hook, or otherwise we paint
ourselves into a corner.

> > The example I have here with git-fetch(1) would mean that if we were to
> > race with any other process or if any other unexpected error occurs
> > which leads us to die, then we'll not commit any change to disk. This
> > can be seen as an improvement in consistency, but it can also be seen as
> > a change which breaks current semantics of trying to do as much work as
> > possible.
> 
> I like the idea, but it seems a bit safer to me to have 3 different
> modes, so that people can test them in real life first for some time.
> Then we might later be more confident with changing the default for
> some backends. I might be worrying too much though, as very few people
> probably use reference-transaction hooks.

Fair, I'll wait a bit for more feedback and then probably post a v2 with
a third mode.

I think changing behaviour depending on which ref backend would be bad
precedent though. Ultimately, the user shouldn't notice which backend is
in use, except that it would ideally be faster if using the new reftable
backend. But slightly changing semantics of commands depending on which
backend is active is only going to lead to confusion.

Patrick
Attachment:
signature.asc

Description: PGP signature