Re: Distinguishing FF vs non-FF updates in the reflog?

Han-Wen Nienhuys <hanwen@xxxxxxxxxx> · Mon, 22 Mar 2021 15:40:46 +0100

On Thu, Mar 18, 2021 at 8:47 PM Jeff King <peff@xxxxxxxx> wrote:
> > I'm working on some extensions to Gerrit for which it would be very
> > beneficial if we could tell from the reflog if an update is a
> > fast-forward or not: if we find a SHA1 in the reflog, and see there
> > were only FF updates since, we can be sure that the SHA1 is reachable
> > from the branch, without having to open packfiles and decode commits.
>
> I left some numbers in another part of the thread, but IMHO performance
> isn't that compelling a reason to do this these days, if you are using
> commit-graphs.
>
> Just walking the reflog might be _slightly_ faster, though not
> necessarily (it depends on whether the depth of the object graph or the
> depth of the reflog chain is deeper). It might matter more if you are
> using a more exotic storage scheme, where switching from accessing
> reflogs to objects implies extra round-trips to a server (e.g., custom
> storage backends with JGit; I don't know the state of the art in what
> Google is doing there).

JGit doesn't currently support commit-graph, so it's hard to predict
what performance will be like, but isn't commit-graph is keyed by
SHA1? That makes it hard to do caching, especially when considering
large repositories.

AFAIU, commit-graph would help speed up reachability checks, by being
able to shortcut cases where the commit number proves that some commit
is not ancestor of the other, but you still have to do a revwalk to
conclusively prove reachability.

In our storage system, the revwalk runs on top of packfile data that
must be faulted-in (slow!) from datacenter-wide storage. It's made
worse because we don't support midx yet.

The application that I'm thinking of providing a way for automation to
deal with lagging replicas. This could be done by specifying a

  X-Need-GitRef: $repositoryname~$refname~$SHA1

header on Gerrit requests, that specify that the given $SHA1 must have
been in a recent ref update, and be reachable from $refname. The
reflog has this information organized in a form that suited very well
to answering these questions quickly (assuming the reflog is annotated
such that we can distinguish FF and non-FF updates)

> > Does this make sense, and if yes is it worth proposing a change?
>
> At GitHub we do something similar. We don't generally use reflogs much
> at all, but we keep a custom "audit log": a single append-only file that
> records every ref update in the repository. And its format just happens
> to be one reflog entry per line, prefixed by the updated ref.

The interest of having a standard/convention in Git would be to not
require reftable for folks that want to use this feature.

> And there we do generally annotate the FF-ness of an update by stuffing
> it into the free-form message field (in fact, we shove in a small JSON
> object, so we record multiple fields like the pushing id, IP, etc).
>
> But the main goal there isn't performance (and in fact we don't
> generally consult it for anything outside of debugging). The reason we
> record FF-ness is for later debugging or analysis. We don't prune from
> the audit log, and we don't consider it for reachability when we prune
> objects (since otherwise you'd never be able to prune anything!). So the
> objects sometimes aren't available later to compute, but we still want
> to know if the user did a force-push, etc.

We store reflogs in a global database table, which has this kind of
information, but the Google-specific format is harder to make work
with Gerrit, which is open source.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado