Re: [PATCH v2] refs: implement reference transaction hook

Taylor Blau <me@xxxxxxxxxxxx> · Mon, 15 Jun 2020 10:46:39 -0600

Hi Patrick,

Sorry for the slow response. I was out of the office last week and am
only just now getting a chance to catch up on the backlog of emails that
I missed.

On Thu, Jun 04, 2020 at 09:36:32AM +0200, Patrick Steinhardt wrote:
> On Wed, Jun 03, 2020 at 10:51:42AM -0600, Taylor Blau wrote:
> > Hi Patrick,
> >
> > On Wed, Jun 03, 2020 at 02:27:50PM +0200, Patrick Steinhardt wrote:
> > > In order to test the impact on the case where we don't have any
> > > "reference-transaction" hook installed in the repository, this commit
> > > introduces a new performance test for git-update-refs(1). Run against an
> > > empty repository, it produces the following results:
> > >
> > >   Test                                   HEAD~             HEAD
> > >   ------------------------------------------------------------------------------
> > >   1400.2: update existing reference      2.05(1.58+0.54)   2.08(1.58+0.57) +1.5%
> > >   1400.3: create and destroy reference   1.79(1.38+0.49)   1.82(1.39+0.51) +1.7%
> > >
> > > So the overhead is around ~1.5%. Given that git-update-refs(1) is a
> > > near-direct wrapper around reference transactions, there likely is no
> > > other command that is impacted much worse than this.
> >
> > This is a serious performance regression that is worth considering. For
> > example, a 1.5% slow-down on average in reference transactions would
> > cause be enough for me to bisect the regression down to see what
> > changed.
> >
> > What are ways that this could be avoided? I bet that this would work
> > quite well with Emily's work on hooks, where we could check in the
> > configuration first whether a hook is even configured.
> >
> > Could this be integrated with that? If not, could you cache the result
> > of whether or not the hook exists, and/or implement some mechanism to
> > say something like, for eg., "only run reference transaction hooks
> > core.blah = 1" is true?
>
> I wasn't aware of her work until now, so I'll take a look.
>
> Meanwhile, I toyed around with performance and tried out two different
> caching mechanisms:
>
>     - map-cache: `find_hook()` uses a map of hook names mapped to their
>       resolved hook path (or `NULL` if none was found). This is a
>       generic mechanism working across all hooks, but has some overhead
>       in maintaining the map.
>
>     - reftx-cache: `run_transaction_hook()` caches the path returned by
>       `find_hook()`. It's got less overhead as it only caches the path,
>       but obviously only applies to the reference-transaction hook.
>
> In theory, we could go even further and cache the hook's file
> descriptor, executing it via fexecve(3P) whenever it's required, but I
> didn't go down that route yet. This would also solve the atomicity
> issue, though, if the administrator replaces the reference-transactions
> hook halfway through the transaction.
>
> Benchmarking results are mixed, mostly in the sense that I can choose
> which run of the benchmarks I take in order to have my own work look
> better or worse, as timings vary quite a lot between runs. Which
> probably hints at the fact that the benchmarks themselves aren't really
> reliable. The issue is that a single git-update-ref(1) run finishes so
> quick that it's hard to measure with our benchmarks, but spawning
> thousands of them to get something different than 0.0s very much depends
> on the operating system and thus fluctuates. On the other hand, if we
> only spawn a single git-update-refs and have it perform a few thousand
> ref updates, overhead of the hook will not show at all as it is only
> executed once per prepare/commit of the transaction.
>
> The following timings are for the case where we execute
>
>     git update-ref refs/heads/update-branch PRE POST &&
>     git update-ref refs/heads/update-branch POST PRE
>
> respectively
>
>     git update-ref refs/heads/new POST &&
>     git update-ref -d refs/heads/new
>
> a thousand times:
>
> Test                                   master            ref-hook-no-cache       ref-hook-map-cache      ref-hook-reftx-cache
> ------------------------------------------------------------------------------------------------------------------------------
> 1400.2: update existing reference      1.96(1.50+0.53)   2.00(1.54+0.53) +2.0%   2.02(1.54+0.55) +3.1%   1.98(1.52+0.52) +1.0%
> 1400.3: create and destroy reference   1.74(1.33+0.49)   1.77(1.38+0.47) +1.7%   1.77(1.36+0.48) +1.7%   1.76(1.35+0.49) +1.1%

Huh. It is super interesting (to me, at least) that caching the set of
hooks that are on disk and should be run makes this *slower* than
without the cache. What's going on there? In p1400.2, I'd expect to see
'ref-hook-map-cache' attain at most a 2.0% slow-down, plus a little bit
to process the hook when it is there, but not much more than that.

I think that this just seems a little contrived to me. I can understand
why server administrators may want this feature, but the general
user-base of Git doesn't seem to stand to benefit much from this change
(in my own mind, at least).

So... I'm not sure where this leaves us. Maybe a 2.0% slow-down on an
already fast 'git update-ref' invocation for the average user won't be
noticeable. It certainly *will* be noticeable to administrators who
processes a much higher volume of such transactions.

> For such a short-lived program like git-update-refs(1), one can see that
> the overhead of using a map negatively impacts performance compared to
> the no-cache case. But using the reftx-cache roughly cuts the overhead
> in half as expected, as we only need to look up the hook once instead of
> twice.
>
> And here's the results if we use a single `git update-ref --stdin` with a
> thousand reference updates at once:
>
> Test                             master            ref-hook-no-cache       ref-hook-map-cache      ref-hook-reftx-cache
> ------------------------------------------------------------------------------------------------------------------------
> 1400.2: git update-ref --stdin   0.21(0.09+0.12)   0.21(0.07+0.14) +0.0%   0.21(0.07+0.13) +0.0%   0.21(0.07+0.13) +0.0%
>
> As expected, there's nothing much to be seen here because there's only a
> single transaction for all ref updates and, as a result, at most two
> calls to `access(refhook_path, X_OK)`.

Better, but I have to imagine that real-world usage will look much more
like a thousand tiny transactions than one large one.

> Patrick

Thanks,
Taylor