Re: [PATCH v3 3/4] revision: avoid loading object headers multiple times

Patrick Steinhardt <ps@xxxxxx> · Fri, 6 Aug 2021 16:17:08 +0200

On Tue, Aug 03, 2021 at 11:07:29AM +0200, Patrick Steinhardt wrote:
> On Mon, Aug 02, 2021 at 12:40:56PM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@xxxxxx> writes:
> > 
> > > When loading references, we try to optimize loading of commits by using
> > > the commit graph. To do so, we first need to determine whether the
> > > object actually is a commit or not, which is why we always execute
> > > `oid_object_info()` first. Like this, we'll unpack the object header of
> > > each object first.
> > >
> > > This pattern can be quite inefficient in case many references point to
> > > the same commit: if the object didn't end up in the cached objects, then
> > > we'll repeatedly unpack the same object header, even if we've already
> > > seen the object before.
> > > ...
> > > Assuming that in almost all repositories, most references will point to
> > > either a tag or a commit, we'd have a modest increase in memory
> > > consumption of about 12.5% here.
> > 
> > I wonder if we can also say almost all repositories, the majority of
> > refs point at the same object.  If that holds, this would certainly
> > be a win, but otherwise, it is not so clear.
> 
> I doubt that's the case in general. I rather assume that it's typically
> going to be a smallish subset that points to the same commit, but for
> these cases we at least avoid doing the lookup multiple times. As I
> said, it's definitely a tradeoff between memory and performance: in the
> worst case (all references point to different blobs) we allocate 33%
> more memory without having any speedups. A more realistic scenario would
> probably be something like a trunk-based development repo, where there's
> a single branch only and the rest is tags. There we'd allocate 11% more
> memory without any speedups. In general, it's going to be various shades
> of gray, where we allocate something from 0% to 11% more memory while
> getting some modest speedups in some cases.
> 
> So if we only inspect this commit as a standalone it's definitely
> debatable whether we'd want to take it or not. But one important thing
> is that it's a prerequisite for patch 4/4: in order to not parse commits
> in case they're part of the commit-graph, we need to first obtain an
> object such that we can fill it in via the graph. So we have to call
> `lookup_unknown_object()` anyway. Might be sensible to document this as
> part of the commit message.

Scratch that: we can just rewrite this to use `lookup_object()` to check
whether we already know the object, and then we can rewrite the
`parse_commit_in_graph_gently()` to be `lookup_commit_in_graph()`, which
does a `lookup_commit()` to create the object in case the OID was found
in the graph. That's also got nicer calling semantics..

It's negligibly slower (3.021s instead of 2.983s), but it doesn't need
me arguing about the performance/memory tradeoff. Which is a good thing
I guess: there will always be that one repo that's completely different
and where such assumptions don't hold.

Patrick
Attachment:
signature.asc

Description: PGP signature