Re: [PATCH v3 3/4] revision: avoid loading object headers multiple times

Patrick Steinhardt <ps@xxxxxx> · Tue, 3 Aug 2021 11:07:29 +0200

On Mon, Aug 02, 2021 at 12:40:56PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@xxxxxx> writes:
> 
> > When loading references, we try to optimize loading of commits by using
> > the commit graph. To do so, we first need to determine whether the
> > object actually is a commit or not, which is why we always execute
> > `oid_object_info()` first. Like this, we'll unpack the object header of
> > each object first.
> >
> > This pattern can be quite inefficient in case many references point to
> > the same commit: if the object didn't end up in the cached objects, then
> > we'll repeatedly unpack the same object header, even if we've already
> > seen the object before.
> > ...
> > Assuming that in almost all repositories, most references will point to
> > either a tag or a commit, we'd have a modest increase in memory
> > consumption of about 12.5% here.
> 
> I wonder if we can also say almost all repositories, the majority of
> refs point at the same object.  If that holds, this would certainly
> be a win, but otherwise, it is not so clear.

I doubt that's the case in general. I rather assume that it's typically
going to be a smallish subset that points to the same commit, but for
these cases we at least avoid doing the lookup multiple times. As I
said, it's definitely a tradeoff between memory and performance: in the
worst case (all references point to different blobs) we allocate 33%
more memory without having any speedups. A more realistic scenario would
probably be something like a trunk-based development repo, where there's
a single branch only and the rest is tags. There we'd allocate 11% more
memory without any speedups. In general, it's going to be various shades
of gray, where we allocate something from 0% to 11% more memory while
getting some modest speedups in some cases.

So if we only inspect this commit as a standalone it's definitely
debatable whether we'd want to take it or not. But one important thing
is that it's a prerequisite for patch 4/4: in order to not parse commits
in case they're part of the commit-graph, we need to first obtain an
object such that we can fill it in via the graph. So we have to call
`lookup_unknown_object()` anyway. Might be sensible to document this as
part of the commit message.

Patrick
Attachment:
signature.asc

Description: PGP signature