Re: [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex

Jeff King <peff@xxxxxxxx> · Thu, 14 Jan 2021 14:43:24 -0500

On Thu, Jan 14, 2021 at 11:53:39AM -0500, Taylor Blau wrote:

> On Wed, Jan 13, 2021 at 10:33:01PM -0800, Junio C Hamano wrote:
> > Taylor Blau <me@xxxxxxxxxxxx> writes:
> >
> > > 'estimate_repack_memory()' takes into account the amount of memory
> > > required to load the reverse index in memory by multiplying the assumed
> > > number of objects by the size of the 'revindex_entry' struct.
> > >
> > > Prepare for hiding the definition of 'struct revindex_entry' by removing
> > > a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
> > > guess that one off_t and one uint32_t are required per object. Strictly
> > > speaking, this is a worse guess than asking for 'sizeof(struct
> > > revindex_entry)' directly, since the true size of this struct is 16
> > > bytes with padding on the end of the struct in order to align the offset
> > > field.
> >
> > Meaning that we under-estimate by 25%?
> 
> In this area, yes. I'm skeptical that this estimate is all that
> important, since it doesn't seem to take into account the memory
> required to select delta/base candidates [1].

It has many other inaccuracies:

  - it assumes half of all objects are blobs, which is not really
    accurate (linux.git is more like 60% trees, 12% commits, 28% blobs).
    This underestimates because blobs are the smallest struct.

  - since we moved a bunch of stuff out of "struct object_entry" into
    lazily-initialized auxiliary structures, we are under-counting the
    per-object cost when we have to spill into this structures

So I'm rather skeptical that this number is close to accurate. But
since there's a bunch of leeway (we are looking to use half of the
system memory) I suspect it doesn't matter all that much. But I
definitely don't think it's worth trying to micro-optimize its accuracy.

-Peff