Re: [PATCH] oidmap: map with OID as key

Jeff King <peff@xxxxxxxx> · Fri, 29 Sep 2017 15:26:25 -0400

On Fri, Sep 29, 2017 at 12:04:56PM -0700, Jonathan Tan wrote:

> > So depending how you count it, we're wasting between 28% (sha1 and no
> > extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but
> > it's probably not breaking the bank.
> 
> Hmm...how did you get the 16% figure? The values, as I see it, are:
>  - 32 for the sha256 hash itself
>  - 8 for the "next" chain pointer
>  - 8 for the padded hash
>  - 8 for the "util" pointer
> 
> For an oidset, the padded hash and the "util" pointer are wasted, which is
> 16/56=0.286. (If you assume, say, 8 bytes of malloc bookkeeping overhead, then
> 16/64=0.25.)

Sorry to be unclear. I was just counting the waste for the "util"
pointer, not the extra padded hash bit (which AFAIK is already wasted in
oidset). So I computed 48 bytes without the util pointer, which means we
waste an additional 16% to add it.

Anyway, my point was mostly to say that this is a fractional percentage
of the total memory. So it falls into the category of "this might help
in tight situations" and less "this will blow up in our faces".

> In a 100-million-object monorepo, we will probably end up only operating on the
> "frontier" objects (objects that we do not have but we know about because some
> object we have reference them) at the worst. I don't have numbers but I think
> that that is at least an order of magnitude less than 100M.

Agreed.

> > So I think we may be better off going with the solution here that's
> > simpler and requires introducing less code. If it does turn out to be a
> > memory problem in the future, this is a _really_ easy thing to optimize
> > after the fact, because we have these nice abstractions.
> 
> Optimizing away the padded hash should be straightforward. Optimizing away the
> "util" pointer after we have code that uses it is (slightly?) less
> straightforward, but still doable.

I was thinking here of just oidset. It has a limited API, so swapping
out the implementation for one that does not depend on oidmap and waste
the "util" pointer would be the "easy" part.

> I still lean towards saving memory by eliminating the padded hash and
> not using the "util" pointer, because the complication is contained
> within only two files (oidmap.{c,h}), and because the constant factors
> in memory savings might end up mattering. But I don't feel strongly
> about that - I think any of the oidmaps that we have discussed is an
> improvement over what we have now.

My main concern is not so much about complexity bleeding out of the
oidmap code, but that now we have two parallel near-identical hashmap
implementations.  When one gets an optimization, bugfix, or new feature,
we have to port it over manually to the other.

-Peff