Re: [PATCH] oidmap: map with OID as key

Jeff King <peff@xxxxxxxx> · Thu, 28 Sep 2017 16:05:57 -0400

On Thu, Sep 28, 2017 at 10:46:16AM -0700, Jonathan Tan wrote:

> > To me it seems like a much simpler API for a map would be to just allow
> > callers to store a 'void *' as the value.
> 
> I agree that the API would be simpler.
> 
> My main motivation with this design is indeed to save memory, and not
> inconvenience the user too much (in the case where you're storing things
> larger than one pointer, you just need to remember to put the special
> struct at the beginning of your struct), but if memory is not so
> important, I agree that we can switch to the "util" design.

When I saw that you were implementing "oidset" in terms of "oidmap", I
was all ready to be crabby about this extra memory. But then I saw that
the implementation tries hard not to waste any memory. :)

All of which is to say I gave this some thought when I was in the "ready
to be crabby" phase, and came to the conclusion that it probably isn't
that painful. An unused pointer is 8 bytes per entry. We're already
spending 20 for the oid itself (which is likely to grow to 32
eventually), plus 8 for the chained "next" pointer. Plus potentially 8
for a padded version of the hash, if we just use a straight hashmap that
duplicates the hash field.

So depending how you count it, we're wasting between 28% (sha1 and no
extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but
it's probably not breaking the bank.

Another way of thinking about it. Large-ish (but not insane) repos have
on the order of 5-10 million objects. If we had an oidset that mentioned
every single object in the repository, that's 40-80MB wasted in the
worst case. For current uses of oidset, that's probably fine. It's
generally used only to collect ref tips (so probably two orders of
magnitude less).

If you're planning on using an oidset to mark every object in a
100-million-object monorepo, we'd probably care more. But I'd venture to
say that any scheme which involves generating that hash table on the fly
is doing it wrong. At at that scale we'd want to look at compact
mmap-able on-disk representations.

So I think we may be better off going with the solution here that's
simpler and requires introducing less code. If it does turn out to be a
memory problem in the future, this is a _really_ easy thing to optimize
after the fact, because we have these nice abstractions.

-Peff