Re: [PATCH] oidmap: map with OID as key

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 28 Sep 2017 16:05:57 -0400
Jeff King <peff@xxxxxxxx> wrote:

> When I saw that you were implementing "oidset" in terms of "oidmap", I
> was all ready to be crabby about this extra memory. But then I saw that
> the implementation tries hard not to waste any memory. :)
> 
> All of which is to say I gave this some thought when I was in the "ready
> to be crabby" phase, and came to the conclusion that it probably isn't
> that painful. An unused pointer is 8 bytes per entry. We're already
> spending 20 for the oid itself (which is likely to grow to 32
> eventually), plus 8 for the chained "next" pointer. Plus potentially 8
> for a padded version of the hash, if we just use a straight hashmap that
> duplicates the hash field.
> 
> So depending how you count it, we're wasting between 28% (sha1 and no
> extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but
> it's probably not breaking the bank.

Hmm...how did you get the 16% figure? The values, as I see it, are:
 - 32 for the sha256 hash itself
 - 8 for the "next" chain pointer
 - 8 for the padded hash
 - 8 for the "util" pointer

For an oidset, the padded hash and the "util" pointer are wasted, which is
16/56=0.286. (If you assume, say, 8 bytes of malloc bookkeeping overhead, then
16/64=0.25.)

For other uses of an oidmap, the padded hash and "util" pointer are still
wasted, so the numbers are the same as above (except that the malloc
bookkeeping overhead is doubled).

> Another way of thinking about it. Large-ish (but not insane) repos have
> on the order of 5-10 million objects. If we had an oidset that mentioned
> every single object in the repository, that's 40-80MB wasted in the
> worst case. For current uses of oidset, that's probably fine. It's
> generally used only to collect ref tips (so probably two orders of
> magnitude less).
> 
> If you're planning on using an oidset to mark every object in a
> 100-million-object monorepo, we'd probably care more. But I'd venture to
> say that any scheme which involves generating that hash table on the fly
> is doing it wrong. At at that scale we'd want to look at compact
> mmap-able on-disk representations.

In a 100-million-object monorepo, we will probably end up only operating on the
"frontier" objects (objects that we do not have but we know about because some
object we have reference them) at the worst. I don't have numbers but I think
that that is at least an order of magnitude less than 100M.

> So I think we may be better off going with the solution here that's
> simpler and requires introducing less code. If it does turn out to be a
> memory problem in the future, this is a _really_ easy thing to optimize
> after the fact, because we have these nice abstractions.

Optimizing away the padded hash should be straightforward. Optimizing away the
"util" pointer after we have code that uses it is (slightly?) less
straightforward, but still doable.

I still lean towards saving memory by eliminating the padded hash and
not using the "util" pointer, because the complication is contained
within only two files (oidmap.{c,h}), and because the constant factors
in memory savings might end up mattering. But I don't feel strongly
about that - I think any of the oidmaps that we have discussed is an
improvement over what we have now.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux